Ankan's Computer Vision Blog: 2016

Friday, December 9, 2016

A short summary of the paper - Generative adversarial nets

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

This paper [1] proposes a method for estimating generative models using an adversarial process. The authors train two networks simultaneously: a generative model, $G$, and a discriminative model, $D$. The generative model learns the data distribution and the discriminative model estimates whether the probability that a sample came from the data rather than $G$. The two networks are trained using a two-player minimax game. The discriminative network, $D$, is trained to maximise the probability of correctly classifying samples from both training data and data generated by $G$. Simultaneously, $G$ is trained to maximise the probability of $D$ making a mistake, i.e., in a way that the data generated by G is indistinguishable from the training data. The authors train these two networks iteratively, alternating between $k$ steps of optimising $D$ and one step of optimising $G$.

The authors also present theoretical guarantees on the convergence of the algorithm to the optimal value (in the sense of global minimum of their objective function). However, the guarantees are not applicable to the case presented in the paper because they make some assumptions which are infeasible to implement using neural networks. However, the paper argues that since deep neural networks perform very well in several domains, they are reasonable models to use here.

In this paper, the authors use $k = 1$ to minimize the training cost. I think that it would be interesting to try higher values of $k$. This would bring the framework closer to one of the conditions of their guarantees; the condition that $D$ is allowed to reach its optimum given $G$. The authors don't mention whether they tried this. This could lead to better convergence even though it will be more expensive to train.

A disadvantage of this approach is that $D$ should be well synchronised with $G$, i.e., $G$ cannot be trained too much without updating $D$. Though, adversarial nets offer several advantages over traditional generative models. This paper is an important step towards unsupervised learning and has already inspired some work in that direction [2].

References:

[1] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In Advances in Neural Information Processing Systems, pp. 2672-2680. 2014.
[2] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

Wednesday, December 7, 2016

A short summary of the paper - On the Number of Linear Regions of Deep Neural Networks

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

Abstract

This paper presents theoretical results about the advantages of deep networks over shallow networks. The authors calculate bounds on the number of linear regions that deep networks with piecewise linear activation functions (e.g. ReLU and Maxout) can use to approximate functions. The number of linear regions used for representing a function can be thought of as a measure of the complexity of the representation model. The paper shows that deep networks produce exponentially more linear regions than shallow networks.

Recently, [4] showed that shallow networks require exponentially more sum-product hidden units that deep networks to represent certain functions. For several years people used smoothed activation functions as non-linearities in the neural networks. But [2] showed that Rectified Linear Units (ReLU) are much faster to train. In 2013, [3] introduced a new form of piecewise linear activation called Maxout. This paper [1] aims to present a theoretical analysis of the advantages of deep networks with such activations over shallow networks. Such an analysis was also done in [5] which showed that, while approximating a function, deep networks are able to produce exponentially more linear regions than shallow networks with the same number of hidden units. This paper presents a tighter lower bound on the maximal number of linear regions of functions computed by neural networks than [5]. This lends deep networks an advantage over shallow networks in terms of representation power. A higher number of linear pieces means an ability to represent more complex functions.

This paper also describes how intermediate layers of deep neural networks "map several pieces of their inputs into the same output". The authors present the hypothesis that deep networks re-use and compose features from lower layers to higher layers exponentially often with the increase in the number of layers. This gives them the ability to compute highly complex functions.

Theoretical bounds on the number of linear regions produced by shallow networks were presented in [5]. They propose that the maximal number of linear regions of functions computed by shallow rectifier networks with $n_0$ inputs and $n_1$ hidden units is $\sum_{j=0}^{n_0}\binom{n_1}{j}$. They also obtain a bound of $\Omega ((\frac{n}{n_0})^{(L-1)}n^{n_0})$ on the number of linear regions of a function that can be computed by a rectifier neural network with $n_{0}$ inputs and $L$ hidden units of width $n \geq n_0$.

The main result presented in this paper is the following (Corollary 6 in [1]):
A rectifier neural network with $n_0$ input units and $L$ hidden layers of width $n \geq n_0$ can compute functions that have $\Omega ((\frac{n}{n_0})^{(L-1)n_0}n^{n_0})$ linear regions.

The main contribution of this paper is a tighter bound on the complexity of the functions that deep networks can represent. The authors theoretically show that deep networks are exponentially more efficient than shallow networks. This is a step forward in developing a theoretical understanding of deep networks. However, an important question that needs to be answered is whether we actually need an exponentially higher capacity for the tasks that we are currently interested in. Recently, [6] showed that shallow networks with a similar number of parameters as deep networks mimic the accuracies obtained by deep networks on several tasks (TIMIT and CIFAR-10). Although they used deep networks to train their shallow models, this result shows that shallow networks have the capacity to perform as well as deep networks at least on some tasks. We might just have to figure out ways to train them efficiently to ensure that their capacity is fully exploited.

References:

[1] Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. "On the number of linear regions of deep neural networks." In Advances in neural information processing systems, pp. 2924-2932. 2014.
[2] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep Sparse Rectifier Neural Networks." In Aistats, vol. 15, no. 106, p. 275. 2011.
[3] Goodfellow, Ian J., David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. "Maxout networks." ICML (3) 28 (2013): 1319-1327.
[4] Delalleau, Olivier, and Yoshua Bengio. "Shallow vs. deep sum-product networks." In Advances in Neural Information Processing Systems, pp. 666-674. 2011.
[5] Pascanu, Razvan, Guido Montufar, and Yoshua Bengio. "On the number of response regions of deep feed forward networks with piece-wise linear activations." arXiv preprint arXiv:1312.6098 (2013).
[6] Ba, Jimmy, and Rich Caruana. "Do deep nets really need to be deep?." In Advances in neural information processing systems, pp. 2654-2662. 2014.

Saturday, December 3, 2016

Face Detection with YOLO (You Only Look Once)

Recent face detection systems are achieving near-human performance. However, most of these methods are based on slow RCNN [2] based methods.

Recently, I had to detect the faces in several millions of images (about 14 million). The state-of-the-art face detectors operate at around 1-2 frames per second. Detecting faces in my 14 million images using these methods would have taken about 6 months using 1 TitanX GPU. So I decided to use the recently published YOLO [1] method for training a network for face detection. Although the performance of the YOLO method is lower than other detection methods like Fast R-CNN [3] and Faster R-CNN [4], it achieves a rate of about 45 frames per second which is more than six times than that achieved by Faster R-CNN.

I was able to complete the detection task in about a week, though I had to compromise a bit on the accuracy.

In this post, I first give a brief overview of the YOLO method. Then I will explain the training procedure for faces.

Overview of YOLO
The YOLO method reframes the detection problem as a single regression problem to bounding boxes and class probabilities. It requires just a single neural network evaluation for predicting multiple bounding boxes class probabilities. The image is first resized to the input size of the network and divided into an $ S \times S$ grid. If the center of an object falls into a grid cell, then that grid cell is responsible for detecting that object. Each of the $S \times S$ grid cells predicts $B$ bounding boxes $(x,y,w,h)$ along with the objectness scores for those boxes. Each grid cell also predicts class conditional probabilities (i.e. the probability of each class given that ) for the $C$ classes. So the final output of the network is $S \times S \times ((4 + 1) \times B + C)$.

The objectness score associated with each bounding box is the product of the confidence of the model that the box contains an object and the intersection over union (IOU) between the predicted box and the ground truth box. At test time the class conditional probabilities and the individual box confidence predictions are multiplied to get the class-specific confidence scores for each class. This product encodes both the probability of that class appearing in the box and how well the box fits the object.

The loss function is designed to optimize the loss from location accuracy and the loss from confidence predictions. However, the method suffers from a few limitations. The model struggles with small objects. It also struggles to generalize to objects in unusual aspect ratios or configurations. However, the speed somewhat compensates for these limitations.

Adapting YOLO to face detection
I trained the YOLO detector on the WIDER FACE [5] dataset by making minimal changes to the code. I had to generate the labels in the same format as required by the YOLO code. Each image requires a separate label file with each line in the file representing a ground truth bounding box along with the class (which is just 1 in our case). The bounding boxes are in the format: $(x_{c}, y_{c}, w, h)$, where $(x_{c},y_{c})$ is the center of the bounding box and $w$ and $h$ are the width and height of the bounding box respectively. Also, in the network definition file, I had to change the number of classes and the dimensions of the output. Also, in the main yolo.c file in src/, I had to change the source, the destination, and the number of classes accordingly.

The bounding boxes provided with WIDER dataset are very small. But I needed larger boxes to incorporate context in the detector. So, after convergence on the WIDER FACE dataset, I fine-tuned the YOLO detector on FDDB [6] dataset. Though, I had to convert the ellipse annotations provided by the authors of [6] into rectangular ones.

The final detector achieves good recall. But the most important advantage over other recent detectors is the speed (I did this before SSD [7]). I was able to process the 14 million images within a week.

References:
[1] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).
[2] Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Rich feature hierarchies for accurate object detection and semantic segmentation." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
[3] Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440-1448. 2015.
[4] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in neural information processing systems, pp. 91-99. 2015.
[5] Yang, Shuo, Ping Luo, Chen Change Loy, and Xiaoou Tang. "WIDER FACE: A Face Detection Benchmark." arXiv preprint arXiv:1511.06523 (2015).
[6] Jain, Vidit, and Erik G. Learned-Miller. "Fddb: A benchmark for face detection in unconstrained settings." UMass Amherst Technical Report (2010).
[7] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).

Sunday, July 3, 2016

Biking

This is a post about my biking experience. I mostly bike alone (at least for now) on trails and bike friendly roads. I will update this post whenever I have anything new to share instead of creating a new post.

3 July 2016
I recently bought a bike (Swinn Volare 1200) and was very excited about all the possibilities and adventures that are now possible. So I decided to take the bike (and myself) through the paces today. There is a trail (Paint Branch Trail) which passes right behind my office and has always seemed very attractive.I decided to bike this trail today. It was a very pleasant weather today with complete cloud cover but very little chance of rain. It was pretty cool and not humid. The weather couldn't have been more perfect for biking. I started from home at around 10 a.m. and reached the lake at around 10:20. I did a loop around the lake and then started on the trail. I completed the trail and returned on the same trail and went to the office. I covered a total of about 10 miles in 1 hour.

The bike held up pretty well and I was happy with my performance too. I think this was a very good start for a beginner and I hope this good start bodes well for a lot of biking over the next few years.

10 July 2016
I did another 10 mile ride this Sunday and discovered a hidden gem in College Park. The 10 mile ride was mostly on the Northeast Branch Trail of the Anacostia River Trail system. It is a nicely maintained trail but crosses a few roads with traffic along the way. A small inconvenience was that the a major portion of the trail isn't surrounded by trees. So it gets a bit hot biking in the sun. Apart from that, the trail is mostly quite with few joggers and bikers. I have now come to realise that 10 miles is a bit short for me. I will do 15 miles next week.

Also, I decided to try out the Shortcake Bakery in Hyattsville. I had a pineapple and coconut pie, a blueberry scone, and a pineapple cake. All three were amazing.

I should have discovered this place a lot earlier. They have found a new regular customer in me.

17 July 2016
Had an amazing ride today. I did the whole Northwest Branch Trail along with the Northeast Branch Trail for a total length of over 16 miles. It took me about 90 minutes to do this. But I was extremely tired after the ride and realised that I need to get fitter. The trail itself is quite beautiful. It crosses into Montgomery County from PG County. There were some walkers and joggers along the whole trail. It was a good experience and I hope to continue having a lot more of such experiences.

6 August 2016
Tried mountain biking for the first time and I loved it. Just watching mountain biking videos on youtube, I would have never realised that it could be this taxing on the body. A group of 6 people from the university went to a nearby beginners' trail (Rosaryville State Park). We just did the easier outer loop which is about 9 miles of mountain biking. We took several breaks on the way and completed the loop in about 2.5 hours. But all of us were completely out of gas at the end. None of us had the energy to even think about We ate some food and came back.

It was a very good experience and I will definitely go mountain biking again.

21 August 2016
29 miles. 2 hours 40 minutes (SLOW!). I did the full round trip of the nearby Sligo Creek trail and other parts of the Anacostia trail system (mainly parts of the Nothwest Branch, Northeast Branch and the Paint Branch trails). The trail was fairly well maintained except a few spots where there was maintenance work being done and I had to get down and walk the bike. Also, the signage is very sparse and it is easy to move away from the trail on some forks and intersections with the road. But overall the experience was very good. The weather was awesome. But I am a little bit disappointed with the time it took me to do the trip. I was extremely tired around the 20 mile mark and it became quite difficult to carry on. So yes, I have a to improve a lot.

13 November 2016
I did about 14 miles today on the nearby trail system. Though I did that in about 1 hour. I am quite satisfied with the progress though I don't think I am going to get too much better than this.

August 2017
I sold my trusty Schwinn and bought a new bike. A Giant Contend SL 2 Disc.

Learning a new art

I have been teaching myself how to play the violin on and off for about 7 months now. One thing I learned from this is that it is very difficult to commit to learning something by yourself. There is always something or the other that feels more important than the thing you are trying to learn. This is particularly true if the skill that you are trying to learn doesn't directly relate to your area of study/work.

But, when I do practice, it is extremely satisfying and it makes me happy. Until now I had never understood why would people devote their whole lives towards playing an instrument. But, now I realise that this satisfaction and happiness is what drives people to commit their whole lives towards an art. The sense of achievement when I overcome a particularly difficult exercise or a part of a music piece is extraordinary. I feel the same sense of achievement when I try to learn a new language or when I answer a quiz question that stumped everyone in the room. I think that these moment of happiness and satisfaction are worth the effort it takes to make any kind of progress while learning anything new.

Tuesday, January 26, 2016

First semester

The first few months of a new phase in life. It started in August with the orientation for new students. The orientation program itself was interesting and useful. But the semester started in earnest with the start of classes in the first week of September and simply put it was the most fun I have had since the first semester in college 5 years ago except for the friends.

The three courses were very interesting and a bit challenging too. Some of the professors were among the best I have taken courses with. They knew their stuff and knew how to make us understand and enjoy the course material. But the research experience was minimal. Everyone expected us to just focus on the courses. That was an issue I wish to rectify in the second semester. I believe that the mind-set about first year PhD students just focussing on courses in the first year is flawed. Graduate students have to delve into research as soon as possible and the focus on courses should be lower.

The trips to Shenandoah, and Alexandria and the several trips to D.C.were very enjoyable experiences. Participating in various graduate student get-togethers provided some much needed stress relief. I found new hobbies and am really enjoying them.I realised the need to stay fit and also how difficult it is to find someone interested in playing a sport together. But I am working on it.

The only major problem I felt was the lack of good friends. Maybe it's the age. Maybe it's the culture. It was easy to make good friends 5 years ago but here and now it seems that it is very difficult. Now it seems improbable that we will make friends whom we will call friends, not just acquaintances, 10 years down the line. Well you can't have everything, can you?

As I write this on the day before the start of the second semester, here's hoping it is as good as the first one, and perhaps even better.