Ankan's Computer Vision Blog: December 2016

Friday, December 9, 2016

A short summary of the paper - Generative adversarial nets

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

This paper [1] proposes a method for estimating generative models using an adversarial process. The authors train two networks simultaneously: a generative model, $G$, and a discriminative model, $D$. The generative model learns the data distribution and the discriminative model estimates whether the probability that a sample came from the data rather than $G$. The two networks are trained using a two-player minimax game. The discriminative network, $D$, is trained to maximise the probability of correctly classifying samples from both training data and data generated by $G$. Simultaneously, $G$ is trained to maximise the probability of $D$ making a mistake, i.e., in a way that the data generated by G is indistinguishable from the training data. The authors train these two networks iteratively, alternating between $k$ steps of optimising $D$ and one step of optimising $G$.

The authors also present theoretical guarantees on the convergence of the algorithm to the optimal value (in the sense of global minimum of their objective function). However, the guarantees are not applicable to the case presented in the paper because they make some assumptions which are infeasible to implement using neural networks. However, the paper argues that since deep neural networks perform very well in several domains, they are reasonable models to use here.

In this paper, the authors use $k = 1$ to minimize the training cost. I think that it would be interesting to try higher values of $k$. This would bring the framework closer to one of the conditions of their guarantees; the condition that $D$ is allowed to reach its optimum given $G$. The authors don't mention whether they tried this. This could lead to better convergence even though it will be more expensive to train.

A disadvantage of this approach is that $D$ should be well synchronised with $G$, i.e., $G$ cannot be trained too much without updating $D$. Though, adversarial nets offer several advantages over traditional generative models. This paper is an important step towards unsupervised learning and has already inspired some work in that direction [2].

References:

[1] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In Advances in Neural Information Processing Systems, pp. 2672-2680. 2014.
[2] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

Wednesday, December 7, 2016

A short summary of the paper - On the Number of Linear Regions of Deep Neural Networks

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

Abstract

This paper presents theoretical results about the advantages of deep networks over shallow networks. The authors calculate bounds on the number of linear regions that deep networks with piecewise linear activation functions (e.g. ReLU and Maxout) can use to approximate functions. The number of linear regions used for representing a function can be thought of as a measure of the complexity of the representation model. The paper shows that deep networks produce exponentially more linear regions than shallow networks.

Recently, [4] showed that shallow networks require exponentially more sum-product hidden units that deep networks to represent certain functions. For several years people used smoothed activation functions as non-linearities in the neural networks. But [2] showed that Rectified Linear Units (ReLU) are much faster to train. In 2013, [3] introduced a new form of piecewise linear activation called Maxout. This paper [1] aims to present a theoretical analysis of the advantages of deep networks with such activations over shallow networks. Such an analysis was also done in [5] which showed that, while approximating a function, deep networks are able to produce exponentially more linear regions than shallow networks with the same number of hidden units. This paper presents a tighter lower bound on the maximal number of linear regions of functions computed by neural networks than [5]. This lends deep networks an advantage over shallow networks in terms of representation power. A higher number of linear pieces means an ability to represent more complex functions.

This paper also describes how intermediate layers of deep neural networks "map several pieces of their inputs into the same output". The authors present the hypothesis that deep networks re-use and compose features from lower layers to higher layers exponentially often with the increase in the number of layers. This gives them the ability to compute highly complex functions.

Theoretical bounds on the number of linear regions produced by shallow networks were presented in [5]. They propose that the maximal number of linear regions of functions computed by shallow rectifier networks with $n_0$ inputs and $n_1$ hidden units is $\sum_{j=0}^{n_0}\binom{n_1}{j}$. They also obtain a bound of $\Omega ((\frac{n}{n_0})^{(L-1)}n^{n_0})$ on the number of linear regions of a function that can be computed by a rectifier neural network with $n_{0}$ inputs and $L$ hidden units of width $n \geq n_0$.

The main result presented in this paper is the following (Corollary 6 in [1]):
A rectifier neural network with $n_0$ input units and $L$ hidden layers of width $n \geq n_0$ can compute functions that have $\Omega ((\frac{n}{n_0})^{(L-1)n_0}n^{n_0})$ linear regions.

The main contribution of this paper is a tighter bound on the complexity of the functions that deep networks can represent. The authors theoretically show that deep networks are exponentially more efficient than shallow networks. This is a step forward in developing a theoretical understanding of deep networks. However, an important question that needs to be answered is whether we actually need an exponentially higher capacity for the tasks that we are currently interested in. Recently, [6] showed that shallow networks with a similar number of parameters as deep networks mimic the accuracies obtained by deep networks on several tasks (TIMIT and CIFAR-10). Although they used deep networks to train their shallow models, this result shows that shallow networks have the capacity to perform as well as deep networks at least on some tasks. We might just have to figure out ways to train them efficiently to ensure that their capacity is fully exploited.

References:

[1] Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. "On the number of linear regions of deep neural networks." In Advances in neural information processing systems, pp. 2924-2932. 2014.
[2] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep Sparse Rectifier Neural Networks." In Aistats, vol. 15, no. 106, p. 275. 2011.
[3] Goodfellow, Ian J., David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. "Maxout networks." ICML (3) 28 (2013): 1319-1327.
[4] Delalleau, Olivier, and Yoshua Bengio. "Shallow vs. deep sum-product networks." In Advances in Neural Information Processing Systems, pp. 666-674. 2011.
[5] Pascanu, Razvan, Guido Montufar, and Yoshua Bengio. "On the number of response regions of deep feed forward networks with piece-wise linear activations." arXiv preprint arXiv:1312.6098 (2013).
[6] Ba, Jimmy, and Rich Caruana. "Do deep nets really need to be deep?." In Advances in neural information processing systems, pp. 2654-2662. 2014.

Saturday, December 3, 2016

Face Detection with YOLO (You Only Look Once)

Recent face detection systems are achieving near-human performance. However, most of these methods are based on slow RCNN [2] based methods.

Recently, I had to detect the faces in several millions of images (about 14 million). The state-of-the-art face detectors operate at around 1-2 frames per second. Detecting faces in my 14 million images using these methods would have taken about 6 months using 1 TitanX GPU. So I decided to use the recently published YOLO [1] method for training a network for face detection. Although the performance of the YOLO method is lower than other detection methods like Fast R-CNN [3] and Faster R-CNN [4], it achieves a rate of about 45 frames per second which is more than six times than that achieved by Faster R-CNN.

I was able to complete the detection task in about a week, though I had to compromise a bit on the accuracy.

In this post, I first give a brief overview of the YOLO method. Then I will explain the training procedure for faces.

Overview of YOLO
The YOLO method reframes the detection problem as a single regression problem to bounding boxes and class probabilities. It requires just a single neural network evaluation for predicting multiple bounding boxes class probabilities. The image is first resized to the input size of the network and divided into an $ S \times S$ grid. If the center of an object falls into a grid cell, then that grid cell is responsible for detecting that object. Each of the $S \times S$ grid cells predicts $B$ bounding boxes $(x,y,w,h)$ along with the objectness scores for those boxes. Each grid cell also predicts class conditional probabilities (i.e. the probability of each class given that ) for the $C$ classes. So the final output of the network is $S \times S \times ((4 + 1) \times B + C)$.

The objectness score associated with each bounding box is the product of the confidence of the model that the box contains an object and the intersection over union (IOU) between the predicted box and the ground truth box. At test time the class conditional probabilities and the individual box confidence predictions are multiplied to get the class-specific confidence scores for each class. This product encodes both the probability of that class appearing in the box and how well the box fits the object.

The loss function is designed to optimize the loss from location accuracy and the loss from confidence predictions. However, the method suffers from a few limitations. The model struggles with small objects. It also struggles to generalize to objects in unusual aspect ratios or configurations. However, the speed somewhat compensates for these limitations.

Adapting YOLO to face detection
I trained the YOLO detector on the WIDER FACE [5] dataset by making minimal changes to the code. I had to generate the labels in the same format as required by the YOLO code. Each image requires a separate label file with each line in the file representing a ground truth bounding box along with the class (which is just 1 in our case). The bounding boxes are in the format: $(x_{c}, y_{c}, w, h)$, where $(x_{c},y_{c})$ is the center of the bounding box and $w$ and $h$ are the width and height of the bounding box respectively. Also, in the network definition file, I had to change the number of classes and the dimensions of the output. Also, in the main yolo.c file in src/, I had to change the source, the destination, and the number of classes accordingly.

The bounding boxes provided with WIDER dataset are very small. But I needed larger boxes to incorporate context in the detector. So, after convergence on the WIDER FACE dataset, I fine-tuned the YOLO detector on FDDB [6] dataset. Though, I had to convert the ellipse annotations provided by the authors of [6] into rectangular ones.

The final detector achieves good recall. But the most important advantage over other recent detectors is the speed (I did this before SSD [7]). I was able to process the 14 million images within a week.

References:
[1] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).
[2] Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Rich feature hierarchies for accurate object detection and semantic segmentation." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
[3] Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440-1448. 2015.
[4] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in neural information processing systems, pp. 91-99. 2015.
[5] Yang, Shuo, Ping Luo, Chen Change Loy, and Xiaoou Tang. "WIDER FACE: A Face Detection Benchmark." arXiv preprint arXiv:1511.06523 (2015).
[6] Jain, Vidit, and Erik G. Learned-Miller. "Fddb: A benchmark for face detection in unconstrained settings." UMass Amherst Technical Report (2010).
[7] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).