Showing posts with label deep learning. Show all posts
Showing posts with label deep learning. Show all posts

Tuesday, April 30, 2019

Detecting Human-Object Interactions via Functional Generalization


The problem of Object Detection has received significant attention and rapid progress has been made in the the area. Many recent object detection systems achieve excellent performance [1,2,3,4].

Object detection

However, a deeper understanding of the scene also involves finding the interactions between objects. In human-centric images, the most important interactions are the ones between humans and objects. For example, in the image shown below, in addition to detecting the humans and objects, knowing their interactions provides a better understanding of the scene.

Human-object interaction detection

In this post, I will briefly discuss our recent work on detecting human-object interactions (HOIs). A interaction between a human and an object is usually represented as the triplet $\texttt{<human, predicate, object>}$. Detecting an HOI involves localizing the human and the object and correctly predicting the predicate or the type of interaction between them. In this work, we let a well-performing object detector do the heavy lifting for detecting the entities involved. Such a detector gives bounding boxes, RoI-pooled features, and class labels for each object/human in the image. We, instead, focus on correctly predicting the predicate.

The lack of annotated training data is a major issue for HOI detection. The popular HICO-Det dataset [5] contains interactions involving 80 objects and 117 types of interactions. This means that there are over 9,300 possible HOI classes. (Note that this is the maximum. In practice, this number will be lower because not every predicate can be applied to every object.) Collecting annotated data for such a large number of HOI categories is time-consuming and might be prohibitively expensive. The HICO-Det dataset provides annotations for only 600 HOI triplet categories. Our work tries to deal with this issue.

Our approach is based on the idea of functional similarity between objects. Humans appear similar while interacting with functionally similar objects. For example, in the second row in the image below, all three persons might be drinking from a can, a glass, or a cup. Similarly, in the last row, the three people might be eating either a donut, a muffin, or a slice of cake. Any of these objects could be involved in the interaction. We call such groups of objects functionally similar.

Functional similarity between objects

The core of the idea is that annotated data for an object can be generalized to functionally similar objects. For achieving this, we introduce the functional generalization module. It is just a simple multi-layer perceptron (MLP) containing 2 fully-connected layers. It takes as input the human and object bounding boxes and the corresponding classes from the object detector. It also uses the human RoI-pooled visual features as a representation of the appearance of the human involved in an HOI. The final output is the probability of each predicate.

Network architecture

Given a annotated HOI, we can replace the object by functionally similar objects and generate more data. For example, consider the example $\texttt{<human, drink_with, glass>}$. Here, the object ($\texttt{glass}$) can be replaced by $\texttt{bottle}$ or $\texttt{mug}$ or $\texttt{cup}$ and so on. This helps us generate training instances of different categories from a given training sample.


This generalization approach is particularly useful for detecting rare and non-annotated classes. The following images show some detections generated by our model in the zero-shot HOI detection setting. We did not use any annotated data for the HOI triplets shown in the images.

Some detections for zero-shot categories, i.e., categories for which no annotated data was used during training. Note that some interactions involving these objects were available during training. But the particular interaction triplets shown in the image were not.

Our model can even detect interactions involving objects for which no annotated triplets were available during training. This is because our generalization module can generalize from functionally similar object classes to these.

Detections for zero-shot categories in the unseen object setting.

For more details and an in-depth analysis see our paper: Detecting Human-Object Interactions via Functional Generalization.


References

[1]  Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE international conference on computer vision, pp. 1440-1448. 2015.
[2] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. "Mask r-cnn." In Proceedings of the IEEE international conference on computer vision, pp. 2961-2969. 2017.
[3] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. "Ssd: Single shot multibox detector." In European conference on computer vision, pp. 21-37. Springer, Cham, 2016.
[4] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788. 2016.
[5] Chao, Yu-Wei, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. "Learning to detect human-object interactions." In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381-389. IEEE, 2018.





Monday, April 16, 2018

Zero-Shot Object Detection

Authors: Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, Ajay Divakaran

In this post I give a brief description of our recent work Zero-Shot Object Detection. This work was mostly done when I was an intern at SRI International in summer 2017.

Introduction

This paper introduces the problem of zero-shot object detection. First, let's parse what that means. This contains two terms:

Zero-Shot learning is the framework where training data for some classes is not available but the model is still expected to recognise these classes. The classes for which training data is available are called seen classes and the classes which are not available during training are called unseen classes.

Object Detection is the problem of recognising and localising objects in images. The output of an object detection model is usually a bounding box which covers an object, and the class of the object. Note that this is different from image recognition which involves recognising the class of the image only.

Zero-shot object detection (ZSD) is an important problem because the visual world is composed of thousands of object categories. Anything you see can be given a visual label. However, training a fully-supervised model for detecting so many categories requires infeasible amounts of resources (training data, annotation costs etc.). So, there is a need to develop methods which do not require these huge resources or require resources which are readily available.

Zero-shot learning is the general category of problems where data for some classes is not available during training. There has been some work on zero-shot image recognition, where localising the objects is not important. To be able to recognise previously unseen categories, these methods used external semantic information either in the form of word-vectors, or attributes. This semantic information serves as the link between seen classes and unseen classes. The approach in this paper also uses semantic information to bridge seen and unseen classes.

Using semantic information
Fig. 1: We highlight the task of zero-shot object detection where object classes “arm”, “hand”, and “shirt” are observed (seen) during training, while classes “skirt”, and “shoulder” are not seen. These unseen classes are localized by our approach that leverages semantic relationships, obtained via word embeddings, between seen and unseen classes along with the proposed zero shot detection framework. The example has been generated by our model on images from VisualGenome dataset.
Fully-supervised object detection approaches like R-CNN, Faster R-CNN, YOLO (You Only Look Once), SSD (Single-Shot Detector), have a fixed "background" class which encompasses everything other than the object classes in the training set. However, selecting the background class in the zero-shot setting is not trivial. This is because, unlike the fully-supervised case, there is no such thing as a true background in the zero-shot case. Bounding boxes which do not include any objects from train categories, might either contain background stuff (grass/sky/sea etc.) or might contain objects from the test set. There is no way to say which of these boxes belong to the background and which belong to test classes. Moreover, in the true zero-shot setting (called open vocabulary), you want to be able to detect everything from objects to stuff (grass/sky etc.). So, selecting background is difficult. This paper discusses these issues in greater depth and presents some solutions.

This paper makes the following contributions:
1.  It introduces the problem of ZSD and presents a baseline methods that follows existing work on zero-shot image classification and fully supervised object detection.
2. The authors discuss some challenges associated with incorporating information from background regions and propose two methods for training background-aware detectors.
3. It examines the problem with sparse sampling of classes during training and proposes a solution which densely samples training classes using additional data.

In addition, the paper provides extensive experiments and discussions.


Approach

The baseline approach for ZSD adapts prior work on zero-shot classification for detection. 

Baseline ZSD

Let $\mathcal{C} = \mathcal{S} \cup \mathcal{U} \cup \mathcal{O}$ be the set of all classes, where $\mathcal{S}$ is the set of seen (train) classes, $\mathcal{U}$ is the set of unseen (test) classes, and $\mathcal{O}$ is the set of all classes that are neither part of the seen or unseen classes. Given a bounding box, $b_i \in \mathbb{N}^4$, the cropped object is passed through a CNN to obtain a feature vector $\phi (b_i)$. To use semantic information from the word-vectors, this feature vector is projected into the semantic embedding space $\psi_i = W_p \phi (b_i)$, where $W_p$ is the projection matrix. The projection is trained such that the projection is close to the word-vector of the class of the bounding box. What this means is this. Suppose the class of the given bounding box is $y_i$, and the word-vector of $y_i$ is $w_i$. The aim of training the projection is that $\psi_i$ and $w_i$ should be very similar. If the similarity between a feature vector, $\psi_i$ and a word-vector $w_j$ is given by the cosine similarity $S_{ij}$, then the model (CNN and the projection matrix) is trained to minimise the following max-margin loss:

$$\mathcal{L}(b_i, y_i, \theta) = \sum_{j \in \mathcal{S}, j \neq i} max(0, m - S_{ii} + S_{ij})$$

where, $\theta$ are the parameters of the deep CNN and the projection matrix, and $m$ is the margin. At inference, the predicted class of a bounding box ($b_i$) is given by:

$$\hat{y}_i = \underset{j \in \mathcal{U}}{argmax}  ~S_{ij}$$

Note that, the baseline approach does not include any background boxes in training. To train a more robust model which can better eliminate background boxes, like in fully-supervised object detection methods, two background-aware approaches are presented next. 

Statically Assigned Background (SB) Based ZSD

In this approach, following previous work on object detection, a single static background class is assigned to background training boxes. All the background bounding boxes are assigned to a single background class, $y_b$ with a word-vector, $w_b$. While training, this background class is treated just as any other class in $\mathcal{S}$.

Note that, there is one clear problem with this approach. Some of the background boxes might belong to unseen classes. Also, background boxes might be extremely varied. Trying to assign such varied bounding boxes to a single class is extremely difficult.

To overcome some of these issues, another background-aware model is proposed.

Latent Assignment Based (LAB) ZSD

In this approach the background boxes are spread over the embedding space by using an Expectation Maximization (EM)-like algorithm. Multiple latent classes are assigned to the background objects. At a higher level, this encodes the knowledge that a background box does not belong to the set of seen classes ($\mathcal{S}$), and could possibly belong to a number of different classes from a large vocabulary set, referred to as background set $\mathcal{O}$.

To accomplish this, a baseline model is trained first. This model is used to classify a subset of background boxes into the open vocabulary set $\mathcal{O}$. These background boxes are added to the training set. The model is trained for an epoch. This model is again used to classify another set of background boxes into $\mathcal{O}$. These are added to the training set and the model is trained again for 1 epoch. This process is repeated several times.

This approach is related to open-vocabulary learning, and to latent variables based classification models.

Densely Sampled Embedding Space (DSES)

Using small datasets for training such ZSD models poses the problem that the embedding space is sparsely sampled. This is problematic particularly for recognizing unseen classes which, by definition, lie in parts of the embedding space that do not have training examples. To alleviate this issue, the paper proposes to augment the training procedure with additional data from external sources that contain boxes belonging to classes other than unseen classes, $y_i \in \mathcal{C} - \mathcal{U}$. This means the space of object classes is densely sampled during training to improve the alignment of the embedding space.



This concludes the post. I have tried to give a brief introduction to the problem and some approaches proposed by us. For more details about problem and results, see the paper Zero-Shot Object Detection

Sunday, December 24, 2017

Notes from ICCV 2017

Top Acceptance Rates
Video and Language - 53.8%
Autonomous Driving - 50%
Large-scale Optimization - 45%

Total - 29%


Favourite Papers on Video and Recognition

Video




A Read-Write Memory Network for Movie Story Understanding
 - Question and answering task for large-scale, multimodal movie story understanding

Temporal Tessellation: A Unified Approach for Video Analysis
 - General approach to video understanding inspired by semantic transfer techniques
 - A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.

Unsupervised Action Discovery and Localization in Videos
 - Training data - unlabeled data without bounding box annotations
 - The proposed approach a. Discovers action class labels and b. Spatio-temporally localizes actions in videos

Dense-Captioning Events in Videos
 - Introduce the task of dense-captioning events, which involves both detecting and describing events in a video.
 - Identify all events in a single pass of the video
 - Introduce a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes
 - Introduce a new captioning module that uses contextual information from past and future events to jointly describe all events.
 - New dataset - ActivityNet Captions - 849 video hours with 100k total descriptions

Learning long-term dependencies for action recognition with a biologically-inspired deep network
 - Biological neural systems are typically composed of both feedforward and feedback connections
 - shuttleNet - consists of several processors, each of which is a GRU while associated with multiple groups of hidden states
 - All processors inside shuttleNet are loop connected to mimic the brain's feedforward and feedback connections, in which they are shared across multiple pathways in the loop connection.

Compressive Quantization for Fast Object Instance Search in Videos
 - Object instance search in videos, where efficient point-to-set (image-to-video) matching is essential
 - Jointly optimizing vector quantization method to compress M object proposals extracted from each video into only k binary codes, where k << M
 - Similarity between the query object and the whole video can be determined by the Hamming distance between the query's binary code and the video's best-matched binary code

Complex Event Detection by Identifying Reliable Shots From Untrimmed Videos
 - Formulate as a MIL problem by taking each video as a bag and the video shots in each video as instances
 - New MIL method, which simultaneously learns a linear SVM classifier and infers a binary indicator for each instance in order to select reliable training instances from each positive or negative bag
 - In the objective function balance the weighted training errors and an l1-l2 mixed-norm regularization term which adaptively selects reliable shots as diverse as possible

Spatio-Temporal Person Retrieval via Natural Language Queries
 - Person retrieval from multiple videos
 - Output a tube which encloses the person described by the query
 - New dataset
 - Design a model that combines methods for spatio-temporal human detection and multimodal retrieval

Joint Discovery of Object States and Manipulation Actions
 - Automatically discover the states of objects and the associated manipulation actions
 - Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions

Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks
 - The network aims to distinguish the target area from the background on  the basis of the pixel-level similarity between two object units
 - The proposed network represents a target object using features from different depth layers in order to take advantage of both the spatial details and the category-level semantic information

Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge
 - Recounting of abnormal events - explaining why they are judged to be abnormal
 - Integrate a generic CNN model and environment-dependent anomaly detectors
 - Learn a CNN with multiple visual tasks to exploit semantic information that is useful for detecting and recounting abnormal events
 - Appropriately plugging the model into anomaly detectors

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals
 - TURN jointly predicts action proposals and refines the temporal boundaries by temporal coordinate regression
 - Fast computation is enabled by unit feature reuse: a long untrimmed video is decomposed into video units, which are reused as basic building blocks of temporal proposals

Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
 - Zero-shot localization and classification of human actions in video
 - Spatial-aware object embedding
 - Build embedding on top of freely available actor and object detectors
 - Exploit the object positions and sizes in the spatial-aware embedding to demonstrate a new spatio-temporal action retrieval scenario with composite queries

Temporal Dynamic Graph LSTMs for Action-Driven Video Object Detection
 - Weakly supervised object detection from videos
 - Use action descriptions as supervision
 - But, objects of interest that are not involved in human actions are often absent in global action descriptions
 - Propose a novel temporal dynamic graph LSTM (TD-Graph). TD_graph LSTM enables global temporal reasoning by constructing a dynamic graph that is based on temporal correlations of object proposals and spans the entire video
 - The missing label issue for each individual frame can thus be significantly alleviated by transferring knowledge across correlated object proposals in the whole video


Recognition




Open Set Domain Adaptation
 - Domain adaptation in open sets - only a few categories of interest are shared between source and target data
 - The proposed method fits in both closed and open set scenarios
 - The approach learns a mapping from the source to the target domain by jointly solving an assignment problem that labels those target instances that potentially belong to the categories of interest present in the source dataset

FoveaNet: Perspective-aware Urban Scene Parsing
 - Estimate the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image
 - FoveaNet "undoes" the camera perspective projection - analyzing regions in the space of the actual scene, and thus provides much more reliable parsing results
 - Introduce a new dense CRFs model that takes the perspective geometry as a prior potential

Generative Modeling of Audible Shapes for Object Perception
 - Present a novel, open-source pipeline that generates audio-visual data, purely from 3D shapes and their physical properties
 - Synthetic audio-visual dataset - Sound-20K for object perception tasks
 - Auditory and visual information play complementary roles in object perception, and the representation learned on synthetic audio-visual data can transfer to real-world scenarios

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
 - Transfer human supervision between the previously separate tasks

 - Establishing semantic correspondences between images depicting different instances of the same object or scene category
 - CNN architecture for learning a geometrically plausible model for semantic correspondence
 - Uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function

 - The real-world noisy labels exhibit multi-modal characteristics as the true labels, rather than behaving like independent random outliers
 - Propose a unified distillation framework to use “side” information, including a small clean dataset and label relations in knowledge graph, to “hedge the risk” of learning from noisy labels.
 - Propose a suite of new benchmark datasets

 - Presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues
 - Model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions
 - Special attention given to relationships between people and clothing or body parts mentions, as they are useful for distinguishing individuals. 
 - Automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption

 -  Leverage the strong correlations between the predicate and the <subj, obj> pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects.
 - Use knowledge of linguistic statistics to regularize visual model learning
 - Obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a <subj, obj> pair
 - Distill this knowledge into the deep model to achieve better generalization

 - Introduce an end-to-end multi-task objective that jointly learns object-action relationships
 - Proposed architecture can be used for zero-shot learning of actions

 - Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer
 - Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE

Friday, December 9, 2016

A short summary of the paper - Generative adversarial nets

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

This paper [1] proposes a method for estimating generative models using an adversarial process. The authors train two networks simultaneously: a generative model, $G$,  and  a discriminative model, $D$. The generative model learns the data distribution and the discriminative model estimates whether the probability that a sample came from the data rather than $G$. The two networks are trained using a two-player minimax game. The discriminative network, $D$, is trained to maximise the probability of correctly classifying samples from both training data and data generated by $G$. Simultaneously, $G$ is trained to maximise the probability of $D$ making a mistake, i.e., in a way that the data generated by G is indistinguishable from the training data. The authors train these two networks iteratively, alternating between $k$ steps of optimising $D$ and one step of optimising $G$.

The authors also present theoretical guarantees on the convergence of the algorithm to the optimal value (in the sense of global minimum of their objective function). However, the guarantees are not applicable to the case presented in the paper because they make some assumptions which are infeasible to implement using neural networks. However, the paper argues that since deep neural networks perform very well in several domains, they are reasonable models to use here.

In this paper, the authors use $k = 1$ to minimize the training cost. I think that it would be interesting to try higher values of $k$. This would bring the framework closer to one of the conditions of their guarantees; the condition that $D$ is allowed to reach its optimum given $G$. The authors don't mention whether they tried this. This could lead to better convergence even though it will be more expensive to train.

A disadvantage of this approach is that $D$ should be well synchronised with $G$, i.e., $G$ cannot be trained too much without updating $D$. Though, adversarial nets offer several advantages over traditional generative models. This paper is an important step towards unsupervised learning and has already inspired some work in that direction [2].

References:

[1] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In Advances in Neural Information Processing Systems, pp. 2672-2680. 2014.
[2] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

Wednesday, December 7, 2016

A short summary of the paper - On the Number of Linear Regions of Deep Neural Networks

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

Abstract

This paper presents theoretical results about the advantages of deep networks over shallow networks. The authors calculate bounds on the number of linear regions that deep networks with piecewise linear activation functions (e.g. ReLU and Maxout) can use to approximate functions. The number of linear regions used for representing a function can be thought of as a measure of the complexity of the representation model. The paper shows that deep networks produce exponentially more linear regions than shallow networks. 

Recently, [4] showed that shallow networks require exponentially more sum-product hidden units that deep networks to represent certain functions. For several years people used smoothed activation functions as non-linearities in the neural networks. But [2] showed that Rectified Linear Units (ReLU) are much faster to train. In 2013, [3] introduced a new form of piecewise linear activation called Maxout. This paper [1] aims to present a theoretical analysis of the advantages of deep networks with such activations over shallow networks. Such an analysis was also done in [5] which showed that, while approximating a function, deep networks are able to produce exponentially more linear regions than shallow networks with the same number of hidden units. This paper presents a tighter lower bound on the maximal number of linear regions of functions computed by neural networks than [5]. This lends deep networks an advantage over shallow networks in terms of representation power. A higher number of linear pieces means an ability to represent more complex functions.

This paper also describes how intermediate layers of deep neural networks "map several pieces of their inputs into the same output". The authors present the hypothesis that deep networks re-use and compose features from lower layers to higher layers exponentially often with the increase in the number of layers. This gives them the ability to compute highly complex functions.

Theoretical bounds on the number of linear regions produced by shallow networks were presented in [5]. They propose that the maximal number of linear regions of functions computed by shallow rectifier networks with $n_0$ inputs and $n_1$ hidden units is $\sum_{j=0}^{n_0}\binom{n_1}{j}$. They also obtain a bound of $\Omega ((\frac{n}{n_0})^{(L-1)}n^{n_0})$ on the number of linear regions of a function that can be computed by a rectifier neural network with $n_{0}$ inputs and $L$ hidden units of width $n \geq n_0$.

The main result presented in this paper is the following (Corollary 6 in [1]):
A rectifier neural network with $n_0$ input units and $L$ hidden layers of width $n \geq n_0$ can compute functions that have $\Omega ((\frac{n}{n_0})^{(L-1)n_0}n^{n_0})$ linear regions.

The main contribution of this paper is a tighter bound on the complexity of the functions that deep networks can represent. The authors theoretically show that deep networks are exponentially more efficient than shallow networks. This is a step forward in developing a theoretical understanding of deep networks. However, an important question that needs to be answered is whether we actually need an exponentially higher capacity for the tasks that we are currently interested in. Recently, [6] showed that shallow networks with a similar number of parameters as deep networks mimic the accuracies obtained by deep networks on several tasks (TIMIT and CIFAR-10). Although they used deep networks to train their shallow models, this result shows that shallow networks have the capacity to perform as well as deep networks at least on some tasks. We might just have to figure out ways to train them efficiently to ensure that their capacity is fully exploited.

References:

[1] Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. "On the number of linear regions of deep neural networks." In Advances in neural information processing systems, pp. 2924-2932. 2014.
[2] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep Sparse Rectifier Neural Networks." In Aistats, vol. 15, no. 106, p. 275. 2011.
[3] Goodfellow, Ian J., David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. "Maxout networks." ICML (3) 28 (2013): 1319-1327.
[4] Delalleau, Olivier, and Yoshua Bengio. "Shallow vs. deep sum-product networks." In Advances in Neural Information Processing Systems, pp. 666-674. 2011.
[5] Pascanu, Razvan, Guido Montufar, and Yoshua Bengio. "On the number of response regions of deep feed forward networks with piece-wise linear activations." arXiv preprint arXiv:1312.6098 (2013).
[6] Ba, Jimmy, and Rich Caruana. "Do deep nets really need to be deep?." In Advances in neural information processing systems, pp. 2654-2662. 2014.