Tuesday, April 30, 2019

Detecting Human-Object Interactions via Functional Generalization

The problem of Object Detection has received significant attention and rapid progress has been made in the the area. Many recent object detection systems achieve excellent performance [1,2,3,4].

Object detection

However, a deeper understanding of the scene also involves finding the interactions between objects. In human-centric images, the most important interactions are the ones between humans and objects. For example, in the image shown below, in addition to detecting the humans and objects, knowing their interactions provides a better understanding of the scene.

Human-object interaction detection

In this post, I will briefly discuss our recent work on detecting human-object interactions (HOIs). A interaction between a human and an object is usually represented as the triplet $\texttt{<human, predicate, object>}$. Detecting an HOI involves localizing the human and the object and correctly predicting the predicate or the type of interaction between them. In this work, we let a well-performing object detector do the heavy lifting for detecting the entities involved. Such a detector gives bounding boxes, RoI-pooled features, and class labels for each object/human in the image. We, instead, focus on correctly predicting the predicate.

The lack of annotated training data is a major issue for HOI detection. The popular HICO-Det dataset [5] contains interactions involving 80 objects and 117 types of interactions. This means that there are over 9,300 possible HOI classes. (Note that this is the maximum. In practice, this number will be lower because not every predicate can be applied to every object.) Collecting annotated data for such a large number of HOI categories is time-consuming and might be prohibitively expensive. The HICO-Det dataset provides annotations for only 600 HOI triplet categories. Our work tries to deal with this issue.

Our approach is based on the idea of functional similarity between objects. Humans appear similar while interacting with functionally similar objects. For example, in the second row in the image below, all three persons might be drinking from a can, a glass, or a cup. Similarly, in the last row, the three people might be eating either a donut, a muffin, or a slice of cake. Any of these objects could be involved in the interaction. We call such groups of objects functionally similar.

Functional similarity between objects

The core of the idea is that annotated data for an object can be generalized to functionally similar objects. For achieving this, we introduce the functional generalization module. It is just a simple multi-layer perceptron (MLP) containing 2 fully-connected layers. It takes as input the human and object bounding boxes and the corresponding classes from the object detector. It also uses the human RoI-pooled visual features as a representation of the appearance of the human involved in an HOI. The final output is the probability of each predicate.

Network architecture

Given a annotated HOI, we can replace the object by functionally similar objects and generate more data. For example, consider the example $\texttt{<human, drink_with, glass>}$. Here, the object ($\texttt{glass}$) can be replaced by $\texttt{bottle}$ or $\texttt{mug}$ or $\texttt{cup}$ and so on. This helps us generate training instances of different categories from a given training sample.

This generalization approach is particularly useful for detecting rare and non-annotated classes. The following images show some detections generated by our model in the zero-shot HOI detection setting. We did not use any annotated data for the HOI triplets shown in the images.

Some detections for zero-shot categories, i.e., categories for which no annotated data was used during training. Note that some interactions involving these objects were available during training. But the particular interaction triplets shown in the image were not.

Our model can even detect interactions involving objects for which no annotated triplets were available during training. This is because our generalization module can generalize from functionally similar object classes to these.

Detections for zero-shot categories in the unseen object setting.

For more details and an in-depth analysis see our paper: Detecting Human-Object Interactions via Functional Generalization.


[1]  Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE international conference on computer vision, pp. 1440-1448. 2015.
[2] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. "Mask r-cnn." In Proceedings of the IEEE international conference on computer vision, pp. 2961-2969. 2017.
[3] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. "Ssd: Single shot multibox detector." In European conference on computer vision, pp. 21-37. Springer, Cham, 2016.
[4] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788. 2016.
[5] Chao, Yu-Wei, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. "Learning to detect human-object interactions." In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381-389. IEEE, 2018.

Monday, April 16, 2018

Zero-Shot Object Detection

Authors: Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, Ajay Divakaran

In this post I give a brief description of our recent work Zero-Shot Object Detection. This work was mostly done when I was an intern at SRI International in summer 2017.


This paper introduces the problem of zero-shot object detection. First, let's parse what that means. This contains two terms:

Zero-Shot learning is the framework where training data for some classes is not available but the model is still expected to recognise these classes. The classes for which training data is available are called seen classes and the classes which are not available during training are called unseen classes.

Object Detection is the problem of recognising and localising objects in images. The output of an object detection model is usually a bounding box which covers an object, and the class of the object. Note that this is different from image recognition which involves recognising the class of the image only.

Zero-shot object detection (ZSD) is an important problem because the visual world is composed of thousands of object categories. Anything you see can be given a visual label. However, training a fully-supervised model for detecting so many categories requires infeasible amounts of resources (training data, annotation costs etc.). So, there is a need to develop methods which do not require these huge resources or require resources which are readily available.

Zero-shot learning is the general category of problems where data for some classes is not available during training. There has been some work on zero-shot image recognition, where localising the objects is not important. To be able to recognise previously unseen categories, these methods used external semantic information either in the form of word-vectors, or attributes. This semantic information serves as the link between seen classes and unseen classes. The approach in this paper also uses semantic information to bridge seen and unseen classes.

Using semantic information
Fig. 1: We highlight the task of zero-shot object detection where object classes “arm”, “hand”, and “shirt” are observed (seen) during training, while classes “skirt”, and “shoulder” are not seen. These unseen classes are localized by our approach that leverages semantic relationships, obtained via word embeddings, between seen and unseen classes along with the proposed zero shot detection framework. The example has been generated by our model on images from VisualGenome dataset.
Fully-supervised object detection approaches like R-CNN, Faster R-CNN, YOLO (You Only Look Once), SSD (Single-Shot Detector), have a fixed "background" class which encompasses everything other than the object classes in the training set. However, selecting the background class in the zero-shot setting is not trivial. This is because, unlike the fully-supervised case, there is no such thing as a true background in the zero-shot case. Bounding boxes which do not include any objects from train categories, might either contain background stuff (grass/sky/sea etc.) or might contain objects from the test set. There is no way to say which of these boxes belong to the background and which belong to test classes. Moreover, in the true zero-shot setting (called open vocabulary), you want to be able to detect everything from objects to stuff (grass/sky etc.). So, selecting background is difficult. This paper discusses these issues in greater depth and presents some solutions.

This paper makes the following contributions:
1.  It introduces the problem of ZSD and presents a baseline methods that follows existing work on zero-shot image classification and fully supervised object detection.
2. The authors discuss some challenges associated with incorporating information from background regions and propose two methods for training background-aware detectors.
3. It examines the problem with sparse sampling of classes during training and proposes a solution which densely samples training classes using additional data.

In addition, the paper provides extensive experiments and discussions.


The baseline approach for ZSD adapts prior work on zero-shot classification for detection. 

Baseline ZSD

Let $\mathcal{C} = \mathcal{S} \cup \mathcal{U} \cup \mathcal{O}$ be the set of all classes, where $\mathcal{S}$ is the set of seen (train) classes, $\mathcal{U}$ is the set of unseen (test) classes, and $\mathcal{O}$ is the set of all classes that are neither part of the seen or unseen classes. Given a bounding box, $b_i \in \mathbb{N}^4$, the cropped object is passed through a CNN to obtain a feature vector $\phi (b_i)$. To use semantic information from the word-vectors, this feature vector is projected into the semantic embedding space $\psi_i = W_p \phi (b_i)$, where $W_p$ is the projection matrix. The projection is trained such that the projection is close to the word-vector of the class of the bounding box. What this means is this. Suppose the class of the given bounding box is $y_i$, and the word-vector of $y_i$ is $w_i$. The aim of training the projection is that $\psi_i$ and $w_i$ should be very similar. If the similarity between a feature vector, $\psi_i$ and a word-vector $w_j$ is given by the cosine similarity $S_{ij}$, then the model (CNN and the projection matrix) is trained to minimise the following max-margin loss:

$$\mathcal{L}(b_i, y_i, \theta) = \sum_{j \in \mathcal{S}, j \neq i} max(0, m - S_{ii} + S_{ij})$$

where, $\theta$ are the parameters of the deep CNN and the projection matrix, and $m$ is the margin. At inference, the predicted class of a bounding box ($b_i$) is given by:

$$\hat{y}_i = \underset{j \in \mathcal{U}}{argmax}  ~S_{ij}$$

Note that, the baseline approach does not include any background boxes in training. To train a more robust model which can better eliminate background boxes, like in fully-supervised object detection methods, two background-aware approaches are presented next. 

Statically Assigned Background (SB) Based ZSD

In this approach, following previous work on object detection, a single static background class is assigned to background training boxes. All the background bounding boxes are assigned to a single background class, $y_b$ with a word-vector, $w_b$. While training, this background class is treated just as any other class in $\mathcal{S}$.

Note that, there is one clear problem with this approach. Some of the background boxes might belong to unseen classes. Also, background boxes might be extremely varied. Trying to assign such varied bounding boxes to a single class is extremely difficult.

To overcome some of these issues, another background-aware model is proposed.

Latent Assignment Based (LAB) ZSD

In this approach the background boxes are spread over the embedding space by using an Expectation Maximization (EM)-like algorithm. Multiple latent classes are assigned to the background objects. At a higher level, this encodes the knowledge that a background box does not belong to the set of seen classes ($\mathcal{S}$), and could possibly belong to a number of different classes from a large vocabulary set, referred to as background set $\mathcal{O}$.

To accomplish this, a baseline model is trained first. This model is used to classify a subset of background boxes into the open vocabulary set $\mathcal{O}$. These background boxes are added to the training set. The model is trained for an epoch. This model is again used to classify another set of background boxes into $\mathcal{O}$. These are added to the training set and the model is trained again for 1 epoch. This process is repeated several times.

This approach is related to open-vocabulary learning, and to latent variables based classification models.

Densely Sampled Embedding Space (DSES)

Using small datasets for training such ZSD models poses the problem that the embedding space is sparsely sampled. This is problematic particularly for recognizing unseen classes which, by definition, lie in parts of the embedding space that do not have training examples. To alleviate this issue, the paper proposes to augment the training procedure with additional data from external sources that contain boxes belonging to classes other than unseen classes, $y_i \in \mathcal{C} - \mathcal{U}$. This means the space of object classes is densely sampled during training to improve the alignment of the embedding space.

This concludes the post. I have tried to give a brief introduction to the problem and some approaches proposed by us. For more details about problem and results, see the paper Zero-Shot Object Detection

Sunday, December 24, 2017

Notes from ICCV 2017

Top Acceptance Rates
Video and Language - 53.8%
Autonomous Driving - 50%
Large-scale Optimization - 45%

Total - 29%

Favourite Papers on Video and Recognition


A Read-Write Memory Network for Movie Story Understanding
 - Question and answering task for large-scale, multimodal movie story understanding

Temporal Tessellation: A Unified Approach for Video Analysis
 - General approach to video understanding inspired by semantic transfer techniques
 - A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.

Unsupervised Action Discovery and Localization in Videos
 - Training data - unlabeled data without bounding box annotations
 - The proposed approach a. Discovers action class labels and b. Spatio-temporally localizes actions in videos

Dense-Captioning Events in Videos
 - Introduce the task of dense-captioning events, which involves both detecting and describing events in a video.
 - Identify all events in a single pass of the video
 - Introduce a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes
 - Introduce a new captioning module that uses contextual information from past and future events to jointly describe all events.
 - New dataset - ActivityNet Captions - 849 video hours with 100k total descriptions

Learning long-term dependencies for action recognition with a biologically-inspired deep network
 - Biological neural systems are typically composed of both feedforward and feedback connections
 - shuttleNet - consists of several processors, each of which is a GRU while associated with multiple groups of hidden states
 - All processors inside shuttleNet are loop connected to mimic the brain's feedforward and feedback connections, in which they are shared across multiple pathways in the loop connection.

Compressive Quantization for Fast Object Instance Search in Videos
 - Object instance search in videos, where efficient point-to-set (image-to-video) matching is essential
 - Jointly optimizing vector quantization method to compress M object proposals extracted from each video into only k binary codes, where k << M
 - Similarity between the query object and the whole video can be determined by the Hamming distance between the query's binary code and the video's best-matched binary code

Complex Event Detection by Identifying Reliable Shots From Untrimmed Videos
 - Formulate as a MIL problem by taking each video as a bag and the video shots in each video as instances
 - New MIL method, which simultaneously learns a linear SVM classifier and infers a binary indicator for each instance in order to select reliable training instances from each positive or negative bag
 - In the objective function balance the weighted training errors and an l1-l2 mixed-norm regularization term which adaptively selects reliable shots as diverse as possible

Spatio-Temporal Person Retrieval via Natural Language Queries
 - Person retrieval from multiple videos
 - Output a tube which encloses the person described by the query
 - New dataset
 - Design a model that combines methods for spatio-temporal human detection and multimodal retrieval

Joint Discovery of Object States and Manipulation Actions
 - Automatically discover the states of objects and the associated manipulation actions
 - Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions

Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks
 - The network aims to distinguish the target area from the background on  the basis of the pixel-level similarity between two object units
 - The proposed network represents a target object using features from different depth layers in order to take advantage of both the spatial details and the category-level semantic information

Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge
 - Recounting of abnormal events - explaining why they are judged to be abnormal
 - Integrate a generic CNN model and environment-dependent anomaly detectors
 - Learn a CNN with multiple visual tasks to exploit semantic information that is useful for detecting and recounting abnormal events
 - Appropriately plugging the model into anomaly detectors

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals
 - TURN jointly predicts action proposals and refines the temporal boundaries by temporal coordinate regression
 - Fast computation is enabled by unit feature reuse: a long untrimmed video is decomposed into video units, which are reused as basic building blocks of temporal proposals

Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
 - Zero-shot localization and classification of human actions in video
 - Spatial-aware object embedding
 - Build embedding on top of freely available actor and object detectors
 - Exploit the object positions and sizes in the spatial-aware embedding to demonstrate a new spatio-temporal action retrieval scenario with composite queries

Temporal Dynamic Graph LSTMs for Action-Driven Video Object Detection
 - Weakly supervised object detection from videos
 - Use action descriptions as supervision
 - But, objects of interest that are not involved in human actions are often absent in global action descriptions
 - Propose a novel temporal dynamic graph LSTM (TD-Graph). TD_graph LSTM enables global temporal reasoning by constructing a dynamic graph that is based on temporal correlations of object proposals and spans the entire video
 - The missing label issue for each individual frame can thus be significantly alleviated by transferring knowledge across correlated object proposals in the whole video


Open Set Domain Adaptation
 - Domain adaptation in open sets - only a few categories of interest are shared between source and target data
 - The proposed method fits in both closed and open set scenarios
 - The approach learns a mapping from the source to the target domain by jointly solving an assignment problem that labels those target instances that potentially belong to the categories of interest present in the source dataset

FoveaNet: Perspective-aware Urban Scene Parsing
 - Estimate the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image
 - FoveaNet "undoes" the camera perspective projection - analyzing regions in the space of the actual scene, and thus provides much more reliable parsing results
 - Introduce a new dense CRFs model that takes the perspective geometry as a prior potential

Generative Modeling of Audible Shapes for Object Perception
 - Present a novel, open-source pipeline that generates audio-visual data, purely from 3D shapes and their physical properties
 - Synthetic audio-visual dataset - Sound-20K for object perception tasks
 - Auditory and visual information play complementary roles in object perception, and the representation learned on synthetic audio-visual data can transfer to real-world scenarios

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
 - Transfer human supervision between the previously separate tasks

 - Establishing semantic correspondences between images depicting different instances of the same object or scene category
 - CNN architecture for learning a geometrically plausible model for semantic correspondence
 - Uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function

 - The real-world noisy labels exhibit multi-modal characteristics as the true labels, rather than behaving like independent random outliers
 - Propose a unified distillation framework to use “side” information, including a small clean dataset and label relations in knowledge graph, to “hedge the risk” of learning from noisy labels.
 - Propose a suite of new benchmark datasets

 - Presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues
 - Model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions
 - Special attention given to relationships between people and clothing or body parts mentions, as they are useful for distinguishing individuals. 
 - Automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption

 -  Leverage the strong correlations between the predicate and the <subj, obj> pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects.
 - Use knowledge of linguistic statistics to regularize visual model learning
 - Obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a <subj, obj> pair
 - Distill this knowledge into the deep model to achieve better generalization

 - Introduce an end-to-end multi-task objective that jointly learns object-action relationships
 - Proposed architecture can be used for zero-shot learning of actions

 - Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer
 - Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE

Wednesday, May 3, 2017

Superintelligence and Singularity

This is another essay I wrote for a class at Maryland.


Understanding and emulating human intelligence has been the target of artificial intelligence researchers for a long time now. However, human-level artificial intelligence is not the final destination. Most researchers seem to think that in the next few decades we will start developing technologies that improve upon human intelligence. This could either happen by increasing human intelligence or creating artificial intelligence which surpasses human intelligence. This will lead to a positive feedback loop where improving intelligence will lead to more technology which improves intelligence further. The mathematician, I. J. Good called this process “intelligence explosion” [5]. This means that even a small improvement in intelligence will lead to immense changes within a short period. This event is called a technological singularity (or the Singularity, here). Singularity is an event where the runaway intelligence growth far surpasses any human comprehension or control. Ray Kurzweil defines the Singularity as a future period during which the pace of technological change will be so rapid, its impacts so deep, that human life will be irreversibly transformed [6].

An agent that possesses intelligence far surpassing that of the brightest and the most gifted humans [1] is called a superintelligence. Many philosophers and AI researchers believe that once we achieve superintelligence, the singularity is not far behind [4; 7]. And they believe that we are not far from achieving superintelligence. This raises questions about what such a superintelligence might do. Some people believe that this raises major existential risks for humans [4]. Others think that this will be extremely useful for humans [6]. However, everyone agrees that the Singularity is an event which will change/end the way we live. Vernor Vinge says that the change will be comparable to the rise of human life on Earth [7]. Eliezer Yudkowsky believes that the next few decades could determine the future of intelligent life. He says that superintelligence is the single most important issue in the world right now [2]. I.J. Good wrote - “The first ultraintelligent machine is the last invention that man need ever make” [5].

In this essay, I will present some paths that might lead to superintelligence, and hence, the Singularity. I will also discuss the ways in which such an agent might affect human lives and some steps to be taken to avoid the “major existential risks”.


There are several ways through which superintelligence could be achieved. It is extremely difficult to predict exactly which one will ultimately lead to superintelligence. However, most researchers believe that some combination of the following is likely to be the reason [4; 7; 8; 6; 2].

Artificial superintelligence
In this scenario, humans will create an artificial intelligence matching human intelligence. But, since an AI operates at much higher speeds than humans, it will be able to rewrite its own source code and create higher intelligence within a very short time leading to an intelligence explosion.

Biomedical improvements
Humans will increase their intelligence by enhancing the functioning of our biological brains. This could be achieved, for example, through drugs, selective breeding, or manipulation of genes. Such cognitive enhancements will accelerate science and technology. This will enable humans to increase their intelligence further. Higher cognitive capabilities will also enable humans to understand their own brains better and thus build a superintelligent AI.

Brain-to-computer interfaces
We will be able to build technology that can directly interface with human brains. This means that we will achieve intelligence amplification through brain-machine interface. There will be no difference between man and machine. They will become a single entity.

Networks and organisations that link humans with one another will become sufficiently efficient to be considered a superhuman being. This is an example of a collective superintelligence. Such a network will be efficient in the sense that the barriers to communication are reduced or removed. All of humanity will become one superintelligent being.


Regardless of how science achieves superintelligence, its impact on intelligent life will be immense. This will be an event similar to the origin of human life on Earth [7]. What will a superintelligent being do? This is an important question. It is also unanswerable before a superintelligence actually emerges. Unlike human intelligence, the space of all possible superintelligences is vast [2]. Yudkowski says that the impact of the intelligence explosion depends on exactly what kind of minds go through the tipping point [2]. Vinge argues that what the superintelligence will do is absolutely unpredictable [7]. You have to be as intelligent as the superintelligence to understand its motivations and actions. On the other hand, Kurzweil believes that technological developments follow typically smooth exponential curves and thus we can predict the arrival of new technology and its impacts [6; 3]. (He makes several such predictions in his book, which I will discuss in a bit.)

Given all of this, there are two main camps of thought about the future: the pessimists and the optimists. The first camp believes that the development of a superintelligence poses a major existential crisis [4]. Bostrom argues that an intelligence explosion will not give us time to adapt. Once someone finds one of the several keys to creating a superintelligence, we will have anywhere from a few hours to a few weeks till it achieves complete world dominance. This is not enough to form strategies for dealing with such a dramatic change. He believes that the default outcome of this event is doom. The first such system will quickly achieve a decisive strategic advantage and become a singleton eliminating all competing superintelligent systems. Even if programmed with a goal to serve humanity, such an agent might have a convergent instrumental reason to eliminate threats to itself. It might consider the same humans it is supposed to serve as hindrances in achieving its goals. The pessimist camp says that there are several malignant failure modes for a superintelligent system. The agent might find some way of satisfying its final goals which violates the intentions of the programmers who defined the goals. Or the agent might transform large parts of the universe into infrastructure needed to satisfy its goals. This will prevent humanity from realising its ”full axio logical potential” [4]. Bostrom also argues that controlling such an agent is almost impossible.

On the other hand, the optimists believe that development of a superintelligence will be beneficial for humanity. Ray Kurzweil says - “The Singularity will allow us to transcend the limitations of our biological bodies and brains. We will be able to live long (as long as we want)...fully understand human thinking and will vastly extend and expand its reach”. He believes that the Singularity will be achieved through brain-machine interface. He envisions a world that is still human but that transcends our biological roots. In his world, there will be no distinction between brain and machine or physical and virtual reality. Kurzweil says that the intelligence will still represent the human civilization. Others in the optimist camp believe that the superintelligent agents will be benevolent gods. Such agents can develop cures for currently incurable diseases, can crack the aging problem, and can find ways to eliminate all human suffering.

The impact of the Singularity is a very contentious issue. However, everyone agrees that it will be immense and the development of a superintelligence will be a world changing event. Such an event also raises moral and ethical issues. Should the superintelligent agent be given moral status? If so, how much? Should the agent be considered on par with humans? Or should it be given a higher moral status? These are important questions and have significant implications.

I believe that development of superintelligence represents the next level in evolution of intelligent beings. I think that if a truly superintelligent being is created, then it has every right to attain world dominance just like we do now. Such an agent might decide to eliminate humans or we might become that agent. But this should not stop us from trying to understand intelligence and build intelligent systems. However, we have to be absolutely sure that such an agent is superintelligent, i.e., is better than humans in all respects. Unless we are sure of that, we have to be extremely careful.

[1] https://en.wikipedia.org/wiki/Superintelligence
[2] http://yudkowsky.net/singularity/intro/
[3] http://yudkowsky.net/singularity/schools/
[4] Nick Bostrom. Superintelligence: Paths, dangers, strategies. OUP Oxford, 2014.
[5] Irving John Good. Speculations concerning the first ultraintelligent machine. Advances in computers, 6:31–88, 1966.
[6] Ray Kurzweil. The singularity is near: When humans transcend biology. Penguin, 2005.
[7] Vernor Vinge. The coming technological singularity: How to survive in the post-human era. In Proceedings of a Symposium Vision-21: Interdisciplinary Science & Engineering in the Era of CyberSpace, held at NASA Lewis Research Center (NASA Conference Publication CP-10129).1993, 1993.
[8] Vernor Vinge. Signs of the singularity. IEEE Spectrum, 45(6), 2008.

Thursday, April 13, 2017

False Memories

This is an essay I wrote for a class at Maryland.


Each of us remembers an event or events which none of our friends and relatives remember. You might remember getting lost in a mall while on a family trip or witnessing an accident or, like I did recently, taking a group photograph at a friend’s wedding. However, your friends and family remember something completely different about the day of the event and they all agree on what happened. You think all your friends just have very poor memories and they must have forgotten the event. But, chances are, you are the one who doesn’t remember what happened. The event you so clearly remember might not have ever happened or might have happened very differently. What is happening here? Are you losing your mind or are your friends playing a prank?

False memory is a well studied psychological phenomenon of a person recalling something which either did not occur or occurred differently. When you remember taking a photograph at your friend’s wedding and no such photograph exists, you have created a false memory somehow. In this essay, I will discuss some studies which show how easy it is to acquire false memories. Studying false memories can shed light on how the human brain stores and retrieves memories.

When false memories begin influencing the orientation of a person’s life, the condition is called false memory syndrome (FMS). Though it is not recognised as a psychiatric illness, FMS can affect the “identity and relationships” of a person [1]. In some cases, the whole identity of a person can change because of a false memory of a traumatic experience. Understanding this phenomenon will help us understand ideas about identity and consciousness. Neurological study of patients suffering from FMS can help unlock secrets of the memory creation and storage process.

The concepts of false memory and false memory syndrome are close related to the phenomenon of confabulation which is the process of creating false memories without the intention to deceive. There are profound legal issues related to confabulation and false memories. How do you find out if a person has an intention of deceiving? How much do you trust eye-witness testimony? Which of the thousands of claims of repressed memories childhood sexual abuse do you believe? These are important questions for the judiciary and for psychologists trying to understand the human behaviour. Another related effect is the source-monitoring error. This happens when you incorrectly attribute the source of a memory/information. You might attribute a fact that you know to a book when you actually saw it in a video.

In the next few sections, I will explore some of these phenomena and present some experiments which might make you question every memory you have.


Scientists have discovered several ways of creating false memories in people. Photos, speech, or text have all been used to create these false memories. I will describe some very simple examples of memory distortion and false memory implantation. First, I discuss a very influential study by Loftus et al. which shows how language can create false memories.

Recalling incorrect information due to language of the question
In [7], Loftus and Palmer showed videos of cars hitting each other to a few subjects. They then asked the subjects to estimate the speed of the cars. They found that using different words to describe the accident led to different estimates of the speed. For example, the question “About how fast were the cars going when they smashed into each other?” [7] led the subjects to estimate the speed of the cars higher than when using the verbs bumped, collided, contacted, and hit. They observed similar trends for the question “Did you see any broken glass?” [7]. This showed that human memories are extremely susceptible to suggestion and can be influenced by changing just a single word.

Similar studies with changing an article (“Did you see a stop sign?” vs. “Did you see the stop sign?”) or an adjective (“How tall was the basketball player?” vs. “How short was the basketball player?”) in the question led to differing accounts of events [2]. This is because using a particular word instead of others causes subjects to create certain presuppositions which colour their judgment about the events in questions. This raises questions about the reliability of the recalled memories. Dr. Loftus has written extensively about the unreliability of memories recalled through prolonged searches for them [6]. She says that the rise in cases of child abuse involving repressed memories is alarming. The possibility of these recalled memories actually being false memories should not be ignored. In some cases, the psychiatrists themselves might be responsible for creating these false memories in the subjects through techniques like age regression, hypnosis, guided visualisation, etc.

Another study involving the use of language for creating false memories dealt with remembering lists of words.

False memories through lists
The authors of [8] show that even college students who are “professional memorisers” can falsely remember words not present in a list which they were asked to remember. Subjects were given lists of words related to a concept (nonpresented word), without explicitly stating the concept in the list. For example, a subject might have been given the list bed, alarm, rise, dream, ... etc. All these words are usually associated with sleep but the word sleep is not explicitly mentioned in the list. The recall rate of the nonpresented word was very high in the subjects. This led the authors to conclude that all memory is constructive in nature. This was in contradiction to the theory of reproductive and reconstructive memories proposed by Bartlett and Burt [4] which was the prevalent belief at that time. The theory said that list learning paradigms come under rote reproduction which causes few errors. On the other hand, rich material like stories encourages constructive processes which form associations and connection between different parts of the material. Retrieval of these memories leads to more errors. By showing incorrect recall of words in lists, the authors of [8] showed that the distinction between reproductive and reconstructive memories was ill-founded.

Obviously, language is not the only source of false memory creation. The next section describes a study which shows that visual information can also lead to false memories.

Photographs with news articles
Photographs accompanying a news article can help cement the content better [10]. In their experiments, authors of [10] showed newspaper headlines to subjects. Some of these headlines were accompanied by photos which were tangentially related to the headline. Also, some of these headlines were false, that is, the events described in the headlines had never actually happened. After reading the headlines and seeing the photographs, where present, the subjects were asked whether they remembered the events described in the headlines. The authors found that photos mattered. For both true and false headlines, people remembered more of the events described by the headlines which were accompanied by photographs. In remembering the events described by the false newspaper headlines,people had created false memories of the events. And they created more false memories for the events which had photos associated with them. The authors claimed that this could be explained by Rubin’s basic systems approach to memory [9]. This theory says that memory is a result of multiple systems and subsystems - visual, auditory, language etc. - which interact and reinforce each other. Providing stimulus to multiple subsystems leads to reinforcement of each subsystem and that helps in creating stronger memories.

From all these ways of creating false memories, we can clearly say that the study of the phenomenon of false memory can help in answering several questions about the human brain and how it encodes, stores, and retrieves information.

However, false memories are not just a personal phenomena. Whole societies and communities can create false memories among the community. Similar false memories can be shared by many people or the whole community.

Collective False Memory
Very recently in a lecture, someone mentioned that Jimmy Carter held a nuclear engineering degree. A lot of people in the audience agreed with this fact. However, on checking, I found out that he actually did not hold a nuclear engineering degree. This is an example of a collective false memory - a memory shared by multiple people which is incorrect. This phenomenon is also called the ‘Mandela’ effect due to several people around the world incorrectly remembering that Nelson Mandela died in the 1980s. Social reinforcement of false memories is held to be one of the leading causes of collective false memory. Suggestibility of people under similar circumstances can also lead to the creation of collective false memories.

The study of false memories can give important clues as to how human memory is encoded, stored, manipulated, and retrieved. Studying retroactive interference and the misinformation effect [3] can help us in understanding the encoding process for memories. Retroactive interference is the process by which information presented later interferes with the information already stored in the brain. This causes the earlier information/memories to be modified or completely erased. This effect can be clearly seen at play during several studies which create false memories (e.g. the case of false memory creation through language.). Neurological studies while conducting false memory experiments can reveal the areas of the brain being affected by the incorrect information.

False memories are also related to imagination. In [5], the authors demonstrated “imagination inflation” - the phenomenon that simply imagining a childhood event increases the confidence of the subjects that that event actually happened. Studying this further might help us understand how we imagine, what is the process of forming pictures in the “mind’s eye”, and how is imagination related to memory.

Studying false memory, like any other peculiar human behaviour can provide important information about the human brain and the mind.

[1] https://en.wikipedia.org/wiki/False memory syndrome.
[2] https://en.wikipedia.org/wiki/False memory.
[3] https://en.wikipedia.org/wiki/Misinformation effect.
[4] Frederic Charles Bartlett and Cyril Burt. Remembering: A study in experimental and social psychology. British Journal of Educational Psychology, 3(2):187–192, 1933.
[5] Maryanne Garry, Charles G Manning, Elizabeth F Loftus, and Steven J Sherman. Imagination inflation: Imagining a childhood event inflates confidence that it occurred. Psychonomic Bulletin & Review, 3(2):208–214, 1996.
[6] Elizabeth Loftus. Memory distortion and false memory creation. Bulletin of the American Academy of Psychiatry and Law, 24(3):281–295, 1996.
[7] Elizabeth F Loftus and John C Palmer. Reconstruction of automobile destruction: An example of the interaction between language and memory. Journal of verbal learning and verbal behavior, 13(5):585–589, 1974.
[8] Henry L Roediger and Kathleen B McDermott. Creating false memories: Remembering words not presented in lists. Journal of experimental psychology: Learning, Memory, and Cognition, 21(4):803, 1995.
[9] David C Rubin. The basic-systems model of episodic memory. Perspectives on Psychological Science, 1(4):277–311, 2006.
[10] Deryn Strange, Maryanne Garry, Daniel M Bernstein, and D Stephen Lindsay. Photographs cause false memories for the news. Acta psychologica, 136(1):90–94, 2011.

Friday, December 9, 2016

A short summary of the paper - Generative adversarial nets

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.

This paper [1] proposes a method for estimating generative models using an adversarial process. The authors train two networks simultaneously: a generative model, $G$,  and  a discriminative model, $D$. The generative model learns the data distribution and the discriminative model estimates whether the probability that a sample came from the data rather than $G$. The two networks are trained using a two-player minimax game. The discriminative network, $D$, is trained to maximise the probability of correctly classifying samples from both training data and data generated by $G$. Simultaneously, $G$ is trained to maximise the probability of $D$ making a mistake, i.e., in a way that the data generated by G is indistinguishable from the training data. The authors train these two networks iteratively, alternating between $k$ steps of optimising $D$ and one step of optimising $G$.

The authors also present theoretical guarantees on the convergence of the algorithm to the optimal value (in the sense of global minimum of their objective function). However, the guarantees are not applicable to the case presented in the paper because they make some assumptions which are infeasible to implement using neural networks. However, the paper argues that since deep neural networks perform very well in several domains, they are reasonable models to use here.

In this paper, the authors use $k = 1$ to minimize the training cost. I think that it would be interesting to try higher values of $k$. This would bring the framework closer to one of the conditions of their guarantees; the condition that $D$ is allowed to reach its optimum given $G$. The authors don't mention whether they tried this. This could lead to better convergence even though it will be more expensive to train.

A disadvantage of this approach is that $D$ should be well synchronised with $G$, i.e., $G$ cannot be trained too much without updating $D$. Though, adversarial nets offer several advantages over traditional generative models. This paper is an important step towards unsupervised learning and has already inspired some work in that direction [2].


[1] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In Advances in Neural Information Processing Systems, pp. 2672-2680. 2014.
[2] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

Wednesday, December 7, 2016

A short summary of the paper - On the Number of Linear Regions of Deep Neural Networks

This is a short summary of [1] which I wrote for a lecture of the Deep Learning course I am taking this semester.


This paper presents theoretical results about the advantages of deep networks over shallow networks. The authors calculate bounds on the number of linear regions that deep networks with piecewise linear activation functions (e.g. ReLU and Maxout) can use to approximate functions. The number of linear regions used for representing a function can be thought of as a measure of the complexity of the representation model. The paper shows that deep networks produce exponentially more linear regions than shallow networks. 

Recently, [4] showed that shallow networks require exponentially more sum-product hidden units that deep networks to represent certain functions. For several years people used smoothed activation functions as non-linearities in the neural networks. But [2] showed that Rectified Linear Units (ReLU) are much faster to train. In 2013, [3] introduced a new form of piecewise linear activation called Maxout. This paper [1] aims to present a theoretical analysis of the advantages of deep networks with such activations over shallow networks. Such an analysis was also done in [5] which showed that, while approximating a function, deep networks are able to produce exponentially more linear regions than shallow networks with the same number of hidden units. This paper presents a tighter lower bound on the maximal number of linear regions of functions computed by neural networks than [5]. This lends deep networks an advantage over shallow networks in terms of representation power. A higher number of linear pieces means an ability to represent more complex functions.

This paper also describes how intermediate layers of deep neural networks "map several pieces of their inputs into the same output". The authors present the hypothesis that deep networks re-use and compose features from lower layers to higher layers exponentially often with the increase in the number of layers. This gives them the ability to compute highly complex functions.

Theoretical bounds on the number of linear regions produced by shallow networks were presented in [5]. They propose that the maximal number of linear regions of functions computed by shallow rectifier networks with $n_0$ inputs and $n_1$ hidden units is $\sum_{j=0}^{n_0}\binom{n_1}{j}$. They also obtain a bound of $\Omega ((\frac{n}{n_0})^{(L-1)}n^{n_0})$ on the number of linear regions of a function that can be computed by a rectifier neural network with $n_{0}$ inputs and $L$ hidden units of width $n \geq n_0$.

The main result presented in this paper is the following (Corollary 6 in [1]):
A rectifier neural network with $n_0$ input units and $L$ hidden layers of width $n \geq n_0$ can compute functions that have $\Omega ((\frac{n}{n_0})^{(L-1)n_0}n^{n_0})$ linear regions.

The main contribution of this paper is a tighter bound on the complexity of the functions that deep networks can represent. The authors theoretically show that deep networks are exponentially more efficient than shallow networks. This is a step forward in developing a theoretical understanding of deep networks. However, an important question that needs to be answered is whether we actually need an exponentially higher capacity for the tasks that we are currently interested in. Recently, [6] showed that shallow networks with a similar number of parameters as deep networks mimic the accuracies obtained by deep networks on several tasks (TIMIT and CIFAR-10). Although they used deep networks to train their shallow models, this result shows that shallow networks have the capacity to perform as well as deep networks at least on some tasks. We might just have to figure out ways to train them efficiently to ensure that their capacity is fully exploited.


[1] Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. "On the number of linear regions of deep neural networks." In Advances in neural information processing systems, pp. 2924-2932. 2014.
[2] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep Sparse Rectifier Neural Networks." In Aistats, vol. 15, no. 106, p. 275. 2011.
[3] Goodfellow, Ian J., David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. "Maxout networks." ICML (3) 28 (2013): 1319-1327.
[4] Delalleau, Olivier, and Yoshua Bengio. "Shallow vs. deep sum-product networks." In Advances in Neural Information Processing Systems, pp. 666-674. 2011.
[5] Pascanu, Razvan, Guido Montufar, and Yoshua Bengio. "On the number of response regions of deep feed forward networks with piece-wise linear activations." arXiv preprint arXiv:1312.6098 (2013).
[6] Ba, Jimmy, and Rich Caruana. "Do deep nets really need to be deep?." In Advances in neural information processing systems, pp. 2654-2662. 2014.