## Monday, April 16, 2018

### Zero-Shot Object Detection

Authors: Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, Ajay Divakaran

In this post I give a brief description of our recent work Zero-Shot Object Detection. This work was mostly done when I was an intern at SRI International in summer 2017.

## Introduction

This paper introduces the problem of zero-shot object detection. First, let's parse what that means. This contains two terms:

Zero-Shot learning is the framework where training data for some classes is not available but the model is still expected to recognise these classes. The classes for which training data is available are called seen classes and the classes which are not available during training are called unseen classes.

Object Detection is the problem of recognising and localising objects in images. The output of an object detection model is usually a bounding box which covers an object, and the class of the object. Note that this is different from image recognition which involves recognising the class of the image only.

Zero-shot object detection (ZSD) is an important problem because the visual world is composed of thousands of object categories. Anything you see can be given a visual label. However, training a fully-supervised model for detecting so many categories requires infeasible amounts of resources (training data, annotation costs etc.). So, there is a need to develop methods which do not require these huge resources or require resources which are readily available.

Zero-shot learning is the general category of problems where data for some classes is not available during training. There has been some work on zero-shot image recognition, where localising the objects is not important. To be able to recognise previously unseen categories, these methods used external semantic information either in the form of word-vectors, or attributes. This semantic information serves as the link between seen classes and unseen classes. The approach in this paper also uses semantic information to bridge seen and unseen classes.

 Fig. 1: We highlight the task of zero-shot object detection where object classes “arm”, “hand”, and “shirt” are observed (seen) during training, while classes “skirt”, and “shoulder” are not seen. These unseen classes are localized by our approach that leverages semantic relationships, obtained via word embeddings, between seen and unseen classes along with the proposed zero shot detection framework. The example has been generated by our model on images from VisualGenome dataset.
Fully-supervised object detection approaches like R-CNN, Faster R-CNN, YOLO (You Only Look Once), SSD (Single-Shot Detector), have a fixed "background" class which encompasses everything other than the object classes in the training set. However, selecting the background class in the zero-shot setting is not trivial. This is because, unlike the fully-supervised case, there is no such thing as a true background in the zero-shot case. Bounding boxes which do not include any objects from train categories, might either contain background stuff (grass/sky/sea etc.) or might contain objects from the test set. There is no way to say which of these boxes belong to the background and which belong to test classes. Moreover, in the true zero-shot setting (called open vocabulary), you want to be able to detect everything from objects to stuff (grass/sky etc.). So, selecting background is difficult. This paper discusses these issues in greater depth and presents some solutions.

This paper makes the following contributions:
1.  It introduces the problem of ZSD and presents a baseline methods that follows existing work on zero-shot image classification and fully supervised object detection.
2. The authors discuss some challenges associated with incorporating information from background regions and propose two methods for training background-aware detectors.
3. It examines the problem with sparse sampling of classes during training and proposes a solution which densely samples training classes using additional data.

In addition, the paper provides extensive experiments and discussions.

## Approach

The baseline approach for ZSD adapts prior work on zero-shot classification for detection.

Baseline ZSD

Let $\mathcal{C} = \mathcal{S} \cup \mathcal{U} \cup \mathcal{O}$ be the set of all classes, where $\mathcal{S}$ is the set of seen (train) classes, $\mathcal{U}$ is the set of unseen (test) classes, and $\mathcal{O}$ is the set of all classes that are neither part of the seen or unseen classes. Given a bounding box, $b_i \in \mathbb{N}^4$, the cropped object is passed through a CNN to obtain a feature vector $\phi (b_i)$. To use semantic information from the word-vectors, this feature vector is projected into the semantic embedding space $\psi_i = W_p \phi (b_i)$, where $W_p$ is the projection matrix. The projection is trained such that the projection is close to the word-vector of the class of the bounding box. What this means is this. Suppose the class of the given bounding box is $y_i$, and the word-vector of $y_i$ is $w_i$. The aim of training the projection is that $\psi_i$ and $w_i$ should be very similar. If the similarity between a feature vector, $\psi_i$ and a word-vector $w_j$ is given by the cosine similarity $S_{ij}$, then the model (CNN and the projection matrix) is trained to minimise the following max-margin loss:

$$\mathcal{L}(b_i, y_i, \theta) = \sum_{j \in \mathcal{S}, j \neq i} max(0, m - S_{ii} + S_{ij})$$

where, $\theta$ are the parameters of the deep CNN and the projection matrix, and $m$ is the margin. At inference, the predicted class of a bounding box ($b_i$) is given by:

$$\hat{y}_i = \underset{j \in \mathcal{U}}{argmax} ~S_{ij}$$

Note that, the baseline approach does not include any background boxes in training. To train a more robust model which can better eliminate background boxes, like in fully-supervised object detection methods, two background-aware approaches are presented next.

Statically Assigned Background (SB) Based ZSD

In this approach, following previous work on object detection, a single static background class is assigned to background training boxes. All the background bounding boxes are assigned to a single background class, $y_b$ with a word-vector, $w_b$. While training, this background class is treated just as any other class in $\mathcal{S}$.

Note that, there is one clear problem with this approach. Some of the background boxes might belong to unseen classes. Also, background boxes might be extremely varied. Trying to assign such varied bounding boxes to a single class is extremely difficult.

To overcome some of these issues, another background-aware model is proposed.

Latent Assignment Based (LAB) ZSD

In this approach the background boxes are spread over the embedding space by using an Expectation Maximization (EM)-like algorithm. Multiple latent classes are assigned to the background objects. At a higher level, this encodes the knowledge that a background box does not belong to the set of seen classes ($\mathcal{S}$), and could possibly belong to a number of different classes from a large vocabulary set, referred to as background set $\mathcal{O}$.

To accomplish this, a baseline model is trained first. This model is used to classify a subset of background boxes into the open vocabulary set $\mathcal{O}$. These background boxes are added to the training set. The model is trained for an epoch. This model is again used to classify another set of background boxes into $\mathcal{O}$. These are added to the training set and the model is trained again for 1 epoch. This process is repeated several times.

This approach is related to open-vocabulary learning, and to latent variables based classification models.

Densely Sampled Embedding Space (DSES)

Using small datasets for training such ZSD models poses the problem that the embedding space is sparsely sampled. This is problematic particularly for recognizing unseen classes which, by definition, lie in parts of the embedding space that do not have training examples. To alleviate this issue, the paper proposes to augment the training procedure with additional data from external sources that contain boxes belonging to classes other than unseen classes, $y_i \in \mathcal{C} - \mathcal{U}$. This means the space of object classes is densely sampled during training to improve the alignment of the embedding space.

This concludes the post. I have tried to give a brief introduction to the problem and some approaches proposed by us. For more details about problem and results, see the paper Zero-Shot Object Detection