The problem of Object Detection has received significant attention and rapid progress has been made in the the area. Many recent object detection systems achieve excellent performance [1,2,3,4].
Object detection |
However, a deeper understanding of the scene also involves finding the interactions between objects. In human-centric images, the most important interactions are the ones between humans and objects. For example, in the image shown below, in addition to detecting the humans and objects, knowing their interactions provides a better understanding of the scene.
Human-object interaction detection |
In this post, I will briefly discuss our recent work on detecting human-object interactions (HOIs). A interaction between a human and an object is usually represented as the triplet $\texttt{<human, predicate, object>}$. Detecting an HOI involves localizing the human and the object and correctly predicting the predicate or the type of interaction between them. In this work, we let a well-performing object detector do the heavy lifting for detecting the entities involved. Such a detector gives bounding boxes, RoI-pooled features, and class labels for each object/human in the image. We, instead, focus on correctly predicting the predicate.
The lack of annotated training data is a major issue for HOI detection. The popular HICO-Det dataset [5] contains interactions involving 80 objects and 117 types of interactions. This means that there are over 9,300 possible HOI classes. (Note that this is the maximum. In practice, this number will be lower because not every predicate can be applied to every object.) Collecting annotated data for such a large number of HOI categories is time-consuming and might be prohibitively expensive. The HICO-Det dataset provides annotations for only 600 HOI triplet categories. Our work tries to deal with this issue.
Our approach is based on the idea of functional similarity between objects. Humans appear similar while interacting with functionally similar objects. For example, in the second row in the image below, all three persons might be drinking from a can, a glass, or a cup. Similarly, in the last row, the three people might be eating either a donut, a muffin, or a slice of cake. Any of these objects could be involved in the interaction. We call such groups of objects functionally similar.
Functional similarity between objects |
The core of the idea is that annotated data for an object can be generalized to functionally similar objects. For achieving this, we introduce the functional generalization module. It is just a simple multi-layer perceptron (MLP) containing 2 fully-connected layers. It takes as input the human and object bounding boxes and the corresponding classes from the object detector. It also uses the human RoI-pooled visual features as a representation of the appearance of the human involved in an HOI. The final output is the probability of each predicate.
Network architecture |
Given a annotated HOI, we can replace the object by functionally similar objects and generate more data. For example, consider the example $\texttt{<human, drink_with, glass>}$. Here, the object ($\texttt{glass}$) can be replaced by $\texttt{bottle}$ or $\texttt{mug}$ or $\texttt{cup}$ and so on. This helps us generate training instances of different categories from a given training sample.
This generalization approach is particularly useful for detecting rare and non-annotated classes. The following images show some detections generated by our model in the zero-shot HOI detection setting. We did not use any annotated data for the HOI triplets shown in the images.
Our model can even detect interactions involving objects for which no annotated triplets were available during training. This is because our generalization module can generalize from functionally similar object classes to these.
Detections for zero-shot categories in the unseen object setting. |
For more details and an in-depth analysis see our paper: Detecting Human-Object Interactions via Functional Generalization.
References
[1] Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE international conference on computer vision, pp. 1440-1448. 2015.
[2] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. "Mask r-cnn." In Proceedings of the IEEE international conference on computer vision, pp. 2961-2969. 2017.
[3] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. "Ssd: Single shot multibox detector." In European conference on computer vision, pp. 21-37. Springer, Cham, 2016.
[4] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788. 2016.
[5] Chao, Yu-Wei, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. "Learning to detect human-object interactions." In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381-389. IEEE, 2018.