Sunday, December 24, 2017

Notes from ICCV 2017

Top Acceptance Rates
Video and Language - 53.8%
Autonomous Driving - 50%
Large-scale Optimization - 45%

Total - 29%


Favourite Papers on Video and Recognition

Video




A Read-Write Memory Network for Movie Story Understanding
 - Question and answering task for large-scale, multimodal movie story understanding

Temporal Tessellation: A Unified Approach for Video Analysis
 - General approach to video understanding inspired by semantic transfer techniques
 - A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.

Unsupervised Action Discovery and Localization in Videos
 - Training data - unlabeled data without bounding box annotations
 - The proposed approach a. Discovers action class labels and b. Spatio-temporally localizes actions in videos

Dense-Captioning Events in Videos
 - Introduce the task of dense-captioning events, which involves both detecting and describing events in a video.
 - Identify all events in a single pass of the video
 - Introduce a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes
 - Introduce a new captioning module that uses contextual information from past and future events to jointly describe all events.
 - New dataset - ActivityNet Captions - 849 video hours with 100k total descriptions

Learning long-term dependencies for action recognition with a biologically-inspired deep network
 - Biological neural systems are typically composed of both feedforward and feedback connections
 - shuttleNet - consists of several processors, each of which is a GRU while associated with multiple groups of hidden states
 - All processors inside shuttleNet are loop connected to mimic the brain's feedforward and feedback connections, in which they are shared across multiple pathways in the loop connection.

Compressive Quantization for Fast Object Instance Search in Videos
 - Object instance search in videos, where efficient point-to-set (image-to-video) matching is essential
 - Jointly optimizing vector quantization method to compress M object proposals extracted from each video into only k binary codes, where k << M
 - Similarity between the query object and the whole video can be determined by the Hamming distance between the query's binary code and the video's best-matched binary code

Complex Event Detection by Identifying Reliable Shots From Untrimmed Videos
 - Formulate as a MIL problem by taking each video as a bag and the video shots in each video as instances
 - New MIL method, which simultaneously learns a linear SVM classifier and infers a binary indicator for each instance in order to select reliable training instances from each positive or negative bag
 - In the objective function balance the weighted training errors and an l1-l2 mixed-norm regularization term which adaptively selects reliable shots as diverse as possible

Spatio-Temporal Person Retrieval via Natural Language Queries
 - Person retrieval from multiple videos
 - Output a tube which encloses the person described by the query
 - New dataset
 - Design a model that combines methods for spatio-temporal human detection and multimodal retrieval

Joint Discovery of Object States and Manipulation Actions
 - Automatically discover the states of objects and the associated manipulation actions
 - Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions

Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks
 - The network aims to distinguish the target area from the background on  the basis of the pixel-level similarity between two object units
 - The proposed network represents a target object using features from different depth layers in order to take advantage of both the spatial details and the category-level semantic information

Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge
 - Recounting of abnormal events - explaining why they are judged to be abnormal
 - Integrate a generic CNN model and environment-dependent anomaly detectors
 - Learn a CNN with multiple visual tasks to exploit semantic information that is useful for detecting and recounting abnormal events
 - Appropriately plugging the model into anomaly detectors

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals
 - TURN jointly predicts action proposals and refines the temporal boundaries by temporal coordinate regression
 - Fast computation is enabled by unit feature reuse: a long untrimmed video is decomposed into video units, which are reused as basic building blocks of temporal proposals

Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
 - Zero-shot localization and classification of human actions in video
 - Spatial-aware object embedding
 - Build embedding on top of freely available actor and object detectors
 - Exploit the object positions and sizes in the spatial-aware embedding to demonstrate a new spatio-temporal action retrieval scenario with composite queries

Temporal Dynamic Graph LSTMs for Action-Driven Video Object Detection
 - Weakly supervised object detection from videos
 - Use action descriptions as supervision
 - But, objects of interest that are not involved in human actions are often absent in global action descriptions
 - Propose a novel temporal dynamic graph LSTM (TD-Graph). TD_graph LSTM enables global temporal reasoning by constructing a dynamic graph that is based on temporal correlations of object proposals and spans the entire video
 - The missing label issue for each individual frame can thus be significantly alleviated by transferring knowledge across correlated object proposals in the whole video


Recognition




Open Set Domain Adaptation
 - Domain adaptation in open sets - only a few categories of interest are shared between source and target data
 - The proposed method fits in both closed and open set scenarios
 - The approach learns a mapping from the source to the target domain by jointly solving an assignment problem that labels those target instances that potentially belong to the categories of interest present in the source dataset

FoveaNet: Perspective-aware Urban Scene Parsing
 - Estimate the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image
 - FoveaNet "undoes" the camera perspective projection - analyzing regions in the space of the actual scene, and thus provides much more reliable parsing results
 - Introduce a new dense CRFs model that takes the perspective geometry as a prior potential

Generative Modeling of Audible Shapes for Object Perception
 - Present a novel, open-source pipeline that generates audio-visual data, purely from 3D shapes and their physical properties
 - Synthetic audio-visual dataset - Sound-20K for object perception tasks
 - Auditory and visual information play complementary roles in object perception, and the representation learned on synthetic audio-visual data can transfer to real-world scenarios

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
 - Transfer human supervision between the previously separate tasks

 - Establishing semantic correspondences between images depicting different instances of the same object or scene category
 - CNN architecture for learning a geometrically plausible model for semantic correspondence
 - Uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function

 - The real-world noisy labels exhibit multi-modal characteristics as the true labels, rather than behaving like independent random outliers
 - Propose a unified distillation framework to use “side” information, including a small clean dataset and label relations in knowledge graph, to “hedge the risk” of learning from noisy labels.
 - Propose a suite of new benchmark datasets

 - Presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues
 - Model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions
 - Special attention given to relationships between people and clothing or body parts mentions, as they are useful for distinguishing individuals. 
 - Automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption

 -  Leverage the strong correlations between the predicate and the <subj, obj> pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects.
 - Use knowledge of linguistic statistics to regularize visual model learning
 - Obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a <subj, obj> pair
 - Distill this knowledge into the deep model to achieve better generalization

 - Introduce an end-to-end multi-task objective that jointly learns object-action relationships
 - Proposed architecture can be used for zero-shot learning of actions

 - Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer
 - Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE