Robust Learning Architectures for Perceiving Object Semantics and Geometry

Chi LI, Johns Hopkins University

Parsing object semantics and geometry in a scene is one core task in visual understanding. This includes localization of an object, classification of its identity, estimation of object orientation and parsing 3D shape structures. With the emergence of deep convolutional architectures in recent years, substantial progress has been made for large-scale vision problems such as image classification. However, there still remains some fundamental challenges. First, creating object representations that are robust to changes in viewpoint while capturing local visual details continues to be a problem. Second, deep Convolutional Neural Networks (CNNs) are purely driven by data and predominantly pose the scene interpretation problem as an end-to-end black-box mapping. However, decades of work on perceptual organization in both human and machine vision suggests that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization.

In this dissertation, we present two methodologies to surmount the aforementioned two issues. We first introduce a multi-domain pooling framework which group local visual signals within generic feature spaces that are invariant to 3D object transformation, thereby reducing the sensitivity of output feature to spatial deformations. Next, we explore an approach for injecting prior domain structure into neural network training, which leads a CNN to recover a sequence of intermediate milestones towards the final goal. We implement this deep supervision framework with a novel CNN architecture which is trained on synthetic image only and achieves the state-of-the-art performance of 2D/3D keypoint localization on real image benchmarks. Finally, the proposed deep supervision scheme also motivates an approach for accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of object classes from single or multiple views.

Speaker Biography

Chi Li is a Ph.D. candidate primarily advised by Dr. Greg Hager. He received his B.E. from Cognitive Science Department at Xiamen University in 2012, where he became interested in computer vision. His research mainly focuses on visual understanding of object properties from semantic class to 3D pose and structure. In particular, he is interested in leveraging scene geometry to enhance deep learning techniques on 2D/3D/Multi-view perception. During his Ph.D., he also gain industrial experience from his three research internships with Apple, NEC Laboratories America and Microsoft Research.

Professional plans after Hopkins: I am going to join Apple and continue my research of 2D/3D visual perception after graduation.