Understanding objects and their behavior from images and videos is a difficult inverse problem. It requires learning a metric in image space that reflects object relations in real world. This metric learning problem calls for large volumes of training data. While images and videos are easily available, labels are not, thus motivating self-supervised metric and representation learning. Furthermore, I will present a widely applicable strategy based on deep reinforcement learning to improve the surrogate tasks underlying self-supervision. Thereafter, the talk will cover the learning of disentangled representations that explicitly separate different object characteristics. Our approach is based on an analysis-by-synthesis paradigm and can generate novel object instances with flexible changes to individual characteristics such as their appearance and pose. It nicely addresses diverse applications in human and animal behavior analysis, a topic we have intensive collaboration on with neuroscientists. Time permitting, I will discuss the disentangling of representations from a wider perspective including novel strategies to image stylization and new strategies for regularization of the latent space of generator networks.
Organizers: Joel Janai
The past few years with the advent of Deep Convolutional Neural Networks (DCNNs), as well as the availability of visual data it was shown that it is possible to produce excellent results in very challenging tasks, such as visual object recognition, detection, tracking etc. Nevertheless, in certain tasks such as fine-grain object recognition (e.g., face recognition) it is very difficult to collect the amount of data that are needed. In this talk, I will show how, using DCNNs, we can generate highly realistic faces and heads and use them for training algorithms such as face and facial expression recognition. Next, I will reverse the problem and demonstrate how by having trained a very powerful face recognition network it can be used to perform very accurate 3D shape and texture reconstruction of faces from a single image. Finally, I will demonstrate how to create very lightweight networks for representing 3D face texture and shape structure by capitalising upon intrinsic mesh convolutions.
Organizers: Dimitris Tzionas
In this talk, I will present my understanding on 3D face reconstruction, modelling and applications from a deep learning perspective. In the first part of my talk, I will discuss the relationship between representations (point clouds, meshes, etc) and network layers (CNN, GCN, etc) on face reconstruction task, then present my ECCV work PRN which proposed a new representation to help achieve state-of-the-art performance on face reconstruction and dense alignment tasks. I will also introduce my open source project face3d that provides examples for generating different 3D face representations. In the second part of the talk, I will talk some publications in integrating 3D techniques into deep networks, then introduce my upcoming work which implements this. In the third part, I will present how related tasks could promote each other in deep learning, including face recognition for face reconstruction task and face reconstruction for face anti-spoofing task. Finally, with such understanding of these three parts, I will present my plans on 3D face modelling and applications.
Organizers: Timo Bolkart
Much existing work in reinforcement learning involves environments that are either intentionally neutral, lacking a role for cooperation and competition, or intentionally simple, when agents need imagine nothing more than that they are playing versions of themselves. Richer game theoretic notions become important as these constraints are relaxed. For humans, this encompasses issues that concern utility, such as envy and guilt, and that concern inference, such as recursive modeling of other players, I will discuss studies treating a paradigmatic game of trust as an interactive partially-observable Markov decision process, and will illustrate the solution concepts with evidence from interactions between various groups of subjects, including those diagnosed with borderline and anti-social personality disorders.
Object grasping and manipulation is a crucial part of daily human activities. The study of these actions represents a central component in the development of systems that attempt to understand human activities and robots that are able to act in human environments.
Three essential parts of this problem are tackled in this talk: the perception of the human hand in interaction with objects, the modeling of human grasping actions and the refinement of the execution of a robotic grasp. The estimation of the human hand pose is carrried out with a markerless visual system that performs in real time under object occlusions. Low dimensional models of various grasping actions are created by exploiting the correlations between different hand joints in a non-linear manner with Gaussian Process Latent Variable Models (GPLVM). Finally, robot grasping actions are perfected by exploiting the appearance of the robot during action execution.
Enabling computers to understand human behavior has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human computer interaction, and social robotics. A critical element in the design of any behavioral sensing system is to find a good representation of the data for encoding, segmenting, classifying and predicting subtle human behavior.
In this talk I will propose several extensions of Component Analysis (CA) techniques (e.g., kernel principal component analysis, support vector machines, spectral clustering) that are able to learn spatio-temporal representations or components useful in many human sensing tasks. In the first part of the talk I will give an overview of several ongoing projects in the CMU Human Sensing Laboratory, including our current work on depression assessment from video, as well as hot-flash detection from wearable sensors.
In the second part of the talk I will show how several extensions of the CA methods outperform state-of-the-art algorithms in problems such as temporal alignment of human motion, temporal segmentation/clustering of human activities, joint segmentation and classification of human behavior, facial expression analysis, and facial feature detection in images. The talk will be adaptive, and I will discuss the topics of major interest to the audience.
Semiconductor light emitters, based on single crystal epitaxial inorganic semiconductor heterostructures are ubiquitous. In spite of their extraordinary versatility and technological maturity, penetrating the full visible spectrum using a single material system for red, green, and blue (RGB) in a seamless way remains, nonetheless, an elusive challenge.
Semiconductor nanocrystals, or quantum dots (QDs), synthesized by solution-based methods of colloidal chemistry represent a strongly contrasting basis for active optical materials. While possessing an ability to absorb and efficiently luminesce across the RGB by simple nanocrystal particle size control within a single material system, these preparations have yet to make a significant impact as viable light emitting devices, mainly due to the difficulties in casting such materials from their natural habitat, that is “from the chemist’s bottle” to a useful solid thin film form for device use. In this presentation we show how II-VI compound nanocrystals can be transitioned to solid templates with targeted spatial control and placeme
Invasive access by microprobe arrays inserted safely into the brain is now enabling us to “listen” to local neural circuits at levels of spatial and temporal detail which, in addition to enriching fundamental brain science, has led to the possibility of a new generation of neurotechnologies to overcome disabilities due to a range of neurological injuries where pathways from the brain to the rest of the central and peripheral nervous systems have been injured or severed. In this presentation we discuss the biomedical engineering challenges and opportunities with these incipient technologies, with emphasis on implantable wireless neural interfaces for communicating with the brain.
A second topic, related to the possibility of sending direct inputs of information back to the brain by implanted devices is also explored, focusing on recently discovered means to render selected neural cell types and microcircuits to be light sensitized, following local microbiologically induced conditioning.
The scope of this work is hand-object interaction. As a starting point, we observe hands manipulating objects and derive information based on computer vision methods. After considering hands and objects in isolation, we focus on the inherent interdependencies. One application of the gained knowledge is the synthesis of interactive hand motion for animated sequences.
Vision is a crucial sense for computational systems to interact with their environments as biological systems do. A major task is interpreting images of complex scenes, by recognizing and localizing objects, persons and actions. This involves learning a large number of visual models, ideally autonomously.
In this talk I will present two ways of reducing the amount of human supervision required by this learning process. The first way is labeling images only by the object class they contain. Learning from cluttered images is very challenging in this weakly supervised setting. In the traditional paradigm, each class is learned starting from scratch. In our work instead, knowledge generic over classes is first learned during a meta-training stage from images of diverse classes with given object locations, and is then used to support learning any new class without location annotation. Generic knowledge helps because during meta-training the system can learn about localizing objects in general. As demonstrated experimentally, this approach enables learning from more challenging images than possible before, such as the PASCAL VOC 2007, containing extensive clutter and large scale and appearance variations between object instances.
The second way is the analysis of news items consisting of images and text captions. We associate names and action verbs in the captions to the face and body pose of the persons in the images. We introduce a joint probabilistic model for simultaneously recovering image-caption correspondences and learning appearance models for the face and pose classes occurring in the corpus. As demonstrated experimentally, this joint `face and pose' model solves the correspondence problem better than earlier models covering only the face.
I will conclude with an outlook on the idea of visual culture, where new visual concepts are learned incrementally on top of all visual knowledge acquired so far. Beside generic knowledge, visual culture includes also knowledge specific to a class, knowledge of scene structures and other forms of visual knowledge. Potentially, this approach could considerably extend current visual recognition capabilities and produce an integrated body of visual knowledge.
I will survey our work on tracking and measurement, waypoints on the path to activity recognition and understanding, in sports video, highlighting some of our recent work on rectification and player tracking, not just in hockey but more recently in basketball, where we have addressed player identification both in a fully supervised and semi-supervised manner.
Methods for visual recognition have made dramatic strides in recent years on various online benchmarks, but performance in the real world still often falters. Classic gradient-histogram models make overly simplistic assumptions regarding image appearance statistics, both locally and globally. Recent progress suggests that new learning-based representations can improve recognition by devices that are embedded in a physical world.
I'll review new methods for domain adaptation which capture the visual domain shift between environments, and improve recognition of objects in specific places when trained from generic online sources. I'll discuss methods for cross-modal semi-supervised learning, which can leverage additional unlabeled modalities in a test environment.
Finally as time permits I'll present recent results learning hierarchical local image representations based on recursive probabilistic topic models, on learning strong object color models from sets of uncalibrated views using a new multi-view color constancy paradigm, and/or on recent results on monocular estimation of grasp affordances.
In the first part of the talk, I will describe methods that learn a single family of detectors for object classes that exhibit large within-class variation. One common solution is to use a divide-and-conquer strategy, where the space of possible within-class variations is partitioned, and different detectors are trained for different partitions.
However, these discrete partitions tend to be arbitrary in continuous spaces, and the classifiers have limited power when there are too few training samples in each subclass. To address this shortcoming, explicit feature sharing has been proposed, but it also makes training more expensive. We show that foreground-background classification (detection) and within-class classification of the foreground class (pose estimation) can be jointly solved in a multiplicative form of two kernel functions. One kernel measures similarity for foreground-background classification. The other kernel accounts for latent factors that control within-class variation and implicitly enables feature sharing among foreground training samples. The multiplicative kernel formulation enables feature sharing implicitly; the solution for the optimal sharing is a byproduct of SVM learning.
The resulting detector family is tuned to specific variations in the foreground. The effectiveness of this framework is demonstrated in experiments that involve detection, tracking, and pose estimation of human hands, faces, and vehicles in video.