We propose a geometric approach to articulated tracking, where the human pose representation is expressed on the Riemannian manifold of joint positions. This is in contrast to conventional methods where the problem is phrased in terms of intrinsic parameters of the human pose. Our model is based on a physically natural metric that also has strong links to neurological models of human motion planning. Some benefits of the model is that it allows for easy modeling of interaction with the environment, for data-driven optimization schemes and for well-posed low-pass filtering properties.
To apply the Riemannian model in practice, we derive simulation schemes for Brownian motion on manifolds as well as computationally efficient approximation schemes. The resulting algorithms seem to outperform gold standards both in terms of accuracy and running times.
Organizers: Michel Besserve
A pure refinement procedure for non-rigid registration can be highly effective for establishing dense correspondences between pairs of scanned data, even for significant deformations. I will explain how to design robust non-rigid algorithms and why it is important to couple the optimization of correspondence positions, warping field, and overlapping regions. I will show several applications where it has been successfully applied ranging from film/game production to radiation oncology. One particular interest of mine is facial animation. I will present a fully integrated system for real-time facial performance capture and expression transfer and give a live demo of our latest technology, faceshift. At the end of the talk I
Organizers: Gerard Pons-Moll
Many machine vision/image processing algorithms are designed to be real-time and fully automatic. These attributes are essential, e.g., for stereo robotics vision applications. Visual Effects Studios, however, posses giant server farms and command armies of artists to perform intelligent initialization or provide guidance to algorithms. On the other hand, motion pictures have very high accuracy requirements and the ability to influence an algorithm manually is often more important than other factors, generally considered crucial in Academia. In this talk I will highlight some scenarios, where Academia and the Visual Effects industry disagree.
In the era of perpetually increasing computational capabilities, multi-camera acquisition systems are being increasingly used to capture parameterization-free articulated 3D shapes. These systems allow marker-less shape acquisition and are useful for a wide range of applications in the entertainment, sports, surveillance industries and also in interactive, and augmented reality systems. The availability of vast amount of 3D shape data has increased interest in 3D shape analysis methods. Segmentation and Matching are two important shape analysis tasks. 3D shape segmentation is a subjective task that involves dividing a given shape into constituent parts by assigning each part with a unique segment label.
In the case of 3D shape matching, a dense vertex-to-vertex correspondence between two shapes is desired. However, 3D shapes analysis is particularly difficult in the case of articulated shapes due to complex kinematic poses. These poses induce self-occlusions and shadow effects which cause topological changes such as merging and splitting. In this work we propose robust segmentation and matching methods for articulated 3D shapes represented as mesh-graphs using graph spectral methods.
This talk is divided into two parts. Part one of the talk will focus on 3D shape segmentation, attempted both in an unsupervised and semi-supervised setting by analysing the properties of discrete Laplacian eigenspaces of mesh-graphs. In the second part, 3D shape matching is analysed in a multi-scale heat-diffusion framework derived from Laplacian eigenspace. We believe that this framework is well suited to handle large topological changes and we substantiate our beliefe by showing promising results on various publicly available real mesh datasets.
Organizers: Sebastian Trimpe
Capturing human motion or objects by vision technology has been intensively studied. Although humans interact very often with other persons or objects, most of the previous work has focused on capturing a single object or the motion of a single person. In this talk, I will highlight four projects that deal with human-human or human-object interactions. The first project addresses the problem of capturing skeleton and non-articulated cloth motion of two interacting characters. The second project aims to model spatial hand-object relations during object manipulation. In the third project, an affordance detector is learned from human-object interactions. The fourth project investigates how human motion can be exploited for object discovery from depth video streams.
n this talk I will present recent work on two different topics from low- and high-level computer vision: Intrinsic Image Recovery and Efficient object detection. By intrinsic image decomposition we refer to the challenging task of decoupling material properties from lighting properties given a single image. We propose a probabilistic model that incorporates previous attempts exploiting edge information and combine it with a novel prior on material reflectances in the image. This results in a random field model with global, latent variables and pixel-accurate output reflectance values. I will present experiments on a recently proposed ground-truth database.
The proposed model is found to outperform previous models that have been proposed. Then I will also discuss some possible future developments in this field. In the second part of the talk I will present an efficient object detection scheme that breaks the computational complexity of commonly used detection algorithms, eg sliding windows. We pose the detection problem naturally as a structured prediction problem for which we decompose the inference procedure into an adaptive best-first search.
This results in test-time inference that scales sub-linearly in the size of the search space and detection requires usually less than 100 classifier evaluations. This paves the way for using strong (but costly) classifiers such as non-linear SVMs. The algorithmic properties are demonstrated using the VOC'07 dataset. This work is part of the Visipedia project, in collaboration with Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder and Pietro Perona.
3D shape correspondence methods seek on two given shapes for pairs of surface points that are semantically equivalent. We present three automatic algorithms that address three different aspects of this problem: 1) coarse, 2) dense, and 3) partial correspondence. In 1), after sampling evenly-spaced base vertices on shapes, we formulate the problem of shape correspondence as combinatorial optimization over the domain of all possible mappings of bases, which then reduces within a probabilistic framework to a log-likelihood maximization problem that we solve via EM (Expectation Maximization) algorithm.
Due to computational limitations, we change this algorithm to a coarse-to-fine one (2) to achieve dense correspondence between all vertices. Our scale-invariant isometric distortion measure makes partial matching (3) possible as well.
We present an interactive, hybrid human-computer method for object classification. The method applies to classes of problems that are difficult for most people, but are recognizable by people with the appropriate expertise (e.g., animal species or airplane model recognition). The classification method can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively.
The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. Incorporating user input drives up recognition accuracy to levels that are good enough for practical applications; at the same time, computer vision reduces the amount of human interaction required. The resulting hybrid system is able to handle difficult, large multi-class problems with tightly-related categories.
We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate the accuracy and computational properties of different computer vision algorithms and the effects of noisy user responses on a dataset of 200 bird species and on the Animals With Attributes dataset.
Our results demonstrate the effectiveness and practicality of the hybrid human-computer classification paradigm. This work is part of the Visipedia project, in collaboration with Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder and Pietro Perona.
Object grasping and manipulation is a crucial part of daily human activities. The study of these actions represents a central component in the development of systems that attempt to understand human activities and robots that are able to act in human environments.
Three essential parts of this problem are tackled in this talk: the perception of the human hand in interaction with objects, the modeling of human grasping actions and the refinement of the execution of a robotic grasp. The estimation of the human hand pose is carrried out with a markerless visual system that performs in real time under object occlusions. Low dimensional models of various grasping actions are created by exploiting the correlations between different hand joints in a non-linear manner with Gaussian Process Latent Variable Models (GPLVM). Finally, robot grasping actions are perfected by exploiting the appearance of the robot during action execution.
Enabling computers to understand human behavior has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human computer interaction, and social robotics. A critical element in the design of any behavioral sensing system is to find a good representation of the data for encoding, segmenting, classifying and predicting subtle human behavior.
In this talk I will propose several extensions of Component Analysis (CA) techniques (e.g., kernel principal component analysis, support vector machines, spectral clustering) that are able to learn spatio-temporal representations or components useful in many human sensing tasks. In the first part of the talk I will give an overview of several ongoing projects in the CMU Human Sensing Laboratory, including our current work on depression assessment from video, as well as hot-flash detection from wearable sensors.
In the second part of the talk I will show how several extensions of the CA methods outperform state-of-the-art algorithms in problems such as temporal alignment of human motion, temporal segmentation/clustering of human activities, joint segmentation and classification of human behavior, facial expression analysis, and facial feature detection in images. The talk will be adaptive, and I will discuss the topics of major interest to the audience.