The ability to predict how an environment changes based on forces applied to it is fundamental for a robot to achieve specific goals. Traditionally in robotics, this problem is addressed through the use of pre-specified models or physics simulators, taking advantage of prior knowledge of the problem structure. While these models are general and have broad applicability, they depend on accurate estimation of model parameters such as object shape, mass, friction etc. On the other hand, learning based methods such as Predictive State Representations or more recent deep learning approaches have looked at learning these models directly from raw perceptual information in a model-free manner. These methods operate on raw data without any intermediate parameter estimation, but lack the structure and generality of model-based techniques. In this talk, I will present some work that tries to bridge the gap between these two paradigms by proposing a specific class of deep visual dynamics models (SE3-Nets) that explicitly encode strong physical and 3D geometric priors (specifically, rigid body dynamics) in their structure. As opposed to traditional deep models that reason about dynamics/motion a pixel level, we show that the physical priors implicit in our network architectures enable them to reason about dynamics at the object level - our network learns to identify objects in the scene and to predict rigid body rotation and translation per object. I will present results on applying our deep architectures to two specific problems: 1) Modeling scene dynamics where the task is to predict future depth observations given the current observation and an applied action and 2) Real-time visuomotor control of a Baxter manipulator based only on raw depth data. We show that: 1) Our proposed architectures significantly outperform baseline deep models on dynamics modelling and 2) Our architectures perform comparably or better than baseline models for visuomotor control while operating at camera rates (30Hz) and relying on far less information.
Organizers: Franzi Meier
Machine learning has become a popular application domain for modern optimization techniques, pushing its algorithmic frontier. The need for large scale optimization algorithms which can handle millions of dimensions or data points, typical for the big data era, have brought a resurgence of interest for first order algorithms, making us revisit the venerable stochastic gradient method [Robbins-Monro 1951] as well as the Frank-Wolfe algorithm [Frank-Wolfe 1956]. In this talk, I will review recent improvements on these algorithms which can exploit the structure of modern machine learning approaches. I will explain why the Frank-Wolfe algorithm has become so popular lately; and present a surprising tweak on the stochastic gradient method which yields a fast linear convergence rate. Motivating applications will include weakly supervised video analysis and structured prediction problems.
Organizers: Philipp Hennig
The scope of this work is hand-object interaction. As a starting point, we observe hands manipulating objects and derive information based on computer vision methods. After considering hands and objects in isolation, we focus on the inherent interdependencies. One application of the gained knowledge is the synthesis of interactive hand motion for animated sequences.
Vision is a crucial sense for computational systems to interact with their environments as biological systems do. A major task is interpreting images of complex scenes, by recognizing and localizing objects, persons and actions. This involves learning a large number of visual models, ideally autonomously.
In this talk I will present two ways of reducing the amount of human supervision required by this learning process. The first way is labeling images only by the object class they contain. Learning from cluttered images is very challenging in this weakly supervised setting. In the traditional paradigm, each class is learned starting from scratch. In our work instead, knowledge generic over classes is first learned during a meta-training stage from images of diverse classes with given object locations, and is then used to support learning any new class without location annotation. Generic knowledge helps because during meta-training the system can learn about localizing objects in general. As demonstrated experimentally, this approach enables learning from more challenging images than possible before, such as the PASCAL VOC 2007, containing extensive clutter and large scale and appearance variations between object instances.
The second way is the analysis of news items consisting of images and text captions. We associate names and action verbs in the captions to the face and body pose of the persons in the images. We introduce a joint probabilistic model for simultaneously recovering image-caption correspondences and learning appearance models for the face and pose classes occurring in the corpus. As demonstrated experimentally, this joint `face and pose' model solves the correspondence problem better than earlier models covering only the face.
I will conclude with an outlook on the idea of visual culture, where new visual concepts are learned incrementally on top of all visual knowledge acquired so far. Beside generic knowledge, visual culture includes also knowledge specific to a class, knowledge of scene structures and other forms of visual knowledge. Potentially, this approach could considerably extend current visual recognition capabilities and produce an integrated body of visual knowledge.
I will survey our work on tracking and measurement, waypoints on the path to activity recognition and understanding, in sports video, highlighting some of our recent work on rectification and player tracking, not just in hockey but more recently in basketball, where we have addressed player identification both in a fully supervised and semi-supervised manner.
Methods for visual recognition have made dramatic strides in recent years on various online benchmarks, but performance in the real world still often falters. Classic gradient-histogram models make overly simplistic assumptions regarding image appearance statistics, both locally and globally. Recent progress suggests that new learning-based representations can improve recognition by devices that are embedded in a physical world.
I'll review new methods for domain adaptation which capture the visual domain shift between environments, and improve recognition of objects in specific places when trained from generic online sources. I'll discuss methods for cross-modal semi-supervised learning, which can leverage additional unlabeled modalities in a test environment.
Finally as time permits I'll present recent results learning hierarchical local image representations based on recursive probabilistic topic models, on learning strong object color models from sets of uncalibrated views using a new multi-view color constancy paradigm, and/or on recent results on monocular estimation of grasp affordances.
In the first part of the talk, I will describe methods that learn a single family of detectors for object classes that exhibit large within-class variation. One common solution is to use a divide-and-conquer strategy, where the space of possible within-class variations is partitioned, and different detectors are trained for different partitions.
However, these discrete partitions tend to be arbitrary in continuous spaces, and the classifiers have limited power when there are too few training samples in each subclass. To address this shortcoming, explicit feature sharing has been proposed, but it also makes training more expensive. We show that foreground-background classification (detection) and within-class classification of the foreground class (pose estimation) can be jointly solved in a multiplicative form of two kernel functions. One kernel measures similarity for foreground-background classification. The other kernel accounts for latent factors that control within-class variation and implicitly enables feature sharing among foreground training samples. The multiplicative kernel formulation enables feature sharing implicitly; the solution for the optimal sharing is a byproduct of SVM learning.
The resulting detector family is tuned to specific variations in the foreground. The effectiveness of this framework is demonstrated in experiments that involve detection, tracking, and pose estimation of human hands, faces, and vehicles in video.
Beginning with a seminal paper of Diaconis (1988), the aim of so-called "probabilistic numerics" is to compute probabilistic solutions to deterministic problems arising in numerical analysis by casting them as statistical inference problems. For example, numerical integration of a deterministic function can be seen as the integration of an unknown/random function, with evaluations of the integrand at the integration nodes proving partial information about the integrand. Advantages offered by this viewpoint include: access to the Bayesian representation of prior and posterior uncertainties; better propagation of uncertainty through hierarchical systems than simple worst-case error bounds; and appropriate accounting for numerical truncation and round-off error in inverse problems, so that the replicability of deterministic simulations is not confused with their accuracy, thereby yielding an inappropriately concentrated Bayesian posterior. This talk will describe recent work on probabilistic numerical solvers for ordinary and partial differential equations, including their theoretical construction, convergence rates, and applications to forward and inverse problems. Joint work with Andrew Stuart (Warwick).
Organizers: Philipp Hennig