Writing and maintaining programs for robots poses some interesting challenges. It is hard to generalize them, as their targets are more than computing platforms. It can be deceptive to see them as input to output mappings, as interesting environments result in unpredictable inputs, and mixing reactive and deliberative behavior make intended outputs hard to define. Given the wide and fragmented landscape of components, from hardware to software, and the parties involved in providing and using them, integration is also a non-trivial aspect. The talk will illustrate the work ongoing at Fraunhofer IPA to tackle these challenges, how Open Source is its common trait, and how this translates into the industrial field thanks to the ROS-Industrial initiative.
Organizers: Vincent Berenz
Performance metrics are a key component of machine learning systems, and are ideally constructed to reflect real world tradeoffs. In contrast, much of the literature simply focuses on algorithms for maximizing accuracy. With the increasing integration of machine learning into real systems, it is clear that accuracy is an insufficient measure of performance for many problems of interest. Unfortunately, unlike accuracy, many real world performance metrics are non-decomposable i.e. cannot be computed as a sum of losses for each instance. Thus, known algorithms and associated analysis are not trivially extended, and direct approaches require expensive combinatorial optimization. I will outline recent results characterizing population optimal classifiers for large families of binary and multilabel classification metrics, including such nonlinear metrics as F-measure and Jaccard measure. Perhaps surprisingly, the prediction which maximizes the utility for a range of such metrics takes a simple form. This results in simple and scalable procedures for optimizing complex metrics in practice. I will also outline how the same analysis gives optimal procedures for selecting point estimates from complex posterior distributions for structured objects such as graphs. Joint work with Nagarajan Natarajan, Bowei Yan, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon.
Organizers: Mijung Park
Matching between two sets arises in various areas in computer vision, such as feature point matching for 3D reconstruction, person re-identification for surveillance or data association for multi-target tracking. Most previous work focused either on designing suitable features and matching cost functions, or on developing faster and more accurate solvers for quadratic or higher-order problems. In the first part of my talk, I will present a strategy for improving state-of-the-art solutions by efficiently computing the marginals of the joint matching probability. The second part of my talk will revolve around our recent work on online multi-target tracking using recurrent neural networks (RNNs). I will mention some fundamental challenges we encountered and present our current solution.
The accurate reconstruction of facial shape is important for applications such as telepresence and gaming. It can be solved efficiently with the help of statistical shape models that constrain the shape of the reconstruction. In this talk, several methods to statistically analyze static and dynamic 3D face data are discussed. When statistically analyzing faces, various challenges arise from noisy, corrupt, or incomplete data. To overcome the limitations imposed by the poor data quality, we leverage redundancy in the data for shape processing. This is done by processing entire motion sequences in the case of dynamic data, and by jointly processing large databases in a groupwise fashion in the case of static data. First, a fully automatic approach to robustly register and statistically analyze facial motion sequences using a multilinear face model as statistical prior is proposed. Further, a statistical face model is discussed, which consists of many localized, decorrelated multilinear models. The localized and multi-scale nature of this model allows for recovery of fine-scale details while retaining robustness to severe noise and occlusions. Finally, the learning of statistical face models is formulated as a groupwise optimization framework that aims to learn a multilinear model while jointly optimizing the correspondence, or correcting the data.
In many control applications it is the goal to operate a dynamical system in an optimal way with respect to a certain performance criterion. In a combustion engine, for example, the goal could be to control the engine such that the emissions are minimized. Due to the complexity of an engine, the desired operating point is unknown or may even change over time so that it cannot be determined a priori. Extremum seeking control is a learning-control methodology to solve such kind of control problems. It is a model-free method that optimizes the steady-state behavior of a dynamical system. Since it can be implemented with very limited resources, it has found several applications in industry. In this talk we give an introduction to extremum seeking theory based on a recently developed framework which relies on tools from geometric control. Furthermore, we discuss how this framework can be utilized to solve distributed optimization and coordination problems in multi-agent systems.
Organizers: Sebastian Trimpe
I am studying the question how robots can autonomously develop skills. Considering children, it seems natural that they have their own agenda. They explore their environment in a playful way, without the necessity for somebody to tell them what to do next. With robots the situation is different. There are many methods to let robots learn to do something, but it is always about learning to do a specific task from a supervision signal. Unfortunately, these methods do not scale well to systems with many degrees of freedom, except a good prestructuring is available. The hypothesis is that if the robots first learn to use their bodies and interact with the environment in a playful way they can acquire many small skills with which they can later solve complicated tasks much quicker. In the talk I will present my steps into this direction. Starting from some general information theoretic consideration we provide robots with an own drive to do something and explore their behavioral capabilities. Technically this is achieved by considering the sensorimotor loop as a dynamical system, whose parameters are adapted online according to a gradient ascent on an approximated information quantity. I will show examples of simulated and real robots behaving in a self-determined way and present future directions of my research.
Organizers: Jane Walters
In the last decade, there has been a major shift in the perception, use and predicted applications of robots. In contrast to their early industrial counterparts, robots are envisioned to operate in increasingly complex and uncertain environments, alongside humans, and over long periods of time. In my talk, I will argue that machine learning is indispensable in order for this new generation of robots to achieve high performance. Based on various examples (and videos) ranging from aerial-vehicle dancing to ground-vehicle racing, I will demonstrate the effect of robot learning, and highlight how our learning algorithms intertwine model-based control with machine learning. In particular, I will focus on our latest work that provides guarantees during learning (for example, safety and robustness guarantees) by combining traditional controls methods (nonlinear, robust and model predictive control) with Gaussian process regression.
Organizers: Sebastian Trimpe
In this talk we present some recent results on human action recognition in videos. We, first, show how to use human pose for action recognition. To this end we propose a new pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. Next, we present an approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and then tracks high-scoring proposals in the video. Our tracker relies simultaneously on instance-level and class-level detectors. Action are localized in time with a sliding window approach at the track level. Finally, we show how to extend this method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.
Typical human actions such as hand-shaking and drinking last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of single frames or short video clips and fail to model actions at their full temporal scale. In this work we learn video representations using neural networks with long-term temporal convolutions. We demonstrate that CNN models with increased temporal extents improve the accuracy of action recognition despite reduced spatial resolution. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 and HMDB51.
Proper handling of occlusions is a big challenge for model based reconstruction, e.g. for multi-view motion capture a major difficulty is the handling of occluding body parts. We propose a smooth volumetric scene representation, which implicitly converts occlusion into a smooth and differentiable phenomena (ICCV2015). Our ray tracing image formation model helps to express the objective in a single closed-form expression. This is in contrast to existing surface(mesh) representations, where occlusion is a local effect, causes non-differentiability, and is difficult to optimize. We demonstrate improvements for multi-view scene reconstruction, rigid object tracking, and motion capture. Moreover, I will show an application of motion tracking to the interactive control of virtual characters (SigAsia2015).
The core focus of my research is on robot perception. Within this broad categorization, I am mainly interested in understanding how teams of robots and sensors can cooperate and/or collaborate to improve the perception of themselves (self-localization) as well as their surroundings (target tracking, mapping, etc.). In this talk I will describe the inter-dependencies of such perception modules and present state-of-the-art methods to perform unified cooperative state estimation. The trade-off between accuracy of estimation and computational speed will be highlighted through a new optimization-based method for unified-state estimation. Furthermore, I will also describe how perception-based multirobot formation control can be achieved. Towards the end, I will present some recent results on cooperative vision-based target tracking and a few comments on our ongoing work regarding cooperative aerial mapping with human-in-the-loop.
Modeling and reconstruction of shape and motion are problems of fundamental importance in computer vision. Inverse Problem theory constitutes a powerful mathematical framework for dealing with ill-posed problems as the ones typically arising in shape and motion modeling. In this talk, I will present methods inspired by Inverse Problem theory, for dealing with four different shape and motion modeling problems. In particular, in the context of shape modeling, I will present a method for component-wise modeling of articulated objects and its application in computing 3D models of animals. Additionally, I will discuss the problem of modeling of specular surfaces via the properties of their material, and I will also present a model for confidence driven depth image fusion based on total variation regularization. Regarding motion, I will discuss a method for the recognition of human actions from motion capture data based on Nonparametric Bayesian models.