In my talk I will present my work regarding 3D mapping using lidar scanners. I will give an overview of the SLAM problem and its main challenges: robustness, accuracy and processing speed. Regarding robustness and accuracy, we investigate a better point cloud representation based on resampling and surface reconstruction. Moreover, we demonstrate how it can be incorporated in an ICP-based scan matching technique. Finally, we elaborate on globally consistent mapping using loop closures. Regarding processing speed, we propose the integration of our scan matching in a multi-resolution scheme and a GPU-accelerated implementation using our programming language Quasar.
Organizers: Simon Donne
The ability to predict how an environment changes based on forces applied to it is fundamental for a robot to achieve specific goals. Traditionally in robotics, this problem is addressed through the use of pre-specified models or physics simulators, taking advantage of prior knowledge of the problem structure. While these models are general and have broad applicability, they depend on accurate estimation of model parameters such as object shape, mass, friction etc. On the other hand, learning based methods such as Predictive State Representations or more recent deep learning approaches have looked at learning these models directly from raw perceptual information in a model-free manner. These methods operate on raw data without any intermediate parameter estimation, but lack the structure and generality of model-based techniques. In this talk, I will present some work that tries to bridge the gap between these two paradigms by proposing a specific class of deep visual dynamics models (SE3-Nets) that explicitly encode strong physical and 3D geometric priors (specifically, rigid body dynamics) in their structure. As opposed to traditional deep models that reason about dynamics/motion a pixel level, we show that the physical priors implicit in our network architectures enable them to reason about dynamics at the object level - our network learns to identify objects in the scene and to predict rigid body rotation and translation per object. I will present results on applying our deep architectures to two specific problems: 1) Modeling scene dynamics where the task is to predict future depth observations given the current observation and an applied action and 2) Real-time visuomotor control of a Baxter manipulator based only on raw depth data. We show that: 1) Our proposed architectures significantly outperform baseline deep models on dynamics modelling and 2) Our architectures perform comparably or better than baseline models for visuomotor control while operating at camera rates (30Hz) and relying on far less information.
Organizers: Franzi Meier
Machine learning has become a popular application domain for modern optimization techniques, pushing its algorithmic frontier. The need for large scale optimization algorithms which can handle millions of dimensions or data points, typical for the big data era, have brought a resurgence of interest for first order algorithms, making us revisit the venerable stochastic gradient method [Robbins-Monro 1951] as well as the Frank-Wolfe algorithm [Frank-Wolfe 1956]. In this talk, I will review recent improvements on these algorithms which can exploit the structure of modern machine learning approaches. I will explain why the Frank-Wolfe algorithm has become so popular lately; and present a surprising tweak on the stochastic gradient method which yields a fast linear convergence rate. Motivating applications will include weakly supervised video analysis and structured prediction problems.
Organizers: Philipp Hennig
Humanoid locomotion on horizontal floors was solved by closing the feedback loop on the Zero-tiling Moment Point (ZMP), a measurable dynamic point that needs to stay inside the foot contact area to prevent the robot from falling (contact stability criterion). However, this criterion does not apply to general multi-contact settings, the "new frontier" in humanoid locomotion. In this talk, we will see how the ideas of ZMP and support area can be generalized and applied to multi-contact locomotion. First, we will show how support areas can be calculated in any virtual plane, allowing one to apply classical schemes even when contacts are not coplanar. Yet, these schemes constraint the center-of-mass (COM) to planar motions. We overcome this limitation by extending the calculation of the contact-stability criterion from a support area to a support cone of 3D COM accelerations. We use this new criterion to implement a multi-contact walking pattern generator based on predictive control of COM accelerations, which we will demonstrate in real-time simulations during the presentation.
Organizers: Ludovic Righetti
Understanding people in images and videos is a problem studied intensively in computer vision. While continuous progress has been made, occlusions, cluttered background, complex poses and large variety of appearance remain challenging, especially for crowded scenes. In this talk, I will explore the algorithms and tools that enable computer to interpret people's position, motion and articulated poses in the real-world challenging images and videos.More specifically, I will discuss an optimization problem whose feasible solutions define a decomposition of a given graph. I will highlight the applications of this problem in computer vision, which range from multi-person tracking [1,2,3] to motion segmentation . I will also cover an extended optimization problem whose feasible solutions define a decomposition of a given graph and a labeling of its nodes with the application on multi-person pose estimation . Reference:  Subgraph Decomposition for Multi-Object Tracking; S. Tang, B. Andres, M. Andriluka and B. Schiele; CVPR 2015  Multi-Person Tracking by Multicut and Deep Matching; S. Tang, B. Andres, M. Andriluka and B. Schiele; arXiv 2016  Multi-Person Tracking by Lifted Multicut and Person Re-identification; S. Tang, B. Andres, M. Andriluka and B. Schiele  A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects; M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox and B. Schiele; arXiv 2016  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation.: L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele; CVPR16
Organizers: Naureen Mahmood
Coronary artery disease (CAD) is the single leading cause of death worldwide and Cardiac Computed Tomography Angiography (CCTA) is a non-invasive test to rule out CAD using the anatomical characterization of the coronary lesions. Recent studies suggest that coronary lesions’ hemodynamic significance can be assessed by Fractional Flow Reserve (FFR), which is usually measured invasively in the CathLab but can also be simulated from a patient-specific biophysical model based on CCTA data. We learn a parametric lumped model (LM) enabling fast computational fluid dynamic simulations of blood flow in elongated vessel networks to alleviate the computational burden of 3D finite element (FE) simulations. We adapt the coefficients balancing the local nonlinear hydraulic effects from a training set of precomputed FE simulations. Our LM yields accurate pressure predictions suggesting that costly FE simulations can be replaced by our fast LM paving the way to use a personalised interactive biophysical model with realtime feedback in clinical practice.
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priory knowledge of the object's shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.
Organizers: Javier Romero
Matching between two sets arises in various areas in computer vision, such as feature point matching for 3D reconstruction, person re-identification for surveillance or data association for multi-target tracking. Most previous work focused either on designing suitable features and matching cost functions, or on developing faster and more accurate solvers for quadratic or higher-order problems. In the first part of my talk, I will present a strategy for improving state-of-the-art solutions by efficiently computing the marginals of the joint matching probability. The second part of my talk will revolve around our recent work on online multi-target tracking using recurrent neural networks (RNNs). I will mention some fundamental challenges we encountered and present our current solution.
The accurate reconstruction of facial shape is important for applications such as telepresence and gaming. It can be solved efficiently with the help of statistical shape models that constrain the shape of the reconstruction. In this talk, several methods to statistically analyze static and dynamic 3D face data are discussed. When statistically analyzing faces, various challenges arise from noisy, corrupt, or incomplete data. To overcome the limitations imposed by the poor data quality, we leverage redundancy in the data for shape processing. This is done by processing entire motion sequences in the case of dynamic data, and by jointly processing large databases in a groupwise fashion in the case of static data. First, a fully automatic approach to robustly register and statistically analyze facial motion sequences using a multilinear face model as statistical prior is proposed. Further, a statistical face model is discussed, which consists of many localized, decorrelated multilinear models. The localized and multi-scale nature of this model allows for recovery of fine-scale details while retaining robustness to severe noise and occlusions. Finally, the learning of statistical face models is formulated as a groupwise optimization framework that aims to learn a multilinear model while jointly optimizing the correspondence, or correcting the data.
In many control applications it is the goal to operate a dynamical system in an optimal way with respect to a certain performance criterion. In a combustion engine, for example, the goal could be to control the engine such that the emissions are minimized. Due to the complexity of an engine, the desired operating point is unknown or may even change over time so that it cannot be determined a priori. Extremum seeking control is a learning-control methodology to solve such kind of control problems. It is a model-free method that optimizes the steady-state behavior of a dynamical system. Since it can be implemented with very limited resources, it has found several applications in industry. In this talk we give an introduction to extremum seeking theory based on a recently developed framework which relies on tools from geometric control. Furthermore, we discuss how this framework can be utilized to solve distributed optimization and coordination problems in multi-agent systems.
Organizers: Sebastian Trimpe
I am studying the question how robots can autonomously develop skills. Considering children, it seems natural that they have their own agenda. They explore their environment in a playful way, without the necessity for somebody to tell them what to do next. With robots the situation is different. There are many methods to let robots learn to do something, but it is always about learning to do a specific task from a supervision signal. Unfortunately, these methods do not scale well to systems with many degrees of freedom, except a good prestructuring is available. The hypothesis is that if the robots first learn to use their bodies and interact with the environment in a playful way they can acquire many small skills with which they can later solve complicated tasks much quicker. In the talk I will present my steps into this direction. Starting from some general information theoretic consideration we provide robots with an own drive to do something and explore their behavioral capabilities. Technically this is achieved by considering the sensorimotor loop as a dynamical system, whose parameters are adapted online according to a gradient ascent on an approximated information quantity. I will show examples of simulated and real robots behaving in a self-determined way and present future directions of my research.
Organizers: Jane Walters
In the last decade, there has been a major shift in the perception, use and predicted applications of robots. In contrast to their early industrial counterparts, robots are envisioned to operate in increasingly complex and uncertain environments, alongside humans, and over long periods of time. In my talk, I will argue that machine learning is indispensable in order for this new generation of robots to achieve high performance. Based on various examples (and videos) ranging from aerial-vehicle dancing to ground-vehicle racing, I will demonstrate the effect of robot learning, and highlight how our learning algorithms intertwine model-based control with machine learning. In particular, I will focus on our latest work that provides guarantees during learning (for example, safety and robustness guarantees) by combining traditional controls methods (nonlinear, robust and model predictive control) with Gaussian process regression.
Organizers: Sebastian Trimpe
In this talk we present some recent results on human action recognition in videos. We, first, show how to use human pose for action recognition. To this end we propose a new pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. Next, we present an approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and then tracks high-scoring proposals in the video. Our tracker relies simultaneously on instance-level and class-level detectors. Action are localized in time with a sliding window approach at the track level. Finally, we show how to extend this method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.