In my talk I will present my work regarding 3D mapping using lidar scanners. I will give an overview of the SLAM problem and its main challenges: robustness, accuracy and processing speed. Regarding robustness and accuracy, we investigate a better point cloud representation based on resampling and surface reconstruction. Moreover, we demonstrate how it can be incorporated in an ICP-based scan matching technique. Finally, we elaborate on globally consistent mapping using loop closures. Regarding processing speed, we propose the integration of our scan matching in a multi-resolution scheme and a GPU-accelerated implementation using our programming language Quasar.
Organizers: Simon Donne
The ability to predict how an environment changes based on forces applied to it is fundamental for a robot to achieve specific goals. Traditionally in robotics, this problem is addressed through the use of pre-specified models or physics simulators, taking advantage of prior knowledge of the problem structure. While these models are general and have broad applicability, they depend on accurate estimation of model parameters such as object shape, mass, friction etc. On the other hand, learning based methods such as Predictive State Representations or more recent deep learning approaches have looked at learning these models directly from raw perceptual information in a model-free manner. These methods operate on raw data without any intermediate parameter estimation, but lack the structure and generality of model-based techniques. In this talk, I will present some work that tries to bridge the gap between these two paradigms by proposing a specific class of deep visual dynamics models (SE3-Nets) that explicitly encode strong physical and 3D geometric priors (specifically, rigid body dynamics) in their structure. As opposed to traditional deep models that reason about dynamics/motion a pixel level, we show that the physical priors implicit in our network architectures enable them to reason about dynamics at the object level - our network learns to identify objects in the scene and to predict rigid body rotation and translation per object. I will present results on applying our deep architectures to two specific problems: 1) Modeling scene dynamics where the task is to predict future depth observations given the current observation and an applied action and 2) Real-time visuomotor control of a Baxter manipulator based only on raw depth data. We show that: 1) Our proposed architectures significantly outperform baseline deep models on dynamics modelling and 2) Our architectures perform comparably or better than baseline models for visuomotor control while operating at camera rates (30Hz) and relying on far less information.
Organizers: Franzi Meier
Machine learning has become a popular application domain for modern optimization techniques, pushing its algorithmic frontier. The need for large scale optimization algorithms which can handle millions of dimensions or data points, typical for the big data era, have brought a resurgence of interest for first order algorithms, making us revisit the venerable stochastic gradient method [Robbins-Monro 1951] as well as the Frank-Wolfe algorithm [Frank-Wolfe 1956]. In this talk, I will review recent improvements on these algorithms which can exploit the structure of modern machine learning approaches. I will explain why the Frank-Wolfe algorithm has become so popular lately; and present a surprising tweak on the stochastic gradient method which yields a fast linear convergence rate. Motivating applications will include weakly supervised video analysis and structured prediction problems.
Organizers: Philipp Hennig
This talk addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given video frames as input, our approach first assigns each pixel an object or background label obtained with an encoder-decoder network that takes as input optical flow and is trained on synthetic data. Next, a “visual memory” specific to the video is acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. This is joint work with K. Alahari and P. Tokmakov.
Organizers: Osman Ulusoy
Many of the existing Robotics & Automation (R&A) technologies are at a sufficient level of maturity and are widely accepted by the academic (and to a lesser extent by the industrial) community after having undergone the scientific rigor and peer reviews that accompany such works. I believe that most of the past and current research and development efforts in robotics and automation have been squarely aimed at increasing the Standard of Living (SoL) in developed economies where housing, running water, transportation, schools, access to healthcare, to name a few, are taken for granted. Humanitarian R&A, on the other hand, can be taken to mean technologies that can make a fundamental difference in people’s lives by alleviating their suffering in times of need, such as during natural or man-made disasters or in pockets of the population where the most basic needs of humanity are not met, thus improving their Quality of Life (QoL) and not just SoL. My current work focuses on the applied use of robotics and automation technologies for the benefit of under-served and under-developed communities by working closely with them to develop solutions that showcase the effectiveness of R&A solutions in domains that strike a chord with the beneficiaries. This is made possible by bringing together researchers, practitioners from industry, academia, local governments, and various entities such as the IEEE Robotics Automation Society’s Special Interest Group on Humanitarian Technology (RAS-SIGHT), NGOs, and NPOs across the globe. I will share some of my efforts and thoughts on challenges that need to be taken into consideration including sustainability of developed solutions. I will also outline my recent efforts in the technology and public policy domains with emphasis on socio-economic, cultural, privacy, and security issues in developing and developed economies.
Organizers: Ludovic Righetti
I'll present my master thesis "Biquadratic Forms and Semi-Definite Relaxations". It is about biquadratic optimization programs (which are NP-hard generally) and examines a condition under which there exists an algorithm that finds a solution to every instance of the problem in polynomial time. I'll present a counterexample for which this is not possible generally and face the question of what happens if further knowledge about the variables over which we optimise is applied.
Organizers: Fatma Güney
A large part of image analysis is about breaking things into pieces. Decompositions of a graph are a mathematical abstraction of the possible outcomes. This talk is about optimization problems whose feasible solutions define decompositions of a graph. One example is the correlation clustering problem whose feasible solutions relate one-to-one to the decompositions of a graph, and whose objective function puts a cost or reward on neighboring nodes ending up in distinct components. This talk shows applications of this problem and proposed generalizations to diverse image analysis tasks. It sketches algorithms for finding feasible solutions for large instances in practice, solutions that are often superior in the metrics of application-specific benchmarks. It also sketches algorithms for finding lower bounds and points to new findings and open problems of polyhedral geometry in this context.
Organizers: Christoph Lassner
Colloquium on haptics: Two guests of the department "Haptic Intelligence" (Dept. Kuchenbecker), will each give a short talk this Friday (May 5) in Tübingen. The talks will be broadcasted to Stuttgart, room 2 P4.
Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.
Organizers: Dimitris Tzionas
Human-centric robotic applications often require the robots to learn new skills by interacting with the end-users. From a machine learning perspective, the challenge is to acquire skills from only few interactions, with strong generalization demands. It requires: 1) the development of intuitive active learning interfaces to acquire meaningful demonstrations; 2) the development of models that can exploit the structure and geometry of the acquired data in an efficient way; 3) the development of adaptive control techniques that can exploit the learned task variations and coordination patterns. The developed models often need to serve several purposes (recognition, prediction, online synthesis), and be compatible with different learning strategies (imitation, emulation, exploration). For the reproduction of skills, these models need to be enriched with force and impedance information to enable human-robot collaboration and to generate safe and natural movements. I will present an approach combining model predictive control and statistical learning of movement primitives in multiple coordinate systems. The proposed approach will be illustrated in various applications, with robots either close to us (robot for dressing assistance), part of us (prosthetic hand with EMG and tactile sensing), or far from us (teleoperation of bimanual robot in deep water).
Organizers: Ludovic Righetti
This talk will survey recent work to achieve multi-contact locomotion control of humanoid and legged robots. I will start by presenting some results on robust optimization-based control. We exploited robust optimization techniques, either stochastic or worst-case, to improve the robustness of Task-Space Inverse Dynamics (TSID), a well-known control framework for legged robots. We modeled uncertainties in the joint torques, and we immunized the constraints of the system to any of the realizations of these uncertainties. We also applied the same methodology to ensure the balance of the robot despite bounded errors in the its inertial parameters. Extensive simulations in a realistic environment show that the proposed robust controllers greatly outperform the classic one. Then I will present preliminary results on a new capturability criterion for legged robots in multi-contact. "N-step capturability" is the ability of a system to come to a stop by taking N or fewer steps. Simplified models to compute N-step capturability already exist and are widely used, but they are limited to locomotion on flat terrains. We propose a new efficient algorithm to compute 0-step capturability for a robot in arbitrary contact scenarios. Finally, I will present our recent efforts to transfer the above-mentioned techniques to the real humanoid robot HRP-2, on which we recently implemented joint torque control.
Organizers: Ludovic Righetti
The retina in the eye performs complex computations, to transmit only behaviourally relevant information about our visual environment to the brain. These computations are implemented by numerous different cell types that form complex circuits. New experimental and computational methods make it possible to study the cellular diversity of the retina in detail – the goal of obtaining a complete list of all the cell types in the retina and, thus, its “building blocks”, is within reach. I will review our recent contributions in this area, showing how analyzing multimodal datasets from electron microscopy and functional imaging can yield insights into the cellular organization of retinal circuits.
Organizers: Philipp Hennig
From gait, dance to martial art, human movements provide rich, complex yet coherent spatiotemporal patterns reflecting characteristics of a group or an individual. We develop computer algorithms to automatically learn such quality discriminative features from multimodal data. In this talk, I present a trilogy on learning from human movements: (1) Gait analysis from video data: based on frieze patterns (7 frieze groups), a video sequence of silhouettes is mapped into a pair of spatiotemporal patterns that are near-periodic along the time axis. A group theoretical analysis of periodic patterns allows us to determine the dynamic time warping and affine scaling that aligns two gait sequences from similar viewpoints for human identification. (2) Dance analysis and synthesis (mocap, music, ratings from Mechanical Turks): we explore the complex relationship between perceived dance quality/dancer's gender and dance movements respectively. As a feasibility study, we construct a computational framework for an analysis-synthesis-feedback loop using a novel multimedia dance-texture representation for joint angular displacement, velocity and acceleration. Furthermore, we integrate crowd sourcing, music and motion-capture data, and machine learning-based methods for dance segmentation, analysis and synthesis of new dancers. A quantitative validation of this framework on a motion-capture dataset of 172 dancers evaluated by more than 400 independent on-line raters demonstrates significant correlation between human perception and the algorithmically intended dance quality or gender of the synthesized dancers. (3) Tai Chi performance evaluation (mocap + video): I shall also discuss the feasibility of utilizing spatiotemporal synchronization and, ultimately, machine learning to evaluate Tai Chi routines performed by different subjects in our current project of “Tai Chi + Advanced Technology for Smart Health”.