Recognition of pain in horses and other animals is important, because pain is a manifestation of disease and decreases animal welfare. Pain diagnostics for humans typically includes self-evaluation and location of the pain with the help of standardized forms, and labeling of the pain by an clinical expert using pain scales. However, animals cannot verbalize their pain as humans can, and the use of standardized pain scales is challenged by the fact that animals as horses and cattle, being prey animals, display subtle and less obvious pain behavior - it is simply beneficial for a prey animal to appear healthy, in order lower the interest from predators. We work together with veterinarians to develop methods for automatic video-based recognition of pain in horses. These methods are typically trained with video examples of behavioral traits labeled with pain level and pain characteristics. This automated, user independent system for recognition of pain behavior in horses will be the first of its kind in the world. A successful system might change the concept for how we monitor and care for our animals.
A dominant trend in manufacturing is the move toward small production volumes and high product variability. It is thus anticipated that future manufacturing automation systems will be characterized by a high degree of autonomy, and must be able to learn new behaviors without explicit programming. Robot Learning, and more generic, Autonomous Manufacturing, is an exciting research field at the intersection of Machine Learning and Automation. The combination of "traditional" control techniques with data-driven algorithms holds the promise of allowing robots to learn new behaviors through experience. This talk introduces selected Siemens research projects in the area of Autonomous Manufacturing.
Animals and humans are excellent in conceiving of solutions to physical and geometric problems, for instance in using tools, coming up with creative constructions, or eventually inventing novel mechanisms and machines. Cognitive scientists coined the term intuitive physics in this context. It is a shame we do not yet have good computational models of such capabilities. A main stream of current robotics research focusses on training robots for narrow manipulation skills - often using massive data from physical simulators. Complementary to that we should also try to understand how basic principles underlying physics can directly be used to enable general purpose physical reasoning in robots, rather than sampling data from physical simulations. In this talk I will discuss an approach called Logic-Geometric Programming, which builds a bridge between control theory, AI planning and robot manipulation. It demonstrates strong performance on sequential manipulation problems, but also raises a number of highly interesting fundamental problems, including its probabilistic formulation, reactive execution and learning.
The state-of-the-art robotic systems adopting magnetically actuated ferromagnetic bodies or even whole miniature robots have recently become a fast advancing technological field, especially at the nano and microscale. The mesoscale and above all multiscale magnetically guided robotic systems appear to be the advanced field of study, where it is difficult to reflect different forces, precision and also energy demands. The major goal of our talk is to discuss the challenges in the field of magnetically guided mesoscale and multiscale actuation, followed by the results of our research in the field of magnetic positioning systems and the magnetic soft-robotic grippers.
Organizers: Metin Sitti
Human pose stability analysis is the key to understanding locomotion and control of body equilibrium, with numerous applications in the fields of Kinesiology, Medicine and Robotics. We propose and validate a novel approach to learn dynamics from kinematics of a human body to aid stability analysis. More specifically, we propose an end-to-end deep learning architecture to regress foot pressure from a human pose derived from video. We have collected and utilized a set of long (5min +) choreographed Taiji (Tai Chi) sequences of multiple subjects with synchronized motion capture, foot pressure and video data. The derived human pose data and corresponding foot pressure maps are used jointly in training a convolutional neural network with residual architecture, named “PressNET”. Cross validation results show promising performance of PressNet, significantly outperforming the baseline method under reasonable sensor noise ranges.
Organizers: Nadine Rueegg
Understanding objects and their behavior from images and videos is a difficult inverse problem. It requires learning a metric in image space that reflects object relations in real world. This metric learning problem calls for large volumes of training data. While images and videos are easily available, labels are not, thus motivating self-supervised metric and representation learning. Furthermore, I will present a widely applicable strategy based on deep reinforcement learning to improve the surrogate tasks underlying self-supervision. Thereafter, the talk will cover the learning of disentangled representations that explicitly separate different object characteristics. Our approach is based on an analysis-by-synthesis paradigm and can generate novel object instances with flexible changes to individual characteristics such as their appearance and pose. It nicely addresses diverse applications in human and animal behavior analysis, a topic we have intensive collaboration on with neuroscientists. Time permitting, I will discuss the disentangling of representations from a wider perspective including novel strategies to image stylization and new strategies for regularization of the latent space of generator networks.
Organizers: Joel Janai
The past few years with the advent of Deep Convolutional Neural Networks (DCNNs), as well as the availability of visual data it was shown that it is possible to produce excellent results in very challenging tasks, such as visual object recognition, detection, tracking etc. Nevertheless, in certain tasks such as fine-grain object recognition (e.g., face recognition) it is very difficult to collect the amount of data that are needed. In this talk, I will show how, using DCNNs, we can generate highly realistic faces and heads and use them for training algorithms such as face and facial expression recognition. Next, I will reverse the problem and demonstrate how by having trained a very powerful face recognition network it can be used to perform very accurate 3D shape and texture reconstruction of faces from a single image. Finally, I will demonstrate how to create very lightweight networks for representing 3D face texture and shape structure by capitalising upon intrinsic mesh convolutions.
Organizers: Dimitris Tzionas
Much existing work in reinforcement learning involves environments that are either intentionally neutral, lacking a role for cooperation and competition, or intentionally simple, when agents need imagine nothing more than that they are playing versions of themselves. Richer game theoretic notions become important as these constraints are relaxed. For humans, this encompasses issues that concern utility, such as envy and guilt, and that concern inference, such as recursive modeling of other players, I will discuss studies treating a paradigmatic game of trust as an interactive partially-observable Markov decision process, and will illustrate the solution concepts with evidence from interactions between various groups of subjects, including those diagnosed with borderline and anti-social personality disorders.
The scenario approach is a broad methodology to deal with decision-making in an uncertain environment. By resorting to observations, or by sampling uncertainty from a given model, one obtains an optimization problem (the scenario problem), whose solution bears precise probabilistic guarantees in relation to new, unseen, situations. The scenario approach opens up new avenues to address data-based problems in learning, identification, finance, and other fields.
Organizers: Sebastian Trimpe
Driven by the increasing demand for photorealistic computer-generated images, graphics is currently undergoing a substantial transformation to physics-based approaches which accurately reproduce the interaction of light and matter. Progress on both sides of this transformation -- physical models and simulation techniques -- has been steady but mostly independent from another. When combined, the resulting methods are in many cases impracticably slow and require unrealistic workarounds to process even simple everyday scenes. My research lies at the interface of these two research fields; my goal is to break down the barriers between simulation techniques and the underlying physical models, and to use the resulting insights to develop realistic methods that remain efficient over a wide range of inputs.
I will cover three areas of recent work: the first involves volumetric modeling approaches to create realistic images of woven and knitted cloth. Next, I will discuss reflectance models for glitter/sparkle effects and arbitrarily layered materials that are specially designed to allow for efficient simulations. In the last part of the talk, I will give an overview of Manifold Exploration, a Markov Chain Monte Carlo technique that is able to reason about the geometric structure of light paths in high dimensional configuration spaces defined by the underlying physical models, and which uses this information to compute images more efficiently.
I will present selected research projects of the Photogrammetry and Remote Sensing Group at ETH, including (i) 3D scene flow estimation for stereo video captured from a car; (ii) extraction of road networks from aerial images; and (iii) 3D reconstruction from large, unstructured (e.g. crowd-sourced) image collections.
The growing scale of image and video datasets in vision makes labeling and annotation of such datasets, for training of recognition models, difficult and time consuming. Further, richer models often require richer labelings of the data, that are typically even more difficult to obtain. In this talk I will focus on two models that make use of different forms of supervision for two different vision tasks.
In the first part of this talk I will focus on object detection. The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level labels (e.g., “car”). We postulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained sub-categories, consistent in appearance and view, and higher-order composites – contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is infeasible. To this end, we propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input.
In the second part of the talk I will focus on the framework for large scale image set and video summarization. Starting from the intuition that the characteristics of the two media types are different but complementary, we develop a fast and easily-parallelizable approach for creating not only video summaries but also novel structural summaries of events in the form of the storyline graphs. The storyline graphs can illustrate various events or activities associated with the topic in the form of a branching directed network. The video summarization is achieved by diversity ranking on the similarity graphs between images and video frame, thereby treating consumer image as essentially a form of weak-supervision. The reconstruction of storyline graphs on the other hand is formulated as inference of the sparse time-varying directed graphs from a set of photo streams with assistance of consumer videos.
Time permitting I will also talk about a few other recent project highlights.
Abstract: I will present a general framework for modelling and recovering 3D shape and pose using subdivision surfaces. To demonstrate this frameworks generality, I will show how to recover both a personalized rigged hand model from a sequence of depth images and a blend shape model of dolphin pose from a collection of 2D dolphin images. The core requirement is the formulation of a generative model in which the control vertices of a smooth subdivision surface are parameterized (e.g. with joint angles or blend weights) by a differentiable deformation function. The energy function that falls out of measuring the deviation between the surface and the observed data is also differentiable and can be minimized through standard, albeit tricky, gradient based non-linear optimization from a reasonable initial guess. The latter can often be obtained using machine learning methods when manual intervention is undesirable. Satisfyingly, the "tricks" involved in the former are elegant and widen the applicability of these methods.
In order to avoid an expensive manual labeling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visual similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in these videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.
Facebook serves close to a billion people every day, who are only able to consume a small subset of the information available to them. In this talk I will give some examples of how machine learning is used to personalize people’s Facebook experience. I will also present some data science experiments with fairly counter-intuitive results.
In this talk I will discuss two related problems in 3D reconstruction: (i) recovering the 3D shape of a temporally varying non-rigid 3D surface given a single video sequence and (ii) reconstructing different instances of the same object class category given a large collection of images from that category. In both cases we extract dense 3D shape information by analysing shape variation -- in one case of the same object instance over time and in the other across different instances of objects that belong to the same class.
First I will discuss the problem of dense capture of 3D non-rigid surfaces from a monocular video sequence. We take a purely model-free approach where no strong assumptions are made about the object we are looking at or the way it deforms. We apply low rank and spatial smoothness priors to obtain dense non-rigid models using a variational approach.
Second I will describe our recent approach to populating the Pascal VOC dataset with dense, per-object 3D reconstructions, bootstrapped from class labels, ground truth figure-ground segmentations and a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion, then reconstructs objects shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions.
Stochastic differential equations (SDEs) arise naturally as descriptions of continuous time dynamical systems. My talk addresses the problem of inferring the dynamical state and parameters of such systems from observations taken at discrete times. I will discuss the application of approximate inference methods such as the variational method and expectation propagation and show how higher dimensional systems can be treated by a mean field approximation. In the second part of my talk I will discuss the nonparametric estimation of the drift (i.e. the deterministic part of the ‘force’ which governs the dynamics) as a function of the state using Gaussian process approaches.
Even though many challenges remain unsolved, in recent years computer graphics algorithms to render photo-realistic imagery have seen tremendous progress. An important prerequisite for high-quality renderings is the availability of good models of the scenes to be rendered, namely models of shape, motion and appearance. Unfortunately, the technology to create such models has not kept pace with the technology to render the imagery. In fact, we observe a content creation bottleneck, as it often takes man months of tedious manual work by a animation artists to craft models of moving virtual scenes.
To overcome this limitation, the research community has been developing techniques to capture models of dynamic scenes from real world examples, for instance methods that rely on footage recorded with cameras or other sensors. One example are performance capture methods that measure detailed dynamic surface models, for example of actors or an actor's face, from multi-view video and without markers in the scene. Even though such 4D capture methods made big strides ahead, they are still at an early stage of their development. Their application is limited to scenes of moderate complexity in controlled environments, reconstructed detail is limited, and captured content cannot be easily modified, to name only a few restrictions.
In this talk, I will elaborate on some ideas on how to go beyond this limited scope of 4D reconstruction, and show some results from our recent work. For instance, I will show how we can capture more complex scenes with many objects or subjects in close interaction, as well as very challenging scenes of a smaller scale, such a hand motion. The talk will also show how we can capitalize on more sophisticated light transport models and inverse rendering to enable high-quality reconstruction in much more uncontrolled scenes, eventually also outdoors, and with very few cameras. I will also demonstrate how to represent captured scenes such that they can be conveniently modified. If time allows, the talk will cover some of our recent ideas on how to perform advanced edits of videos (e.g. removing or modifying dynamic objects in scenes) by exploiting reconstructed 4D models, as well as robustly found inter- and intra-frame correspondences.
Organizers: Gerard Pons-Moll