Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather the traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies. In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation. We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.
Organizers: Jonas Wulff
Deep Learning is one of the most successful machine learning approaches to artificial intelligence. In this talk I discuss the geometry of neural networks as a way to study the success of Deep Learning at a mathematical level and to develop a theoretical basis for making further advances, especially in situations with limited amounts of data and challenging problems in reinforcement learning. I present a few recent results on the representational power of neural networks and then demonstrate how to align this with structures from perception-action problems in order to obtain more efficient learning systems.
Organizers: Jane Walters
Human observers can classify photographs of real-world scenes after only a very brief exposure to the image (Potter & Levy, 1969; Thorpe, Fize, Marlot, et al., 1996; VanRullen & Thorpe, 2001). Line drawings of natural scenes have been shown to capture essential structural information required for successful scene categorization (Walther et al., 2011). Here, we investigate how the spatial relationships between lines and line segments in the line drawings affect scene classification. In one experiment, we tested the effect of removing either the junctions or the middle segments between junctions. Surprisingly, participants performed better when shown the middle segments (47.5%) than when shown the junctions (42.2%). It appeared as if the images with middle segments tended to maintain the most parallel/locally symmetric portions of the contours. In order to test this hypothesis, in a second experiment, we either removed the most symmetric half of the contour pixels or the least symmetric half of the contour pixels using a novel method of measuring the local symmetry of each contour pixel in the image. Participants were much better at categorizing images containing the most symmetric contour pixels (49.7%) than the least symmetric (38.2%). Thus, results from both experiments demonstrate that local contour symmetry is a crucial organizing principle in complex real-world scenes. Joint work with John Wilder (UofT CS, Psych), Morteza Rezanejad (McGill CS), Kaleem Siddiqi (McGill CS), Allan Jepson (UofT CS), and Dirk Bernhardt-Walther (UofT Psych), to be presented at VSS 2017.
Organizers: Ahmed Osman
In this talk we present some recent results on human action recognition in videos. We, first, show how to use human pose for action recognition. To this end we propose a new pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. Next, we present an approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and then tracks high-scoring proposals in the video. Our tracker relies simultaneously on instance-level and class-level detectors. Action are localized in time with a sliding window approach at the track level. Finally, we show how to extend this method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.
Typical human actions such as hand-shaking and drinking last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of single frames or short video clips and fail to model actions at their full temporal scale. In this work we learn video representations using neural networks with long-term temporal convolutions. We demonstrate that CNN models with increased temporal extents improve the accuracy of action recognition despite reduced spatial resolution. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 and HMDB51.
Proper handling of occlusions is a big challenge for model based reconstruction, e.g. for multi-view motion capture a major difficulty is the handling of occluding body parts. We propose a smooth volumetric scene representation, which implicitly converts occlusion into a smooth and differentiable phenomena (ICCV2015). Our ray tracing image formation model helps to express the objective in a single closed-form expression. This is in contrast to existing surface(mesh) representations, where occlusion is a local effect, causes non-differentiability, and is difficult to optimize. We demonstrate improvements for multi-view scene reconstruction, rigid object tracking, and motion capture. Moreover, I will show an application of motion tracking to the interactive control of virtual characters (SigAsia2015).
The core focus of my research is on robot perception. Within this broad categorization, I am mainly interested in understanding how teams of robots and sensors can cooperate and/or collaborate to improve the perception of themselves (self-localization) as well as their surroundings (target tracking, mapping, etc.). In this talk I will describe the inter-dependencies of such perception modules and present state-of-the-art methods to perform unified cooperative state estimation. The trade-off between accuracy of estimation and computational speed will be highlighted through a new optimization-based method for unified-state estimation. Furthermore, I will also describe how perception-based multirobot formation control can be achieved. Towards the end, I will present some recent results on cooperative vision-based target tracking and a few comments on our ongoing work regarding cooperative aerial mapping with human-in-the-loop.
Modeling and reconstruction of shape and motion are problems of fundamental importance in computer vision. Inverse Problem theory constitutes a powerful mathematical framework for dealing with ill-posed problems as the ones typically arising in shape and motion modeling. In this talk, I will present methods inspired by Inverse Problem theory, for dealing with four different shape and motion modeling problems. In particular, in the context of shape modeling, I will present a method for component-wise modeling of articulated objects and its application in computing 3D models of animals. Additionally, I will discuss the problem of modeling of specular surfaces via the properties of their material, and I will also present a model for confidence driven depth image fusion based on total variation regularization. Regarding motion, I will discuss a method for the recognition of human actions from motion capture data based on Nonparametric Bayesian models.
Computer vision on flying robots - or UAVs - brings its own challenges, especially if conducted in real time. On-board processing is limited by tight weight and size constraints for the electronics while off-board processing is challenged by signal delays and connection quality, especially considering the data rates required for high fps high resolution video. Unlike ground based vehicles, precision odometry is unavailable. Positional information is provided by GPS, which can have signal losses and limited precision, especially near terrain. Exact orientation can be even more problematic due to magnetic interference and vibration affecting sensors. In this talk I'd like to present and discuss some examples of practical problems encountered when trying to get robotics airborne – as well as possible solutions.
Organizers: Alina Allakhverdieva
The development of increasingly intelligent and autonomous technologies will inevitably lead to these systems having to face morally problematic situations. This is particularly true of artificial systems that are used in geriatric care environments. It will, therefore, be necessary in the long run to develop machines which have the capacity for a certain amount of autonomous moral decision-making. The goal of this talk is to provide the theoretical foundations for artificial morality, i.e., for implementing moral capacities in artificial systems in general and a roadmap for developing an assistive system in geriatric care which is capable of moral learning.
Inverse problems are ubiquitous in image processing and applied science in general. Such problems describe the challenge of computing the parameters that characterize a system from the outcomes. While this might seem easy at first for simple systems, many inverse problems share a property that makes them much more intricate: they are ill-posed. This means that either the problem does not have a unique solution or this solution does not depend continuously on the outcomes of the system. Bayesian statistics provides a framework that allows to treat such problems in a systematic way. The missing piece of information is encoded as a prior distribution on the space of possible solutions. In this talk, we will study probabilistic image models as priors for statistical inversion. In particular, we will give a probabilistic interpretation of the classical TV-prior and discuss how this interpretation can be used as a starting point for more complex models. We will see that many important auxiliary quantities such as edges and regions can be incorporated into the model in the form of latent variables. This leads to the conjecture that many image processing tasks, such as denoising and segmentation, should not be considered separately, but instead be treated together.
The detection and characterization of planets orbiting other stars than the Sun, i.e., so-called extrasolar planets, is one of the fastest growing and most vibrant research fields in modern astrophysics. In the last 25 years, more than 5400 extrasolar planets and planet candidates were revealed, but the vast majority of these objects was detected with indirect techniques, where the existence of the planet is inferred from periodic changes in the light coming from the central star. No photons from the planets themselves are detected. In this talk, however, I will focus on the direct detection of extrasolar planets. On the one hand I will describe the main challenges that have to be overcome in order to image planets around other stars. In addition to using the world’s largest telescopes and optimized cameras it was realized in last few years that by applying advanced image processing techniques significant sensitivity gains can be achieved. On the other hand I will demonstrate what can be learned if one is successful in “taking a picture” of an extrasolar planet. After all, there must be good scientific reasons and a strong motivation why the direct detection of extrasolar planets is one of the key science drivers for current and future projects on major ground- and space-based telescopes.
Organizers: Diana Rebmann
In general Helga Griffiths is a Multi-Sense-Artist working on the intersection of science and art. She has been working for over 20 years on the integration of various sensory stimuli into her “multi-sense” installations. Typical for her work is to produce a sensory experience to transcend conventional boundaries of perception.
Organizers: Emma-Jayne Holderness