Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather the traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies. In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation. We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.
Organizers: Jonas Wulff
Deep Learning is one of the most successful machine learning approaches to artificial intelligence. In this talk I discuss the geometry of neural networks as a way to study the success of Deep Learning at a mathematical level and to develop a theoretical basis for making further advances, especially in situations with limited amounts of data and challenging problems in reinforcement learning. I present a few recent results on the representational power of neural networks and then demonstrate how to align this with structures from perception-action problems in order to obtain more efficient learning systems.
Organizers: Jane Walters
Human observers can classify photographs of real-world scenes after only a very brief exposure to the image (Potter & Levy, 1969; Thorpe, Fize, Marlot, et al., 1996; VanRullen & Thorpe, 2001). Line drawings of natural scenes have been shown to capture essential structural information required for successful scene categorization (Walther et al., 2011). Here, we investigate how the spatial relationships between lines and line segments in the line drawings affect scene classification. In one experiment, we tested the effect of removing either the junctions or the middle segments between junctions. Surprisingly, participants performed better when shown the middle segments (47.5%) than when shown the junctions (42.2%). It appeared as if the images with middle segments tended to maintain the most parallel/locally symmetric portions of the contours. In order to test this hypothesis, in a second experiment, we either removed the most symmetric half of the contour pixels or the least symmetric half of the contour pixels using a novel method of measuring the local symmetry of each contour pixel in the image. Participants were much better at categorizing images containing the most symmetric contour pixels (49.7%) than the least symmetric (38.2%). Thus, results from both experiments demonstrate that local contour symmetry is a crucial organizing principle in complex real-world scenes. Joint work with John Wilder (UofT CS, Psych), Morteza Rezanejad (McGill CS), Kaleem Siddiqi (McGill CS), Allan Jepson (UofT CS), and Dirk Bernhardt-Walther (UofT Psych), to be presented at VSS 2017.
Organizers: Ahmed Osman
Organizers: Jane Walters
In this talk, I will start with describing the pervasiveness of image and video content, and how such content is growing with the ubiquity of cameras. I will use this to motivate the need for better tools for analysis and enhancement of video content. I will start with some of our earlier work on temporal modeling of video, then lead up to some of our current work and describe two main projects. (1) Our approach for a video stabilizer, currently implemented and running on YouTube, and its extensions. (2) A robust and scaleable method for video segmentation. I will describe, in some detail, our Video stabilization method, which generates stabilized videos and is in wide use. Our method allows for video stabilization beyond the conventional filtering that only suppresses high frequency jitter. This method also supports removal of rolling shutter distortions common in modern CMOS cameras that capture the frame one scan-line at a time resulting in non-rigid image distortions such as shear and wobble. Our method does not rely on a-priori knowledge and works on video from any camera or on legacy footage. I will showcase examples of this approach and also discuss how this method is launched and running on YouTube, with Millions of users. Then I will describe an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. This hierarchical approach generates high quality segmentations and we demonstrate the use of this segmentation as users interact with the video, enabling efficient annotation of objects within the video. I will also show some recent work on how this segmentation and annotation can be used to do dynamic scene understanding. I will then follow up with some recent work on image and video analysis in the mobile domains. I will also make some observations about ubiquity of imaging and video in general and need for better tools for video analysis.
Organizers: Naejin Kong
Optics with long focal length have been extensively used for shooting 2D cinema and television, either to virtually get closer to the scene or to produce an aesthetical effect through the deformation of the perspective. However, in 3D cinema or television, the use of long focal length either creates a ``cardboard effect'' or causes visual divergence. To overcome this problem, state-of-the-art methods use disparity mapping techniques, which is a generalization of view interpolation, and generate new stereoscopic pairs from the two image sequences. We propose to use more than two cameras to solve for the remaining issues in disparity mapping methods. In the first part of the talk, we briefly review the causes of visual fatigue and visual discomfort when viewing a stereoscopic film. We model the depth perception from stereopsis of a 3D scene shot with two cameras, and projected in a movie theater or on a 3DTV. We mathematically characterize this 3D distortion, and derive the mathematical constraints associated with the causes of visual fatigue and discomfort. We illustrate these 3D distortions with a new interactive software, ``The Virtual Projection Room". In order to generate the desired stereoscopic images, we propose to use image-based rendering. These techniques usually proceed in two stages. First, the input images are warped into the target view, and then the warped images are blended together. The warps are usually computed with the help of a geometric proxy (either implicit or explicit). Image blending has been extensively addressed in the literature and a few heuristics have proven to achieve very good performance. Yet the combination of the heuristics is not straightforward, and requires manual adjustment of many parameters. We present a new Bayesian approach to the problem of novel view synthesis, based on a generative model taking into account the uncertainty of the image warps in the image formation model. The Bayesian formalism allows us to deduce the energy of the generative model and to compute the desired images as the Maximum a Posteriori estimate. The method outperforms state-of-the-art image-based rendering techniques on challenging datasets. Moreover, the energy equations provide a formalization of the heuristics widely used inimage-based rendering techniques. Besides, the proposed generative model also addresses the problem of super-resolution, allowing to render images at a higher resolution than the initial ones. In the last part of the presentation, we apply the new rendering technique to the case of the stereoscopic zoom.
The visual effects and entertainment industries are now a fundamental part of the computer graphics and vision landscapes - as well as impacting across society in general. One of the issues in this area is the creation of realistic characters, creating assets for production, and improving work-flow. Advances in computer graphics, vision and rendering have underlined much of the success of these industries, built on top of academic advances. However, there are still many unsolved problems. In this talk I will outline some of the challenges we have faced in crossing over academic research into the visual effects industry. In particular, I will attempt to distinguish between academic challenges and industrial demands we have experienced - and how this has impacted projects. This draws on experience in several themes involving leading Visual Effects and entertainment companies. Our work has been in several diverse areas, including on-set capture, digital doubles, real-time animation and motion capture retargeting. I will describe how many of these problems led to us step back and focus on first solving more fundamental computer vision research problems - particularly in the area of optical flow, non-rigid tracking and shadow removal - and how these opened up other opportunities. Some of these projects are supported through our Centre for Digital Entertainment (CDE) - which has 60 PhD level student embedded across the creative industries in the UK. Others are more specific to partners at The Imaginarium and Double Negative Visual Effects. Attempting to draw these experiences together, we are now starting a new Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA), with leading partners across entertainment, elite sport and rehabilitation.
Organizers: Silvia Zuffi
Current object class detection methods typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications, such as autonomous driving and 3D scene understanding, would benefit from more detailed and richer object hypotheses. In this talk I will present our recent work on building more detailed object class detectors, bridging the gap between higher level tasks and state-of-the-art object detectors. I will present a 3D object class detection method that can reliably estimate the 3D position, orientation and 3D shape of objects from a single image. Based on state-of-the-art CNN features, the method is a carefully designed 3D detection pipeline where each step is tuned for better performance, resulting in a registered CAD model for every object in the image. In the second part of the talk, I will focus on our work on what is holding back convolutional neural nets for detection. We analyze the R-CNN object detection pipeline in combination with state-of-the-art network architectures (AlexNet, GoogleNet and VGG16). Focusing on two central questions, what did the convnets learn and what can they learn, we illustrate that the three network architectures suffer from the same weaknesses, and these downsides can not be alleviated by simply introducing more data. Therefore we conclude that architectural changes are needed. Furthermore, we show that additional, synthetical generated training data, sampled from the modes of the data distribution can further increase the overall detection performance, while still suffering from the same weaknesses. Last, we hint at the complementary nature of the features of the three network architectures considered in this work.
Most computer vision systems cannot take advantage of the abundance of Internet videos as training data. This is because current methods typically learn under strong supervision and require expensive manual annotations. (e.g. videos need to be temporally trimmed to cover the duration of a specific action, object bounding boxes, etc.). In this talk, I will present two techniques that can lead to learning the behavior and the structure of articulated object classes (e.g. animals) from videos, with as little human supervision as possible. First, we discover the characteristic motion patterns of an object class from videos of objects performing natural, unscripted behaviors, such as tigers in the wild. Our method generates temporal intervals that are automatically trimmed to one instance of the discovered behavior, and clusters them by type (e.g. running, turning head, drinking water). Second, we automatically recover thousands of spatiotemporal correspondences within the discovered clusters of behavior, which allow mapping pixels of an instance in one video to those of a different instance in a different video. Both techniques rely on a novel motion descriptor modeling the relative displacement of pairs of trajectories, which is more suitable for articulated objects than state-of-the-art descriptors using single trajectories. We provide extensive quantitative evaluation on our new dataset of tiger videos, which contains more than 100k fully annotated frames.
Organizers: Laura Sevilla
Coherent light enables optical measurements of exquisite sensitivity, advancing technologies for improved sensing and autonomous systems. From my previous work on gravitational physics, I will present a brief overview on technologies for laser-interferometric gravitational-wave observatories, particularly within the scope of the European mission LISA Pathfinder, to be launched at the end 2015. In addition, I will talk about my current work in novel and highly compact optomechanical systems and photonic crystals, optical micro-resonators with sensitivities below femtometer levels, as well as fiber-based non-destructive inspection techniques. To conclude, I will share my ideas on how to expand my research in optomechanical systems, optical technologies, and real-time data and image processing towards applications in robotics and intelligent systems.
Organizers: Senya Polikovsky
In machine learning, the standard explanation of Ockham's razor is to minimize predictive risk. But prediction is interpreted passively---one may not rely on predictions to change the probability distribution used for training. That limitation may be overcome by studying alternatively manipulated systems in randomized experimental trials, but experiments on multivariate systems or on human subjects are often infeasible or immoral. Happily, the past three decades have witnessed the development of a range of statistical techniques for discovering causal relations from non-experimental data. One characteristic of such methods is a strong Ockham bias toward simpler causal theories---i.e., theories with fewer causal connections among the variables of interest. Our question is what Ockham's razor has to do with finding true (rather than merely plausible) causal theories from non-experimental data. The traditional story of minimizing predictive risk does not apply, because uniform consistency is often infeasible in non-experimental causal discovery: without strong and implausible assumptions, the probability of erroneous causal orientation may be arbitrarily high at any sample size. The standard justification for causal discovery methods is point-wise consistency, or convergence in probability to the true causes. But Ockham's razor is not necessary for point-wise convergence: a Bayesian with a strong prior bias toward a complex model would also be point-wise consistent. Either way, the crucial Ockham bias remains disconnected from learning performance. A method reverses its opinion in probability when it probably says A at some sample size and probably says B incompatible with A at a higher sample size. A method cycles in probability when it probably says A, then probably says B incompatible with A, and then probably says A again. Uniform consistency allows for no reversals or cycles in probability. Point-wise consistency allows for arbitrarily many. Lying plausibly between those two extremes is straightest possible convergence to the truth, which allows for only as many cycles and reversals in probability as are necessary to solve the learning problem at hand. We show that Ockham's razor is necessary for cycle-minimal convergence and that patience, or waiting for nature to choose among simplest theories, is necessary for reversal-minimal convergence. The idea yields very tight constraints on inductive statistical methods, both classical and Bayesian, with causal discovery methods as an important special case. It also provides a valid interpretation of significance and power when tests are used to fish inductively for models. The talk is self-contained for a general scientific audience. Novel concepts are illustrated amply with figures and simulations.
The propagation of waves in inhomogeneous media is a vast subject, spanning many different research communities. The ability of waves to interfere leads to the celebrated phenomenon of Anderson localization. Constructive interference increases the probability of return and therefore it can reduce or even cancel the propagation in a disordered medium. Anderson localization was first predicted for electrons in 'dirty' condensed matter systems, very soon however, it was generalized to all kind of waves and has been studied since with light, microwaves, ultrasound, and ultra cold atoms. Here I will give a brief introduction into the basic ideas of Anderson physics and mention some applications. In fact, I will argue that disorder can be used as a resource rather being a nuisance. I will discuss ultra cold atoms as a good candidate for studying Anderson localization and wave propagation in disorder in general and present related experiments.
Organizers: Senya Polikovsky
The external world is represented in the brain as spatiotemporal patterns of electrical activity. Sensory signals, such as light, sound, and touch, are transduced at the periphery and subsequently transformed by various stages of neural circuitry, resulting in increasingly abstract representations through the sensory pathways of the brain. It is these representations that ultimately give rise to sensory perception. Deciphering the messages conveyed in the representations is often referred to as “reading the neural code”. True understanding of the neural code requires knowledge of not only the representation of the external world at one particular stage of the neural pathway, but ultimately how sensory information is communicated from the periphery to successive downstream brain structures. Our laboratory has focused on various challenges posed by this problem, some of which I will discuss. In contrast, prosthetic devices designed to augment or replace sensory function rely on the principle of artificially activating neural circuits to induce a desired perception, which we might refer to as “writing the neural code”. This requires not only significant challenges in biomaterials and interfaces, but also in knowing precisely what to tell the brain to do. Our laboratory has begun some preliminary work in this direction that I will discuss. Taken together, an understanding of these complexities and others is critical for understanding how information about the outside world is acquired and communicated to downstream brain structures, in relating spatiotemporal patterns of neural activity to sensory perception, and for the development of engineered devices for replacing or augmenting sensory function lost to trauma or disease.
Organizers: Jonas Wulff