One of the most challenging tasks of Computer Vision is to endow computers with the ability to discover the underlying relationships between the objects in a scene. The large amount of available labeled data as well as the fast progress in deep learning has significantly advanced many Computer Vision tasks, such as object segmentation. optical flow estimation, action recognition etc. However, a truly intelligent system would ideally be able to infer high-level semantics underlying human actions such as motivation, intent and emotion. However, all human actions involve some uncertainty. To this end, I would like to either develop or to further enhance existing methodologies that incorporate such uncertainties. For now, I have worked on the 3D reconstruction task, by developing a model able to incorporate uncertainties in the image formation process.
Doctor of Philosophy (Ph.D.) (April 2017 - now)
Max Planck Institute for Intelligent Systems and ETH Zurich as part of the Max Planck ETH Center for Learning Systems
Diploma in Electrical and Computer Engineering (September 2009 - December 2015)
Department of Electrical and Computer Engineering of Aristotle University of Thessaloniki, in Greece
In this paper, we consider the problem of reconstructing a dense 3D model using images captured from different views. Recent methods based on convolutional neural networks (CNN) allow learning the entire task from data. However, they do not incorporate the physics of image formation such as perspective geometry and occlusion. Instea...
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2018, 2018 (inproceedings)
In this paper, we consider the problem of reconstructing a dense 3D model using images captured from different views. Recent methods based on convolutional neural networks (CNN) allow learning the entire task from data. However, they do not incorporate the physics of image formation such as perspective geometry and occlusion. Instead, classical approaches based on Markov Random Fields (MRF) with ray-potentials explicitly model these physical processes, but they cannot cope with large surface appearance variations across different viewpoints. In this paper, we propose RayNet, which combines the strengths of both frameworks. RayNet integrates a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion. We train RayNet end-to-end using empirical risk minimization. We thoroughly evaluate our approach on challenging real-world datasets and demonstrate its benefits over a piece-wise trained baseline, hand-crafted models as well as other learning-based approaches.
In IEEE, Signal Processing Conference (EUSIPCO), 25th European, August 2017 (inproceedings)
This paper introduces a family of local feature aggregation functions and a novel method to estimate their parameters, such that they generate optimal representations for classification (or any task that can be expressed as a cost function minimization problem). To achieve that, we compose the local feature aggregation function with the classifier cost function and we backpropagate the gradient of this cost function in order to update the local feature aggregation function parameters. Experiments on synthetic datasets indicate that our method discovers parameters that model the class-relevant information in addition to the local feature space. Further experiments on a variety of motion and visual descriptors, both on image and video datasets, show that our method outperforms other state-of-the-art local feature aggregation functions, such as Bag of Words, Fisher Vectors and VLAD, by a large margin.
In Proceedings of the 2016 ACM on Multimedia Conference, pages: 332,336, ACM Multimedia Conference, October 2016 (inproceedings)
This paper introduces fsLDA, a fast variational inference method for supervised LDA, which overcomes the computational limitations of the original supervised LDA and enables its application in large-scale video datasets. In addition to its scalability, our method also overcomes the drawbacks of standard, unsupervised LDA for video, including its focus on dominant but often irrelevant video information (e.g. background, camera motion). As a result, experiments in the UCF11 and UCF101 datasets show that our method consistently outperforms unsupervised LDA in every metric. Furthermore, analysis shows that class-relevant topics of fsLDA lead to sparse video representations and encapsulate high-level information corresponding to parts of video events, which we denote "micro-events".
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems