Upcoming Events

Symposium: Machine learning for computer vision

22.02.2018 (Thursday) - 23.02.2018 (Friday) , 09:00 - 14:00
APB-1004 , Nöthnitzer Str.46 , 01187 Dresden


9 am: Prof. Björn Andres (University of Tübingen + MPI for Informatics)

Talk: "Clustering for Computer Vision and Biomedical Image Analysis"


A central task in the field of machine learning for computer vision is to break things into pieces. Decompositions of a graph are a mathematical abstraction of the possible outcomes. This talk is about a generalization of the correlation clustering problem whose feasible solutions relate one-to-one to the decompositions of a graph, and whose objective function assigns a cost or reward to pairs of nodes being in distinct components. Toward methods, it sketches combinatorial algorithms for finding feasible solutions of large instances in practice, polyhedral algorithms for finding lower bounds, and generalizations of well-known properties of multicut polytopes. Toward applications, it shows results obtained by these algorithms for diverse tasks in the fields of computer vision and biomedical image analysis, including image segmentation, multiple object tracking and human body pose estimation.

11 am: Prof. Angela Yao (University of Bonn)

Talk: "Looking at people: non-supervised learning for poses and actions”


My research is about the representations, models, and methods for automatically analyzing imagery of people.  In particular, I am interested in non-supervised methods and I will discuss two recent works for 3D hand pose estimation and complex activity understanding.

We model the statistical relationships of 3D hand poses and corresponding depth images using two deep generative models with a shared latent space. By design, our architecture allows for learning from unlabeled image data in a semi-supervised manner. Assuming a one-to-one mapping between a pose and a depth map, any given point in the shared latent space can be projected into both a hand pose and a corresponding depth map. Regressing the hand pose can then be done by learning a discriminator to estimate the posterior of the latent pose given some depth map. To improve generalization and to better exploit unlabeled depth maps, we jointly train a generator and a discriminator. At each iteration, the generator is updated with the back-propagated gradient from the discriminator to synthesize realistic depth maps of the articulated hand, while the discriminator benefits from an augmented training set of synthesized and unlabeled samples.  The proposed discriminator network is highly efficient and runs at 90 FPS on the CPU with state-of-the-art accuracies on 3 publicly available benchmarks.

We present a new method for unsupervised segmentation of complex activities from video into multiple steps or sub-activities. We propose an iterative discriminative-generative approach which alternates between discriminative learning the appearance of sub-activities from the video's visual features to sub-activity labels and generatively modelling the temporal structure of sub-activities using a Generalized Mallows Model. In addition, we introduce a model for background to account for frames unrelated to the actual activities.  Our approach is validated on the challenging Breakfast Actions and Inria Instructional Videos datasets and outperforms both unsupervised and weakly-supervised state-of-the-art.

2 pm: Prof. Matthias Nießner (TU Munich)

Talk: "3D Reconstruction and Understanding of the Real World"


In this talk, I will cover our latest research on 3D reconstruction and semantic scene understanding. To this end, we use modern machine learning techniques, in particular deep learning algorithms, in combination with traditional computer vision approaches. Specifically, I will talk about real-time 3D reconstruction using RGB-D sensors, which enable us to capture high-fidelity geometric representations of the real world. In a new line of research, we use these representations as input to 3D Neural Networks that infer semantic class labels and object classes directly from the volumetric input. In order to train these data-driven learning methods, we introduce several annotated datasets, such as ScanNet and Matterport3D, that are directly annotated in 3D and allow tailored volumetric CNNs to achieve remarkable accuracy. In addition to these discriminative tasks, we put a strong emphasis on generative models. For instance, we aim to predict missing geometry in occluded regions, and obtain completed 3D reconstructions with the goal of eventual use in production applications. We believe that this research has significant potential for application in content creation scenarios (e.g., for Virtual and Augmented Reality) as well as in the field of Robotics where autonomous entities need to obtain an understanding of the surrounding environment.



9 am: Prof. Bastian Leibe (RWTH Aachen) “t.b.a."

11 am: Dr. Mario Fritz (MPI for Informatics, Saarbrücken)

Talk: "Towards Scalable and Holistic Learning and Inference”


With the advance of new sensor technology and abundant data resources, machines can get a detailed “picture” of the real-world – unlike ever possible before. The previously wide gap between these raw data sources and the semantic understanding of humans is starting to close. Driven by big data, increased compute power and advances in machine learning, we see a new generation of systems emerging that achieve new levels of performance on a range of competences such as visual scene understanding and natural language comprehension as well as robotic control.

In this talk, I will first elaborate on how we evaluate and progress such computational intelligence with a modern approach to the Turing Test, where the task is to answer natural language questions on images. Second, I will outline my recent work on deep learning that spans application domains from computer vision, graphics, robotics to natural sciences. Lastly, I will discuss implication on privacy and safety that these new learning techniques have when deployed in future intelligent systems. (http://scalable.mpi-inf.mpg.de)

2 pm: Dr. Thorsten Sattler (ETH Zurich)

Talk: "The Semantics of Visual Localization and Mapping”


3D scene perception is a key ability for robots, as well as for any type of intelligent system designed to operate in the real world. Among 3D scene perception algorithms, methods for 3D mapping reconstruct 3D models of scenes from camera images. Visual localization techniques in turn use these maps to determine the position and orientation of one or more cameras in the world. Visual localization and mapping are thus fundamental problems that need to be solved reliably and robustly in order to enable autonomous agents such as self-driving cars or drones. At the same time, localization and mapping algorithms are key technologies for Mixed and Augmented Reality applications.

Over the last years and decades, tremendous progress has been made in the area of 3D Computer Vision, including impressive results for localization and mapping. Still, localization and mapping techniques can be rather brittle in challenging scenarios that are highly relevant for practical applications. This talk gives an overview over these challenges and explains how a higher-level understanding of the environment can help to solve some of them. In particular, I will present algorithms for localization and 3D reconstruction that rely on semantic information. This higher level of abstraction allows them to succeed under challenging conditions that could not be handled by previous work relying on purely photometric or geometric cues. I will then outline how these techniques can be extended to tackle a certain family of open problems. I will finally conclude the talk with a set of examples showing that algorithms for 3D scene perception will need to become even “smarter” in order to allow complex scene interactions for robots and other types of intelligent systems.


Each talk will take about one hour, with an hour of coffee or lunch break between any two talks.

Go back