Recent Achievements

Paper about M3 accepted at ASPLOS 2016


M3 (microkernel-based system for heterogeneous manycores) is an operating system, which is designed to support arbitrary cores as first-class citizens. That is, M3 provides spatial isolation and access to operating-system services on each core. This is achieved by abstracting the heterogeneity of the cores via a common hardware component per core, called DTU. Both isolation and operating-system services are provided at the network-on-chip level, so that they are available to all cores, independent of their internals. Now, the M3 paper [1] has been accepted at ASPLOS 16, the top conference on co-designing hardware and operating systems. The paper describes the co-design of M3 and the DTU as a new point in the design space for operating systems, which has advantages when integrating heterogeneous cores, and demonstrates that the performance does not suffer. In fact, M3 even outperforms Linux by a factor of five in some application-level benchmarks.

M3 is currently running on the Tomahawk2 chip, the simulator for the next generation of Tomahawk, Linux (by using it as a virtual machine) and gem5. Furthermore, it is available as open source.

The figure above shows an example platform with many heterogeneous cores (colored boxes), each having a local memory (light blue boxes) and a DTU. The M3 kernel is running on a dedicated core and remotely controls the other cores by configuring their DTUs accordingly. With this, M3 assigned a number of cores to each of the two applications in this example. Furthermore, M3 hosts a filesystem service on one core, which hands out memory capabilities to requesting applications, giving direct access to the data of the file in the global memory.

[1] Nils Asmussen, Marcus Völp, Benedikt Nöthen, Hermann Härtig, and Gerhard Fettweis. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores. To appear in the proceedings of the Twenty-first International Conference on Architectural Support for Programming Languages and Operating Systems, April 2016.

Performance analysis on the Tomahawk2


The Orchestration Path employs the Tomahawk2 MPSoC as a demonstrator platform for the hardware/software stack. Cfaed members ported several applications (data bases, computational fluid dynamics, linear algebra) to the T2 and developed M3, an operating system for heterogeneous manycores. To enable graphical runtime performance analysis of these applications and the operating system, the first event tracing infrastructure for the T2 has been developed at the Center for Information Services and High Performance Computing (ZIH). Various types of events like memory transfers, messages, system calls, and user-defined source code regions are recorded for applications based on taskC and M3 and for internals of the M3 operating system as well. The Vampir tool, developed at ZIH and well known in the high performance computing community, is used to visualize the performance data. The performance measurement and analysis on Tomahawk2 allows to check performance assumptions, enables targeted tuning, and aids in design space exploration and performance modeling.

A first analysis framework to handle dynamic application behavior in dataflow applications


At the Chair for Compiler Construction cfaed members affiliated with the orchestration path have devised a series of approaches to analyse and efficiently cope with dynamic software application behaviors. The problem addressed is that current approaches for mapping and scheduling applications described as Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior.

The approach to analyse and handle this leverages inherent mathematical structures of the dataflow models of computation and the hardware architectures. On the side of the dataflow models, it relies on the monoid structure of histories and traces. This structure helps formalize the behavior of multiple executions of a given dynamic application. It uses metrics to define a formal framework for comparing the executions. On the side of the the hardware, the approach takes advantage of symmetries in the architecture to reduce the search space for the mapping problem. (Download paper.)


A Programming Environment for Particle-based Numerical Simulations


Domain-specific languages (DSLs) are of utmost importance in scientific high- performance computing to reduce development costs, raise the level of abstraction and, thus, make scientific programmer’s life easier. The parallel particle-mesh environment (PPME) is a DSL and a programming environment for numerical simulations based on the particle method. The main goals of PPME are 1) to provide a high-level, easy to learn language for developing particle-based simulations and 2) to leverage the domain- specific knowledge encoded in PPME programs for optimizing the generated code. To achieve these goals, PPME relies on a projectional editing approach rather than a conventional text editor. This, on the one hand, enables users to describe their problems using conventional mathematical notation and, on the other hand, it provides a typed graph-based data structure that paves the ground for automatic analyses and optimizations. PPME follows a generative approach: it generates parallel Fortran code that links with the parallel particle-mesh library (PPM), which has been developed by the MOSAIC group. PPM provides efficient implementations of the particle and mesh abstractions, discrete numerics, as well as an abstraction layer on the underlying high-performance computing hardware.

Screenshot of a Gray-Scott Reaction-Diffusion System implemented in PPME

Screenshot of a Gray-Scott Reaction-Diffusion System implemented in PPME

Recently, a first version has been released, in collaboration with the Chair of Scientific Computing for Systems Biology and the MOSAIC group, both headed by Ivo Sbalzarini (Biological Systems Path).


Outstanding Paper Award at HPCS 2014


With their paper „Scalable High-Quality 1D Partitioning” Matthias Lieber and Wolfgang Nagel received the outstanding paper award at the 2014 International Conference on High Performance Computing and Simulation (HPCS 2014).

Best Therory Paper Award at ETAPS 2014

Theroy and Toolsupport for Computing Conditional Probabilities in Markovian Models




















EATCS best theory paper award: Christel Baier, Joachim Klein, Sascha Klüppelholz and Steffen Märcker.Computing Conditional Probabilities in Markovian Models Efficiently. Proceedings of the 20th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS'14) Volume 8413 of Lecture Notes in Computer Science, pages 515-530, Springer, 2014. (extended version)

Tomahawk 2 back in the lab


The Tomahawk 2 chip that had its tape out in march came back to our labs. The chip is a heterogeneous multi processor system on chip (MPSoC). A dynamic task scheduling unit, the CoreManager sits at its center, controlling the computation load of numerous processing elements (PEs). A special extension to the programming language C, namely taskC has been developed to provide an easy to use interface for the programmer that abstracts the distributed parallel computation from the available hardware. In taskC the programmer only defines tasks that can potentially are run in parallel to each other. This means that the programmer does not have to know about the underlying hardware. The CoreManager kows about the chips topology and can place generated tasks to available PEs at runtime even if the topology (e.g. the number of PEs) changes during execution. Furthermore, it can even change the topology on its own by changing PE frequency or turning of PEs completely to save power.

For Orchestration path the Tomahawk architecture can also be viewed from a different angle. The tiled architecture consists of a number of tiles each hosting one processor and a network on chip connecting every tile providing a scalable uniform connection throughout the whole chip. With the set of tiles being inhomogeneous this platform makes a good playground for trying out how to deal with heterogeneity in system integration whilst new technologies from other paths are not ready.