Recent Achievements

Paper on Load Balancing accepted at Euro-Par

Published on Wed, 27 Apr 2016 in ORCHESTRATION (RECENT ACHIEVEMENTS)

With more and more parallel architectures being used, the problem to optimally distributing work and thereby balancing load gains importance. To tackle this challenge, graph partitioning algorithms have been successfully applied in various application domains. However, there is a mismatch between solutions found by classical graph partitioning and the behavior of many real hardware systems. Classical graph partitioning assumes that individual vertex weights add up to partition weights (by us referred to as linear graph partitioning). In the context of parallel or heterogeneous systems, the assumption implies that performance scales linearly with the number of tasks. In reality however, performance does usually not scale linearly with the amount of work due to contention on hardware, operating system, or application resources. We address this mismatch with penalized graph partitioning, a special case of non-linear graph-partitioning, in this paper [1]. The result is a novel load balancing algorithm that shares the advantages of classical graph partitioning while at the same time considering the non-linear performance of real-systems.

To demonstrate the potential of our penalized graph partitioning in presence of non-linear resources, we perform a synthetic partitioning experiment. To run the experiment, we generate a workload graph that contains 1,000 heterogeneous tasks with weights following a Zipf distribution. Each task in the workload graph is communicating with 0 to 10 other tasks (again Zipf distributed). To model a system, we use an exponential penalty function and assume that the underlying resources can execute 16 parallel tasks before the penalty grows with the square of the cardinality due to contention (Figure 1a). The workload in this experiment is partitioned into 32 balanced partitions using a standard graph partitioning library. Afterward, to estimate the actual load for each node, the penalty function is applied to each partition based on the partition cardinality (Figure 1b). The resulting partition weights are compared to a second partitioning of the graph that was generated by our novel penalized graph partitioning algorithm (Figure 1c). The unmodified partitioning algorithm (Figure b), which is unaware of the contention, tries to balance the load. The resulting relative weights show that the node with the highest partition weight receives 3.1 times the load of the node with the lowest partition weight. In contrast, our penalized partitioning algorithm leads to partition weights, and hence node utilizations, that are balanced within a tolerance of 3%.

[1] Tim Kiefer, Dirk Habich, Wolfgang Lehner: Penalized Graph Partitioning for Static and Dynamic Load Balancing. To appears in the proceedings of the 22nd International European Conference on Parallel and Distributed Computing (Euro-Par) in Grenoble, France, August 22-26.

Heterogeneous parallelization in computational fluid dynamics

Published on Thu, 11 Feb 2016 in ORCHESTRATION (RECENT ACHIEVEMENTS)

The Chair of Fluid Mechanics has developed a two-layer parallelization approach capable of dealing with clusters of heterogeneous compute nodes, accommodating multicores as well as GPUs. Hardware-specific adaptations employ OpenACC or OpenMP and are confined to the lower, sub-CPU level, while the upper layer is virtually device independent. Using the MPI message passing interface it bridges arbitrarily heterogeneous nodes and fits seamlessly into the computational fluid dynamics framework (paper).

Nodal points in a cartesian spectral element with boundary points (blue) and inner points (yellow)

Performance of different solver variants in a test case with 512 elements and substantial grid stretching using different iteration schemes

This work is accompanied by the development of new algorithms which attain superior computational complexity combined with adjustable granularity, which yields better mapping and higher performance of solvers on different accelerator types (paper).

Paper about M3 accepted at ASPLOS 2016

Published on Fri, 05 Feb 2016 in ORCHESTRATION (RECENT ACHIEVEMENTS)

M3 (microkernel-based system for heterogeneous manycores) is an operating system, which is designed to support arbitrary cores as first-class citizens. That is, M3 provides spatial isolation and access to operating-system services on each core. This is achieved by abstracting the heterogeneity of the cores via a common hardware component per core, called DTU. Both isolation and operating-system services are provided at the network-on-chip level, so that they are available to all cores, independent of their internals. Now, the M3 paper [1] has been accepted at ASPLOS 16, the top conference on co-designing hardware and operating systems. The paper describes the co-design of M3 and the DTU as a new point in the design space for operating systems, which has advantages when integrating heterogeneous cores, and demonstrates that the performance does not suffer. In fact, M3 even outperforms Linux by a factor of five in some application-level benchmarks.

M3 is currently running on the Tomahawk2 chip, the simulator for the next generation of Tomahawk, Linux (by using it as a virtual machine) and gem5. Furthermore, it is available as open source.

The figure above shows an example platform with many heterogeneous cores (colored boxes), each having a local memory (light blue boxes) and a DTU. The M3 kernel is running on a dedicated core and remotely controls the other cores by configuring their DTUs accordingly. With this, M3 assigned a number of cores to each of the two applications in this example. Furthermore, M3 hosts a filesystem service on one core, which hands out memory capabilities to requesting applications, giving direct access to the data of the file in the global memory.

[1] Nils Asmussen, Marcus Völp, Benedikt Nöthen, Hermann Härtig, and Gerhard Fettweis. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores. To appear in the proceedings of the Twenty-first International Conference on Architectural Support for Programming Languages and Operating Systems, April 2016.

Performance analysis on the Tomahawk2

Published on Wed, 20 Jan 2016 in ORCHESTRATION (RECENT ACHIEVEMENTS)

The Orchestration Path employs the Tomahawk2 MPSoC as a demonstrator platform for the hardware/software stack. Cfaed members ported several applications (data bases, computational fluid dynamics, linear algebra) to the T2 and developed M3, an operating system for heterogeneous manycores. To enable graphical runtime performance analysis of these applications and the operating system, the first event tracing infrastructure for the T2 has been developed at the Center for Information Services and High Performance Computing (ZIH). Various types of events like memory transfers, messages, system calls, and user-defined source code regions are recorded for applications based on taskC and M3 and for internals of the M3 operating system as well. The Vampir tool, developed at ZIH and well known in the high performance computing community, is used to visualize the performance data. The performance measurement and analysis on Tomahawk2 allows to check performance assumptions, enables targeted tuning, and aids in design space exploration and performance modeling.

A first analysis framework to handle dynamic application behavior in dataflow applications

Published on Wed, 23 Dec 2015 in ORCHESTRATION (RECENT ACHIEVEMENTS)

At the Chair for Compiler Construction cfaed members affiliated with the orchestration path have devised a series of approaches to analyse and efficiently cope with dynamic software application behaviors. The problem addressed is that current approaches for mapping and scheduling applications described as Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior.

The approach to analyse and handle this leverages inherent mathematical structures of the dataflow models of computation and the hardware architectures. On the side of the dataflow models, it relies on the monoid structure of histories and traces. This structure helps formalize the behavior of multiple executions of a given dynamic application. It uses metrics to define a formal framework for comparing the executions. On the side of the the hardware, the approach takes advantage of symmetries in the architecture to reduce the search space for the mapping problem. (Download paper.)

A Programming Environment for Particle-based Numerical Simulations

Published on Sun, 20 Dec 2015 in ORCHESTRATION (RECENT ACHIEVEMENTS)

Domain-specific languages (DSLs) are of utmost importance in scientific high- performance computing to reduce development costs, raise the level of abstraction and, thus, make scientific programmer’s life easier. The parallel particle-mesh environment (PPME) is a DSL and a programming environment for numerical simulations based on the particle method. The main goals of PPME are 1) to provide a high-level, easy to learn language for developing particle-based simulations and 2) to leverage the domain- specific knowledge encoded in PPME programs for optimizing the generated code. To achieve these goals, PPME relies on a projectional editing approach rather than a conventional text editor. This, on the one hand, enables users to describe their problems using conventional mathematical notation and, on the other hand, it provides a typed graph-based data structure that paves the ground for automatic analyses and optimizations. PPME follows a generative approach: it generates parallel Fortran code that links with the parallel particle-mesh library (PPM), which has been developed by the MOSAIC group. PPM provides efficient implementations of the particle and mesh abstractions, discrete numerics, as well as an abstraction layer on the underlying high-performance computing hardware.

Screenshot of a Gray-Scott Reaction-Diffusion System implemented in PPME

Recently, a first version has been released, in collaboration with the Chair of Scientific Computing for Systems Biology and the MOSAIC group, both headed by Ivo Sbalzarini (Biological Systems Path).

PPME wiki and repository
A first paper at the DSLDI workshop@ECOOP
Website of the PPM Fortran library

Outstanding Paper Award at HPCS 2014

Published on Mon, 28 Jul 2014 in ORCHESTRATION (RECENT ACHIEVEMENTS)

With their paper „Scalable High-Quality 1D Partitioning” Matthias Lieber and Wolfgang Nagel received the outstanding paper award at the 2014 International Conference on High Performance Computing and Simulation (HPCS 2014).

Best Therory Paper Award at ETAPS 2014

Theroy and Toolsupport for Computing Conditional Probabilities in Markovian Models

Published on Tue, 15 Apr 2014 in ORCHESTRATION (RECENT ACHIEVEMENTS)

bestpaper

EATCS best theory paper award: Christel Baier, Joachim Klein, Sascha Klüppelholz and Steffen Märcker.Computing Conditional Probabilities in Markovian Models Efficiently. Proceedings of the 20th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS'14) Volume 8413 of Lecture Notes in Computer Science, pages 515-530, Springer, 2014. (extended version)

Tomahawk 2 back in the lab

Published on Thu, 13 Sep 2012 in ORCHESTRATION (RECENT ACHIEVEMENTS)

The Tomahawk 2 chip that had its tape out in march came back to our labs. The chip is a heterogeneous multi processor system on chip (MPSoC). A dynamic task scheduling unit, the CoreManager sits at its center, controlling the computation load of numerous processing elements (PEs). A special extension to the programming language C, namely taskC has been developed to provide an easy to use interface for the programmer that abstracts the distributed parallel computation from the available hardware. In taskC the programmer only defines tasks that can potentially are run in parallel to each other. This means that the programmer does not have to know about the underlying hardware. The CoreManager kows about the chips topology and can place generated tasks to available PEs at runtime even if the topology (e.g. the number of PEs) changes during execution. Furthermore, it can even change the topology on its own by changing PE frequency or turning of PEs completely to save power.

For Orchestration path the Tomahawk architecture can also be viewed from a different angle. The tiled architecture consists of a number of tiles each hosting one processor and a network on chip connecting every tile providing a scalable uniform connection throughout the whole chip. With the set of tiles being inhomogeneous this platform makes a good playground for trying out how to deal with heterogeneity in system integration whilst new technologies from other paths are not ready.