Recent Achievements

Orchestration paper accepted at VLDB

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

"Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources"

The hardware landscape is currently changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. This trend is a great opportunity for database systems to increase the overall performance if the heterogeneous resources can be utilized efficiently. To achieve this, the main challenge is to place the right work on the right computing unit. Current approaches tackling this placement for query processing in database systems assume that data cardinalities of intermediate results can be correctly estimated. However, this assumption does not hold for complex analytical queries. To overcome this problem, we developed an adaptive placement approach being independent of cardinality estimation of intermediate results. Our adaptive approach takes a physical query execution plan as input and divides the plan into disjoint execution islands at compile-time. The execution islands are determined in a way that the cardinalities of intermediate results within each island are known or can be precisely calculated. The placement optimization and execution is performed separately per island at query runtime. The processing of the execution islands takes place successively following data dependencies. With our novel adaptive approach, we can use heterogeneous computing resources more efficiently for query processing. The corresponding paper [1], which describes and evaluates our overall approach, has been accepted as full paper at the 43rd International Conference on Very Large Data Bases (VLDB, http://www.vldb.org/2017/). Generally, VLDB is the premier annual international forum for data management and database researches.

[1] Tomas Karnagel, Dirk Habich, Wolfgang Lehner: Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources: Accepted at 43rd International Conference on Very Large Data Bases (VLDB, August 28 – September 1, Munich, Germany), 2017

Christian Menard receives Hermann-Willkomm prize

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

Christian Menard received the Hermann-Willkomm prize for his Diploma Thesis entitled “Mapping KPN-based Applications to the NoC-based Tomahawk Architecture”. The prize is awarded to the best thesis in the area of Information Systems Engineering (Informationssystemtechnik). Christian’s work was supervised by Andrés Goens, a researcher of the Chair for Compiler Construction. The techniques described in the thesis are part of a compiler developed in the context of the Orchestration Path of the Excellence Cluster cfaed. The compiler targets the Tomahawk architecture developed at the Vodafone Chair, headed by Prof. Gerhard Fettweis.

Orchestration position paper accepted at PMES

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

The position paper of the Orchestration path [1] has been accepted at the 1st International Workshop on Post-Moore's Era Supercomputing (PMES'16) and will be presented in Salt Lake City, USA on Monday, Nov 14.

[1] Marcus Völp, Sascha Klüppelholz, Jeronimo Castrillon, Hermann, Härtig, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier,  Gerhard Fettweis, Jochen Fröhlich, Andrès Goens, Sebastian Haas, Dirk  Habich, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Wolfgang  Lehner, Linda Leuschner, Matthias Lieber, Siqi Ling, Steffen Märcker,  Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Penaloza, Michael  Raitza, Jörg Stiller, Annett Ungethüm, and Axel Voigt. The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware.
Proceedings of the 1st Workshop on Post-Moore's Era Supercomputing (PMES), 2016. Accepted for publication.

Best Demonstration Award@SIGMOD 2016

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

From June 26 to July 01, the annual ACM SIGMOD Conference in San Francisco, USA took place. At this international conference, we presented our demo on the topic “Energy Elasticity on Heterogeneous Hardware using Adaptive Resource Reconfiguration LIVE” [1]. We are happy to announce, that this work has been awarded as Best Demonstration at SIGMOD 2016. In this demo, we have shown our novel energy-control loop, which addresses the topic of software-controlled hardware reconfigurations at runtime for data management systems running on a single heterogeneous server system. The work was developed in cooperation between the Orchestration and the HAEC path.

Abstract: Energy awareness of database systems has emerged as a critical research topic, since energy consumption is becoming a major limiter for their scalability. Recent energy-related hardware developments trend towards offering more and more configuration opportunities for the software to control its own energy consumption. Existing research so far mainly focused on leveraging this configuration spectrum to find the most energy-efficient configuration for specific operators or entire queries. In this demo, we introduce the concept of energy elasticity and propose the energy-control loop as an implementation of this concept. Energy elasticity refers to the ability of software to behave energy- proportional and energy-efficient at the same time while maintaining a certain quality of service. Thus, our system does not draw the least energy possible but the least energy necessary to still perform reasonably. We demonstrate our overall approach using a rich interactive GUI to give attendees the opportunity to learn more about our concept.

[1] Annett Ungethüm, Thomas Kissinger, Willi-Wolfram Mentzel, Dirk Habich, Wolfgang Lehner: Energy Elasticity on Heterogeneous Hardware using Adaptive Resource Reconfiguration LIVE. SIGMOD Conference 2016: 2173-2176

 

Limitations of Intra-Operator Parallelism using Heterogeneous Computing Resoucres

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

In the recent years, hardware changes shaped the database system architecture by moving from sequential execution to parallel multi-core execution and from disk-centric systems to in-memory systems. At the moment, the hardware is changing again from homogeneous CPU systems towards heterogeneous systems with many different computing units (CUs). Generally, heterogeneous systems combine different CUs, like CPUs, GPUs or Xeon Phis, with different architectures, memory hierarchies, and interconnects, leading to different execution behaviors. That means, the current challenge for the database community is to find ways to utilize these systems efficiently. One opportunity is to execute a single query operator on all available CUs, this is usually called intra-operator parallelism. In homogeneous multi-core system, that can be easily achieved by uniformly partitioning of data to all cores and there, such intra-operator parallelism is beneficial. In our current research work, we have investigated the same approach to heterogeneous systems. The corresponding paper [1] has been accepted for oral presentation and full publication at the 20th East-European Conference on Advances in Databases and Information Systems (http://adbis2016.vsb.cz).

Figure 1: Operator Execution on Two Computing Units.

 

Figure 1 shows our approach using two different computing units. For heterogeneous systems, we have to define a way to find the ideal data partitioning according to the different execution performances of the given CUs. Afterwards, the partitioned data is used to execute an operator, which computes a partial result. Finally, the partial results of all CUs have to be merged. In our paper [1], we analyze this approach for two operators, selection and sorting, on two different heterogeneous systems. We present performance insides as well as occurring limitations to intra-operator parallelism in heterogeneous environments. As a result, we show that the actual potential of improvements is small, while the limitations and overheads can be significant, sometimes leading to an even worse performance than single-CU execution. Therefore, our findings can help system developers to decide if intra-operator parallelism can be applied effectively or if it should be avoided.

 

[1] Tomas Karnagel, Dirk Habich, Wolfgang Lehner: Limitations of Intra-Operator Parallelism using Heterogeneous Computing Resources: Accepted at 20th East-European Conference on Advances in Databases and Information Systems (ADBIS, August 28-31, Prague, Czech Republic), 2016

Paper on Load Balancing accepted at Euro-Par

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

With more and more parallel architectures being used, the problem to optimally distributing work and thereby balancing load gains importance. To tackle this challenge, graph partitioning algorithms have been successfully applied in various application domains. However, there is a mismatch between solutions found by classical graph partitioning and the behavior of many real hardware systems. Classical graph partitioning assumes that individual vertex weights add up to partition weights (by us referred to as linear graph partitioning). In the context of parallel or heterogeneous systems, the assumption implies that performance scales linearly with the number of tasks. In reality however, performance does usually not scale linearly with the amount of work due to contention on hardware, operating system, or application resources. We address this mismatch with penalized graph partitioning, a special case of non-linear graph-partitioning, in this paper [1]. The result is a novel load balancing algorithm that shares the advantages of classical graph partitioning while at the same time considering the non-linear performance of real-systems.

 

To demonstrate the potential of our penalized graph partitioning in presence of non-linear resources, we perform a synthetic partitioning experiment. To run the experiment, we generate a workload graph that contains 1,000 heterogeneous tasks with weights following a Zipf distribution. Each task in the workload graph is communicating with 0 to 10 other tasks (again Zipf distributed). To model a system, we use an exponential penalty function and assume that the underlying resources can execute 16 parallel tasks before the penalty grows with the square of the cardinality due to contention (Figure 1a). The workload in this experiment is partitioned into 32 balanced partitions using a standard graph partitioning library. Afterward, to estimate the actual load for each node, the penalty function is applied to each partition based on the partition cardinality (Figure 1b). The resulting partition weights are compared to a second partitioning of the graph that was generated by our novel penalized graph partitioning algorithm (Figure 1c). The unmodified partitioning algorithm (Figure b), which is unaware of the contention, tries to balance the load. The resulting relative weights show that the node with the highest partition weight receives 3.1 times the load of the node with the lowest partition weight. In contrast, our penalized partitioning algorithm leads to partition weights, and hence node utilizations, that are balanced within a tolerance of 3%.

[1] Tim Kiefer, Dirk Habich, Wolfgang Lehner: Penalized Graph Partitioning for Static and Dynamic Load Balancing. To appears in the proceedings of the 22nd International European Conference on Parallel and Distributed Computing (Euro-Par) in Grenoble, France, August 22-26.

 

Heterogeneous parallelization in computational fluid dynamics

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

The Chair of Fluid Mechanics has developed a two-layer parallelization approach capable of dealing with clusters of heterogeneous compute nodes, accommodating multicores as well as GPUs. Hardware-specific adaptations employ OpenACC or OpenMP and are confined to the lower, sub-CPU level, while the upper layer is virtually device independent. Using the MPI message passing interface it bridges arbitrarily heterogeneous nodes and fits seamlessly into the computational fluid dynamics framework (paper).

Nodal points in a cartesian spectral element with boundary points (blue) and inner points (yellow)
Performance of different solver variants in a test case with 512 elements and substantial grid stretching using different iteration schemes

This work is accompanied by the development of new algorithms which attain superior computational complexity combined with adjustable granularity, which yields better mapping and higher performance of solvers on different accelerator types (paper).

Paper about M3 accepted at ASPLOS 2016

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

M3 (microkernel-based system for heterogeneous manycores) is an operating system, which is designed to support arbitrary cores as first-class citizens. That is, M3 provides spatial isolation and access to operating-system services on each core. This is achieved by abstracting the heterogeneity of the cores via a common hardware component per core, called DTU. Both isolation and operating-system services are provided at the network-on-chip level, so that they are available to all cores, independent of their internals. Now, the M3 paper [1] has been accepted at ASPLOS 16, the top conference on co-designing hardware and operating systems. The paper describes the co-design of M3 and the DTU as a new point in the design space for operating systems, which has advantages when integrating heterogeneous cores, and demonstrates that the performance does not suffer. In fact, M3 even outperforms Linux by a factor of five in some application-level benchmarks.

M3 is currently running on the Tomahawk2 chip, the simulator for the next generation of Tomahawk, Linux (by using it as a virtual machine) and gem5. Furthermore, it is available as open source.





The figure above shows an example platform with many heterogeneous cores (colored boxes), each having a local memory (light blue boxes) and a DTU. The M3 kernel is running on a dedicated core and remotely controls the other cores by configuring their DTUs accordingly. With this, M3 assigned a number of cores to each of the two applications in this example. Furthermore, M3 hosts a filesystem service on one core, which hands out memory capabilities to requesting applications, giving direct access to the data of the file in the global memory.

[1] Nils Asmussen, Marcus Völp, Benedikt Nöthen, Hermann Härtig, and Gerhard Fettweis. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores. To appear in the proceedings of the Twenty-first International Conference on Architectural Support for Programming Languages and Operating Systems, April 2016.

Performance analysis on the Tomahawk2

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

The Orchestration Path employs the Tomahawk2 MPSoC as a demonstrator platform for the hardware/software stack. Cfaed members ported several applications (data bases, computational fluid dynamics, linear algebra) to the T2 and developed M3, an operating system for heterogeneous manycores. To enable graphical runtime performance analysis of these applications and the operating system, the first event tracing infrastructure for the T2 has been developed at the Center for Information Services and High Performance Computing (ZIH). Various types of events like memory transfers, messages, system calls, and user-defined source code regions are recorded for applications based on taskC and M3 and for internals of the M3 operating system as well. The Vampir tool, developed at ZIH and well known in the high performance computing community, is used to visualize the performance data. The performance measurement and analysis on Tomahawk2 allows to check performance assumptions, enables targeted tuning, and aids in design space exploration and performance modeling.

A first analysis framework to handle dynamic application behavior in dataflow applications

Published on in ORCHESTRATION (RECENT ACHIEVEMENTS)

At the Chair for Compiler Construction cfaed members affiliated with the orchestration path have devised a series of approaches to analyse and efficiently cope with dynamic software application behaviors. The problem addressed is that current approaches for mapping and scheduling applications described as Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior.

The approach to analyse and handle this leverages inherent mathematical structures of the dataflow models of computation and the hardware architectures. On the side of the dataflow models, it relies on the monoid structure of histories and traces. This structure helps formalize the behavior of multiple executions of a given dynamic application. It uses metrics to define a formal framework for comparing the executions. On the side of the the hardware, the approach takes advantage of symmetries in the architecture to reduce the search space for the mapping problem. (Download paper.)