Path G - Resilience

Introduction

banner path g

Today, reliability issues already lead to diminishing performance returns when transitioning to smaller CMOS gate lengths. Soon the costs of traditional resilience mechanisms will cancel most of the benefits gained from transitioning to a new technology. The goal of the Resilience Path is to keep the costs of resilience as low as possible by focusing on flexible, application-specific, adaptive resiliency mechanisms. Reliable information processing with unreliable and adjustable components will be researched, taking into account the projected heterogeneity of future systems and the fault characteristics of new materials-inspired technologies.

Investigators
Overall Goal + Justification

It can be assumed that most post-CMOS technologies, such as the ones investigated in cfAED, will exhibit high error rates. In particular, not only the rate of single event upsets (e.g., bit flips) will increase, but also accelerated aging (e.g., transistor performance degradation) and transistor variability (e.g., threshold voltage). This will result in an increasing rate of transient and permanent errors. To mask these errors, we need to pay a cost in terms of energy, speed, and transistor count. We informally refer to this as the resilience cost. Extrapolating state-of-the-art approaches to future resilience needs, the cost of resilience will eventually prevent the use of new technology generations: the benefits of a new technology must exceed the cost increase of resilience.

The overall goal of the Resilience Path is to reduce the resilience cost. Depending on the context, different emphasis must be given to the costs of energy, speed, and transistor count. For example, a required balance between speed and energy differs widely between a mobile device and high performance servers. The Resilience Path is driven by the hypothesis that a sufficient cost reduction can be achieved by combining the best ideas that exist on different sub-layers. A system can be viewed as a layered system consisting of hardware and software sub-layers. A variety of ideas have been proposed to improve the resilience on each sub-layer. Hence, components that populate these sub-layers come with their own resilience mechanisms.

To achieve a substantial cost reduction, we not only need novel mechanisms but also to orchestrate these mechanisms in an intelligent way. Our general approach to reduce the cost is to dynamically adapt the degree of resilience to the current needs of the applications. Consider, e.g., a banking and a gaming application executed within a browser. The banking application needs to be optimized for integrity, and the gaming application for speed. To allow for such optimizations, we need to explicitly state the resilience requirements of an application. In the simplest case, an application will select its current resilience requirements from a set of pre-specified resilience classes. For more fine-grained control, we will investigate the use of resilience contracts: these contracts can be used to express dynamic resilience requirements negotiated and orchestrated between all sub-layers.

Research Approach

path-g graphic

The overall goal of this Path is to reduce the cost of resilience. Our approach is based on the observation that the cost of resilience does not only depend on the error rate and types of the underlying technology but also on the resilience requirements and the inherent resilience of applications, possibly changing during runtime. Hence, our aim is to provide dynamic control of application resilience. In this way, we can orchestrate to only pay the cost of the currently needed degree of resilience. We will perform a dynamic cross-layer reconfiguration to tune the resilience mechanisms that are implemented on the various layers of a computer system. Dynamic resilience control will not only facilitate the adaptation regarding changing application requirements but also with respect to fluctuating error rates caused by, for example, environment changes or aging effects.

Our vision is to use the best resilience mechanisms on each sub-layer and combine them into one resilient computing stack as depicted in the scheme above. We also need to orchestrate these layers within one computer system and potentially, across multiple machines within distributed systems. This Path’s Research Modules are divided into horizontal “layers” (RM L1-4) and vertical “orchestration” modules (RM O1-O3). This Path integrates the expertise of the two new Strategic Professorships Processor Design and Compiler Construction, and of the new ZMDI endowed professorship Circuits for Energy Efficiency. An Research Group Leader (RGL) position Orchestration of Resilience Mechanisms will be created.

Path Activities

Published on in RESILIENCE PATH

Our work on "SGXBounds: Memory Safety for Shielded Execution" has been awarded the best paper award at EuroSys'17 -- a top conference in computer systems. SGXBounds proposes an efficient technique to achieve memory safety for shielded execution. Memory safety is the most critical property for ensuring software reliability, and security. Surprisingly, SGXBounds beats the state-of-the-art software AddressSanitizer from Google, and Intel MPX hardware ISA extensions for memory safety! Furthermore, SGXBounds not only detect memory safety violations, but also tolerates them to ensure high availability for software systems. SGXBounds' design is based on a simple idea to use tagged pointer, and a compact memory layout in the context of secure enclaves.

The lead author for the project is Dmitrii Kuvaiskii from the Resilience Path at cfaed, who is jointly advised by Christof Fetzer and Pramod Bhatotia.

Published on in RESILIENCE PATH

Our work on "SGXBounds: Memory Safety for Shielded Execution" has been accepted at EuroSys 2017 - a top conference in computer systems. The work proposes an efficient technique to achieve memory safety for shielded execution. Memory safety is the most critical component for ensuring software reliability against faults, and security against vulnerabilities. Surprisingly, SGXBounds beats the state-of-the-art software AddressSanitizer from Google, and Intel MPX hardware ISA extensions for memory safety! Furthermore, SGXBounds not only detect memory safety violations, but also tolerates them to ensure high availability for software systems. SGXBounds' design is based on a simple idea to use tagged pointer in the context of secure enclaves.

Published on in RESILIENCE PATH

Congratulations to Dr. Marco Zimmerling, who was announced to receive the 2015 ACM SIGBED Paul Caspi Memorial Dissertation Award! The committee honors Dr. Zimmerling for his thesis "End-to-end Predictability and Efficiency in Low-power Wireless Networks", which he completed at ETH Zurich. Dr. Zimmerling has been leading cfaed's Networked Embedded Systems Group since November 2015.

The ACM SIGBED Paul Caspi Memorial Dissertation Award is given by the Special Interest Group on Embedded Systems (SIGBED) of the Association for Computing Machinery (ACM). ACM is the world's largest educational and scientific computing society. The award has been established in 2013 in memory of Dr. Paul Caspi (1944-2012). The award recognizes outstanding doctoral dissertations that significantly advance the state of the art in the science of embedded systems. The winner was selected by a committee, chaired by Prof. Wang Yi. Seven nominations for the award have been received from Germany, Sweden, Switzerland, and USA.

Published on in RESILIENCE PATH

After the success of INFOCOM’16, the Resilience Path of cfaed celebrates yet another top-tier publication: A paper on the integration of incremental and approximate computations has been accepted to WWW 2016, a leading conference in the area of "Big Data" analytics. This paper is especially important for our efforts to strengthen the 5G Lab, and the HAEC initiatives at cfaed to support the development of the Tactile Internet. In particular, the proposed data analytics system, called IncApprox, uses a combination of incremental and approximate computing paradigms to enable low-latency energy efficient stream processing.

Published on in RESILIENCE PATH

Another great success for cfaed: Three of our scientists had their paper on efficient and anonymous communication in Darknets accepted at INFOCOM, which, according to Microsoft Academic Research, is the top conference in the entire field of computer science. The paper titled “Anonymous Addresses for Efficient and Resilient Routing in F2F Overlays” was written by Stefanie Roos, Martin Beck, and Thorsten Strufe.

Published on in RESILIENCE PATH

On January 28th 2015, Prof. Jeronimo Castrillon together with Prof. Thorsten Strufe gave their inaugural lectures on “Compilers for Multi and Many Processor Systems” and “Privacy vs. Surveillance and Censorship in Online Services”, respectively. The lectures were well attended, with around 130 colleagues, students and friends coming mostly from the Computer Science and Electrical Engineering faculties. In this event, both professors gave an introduction to themselves and their research areas in collegial, friendly atmosphere. The presentations were followed by a get together, full with interesting follow-up discussions.

Publications

cfaed Publications

Automatically tolerating arbitrary faults in non-malicious settings

Reference

Diogo Behrens, Stefan Weigert, Christof Fetzer, "Automatically tolerating arbitrary faults in non-malicious settings", In Proceeding: Dependable Computing (LADC), 2013 Sixth Latin-American Symposium on, pp. 114–123, 2013. [doi]

Bibtex

@inproceedings{behrens2013automatically,
title={Automatically tolerating arbitrary faults in non-malicious settings},
author={Behrens, Diogo and Weigert, Stefan and Fetzer, Christof},
booktitle={Dependable Computing (LADC), 2013 Sixth Latin-American Symposium on},
pages={114--123},
year={2013},
organization={IEEE},
doi={10.1109/LADC.2013.26}
}

Downloads

No Downloads available for this publication

Related Paths

Resilience Path

Permalink

https://cfaed.tu-dresden.de/resilience?pubId=85


Go back to publications list