- Chair of Compiler Construction
- Chair of Emerging Electronic Technologies
- Chair of Knowledge-Based Systems
- Chair of Molecular Functional Materials
- Chair of Network Dynamics
- Chair of Organic Devices
- Chair of Processor Design
Prof. Dr.-Ing. Jeronimo Castrillon |
||
Phone Fax Visitor's Address |
jeronimo.castrillon@tu-dresden.de +49 (0)351 463 42716 +49 (0)351 463 39995 Chair for Compiler Construction |
Jerónimo Castrillón received the Electronics Engineering degree with honors from the Pontificia Bolivariana University in Colombia in 2004, the master degree from the ALaRI Institute, University of Lugano, in Switzerland in 2006 and the Ph.D. degree (Dr.-Ing.) on Electric Engineering and Information Technology with honors from the RWTH Aachen University in Germany in 2013. From early 2009 to April 2013 Dr. Castrillón was the chief engineer of the chair for Software for Systems on Silicon at the RWTH Aachen University, where he was enrolled as research staff since late 2006. From April 2013 to April 2014 Dr. Castrillón was senior scientific staff in the same institution.
In June 2014, Dr. Castrillón joined the department of computer science of the TU Dresden as professor for compiler construction in the context of the German excellence cluster “Center for Advancing Electronics Dresden” (cfaed). His research interests lie on methodologies, languages, tools and algorithms for programming complex computing systems. He is also affiliated to the Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig (ScaDS.AI), the 6G-life Hub, and the Barkhausen Institut.
Prof. Castrillón has several international publications and has served as program chair and technical program committee in international conferences and workshops (e.g., LCTES, CASES, DAC, DATE, CODES-ISSS, CASES, CGO, Computing Frontiers, FPL, ICCS and MCSoC) as well as a reviewer for ACM and IEEE journals among others. Prof. Castrillón is the recipient of numerous awards, including the Swiss Excellence Government Scholarship in 2005 and the Intel Doctoral Award in 2012. In 2014 he co-founded Silexica GmbH, a company that provides programming tools for embedded multicore architectures, now with AMD/Xilinx.
2025
- Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clément Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms" (to appear), Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Association for Computing Machinery, Mar 2025. [Bibtex & Downloads]
CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms
Reference
Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clément Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms" (to appear), Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Association for Computing Machinery, Mar 2025.
Bibtex
@InProceedings{khan_asplos25,
author = {Khan, Asif Ali and Farzaneh, Hamid and Friebel, Karl F. A. and Fournier, Clément and Chelini, Lorenzo and Castrillon, Jeronimo},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25)},
title = {CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms},
location = {Rotterdam, The Netherlands},
publisher = {Association for Computing Machinery},
series = {ASPLOS '25},
month = mar,
year = {2025},
}Downloads
No Downloads available for this publication
Permalink
- Alexander Brauckmann, Anderson Faustino da Silva, Jeronimo Castrillon, Hugh Leather, "DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses" (to appear), Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, New York, NY, USA, Mar 2025. [Bibtex & Downloads]
DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses
Reference
Alexander Brauckmann, Anderson Faustino da Silva, Jeronimo Castrillon, Hugh Leather, "DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses" (to appear), Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, New York, NY, USA, Mar 2025.
Bibtex
@InProceedings{brauckmann_cc25,
author = {Alexander Brauckmann and Anderson Faustino da Silva and Jeronimo Castrillon and Hugh Leather},
title = {DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses},
booktitle = {Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025)},
year = {2025},
series = {CC 2025},
publisher = {Association for Computing Machinery},
location = {Las Vegas, NV, USA},
month = mar,
address = {New York, NY, USA},
numpages = {11},
}Downloads
No Downloads available for this publication
Permalink
- Anderson Faustino da Silva, Jeronimo Castrillon, Fernando Magno Quintão Pereira, "A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers" (to appear), Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, New York, NY, USA, Mar 2025. [Bibtex & Downloads]
A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers
Reference
Anderson Faustino da Silva, Jeronimo Castrillon, Fernando Magno Quintão Pereira, "A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers" (to appear), Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, New York, NY, USA, Mar 2025.
Bibtex
@InProceedings{faustino_cc25,
author = {Anderson Faustino da Silva and Jeronimo Castrillon and Fernando Magno Quint\~{a}o Pereira},
booktitle = {Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025)},
title = {A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers},
location = {Las Vegas, NV, USA},
publisher = {Association for Computing Machinery},
series = {CC 2025},
address = {New York, NY, USA},
month = mar,
numpages = {11},
year = {2025},
}Downloads
No Downloads available for this publication
Permalink
2024
- Shaokai Lin, Tassilo Tanneberger, Jiahong Bi, Guangyu Feng, Ruomu Xu, Julian Robledo, Robert Khasanov, Jeronimo Castrillon, "Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems", In IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2024), IEEE, Oct 2024. [doi] [Bibtex & Downloads]
Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems
Reference
Shaokai Lin, Tassilo Tanneberger, Jiahong Bi, Guangyu Feng, Ruomu Xu, Julian Robledo, Robert Khasanov, Jeronimo Castrillon, "Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems", In IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2024), IEEE, Oct 2024. [doi]
Bibtex
@Article{lin_ieeeesl-tcrs24,
author = {Shaokai Lin and Tassilo Tanneberger and Jiahong Bi and Guangyu Feng and Ruomu Xu and Julian Robledo and Robert Khasanov and Jeronimo Castrillon},
title = {Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems},
journal = {IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2024)},
month = oct,
numpages = {4},
publisher = {IEEE},
year = {2024},
doi = {10.1109/LES.2024.3469278},
url = {https://ieeexplore.ieee.org/document/10702523},
}Downloads
2410_Lin_TCRS [PDF]
Permalink
- João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-memory High-Radix Counting", Arxiv, pp. 1-14, Sep 2024. [Bibtex & Downloads]
Count2Multiply: Reliable In-memory High-Radix Counting
Reference
João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-memory High-Radix Counting", Arxiv, pp. 1-14, Sep 2024.
Abstract
Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.
Bibtex
@Misc{delima_count2multiply,
author = {Jo{\~a}o Paulo C. de Lima and Benjamin F. Morris III and Asif Ali Khan and Jeronimo Castrillon and Alex K. Jones},
title = {Count2Multiply: Reliable In-memory High-Radix Counting},
pages = {1-14},
publisher = {Arxiv},
month=sep,
year={2024},
eprint={2409.10136},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2409.10136},
abstract = {Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.},
}Downloads
No Downloads available for this publication
Permalink
- Jiahong Bi, Guilherme Korol, Jeronimo Castrillon, "Leveraging the MLIR infrastructure for the computing continuum", In Proceeding: CPS Workshop 2024, Sep 2024. [Bibtex & Downloads]
Leveraging the MLIR infrastructure for the computing continuum
Reference
Jiahong Bi, Guilherme Korol, Jeronimo Castrillon, "Leveraging the MLIR infrastructure for the computing continuum", In Proceeding: CPS Workshop 2024, Sep 2024.
Abstract
With an ever-increasing number of connected devices (e.g., IoT), cloud computing faces efficiency challenges due to complex infrastructure, high communication costs, and privacy. Fog and edge computing enable computing closer to data sources, offering alternatives to the limitations of relying exclusively on the cloud. When combined with high-performance cloud platforms, fog, and edge devices form a computing continuum. However, the continuum challenges designers who need to compile and deploy on distributed and heterogeneous devices and optimize for a diverse set of non-functional requirements. To ease the usage and ensure the full potential of the continuum, a Design and Programming Environment (DPE) that is interoperable, reusable, portable, and cross-layer is needed. In this context, the Multi-Level Intermediate Representation (MLIR) becomes vital since it provides an extensible and reusable compiler infrastructure. The project development of a continuum-oriented DPE leveraging the MLIR infrastructure is discussed in this paper as a work in progress.
Bibtex
@InProceedings{bi_cps24,
author = {Jiahong Bi and Guilherme Korol and Jeronimo Castrillon},
booktitle = {CPS Workshop 2024},
title = {Leveraging the MLIR infrastructure for the computing continuum},
location = {Alghero, Italy},
abstract = {With an ever-increasing number of connected devices (e.g., IoT), cloud computing faces efficiency challenges due to complex infrastructure, high communication costs, and privacy. Fog and edge computing enable computing closer to data sources, offering alternatives to the limitations of relying exclusively on the cloud. When combined with high-performance cloud platforms, fog, and edge devices form a computing continuum. However, the continuum challenges designers who need to compile and deploy on distributed and heterogeneous devices and optimize for a diverse set of non-functional requirements. To ease the usage and ensure the full potential of the continuum, a Design and Programming Environment (DPE) that is interoperable, reusable, portable, and cross-layer is needed. In this context, the Multi-Level Intermediate Representation (MLIR) becomes vital since it provides an extensible and reusable compiler infrastructure. The project development of a continuum-oriented DPE leveraging the MLIR infrastructure is discussed in this paper as a work in progress.},
month = sep,
numpages = {8},
year = {2024},
}Downloads
2409_BI_CPSW [PDF]
Permalink
- Julian Robledo, Christian Menard, Erling Jellum, Edward A. Lee, Jeronimo Castrillon, "Timing enclaves for performance in Lingua Franca", In Proceeding: 2024 Forum for Specification and Design Languages (FDL), pp. 1-9, Sep 2024. [doi] [Bibtex & Downloads]
Timing enclaves for performance in Lingua Franca
Reference
Julian Robledo, Christian Menard, Erling Jellum, Edward A. Lee, Jeronimo Castrillon, "Timing enclaves for performance in Lingua Franca", In Proceeding: 2024 Forum for Specification and Design Languages (FDL), pp. 1-9, Sep 2024. [doi]
Bibtex
@InProceedings{robledo_fdl24,
author = {Julian Robledo and Christian Menard and Erling Jellum and Edward A. Lee and Jeronimo Castrillon},
booktitle = {2024 Forum for Specification and Design Languages (FDL)},
title = {Timing enclaves for performance in Lingua Franca},
location = {Stockholm, Sweden},
month = sep,
year = {2024},
doi = {10.1109/FDL63219.2024.10673834},
pages = {1-9},
url = {https://ieeexplore.ieee.org/document/10673834},
}Downloads
2409_Robledo_FDL [PDF]
Permalink
- Jeronimo Castrillon, "High-level programming abstractions and compilation for near and in-memory computing", In Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform (ABUMPIMP 2024) @ Euro-Par 2024 (invited talk), Aug 2024. [Bibtex & Downloads]
High-level programming abstractions and compilation for near and in-memory computing
Reference
Jeronimo Castrillon, "High-level programming abstractions and compilation for near and in-memory computing", In Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform (ABUMPIMP 2024) @ Euro-Par 2024 (invited talk), Aug 2024.
Bibtex
@Misc{castrillon_abumpimp2024,
author = {Castrillon, Jeronimo},
date = {2024-08},
title = {High-level programming abstractions and compilation for near and in-memory computing},
howpublished = {Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform (ABUMPIMP 2024) @ Euro-Par 2024 (invited talk)},
location = {Madrid, Spain},
month = aug,
year = {2024},
}Downloads
240826_Castrillon_ABUMPIMP [PDF]
Permalink
- Shaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Ruomu Xu, Guangyu Feng, Christian Menard, Marten Lohstroh, Jeronimo Castrillon, Sanjit Seshia, Edward Lee, "PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency", Jun 2024. [Bibtex & Downloads]
PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency
Reference
Shaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Ruomu Xu, Guangyu Feng, Christian Menard, Marten Lohstroh, Jeronimo Castrillon, Sanjit Seshia, Edward Lee, "PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency", Jun 2024.
Bibtex
@Misc{lin_pretvm24,
author = {Shaokai Lin and Erling Jellum and Mirco Theile and Tassilo Tanneberger and Binqi Sun and Chadlia Jerad and Ruomu Xu and Guangyu Feng and Christian Menard and Marten Lohstroh and Jeronimo Castrillon and Sanjit Seshia and Edward Lee},
title = {PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency},
eprint = {2406.06253},
url = {https://arxiv.org/abs/2406.06253},
archiveprefix = {arXiv},
month = jun,
primaryclass = {eess.SY},
year = {2024},
}Downloads
No Downloads available for this publication
Permalink
- Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors", Jun 2024. [Bibtex & Downloads]
E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors
Reference
Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors", Jun 2024.
Bibtex
@Misc{khasanov_emapper24,
author = {Till Smejkal and Robert Khasanov and Jeronimo Castrillon and Hermann Härtig},
title = {{E-Mapper}: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors},
eprint = {2406.18980},
url = {https://arxiv.org/abs/2406.18980},
archiveprefix = {arXiv},
primaryclass = {cs.OS},
year = {2024},
month = jun
}Downloads
No Downloads available for this publication
Permalink
- Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi] [Bibtex & Downloads]
SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs
Reference
Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi]
Abstract
Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes and 4.6\texttimes, respectively.
Bibtex
@InProceedings{farzaneh_dac24,
author = {Hamid Farzaneh and Jo{\~a}o Paulo Cardoso De Lima and Ali Nezhadi Khelejani and Asif Ali Khan and Mahta Mayahinia and Mehdi Tahoori and Jeronimo Castrillon},
booktitle = {Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24)},
title = {{SHERLOCK}: Scheduling Efficient and Reliable Bulk Bitwise Operations in {NVMs}},
location = {San Francisco, California},
series = {DAC '24},
month = jun,
year = {2024},
isbn = {9798400706011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3649329.3658485},
doi = {10.1145/3649329.3658485},
abstract = {Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes{} and 4.6\texttimes{}, respectively.},
articleno = {293},
numpages = {6},
}Downloads
2406_Farzaneh_DAC [PDF]
Permalink
- Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]
C4CAM: A Compiler for CAM-based In-memory Accelerators
Reference
Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi]
Abstract
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.
Bibtex
@InProceedings{farzaneh_asplos24,
author = {Hamid Farzaneh and João Paulo Cardoso de Lima and Mengyuan Li and Asif Ali Khan and Xiaobo Sharon Hu and Jeronimo Castrillon},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3},
title = {C4CAM: A Compiler for CAM-based In-memory Accelerators},
doi = {10.1145/3620666.3651386},
isbn = {9798400703867},
location = {La Jolla, CA, USA},
pages = {164--177},
publisher = {Association for Computing Machinery},
series = {ASPLOS '24},
url = {https://arxiv.org/abs/2309.06418},
abstract = {Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.},
address = {New York, NY, USA},
month = may,
numpages = {14},
year = {2024},
}Downloads
2405_Farzaneh_ASPLOS [PDF]
Permalink
- Stephanie Soldavini, Felix Suchert, Serena Curzel, Michele Fiorito, Karl Friedrich Alexander Friebel, Fabrizio Ferrandi, Radim Cmar, Jeronimo Castrillon, Christian Pilato, "Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes", Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract), 1pp, May 2024. [Bibtex & Downloads]
Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes
Reference
Stephanie Soldavini, Felix Suchert, Serena Curzel, Michele Fiorito, Karl Friedrich Alexander Friebel, Fabrizio Ferrandi, Radim Cmar, Jeronimo Castrillon, Christian Pilato, "Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes", Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract), 1pp, May 2024.
Bibtex
@InProceedings{suchert_fccm24,
author = {Stephanie Soldavini and Felix Suchert and Serena Curzel and Michele Fiorito and Karl Friedrich Alexander Friebel and Fabrizio Ferrandi and Radim Cmar and Jeronimo Castrillon and Christian Pilato},
booktitle = {Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract)},
title = {Etna: {MLIR}-Based System-Level Design and Optimization for Transparent Application Execution on {CPU}-{FPGA} Nodes},
location = {Orlando, CA USA},
pages = {1pp},
series = {FCCM’24},
month = may,
year = {2024},
}Downloads
2405_Suchert_FCCM [PDF]
Permalink
- Francesca Palumbo, Maria Katiuscia Zedda, Tiziana Fanni, Alessandra Bagnato, Luca Castello, Jeronimo Castrillon, Roberto Del Ponte, Yansha Deng, Bart Driessen, Mauro Fadda, Tristan Halna du Fretay, Julio de Oliveira Filho, Veena Rao, Francesco Regazzoni, Alfonso Rodríguez, Melanie Schranz, Giulia Sedda, "MYRTUS: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems", Proceedings of the 21st ACM International Conference on Computing Frontiers (CF'24), Association for Computing Machinery (ACM), pp. 101–106, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]
MYRTUS: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems
Reference
Francesca Palumbo, Maria Katiuscia Zedda, Tiziana Fanni, Alessandra Bagnato, Luca Castello, Jeronimo Castrillon, Roberto Del Ponte, Yansha Deng, Bart Driessen, Mauro Fadda, Tristan Halna du Fretay, Julio de Oliveira Filho, Veena Rao, Francesco Regazzoni, Alfonso Rodríguez, Melanie Schranz, Giulia Sedda, "MYRTUS: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems", Proceedings of the 21st ACM International Conference on Computing Frontiers (CF'24), Association for Computing Machinery (ACM), pp. 101–106, New York, NY, USA, May 2024. [doi]
Abstract
The MYRTUS Horizon Europe Project embraces the principles of the EUCloudEdgeIOT Initiative and integrates edge, fog and cloud computing platforms, leveraging a cognitive engine based on swarm intelligence and federated learning to orchestrate collaborative distributed and decentralised components. Components are augmented with interface contracts covering both functional and non-functional properties.
Bibtex
@InProceedings{palumbo_cf24,
author = {Francesca Palumbo and Maria Katiuscia Zedda and Tiziana Fanni and Alessandra Bagnato and Luca Castello and Jeronimo Castrillon and Roberto Del Ponte and Yansha Deng and Bart Driessen and Mauro Fadda and Tristan Halna du Fretay and Julio de Oliveira Filho and Veena Rao and Francesco Regazzoni and Alfonso Rodríguez and Melanie Schranz and Giulia Sedda},
booktitle = {Proceedings of the 21st ACM International Conference on Computing Frontiers (CF'24)},
title = {{MYRTUS}: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems},
location = {Ischia, Italy},
pages = {101–106},
numpages = {6},
publisher = {Association for Computing Machinery (ACM)},
series = {CF '24 Companion},
abstract = {The MYRTUS Horizon Europe Project embraces the principles of the EUCloudEdgeIOT Initiative and integrates edge, fog and cloud computing platforms, leveraging a cognitive engine based on swarm intelligence and federated learning to orchestrate collaborative distributed and decentralised components. Components are augmented with interface contracts covering both functional and non-functional properties.},
address = {New York, NY, USA},
month = may,
year = {2024},
doi = {10.1145/3637543.3654618},
isbn = {979-8-4007-0492-5/24/05},
}Downloads
2405_Palumbo_CF [PDF]
Permalink
- Jeronimo Castrillon, "Automatic optimization for heterogeneous in-memory computing", In Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk), Mar 2024. [Bibtex & Downloads]
Automatic optimization for heterogeneous in-memory computing
Reference
Jeronimo Castrillon, "Automatic optimization for heterogeneous in-memory computing", In Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk), Mar 2024.
Abstract
Fuelled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann machines in different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts on building an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on cross-bars, on content addressable memories and bulk-wise logic operations. We believe this kind of automation to be key to more quickly navigate the heterogeneous landscape of in-memory accelerators and to bring the benefits of emerging architectures to a boarder range of applications.
Bibtex
@Misc{castrillon_date2024,
author = {Castrillon, Jeronimo},
date = {2024-03},
title = {Automatic optimization for heterogeneous in-memory computing},
howpublished = {Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk)},
location = {Valencia, Spain},
abstract = {Fuelled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann machines in different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts on building an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on cross-bars, on content addressable memories and bulk-wise logic operations. We believe this kind of automation to be key to more quickly navigate the heterogeneous landscape of in-memory accelerators and to bring the benefits of emerging architectures to a boarder range of applications.},
month = mar,
year = {2024},
}Downloads
No Downloads available for this publication
Permalink
- João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024. [Bibtex & Downloads]
Full-Stack Optimization for CAM-Only DNN Inference
Reference
João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024.
Abstract
The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy
Bibtex
@InProceedings{delima_date24,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Luigi Carro and Jeronimo Castrillon},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Full-Stack Optimization for CAM-Only DNN Inference},
location = {Valencia, Spain},
pages = {1-6},
publisher = {IEEE},
series = {DATE'24},
url = {https://ieeexplore.ieee.org/document/10546805},
abstract = {The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy},
month = mar,
year = {2024},
}Downloads
2403_deLima_DATE [PDF]
Permalink
- Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024. [Bibtex & Downloads]
Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers
Reference
Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024.
Bibtex
@InProceedings{niemier_date24,
author = {Michael Niemier and Zephan Enciso and Mohammad Mehdi Sharifi and X. Sharon Hu and Ian O'Connor and Alexander Graening and Ravit Sharma and Puneet Gupta and Jeronimo Castrillon and João Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Nashrah Afroze and Asif Islam Khan and Julien Ryckaert},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546772},
pages = {1--10},
publisher = {IEEE},
series = {DATE'24},
month = mar,
year = {2024},
}Downloads
2403_Niemier_DATE [PDF]
Permalink
- Christian Pilato, Subhadeep Banik, Jakub Beránek, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Radim Cmar, Serena Curzel, Fabrizio Ferrandi, Karl F. A. Friebel, Antonella Galizia, Matteo Grasso, Paulo Silva, Jan Martinovic, Gianluca Palermo, Michele Paolino, Andrea Parodi, Antonio Parodi, Fabio Pintus, Raphael Polig, David Poulet, Francesco Regazzoni, Burkhard Ringlein, Roberto Rocco, Katerina Slaninova, Tom Slooff, Stephanie Soldavini, Felix Suchert, Mattia Tibaldi, Beat Weiss, Christoph Hagleitner, "A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), 6pp, Mar 2024. [Bibtex & Downloads]
A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach
Reference
Christian Pilato, Subhadeep Banik, Jakub Beránek, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Radim Cmar, Serena Curzel, Fabrizio Ferrandi, Karl F. A. Friebel, Antonella Galizia, Matteo Grasso, Paulo Silva, Jan Martinovic, Gianluca Palermo, Michele Paolino, Andrea Parodi, Antonio Parodi, Fabio Pintus, Raphael Polig, David Poulet, Francesco Regazzoni, Burkhard Ringlein, Roberto Rocco, Katerina Slaninova, Tom Slooff, Stephanie Soldavini, Felix Suchert, Mattia Tibaldi, Beat Weiss, Christoph Hagleitner, "A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), 6pp, Mar 2024.
Bibtex
@InProceedings{pilato_date24,
author = {Christian Pilato and Subhadeep Banik and Jakub Beránek and Fabien Brocheton and Jeronimo Castrillon and Riccardo Cevasco and Radim Cmar and Serena Curzel and Fabrizio Ferrandi and Karl F. A. Friebel and Antonella Galizia and Matteo Grasso and Paulo Silva and Jan Martinovic and Gianluca Palermo and Michele Paolino and Andrea Parodi and Antonio Parodi and Fabio Pintus and Raphael Polig and David Poulet and Francesco Regazzoni and Burkhard Ringlein and Roberto Rocco and Katerina Slaninova and Tom Slooff and Stephanie Soldavini and Felix Suchert and Mattia Tibaldi and Beat Weiss and Christoph Hagleitner},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {A System Development Kit for Big Data Applications on {FPGA}-based Clusters: The {EVEREST} Approach},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546518},
pages = {6pp},
series = {DATE'24},
month = mar,
year = {2024},
}Downloads
2403_Pilato_DATE [PDF]
Permalink
- Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024. [Bibtex & Downloads]
The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview
Reference
Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024.
Bibtex
@Report{khan_cimlandscape_2024,
author = {Asif Ali Khan and João Paulo C. De Lima and Hamid Farzaneh and Jeronimo Castrillon},
title = {The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview},
eprint = {2401.14428},
url = {https://arxiv.org/abs/2401.14428},
archiveprefix = {arXiv},
month = jan,
primaryclass = {cs.AR},
year = {2024},
}Downloads
No Downloads available for this publication
Permalink
- Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi] [Bibtex & Downloads]
Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory
Reference
Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi]
Bibtex
@Article{khan_ieeecal24,
author = {Asif Ali Khan and Fazal Hameed and Taha Shahroodi and Alex K. Jones and Jeronimo Castrillon},
title = {Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory},
pages = {4pp},
journal = {IEEE Computer Architecture Letters},
month = jan,
publisher = {IEEE},
year = {2024},
doi = {10.1109/LCA.2024.3350701},
url = {https://ieeexplore.ieee.org/document/10409506},
}Downloads
2401_Khan_IEEECAL [PDF]
Permalink
- Robert Khasanov, Marc Dietrich, Jeronimo Castrillon, "Flexible Spatio-Temporal Energy-Efficient Runtime Management", In Proceeding: 29th Asia and South Pacific Design Automation Conference (ASP-DAC’24), pp. 777–784, Jan 2024. [doi] [Bibtex & Downloads]
Flexible Spatio-Temporal Energy-Efficient Runtime Management
Reference
Robert Khasanov, Marc Dietrich, Jeronimo Castrillon, "Flexible Spatio-Temporal Energy-Efficient Runtime Management", In Proceeding: 29th Asia and South Pacific Design Automation Conference (ASP-DAC’24), pp. 777–784, Jan 2024. [doi]
Bibtex
@InProceedings{khasanov_aspdac24,
author = {Robert Khasanov and Marc Dietrich and Jeronimo Castrillon},
booktitle = {29th Asia and South Pacific Design Automation Conference (ASP-DAC’24)},
title = {Flexible Spatio-Temporal Energy-Efficient Runtime Management},
location = {Incheon, South Korea},
organization = {IEEE},
pages = {777--784},
month = jan,
year = {2024},
url = {https://ieeexplore.ieee.org/document/10473885},
doi = {10.1109/ASP-DAC58780.2024.10473885},
}Downloads
2401_Khasanov_ASPDAC [PDF]
Permalink
- Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Hyperdimensional Computing Quantization with Thermometer Codes", In Proceeding: 6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 7pp, Jan 2024. [Bibtex & Downloads]
Hyperdimensional Computing Quantization with Thermometer Codes
Reference
Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Hyperdimensional Computing Quantization with Thermometer Codes", In Proceeding: 6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 7pp, Jan 2024.
Bibtex
@InProceedings{vieira_accml24,
author = {Caio Vieira and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Hyperdimensional Computing Quantization with Thermometer Codes},
location = {Munich, Germany},
pages = {7pp},
month = jan,
year = {2024},
url = {https://accml.dcs.gla.ac.uk/papers/2024/6th_AccML_paper_22.pdf},
}Downloads
2401_Vieira_AccML [PDF]
Permalink
2023
- Mirko Günther, Lars Schütze, Kilian Becher, Thorsten Strufe, Jeronimo Castrillon, "HElium: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption", Dec 2023. [Bibtex & Downloads]
HElium: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption
Reference
Mirko Günther, Lars Schütze, Kilian Becher, Thorsten Strufe, Jeronimo Castrillon, "HElium: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption", Dec 2023.
Bibtex
@online{schuetze_arxiv23,
author = {Mirko G{\"u}nther and Lars Sch{\"u}tze and Kilian Becher and Thorsten Strufe and Jeronimo Castrillon},
title = {{HElium}: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption},
eprint = {2312.14250},
url = {http://arxiv.org/abs/2312.14250},
journal = {arXiv preprint arXiv:2312.14250},
month = dec,
year = {2023},
}Downloads
2312_Schuetze_arXiv23 [PDF]
Permalink
- Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 8th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), co-located with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC23) (invited talk), Nov 2023. [Bibtex & Downloads]
Domain-specific programming methodologies for domain-specific and emerging computing systems
Reference
Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 8th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), co-located with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC23) (invited talk), Nov 2023.
Abstract
Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.
Bibtex
@Misc{castrillon_espm2023,
author = {Castrillon, Jeronimo},
date = {2023-11},
title = {Domain-specific programming methodologies for domain-specific and emerging computing systems},
howpublished = {8th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), co-located with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC23) (invited talk)},
location = {Denver, CA, USA},
abstract = {Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.},
month = nov,
year = {2023},
}Downloads
No Downloads available for this publication
Permalink
- Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "Management of Energy-Aware Processes in Heterogeneous Architectures", In Poster at the 29th ACM Symposium on Operating Systems Principles (SOSP’23 Posters), Oct 2023. [Bibtex & Downloads]
Management of Energy-Aware Processes in Heterogeneous Architectures
Reference
Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "Management of Energy-Aware Processes in Heterogeneous Architectures", In Poster at the 29th ACM Symposium on Operating Systems Principles (SOSP’23 Posters), Oct 2023.
Bibtex
@Misc{smejkal_poster-sosp23,
author = {Till Smejkal and Robert Khasanov and Jeronimo Castrillon and Hermann H{\"a}rtig},
title = {Management of Energy-Aware Processes in Heterogeneous Architectures},
howpublished = {Poster at the 29th ACM Symposium on Operating Systems Principles (SOSP’23 Posters)},
location = {Koblenz, Germany},
month = oct,
year = {2023},
}Downloads
No Downloads available for this publication
Permalink
- Marcus Rossel, Shaokai Lin, Marten Lohstroh, Jeronimo Castrillon, Andrés Goens, "Provable Determinism for Software in Cyber-Physical Systems", Proceedings of the 15th International Conference on Verified Software: Theories, Tools, and Experiments, 12pp, Oct 2023. [Bibtex & Downloads]
Provable Determinism for Software in Cyber-Physical Systems
Reference
Marcus Rossel, Shaokai Lin, Marten Lohstroh, Jeronimo Castrillon, Andrés Goens, "Provable Determinism for Software in Cyber-Physical Systems", Proceedings of the 15th International Conference on Verified Software: Theories, Tools, and Experiments, 12pp, Oct 2023.
Bibtex
@InProceedings{rossel_vstte23,
author = {Marcus Rossel and Shaokai Lin and Marten Lohstroh and Jeronimo Castrillon and Andr{\'e}s Goens},
booktitle = {Proceedings of the 15th International Conference on Verified Software: Theories, Tools, and Experiments},
title = {Provable Determinism for Software in Cyber-Physical Systems},
doi = {10.1007/978-3-031-66064-1_6},
isbn = {978-3-031-66064-1},
organization = {Springer},
pages = {85--107},
publisher = {Springer Nature Switzerland},
url = {https://link.springer.com/chapter/10.1007/978-3-031-66064-1_6},
abstract = {In Cyber-Physical Systems (CPS), concurrently executing software components interact with each other and the physical environment to deliver functionality that is often safety-critical and time-sensitive. Verifying the correctness of the joint behavior of concurrent software components, however, is challenging. It is helpful to eliminate nondeterminism in the software, at the level of the programming model, and provide first-class programming constructs for expressing timed behavior. The Lingua Franca (LF) coordination language achieves this through the use of the Reactor model as its underlying model of computation. In this paper, we present the first formal operational semantics for the Reactor model, and prove its key properties of progress and determinism. The Reactor model and its associated proofs are fully mechanized in the Lean theorem prover. As an operational model, our semantics are close to the intuition for implementation and a helpful reference. The computational objects of the Reactor model are formalized in a modular fashion, which provides insights into the different structural properties of the model, and their effect on execution behavior.},
address = {Cham},
month = oct,
year = {2023},
}Downloads
2310_Rossel_VSSTE [PDF]
Permalink
- Marten Lohstroh, Soroush Bateni, Christian Menard, Alexander Schulz-Rosengarten, Jeronimo Castrillon, Edward A. Lee, "Deterministic Coordination Across Multiple Timelines", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 23, no. 5, New York, NY, USA, Oct 2023. [doi] [Bibtex & Downloads]
Deterministic Coordination Across Multiple Timelines
Reference
Marten Lohstroh, Soroush Bateni, Christian Menard, Alexander Schulz-Rosengarten, Jeronimo Castrillon, Edward A. Lee, "Deterministic Coordination Across Multiple Timelines", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 23, no. 5, New York, NY, USA, Oct 2023. [doi]
Abstract
We discuss a novel approach for constructing deterministic reactive systems that revolves around a temporal model that incorporates a multiplicity of timelines. This model is central to Lingua Franca (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called reactors, which are objects that react to and emit discrete events. Our temporal model differs from existing models like the logical execution time (LET) paradigm and synchronous languages in that it reflects that there are always at least two distinct timelines involved in a reactive system; a logical one and a physical one—and possibly multiple of each kind. This paper explains how the relationship between events across timelines facilitates reasoning about consistency and availability across components in Cyber-Physical Systems (CPS).
Bibtex
@Article{lohstroh_tecs23,
author = {Marten Lohstroh and Soroush Bateni and Christian Menard and Alexander Schulz-Rosengarten and Jeronimo Castrillon and Edward A. Lee},
title = {Deterministic Coordination Across Multiple Timelines},
doi = {10.1145/3615357},
issn = {1539-9087},
number = {5},
url = {https://doi.org/10.1145/3615357},
volume = {23},
abstract = {We discuss a novel approach for constructing deterministic reactive systems that revolves around a temporal model that incorporates a multiplicity of timelines. This model is central to Lingua Franca (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called reactors, which are objects that react to and emit discrete events. Our temporal model differs from existing models like the logical execution time (LET) paradigm and synchronous languages in that it reflects that there are always at least two distinct timelines involved in a reactive system; a logical one and a physical one—and possibly multiple of each kind. This paper explains how the relationship between events across timelines facilitates reasoning about consistency and availability across components in Cyber-Physical Systems (CPS).},
address = {New York, NY, USA},
articleno = {77},
issue_date = {September 2024},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
month = oct,
numpages = {29},
publisher = {Association for Computing Machinery},
year = {2023},
}Downloads
No Downloads available for this publication
Permalink
- Jeronimo Castrillon, "Next-generation compilers for emerging systems", In Workshop on Compilers, Deployment, and Tooling for Edge AI (CODAI’23), co-located with the ESWeek (keynote), Sep 2023. [Bibtex & Downloads]
Next-generation compilers for emerging systems
Reference
Jeronimo Castrillon, "Next-generation compilers for emerging systems", In Workshop on Compilers, Deployment, and Tooling for Edge AI (CODAI’23), co-located with the ESWeek (keynote), Sep 2023.
Bibtex
@Misc{castrillon_codai2023,
author = {Castrillon, Jeronimo},
date = {2023-09},
title = {Next-generation compilers for emerging systems},
howpublished = {Workshop on Compilers, Deployment, and Tooling for Edge AI (CODAI’23), co-located with the ESWeek (keynote)},
location = {Hamburg, Germany},
month = sep,
url = {http://codai-workshop.com},
year = {2023},
}Downloads
230921_CODAI-castrillon-compressed [PDF]
Permalink
- Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi] [Bibtex & Downloads]
Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications
Reference
Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi]
Abstract
This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.
Bibtex
@InProceedings{henkel_cases23,
author = {J\"{o}rg Henkel and Lokesh Siddhu and Lars Bauer and J\"{u}rgen Teich and Stefan Wildermann and Mehdi Tahoori and Mahta Mayahinia and Jeronimo Castrillon and Asif Ali Khan and Hamid Farzaneh and Jo\~{a}o Paulo C. de Lima and Jian-Jia Chen and Christian Hakert and Kuan-Hsun Chen and Chia-Lin Yang and Hsiang-Yun Cheng},
booktitle = {Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
title = {Special Session -- Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications},
location = {Hamburg, Germany},
abstract = {This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.},
pages = {11--20},
url = {https://ieeexplore.ieee.org/abstract/document/10316216},
doi = {10.1145/3607889.3609088},
isbn = {9798400702907},
series = {CASES '23 Companion},
issn = {2643-1726},
month = sep,
numpages = {10},
year = {2023},
}Downloads
2309_Henkel_CASES [PDF]
Permalink
- Christian Menard, Marten Lohstroh, Soroush Bateni, Mathhew Chorlian, Arthur Deng, Peter Donovan, Clément Fournier, Shaokai Lin, Felix Suchert, Tassilo Tanneberger, Hokeun Kim, Jeronimo Castrillon, Edward A. Lee, "High-Performance Deterministic Concurrency using Lingua Franca", In ACM Transactions on Architecture and Code Optimization (TACO), Association for Computing Machinery, New York, NY, USA, Aug 2023. [doi] [Bibtex & Downloads]
High-Performance Deterministic Concurrency using Lingua Franca
Reference
Christian Menard, Marten Lohstroh, Soroush Bateni, Mathhew Chorlian, Arthur Deng, Peter Donovan, Clément Fournier, Shaokai Lin, Felix Suchert, Tassilo Tanneberger, Hokeun Kim, Jeronimo Castrillon, Edward A. Lee, "High-Performance Deterministic Concurrency using Lingua Franca", In ACM Transactions on Architecture and Code Optimization (TACO), Association for Computing Machinery, New York, NY, USA, Aug 2023. [doi]
Abstract
Actor frameworks and similar reactive programming techniques are widely used for building concurrent systems. They promise to be efficient and scale well to a large number of cores or nodes in a distributed system. However, they also expose programmers to nondeterminism, which often makes implementations hard to understand, debug, and test. The recently proposed reactor model is a promising alternative that enables deterministic concurrency. In this paper, we present an efficient, parallel implementation of reactors and demonstrate that the determinacy of reactors does not imply a loss in performance. To show this, we evaluate Lingua Franca (LF), a reactor-oriented coordination language. LF equips mainstream programming languages with a deterministic concurrency model that automatically takes advantage of opportunities to exploit parallelism. Our implementation of the Savina benchmark suite demonstrates that, in terms of execution time, the runtime performance of LF programs even exceeds popular and highly optimized actor frameworks. We compare against Akka and CAF, which LF outperforms by 1.86x and 1.42x, respectively.
Bibtex
@Article{menard_taco23,
author = {Menard, Christian and Lohstroh, Marten and Bateni, Soroush and Chorlian, Mathhew and Deng, Arthur and Donovan, Peter and Fournier, Clément and Lin, Shaokai and Suchert, Felix and Tanneberger, Tassilo and Kim, Hokeun and Castrillon, Jeronimo and Lee, Edward A.},
title = {High-Performance Deterministic Concurrency using Lingua Franca},
doi = {10.1145/3617687},
issn = {1544-3566},
number = {4},
pages = {1--29},
url = {https://doi.org/10.1145/3617687},
volume = {20},
abstract = {Actor frameworks and similar reactive programming techniques are widely used for building concurrent systems. They promise to be efficient and scale well to a large number of cores or nodes in a distributed system. However, they also expose programmers to nondeterminism, which often makes implementations hard to understand, debug, and test. The recently proposed reactor model is a promising alternative that enables deterministic concurrency. In this paper, we present an efficient, parallel implementation of reactors and demonstrate that the determinacy of reactors does not imply a loss in performance. To show this, we evaluate Lingua Franca (LF), a reactor-oriented coordination language. LF equips mainstream programming languages with a deterministic concurrency model that automatically takes advantage of opportunities to exploit parallelism. Our implementation of the Savina benchmark suite demonstrates that, in terms of execution time, the runtime performance of LF programs even exceeds popular and highly optimized actor frameworks. We compare against Akka and CAF, which LF outperforms by 1.86x and 1.42x, respectively.},
address = {New York, NY, USA},
articleno = {48},
copyright = {Creative Commons Attribution 4.0 International},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
month = aug,
numpages = {29},
publisher = {Association for Computing Machinery},
year = {2023},
}Downloads
2309_Menard_TACO [PDF]
Permalink
- Lars Schütze, Jeronimo Castrillon, "Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages", Proceedings of the 15th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'23), Association for Computing Machinery, pp. 1–8, New York, NY, USA, Jul 2023. [doi] [Bibtex & Downloads]
Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages
Reference
Lars Schütze, Jeronimo Castrillon, "Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages", Proceedings of the 15th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'23), Association for Computing Machinery, pp. 1–8, New York, NY, USA, Jul 2023. [doi]
Abstract
Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. Most approaches focus on optimizing language implementations neglecting the fact that the generated code is a verbose description of contextual roles in an object-oriented paradigm, which incurs an overhead. This paper takes a novel approach to reduce the semantic gap. We propose ObjectTeams/Truffle, to the best of our knowledge, the first virtual machine that optimizes the dispatch of contextual roles. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a speedup of up to 2.49\texttimes over the reference implementation ObjectTeams/Java and 1.2\texttimes over an optimized version ObjectTeams/Java using Dispatch Plans.
Bibtex
@InProceedings{schuetze_cop23,
author = {Sch\"{u}tze, Lars and Castrillon, Jeronimo},
booktitle = {Proceedings of the 15th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'23)},
title = {Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages},
doi = {10.1145/3605154.3605851},
isbn = {9798400702440},
location = {Seattle, USA},
pages = {1–8},
publisher = {Association for Computing Machinery},
series = {COP '23},
url = {https://doi.org/10.1145/3605154.3605851},
abstract = {Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. Most approaches focus on optimizing language implementations neglecting the fact that the generated code is a verbose description of contextual roles in an object-oriented paradigm, which incurs an overhead. This paper takes a novel approach to reduce the semantic gap. We propose ObjectTeams/Truffle, to the best of our knowledge, the first virtual machine that optimizes the dispatch of contextual roles. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a speedup of up to 2.49\texttimes{} over the reference implementation ObjectTeams/Java and 1.2\texttimes{} over an optimized version ObjectTeams/Java using Dispatch Plans.},
address = {New York, NY, USA},
month = jul,
numpages = {8},
year = {2023},
}Downloads
2307_Schuetze_COP [PDF]
Permalink
- João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023. [Bibtex & Downloads]
Efficient Associative Processing with RTM-TCAMs
Reference
João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023.
Bibtex
@InProceedings{lima_imacaw23,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Jeronimo Castrillon},
booktitle = {1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23)},
title = {Efficient Associative Processing with RTM-TCAMs},
location = {San Francisco, CA, USA},
pages = {2pp},
month = jul,
year = {2023},
}Downloads
2307_deLima_iMACAW [PDF]
Permalink
- Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs", In Proceeding: 37th European Conference on Object-Oriented Programming (ECOOP 2023) (Ali, Karim and Salvaneschi, Guido), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 263, pp. 33:1–33:39, Dagstuhl, Germany, Jul 2023. (ECOOP 2023 Distinguished Artifact Award) [doi] [Bibtex & Downloads]
ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs
Reference
Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs", In Proceeding: 37th European Conference on Object-Oriented Programming (ECOOP 2023) (Ali, Karim and Salvaneschi, Guido), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 263, pp. 33:1–33:39, Dagstuhl, Germany, Jul 2023. (ECOOP 2023 Distinguished Artifact Award) [doi]
Abstract
SAT/SMT-solvers and model checkers automate formal verification of sequential programs. Formal reasoning about scalable concurrent programs is still manual and requires expert knowledge. But scalability is a fundamental requirement of current and future programs.
Sequential imperative programs compose statements, function/method calls and control flow constructs. Concurrent programming models provide constructs for concurrent composition. Concurrency abstractions such as threads and synchronization primitives such as locks compose the individual parts of a concurrent program that are meant to execute in parallel. We propose to rather compose the individual parts again using sequential composition and compile this sequential composition into a concurrent one. The developer can use existing tools to formally verify the sequential program while the translated concurrent program provides the dearly requested scalability.
Following this insight, we present ConDRust, a new programming model and compiler for Rust programs. The ConDRust compiler translates sequential composition into a concurrent composition based on threads and message-passing channels. During compilation, the compiler preserves the semantics of the sequential program along with much desired properties such as determinism.
Our evaluation shows that our ConDRust compiler generates concurrent deterministic code that can outperform even non-deterministic programs by up to a factor of three for irregular algorithms that are particularly hard to parallelize.Bibtex
@InProceedings{suchert_ecoop23,
author = {Felix Suchert and Lisza Zeidler and Jeronimo Castrillon and Sebastian Ertel},
booktitle = {37th European Conference on Object-Oriented Programming (ECOOP 2023)},
title = {{ConDRust}: Scalable Deterministic Concurrency from Verifiable Rust Programs},
doi = {10.4230/LIPIcs.ECOOP.2023.33},
editor = {Ali, Karim and Salvaneschi, Guido},
isbn = {978-3-95977-281-5},
location = {Seattle, Washington, USA},
pages = {33:1--33:39},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
url = {https://drops.dagstuhl.de/opus/volltexte/2023/18226},
volume = {263},
abstract = {SAT/SMT-solvers and model checkers automate formal verification of sequential programs. Formal reasoning about scalable concurrent programs is still manual and requires expert knowledge. But scalability is a fundamental requirement of current and future programs.
Sequential imperative programs compose statements, function/method calls and control flow constructs. Concurrent programming models provide constructs for concurrent composition. Concurrency abstractions such as threads and synchronization primitives such as locks compose the individual parts of a concurrent program that are meant to execute in parallel. We propose to rather compose the individual parts again using sequential composition and compile this sequential composition into a concurrent one. The developer can use existing tools to formally verify the sequential program while the translated concurrent program provides the dearly requested scalability.
Following this insight, we present ConDRust, a new programming model and compiler for Rust programs. The ConDRust compiler translates sequential composition into a concurrent composition based on threads and message-passing channels. During compilation, the compiler preserves the semantics of the sequential program along with much desired properties such as determinism.
Our evaluation shows that our ConDRust compiler generates concurrent deterministic code that can outperform even non-deterministic programs by up to a factor of three for irregular algorithms that are particularly hard to parallelize.},
address = {Dagstuhl, Germany},
issn = {1868-8969},
month = jul,
urn = {urn:nbn:de:0030-drops-182263},
year = {2023},
}Downloads
2307_Suchert_ECOOP [PDF]
Permalink
- Karl F. A. Friebel, Jiahong Bi, Jeronimo Castrillon, "BASE2: An IR for Binary Numeral Types", In Proceeding: 13th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2023), Association for Computing Machinery, pp. 19–26, New York, NY, USA, Jun 2023. [doi] [Bibtex & Downloads]
BASE2: An IR for Binary Numeral Types
Reference
Karl F. A. Friebel, Jiahong Bi, Jeronimo Castrillon, "BASE2: An IR for Binary Numeral Types", In Proceeding: 13th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2023), Association for Computing Machinery, pp. 19–26, New York, NY, USA, Jun 2023. [doi]
Abstract
Custom data types and arbitrary-precision arithmetic are often key for efficient hardware designs on Field Programmable Gate Array (FPGA) platforms. Current end-to-end flows incorporating quantization are not only domain-specific, but also tightly integrated and not repurposable. Abstractions for arbitrary-precision arithmetic are generally vendor-specific, and results are hardly portable across platforms. In this work, we present a new Intermediate Representation (IR), base2, to address the programmability issues of custom data types in reconfigurable hardware. We contextualize our proposal in the greater LLVM (llvm) ecosystem, where we show how existing abstractions can be simplified and unified. We implement base2 in Multi-Level Intermediate Representation (MLIR), which allows it to be used in a variety of existing and future target-agnostic front-ends. We demonstrate the power of our model by applying it to sample kernels and evaluating the accuracy of the result. For these samples, we achieve interoperability with an existing end-to-end High-Level Synthesis (HLS) flow.
Bibtex
@InProceedings{friebel_heart23,
author = {Karl F. A. Friebel and Jiahong Bi and Jeronimo Castrillon},
booktitle = {13th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2023)},
title = {{BASE2}: An {IR} for Binary Numeral Types},
doi = {10.1145/3597031.3597048},
isbn = {9798400700439},
location = {Kusatsu, Japan},
pages = {19--26},
publisher = {Association for Computing Machinery},
series = {HEART2023},
url = {https://doi.org/10.1145/3597031.3597048},
abstract = {Custom data types and arbitrary-precision arithmetic are often key for efficient hardware designs on Field Programmable Gate Array (FPGA) platforms. Current end-to-end flows incorporating quantization are not only domain-specific, but also tightly integrated and not repurposable. Abstractions for arbitrary-precision arithmetic are generally vendor-specific, and results are hardly portable across platforms. In this work, we present a new Intermediate Representation (IR), base2, to address the programmability issues of custom data types in reconfigurable hardware. We contextualize our proposal in the greater LLVM (llvm) ecosystem, where we show how existing abstractions can be simplified and unified. We implement base2 in Multi-Level Intermediate Representation (MLIR), which allows it to be used in a variety of existing and future target-agnostic front-ends. We demonstrate the power of our model by applying it to sample kernels and evaluating the accuracy of the result. For these samples, we achieve interoperability with an existing end-to-end High-Level Synthesis (HLS) flow.},
address = {New York, NY, USA},
month = jun,
numpages = {8},
year = {2023},
}Downloads
2306_Friebel_HEART [PDF]
Permalink
- Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Design Space Exploration for CNN Offloading to FPGAs at the Edge", In Proceeding: 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, Los Alamitos, CA, USA, Jun 2023. [doi] [Bibtex & Downloads]
Design Space Exploration for CNN Offloading to FPGAs at the Edge
Reference
Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Design Space Exploration for CNN Offloading to FPGAs at the Edge", In Proceeding: 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, Los Alamitos, CA, USA, Jun 2023. [doi]
Abstract
AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2x) and process inferences at higher user quality of experience (by up to 12.5%).
Bibtex
@InProceedings{korol_isvlsi23,
author = {Guilherme Korol and Michael Guilherme Jordan and Mateus Beck Rutzig and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)},
title = {Design Space Exploration for {CNN} Offloading to {FPGAs} at the Edge},
doi = {10.1109/ISVLSI59464.2023.10238644},
location = {Foz do Igua{\c{c}}u, Brazil},
organization = {IEEE},
publisher = {IEEE Computer Society},
url = {https://ieeexplore.ieee.org/abstract/document/10238644},
address = {Los Alamitos, CA, USA},
month = jun,
year = {2023},
location = {Foz do Igua{\c{c}}u, Brazil},
abstract = {AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2x) and process inferences at higher user quality of experience (by up to 12.5%).},
}Downloads
2306_Korol_ISVLSI [PDF]
Permalink
- Jeronimo Castrillon, "Intermediate abstractions and optimizing compilers for adaptable HPC", In 4th Workshop on LLVM Compiler and Tools for HPC (LLVM-CTH’23), co-located with the 38th ISC High Performance Conference (invited talk), May 2023. [Bibtex & Downloads]
Intermediate abstractions and optimizing compilers for adaptable HPC
Reference
Jeronimo Castrillon, "Intermediate abstractions and optimizing compilers for adaptable HPC", In 4th Workshop on LLVM Compiler and Tools for HPC (LLVM-CTH’23), co-located with the 38th ISC High Performance Conference (invited talk), May 2023.
Bibtex
@Misc{castrillon_isc_llcvm-cth2023,
author = {Castrillon, Jeronimo},
date = {2023-05},
title = {Intermediate abstractions and optimizing compilers for adaptable HPC},
howpublished = {4th Workshop on LLVM Compiler and Tools for HPC (LLVM-CTH’23), co-located with the 38th ISC High Performance Conference (invited talk)},
location = {Hamburg, Germany},
url = {https://hps.vi4io.org/events/2023/llvm},
month = may,
year = {2023},
}Downloads
No Downloads available for this publication
Permalink
- Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs", Proceedings of the 2023 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Apr 2023. [doi] [Bibtex & Downloads]
Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs
Reference
Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs", Proceedings of the 2023 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Apr 2023. [doi]
Abstract
The challenge of processing heavy-load ML tasks, particularly CNN-based ones at resource-constrained IoT devices, has encouraged the use of edge servers. The edge offers performance levels higher than the end devices and better latency and security levels than the Cloud. On top of that, the rising complexity of ML applications, the ever-increasing number of connected devices, and the current demands for energy efficiency require optimizing such CNN models. Pruning and early-exit are notable optimizations that have been successfully used to alleviate the computational cost of inference. However, these optimizations have not yet been exploited simultaneously: while pruning is usually applied at design time, which involves retraining the CNN before deployment, early-exit is inherently dynamic. In this work, we propose AdaPEx, a framework that exploits the intrinsic reconfigurable FPGA capabilities so both can be cooperatively employed. AdaPEx first explores the trade-off between pruning and early-exit at design-time, creating a design space never exploited in the state-of-the-art. Then, AdaPEx applies FPGA reconfiguration as a means to enable the combined use of pruning and early-exit dynamically. At runtime, this allows matching the inference processing to the current edge conditions and a user-configurable accuracy threshold. In a smart IoT application, AdaPEx processes up to 1.32x more inferences and improves EDP by up to 2.55x over the state-of-the-art FPGA-based FINN accelerator.
Bibtex
@InProceedings{korol_date23,
author = {Guilherme Korol and Michael Guilherme Jordan and Mateus Beck Rutzig and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {Proceedings of the 2023 Design, Automation and Test in Europe Conference (DATE)},
title = {Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs},
location = {Antwerp, Belgium},
pages = {6pp},
publisher = {IEEE},
series = {DATE'23},
month = apr,
year = {2023},
doi = {10.23919/DATE56975.2023.10137244},
pages = {1-6},
url = {https://ieeexplore.ieee.org/document/10137244},
abstract = {The challenge of processing heavy-load ML tasks, particularly CNN-based ones at resource-constrained IoT devices, has encouraged the use of edge servers. The edge offers performance levels higher than the end devices and better latency and security levels than the Cloud. On top of that, the rising complexity of ML applications, the ever-increasing number of connected devices, and the current demands for energy efficiency require optimizing such CNN models. Pruning and early-exit are notable optimizations that have been successfully used to alleviate the computational cost of inference. However, these optimizations have not yet been exploited simultaneously: while pruning is usually applied at design time, which involves retraining the CNN before deployment, early-exit is inherently dynamic. In this work, we propose AdaPEx, a framework that exploits the intrinsic reconfigurable FPGA capabilities so both can be cooperatively employed. AdaPEx first explores the trade-off between pruning and early-exit at design-time, creating a design space never exploited in the state-of-the-art. Then, AdaPEx applies FPGA reconfiguration as a means to enable the combined use of pruning and early-exit dynamically. At runtime, this allows matching the inference processing to the current edge conditions and a user-configurable accuracy threshold. In a smart IoT application, AdaPEx processes up to 1.32x more inferences and improves EDP by up to 2.55x over the state-of-the-art FPGA-based FINN accelerator.},
}Downloads
2304_Korol_DATE [PDF]
Permalink
- Carlos Escuin, Asif Ali Khan, Pablo Ibáñez-Marín, Teresa Monreal, Jeronimo Castrillon, Víctor Viñals-Yúfera, "Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs", In Proceeding: the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23), IEEE Computer Society, pp. 179–192, Los Alamitos, CA, USA, Mar 2023. [doi] [Bibtex & Downloads]
Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs
Reference
Carlos Escuin, Asif Ali Khan, Pablo Ibáñez-Marín, Teresa Monreal, Jeronimo Castrillon, Víctor Viñals-Yúfera, "Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs", In Proceeding: the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23), IEEE Computer Society, pp. 179–192, Los Alamitos, CA, USA, Mar 2023. [doi]
Abstract
Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17x compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1% and 1.9% performance, the NVM lifetime can be further increased by 28% and 44%, respectively.
Bibtex
@InProceedings{escuin_hpca23,
author = {Carlos Escuin and Asif Ali Khan and Pablo Ibáñez-Marín and Teresa Monreal and Jeronimo Castrillon and Víctor Viñals-Yúfera},
booktitle = {the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23)},
title = {Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs},
organization = {IEEE},
pages = {179--192},
abstract = {Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17x compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9\% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1\% and 1.9\% performance, the NVM lifetime can be further increased by 28\% and 44\%, respectively.},
doi = {10.1109/HPCA56546.2023.10070968},
url = {https://doi.ieeecomputersociety.org/10.1109/HPCA56546.2023.10070968},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = mar,
year = {2023},
}Downloads
2302_Escuin_HPCA [PDF]
Related Paths
Permalink
- Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi] [Bibtex & Downloads]
DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories
Reference
Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi]
Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6%, and 70.8% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2%.
Bibtex
@Article{khan_toc23,
author = {Asif Ali Khan and Sebastien Ollivier and Fazal Hameed and Jeronimo Castrillon and Alex K. Jones},
date = {2023-03},
journal = {IEEE Transactions on Computers},
doi = {10.1109/TC.2023.3257509},
title = {DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories},
abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6\%, and 70.8\% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2\%.},
month = mar,
numpages = {15},
publisher = {IEEE},
year = {2023},
}Downloads
No Downloads available for this publication
Permalink
- Carlos Escuín Blasco, Asif Ali Khan, Pablo Enrique Ibáñez Marín, Teresa Monreal Arnal, Denis Navarro, José M Llaberia Griñó, Jeronimo Castrillon, Victor Viñals Yúfera, "Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache", In Proceeding: 14th Annual Non-Volatile Memories Workshop: University of California, San Diego, Mar 2023. [Bibtex & Downloads]
Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache
Reference
Carlos Escuín Blasco, Asif Ali Khan, Pablo Enrique Ibáñez Marín, Teresa Monreal Arnal, Denis Navarro, José M Llaberia Griñó, Jeronimo Castrillon, Victor Viñals Yúfera, "Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache", In Proceeding: 14th Annual Non-Volatile Memories Workshop: University of California, San Diego, Mar 2023.
Bibtex
@InProceedings{escuin_nvmw23,
author = {Escuín Blasco, Carlos and Ali Khan, Asif and Ibáñez Marín, Pablo Enrique and Monreal Arnal, Teresa and Navarro, Denis and Llaberia Griñó, José M and Castrillon, Jeronimo and Viñals Yúfera, Victor},
booktitle = {14th Annual Non-Volatile Memories Workshop: University of California, San Diego},
title = {Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache},
organization = {University of California, Los Angeles (UCLA)},
url = {https://upcommons.upc.edu/bitstream/handle/2117/395690/nvmw2023-paper8-final_version_your_extended_abstract.pdf?sequence=1},
month = mar,
year = {2023},
}Downloads
2303_Escuin_NVMW [PDF]
Permalink
- Stephanie Soldavini, Karl F. A. Friebel, Mattia Tibaldi, Gerald Hempel, Jeronimo Castrillon, Christian Pilato, "Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics", In ACM Transactions on Reconfigurable Technology and Systems (TRETS), Association for Computing Machinery, vol. 16, no. 2, New York, NY, USA, Mar 2023. [doi] [Bibtex & Downloads]
Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics
Reference
Stephanie Soldavini, Karl F. A. Friebel, Mattia Tibaldi, Gerald Hempel, Jeronimo Castrillon, Christian Pilato, "Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics", In ACM Transactions on Reconfigurable Technology and Systems (TRETS), Association for Computing Machinery, vol. 16, no. 2, New York, NY, USA, Mar 2023. [doi]
Abstract
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25 \texttimes more energy efficient than expert-crafted Intel CPU implementations.
Bibtex
@Article{friebel_trets23,
author = {Stephanie Soldavini and Karl F. A. Friebel and Mattia Tibaldi and Gerald Hempel and Jeronimo Castrillon and Christian Pilato},
title = {Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics},
doi = {10.1145/3563553},
issn = {1936-7406},
number = {2},
url = {https://doi.org/10.1145/3563553},
volume = {16},
abstract = {Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25 \texttimes{} more energy efficient than expert-crafted Intel CPU implementations.},
address = {New York, NY, USA},
articleno = {21},
journal = {ACM Transactions on Reconfigurable Technology and Systems (TRETS)},
month = mar,
numpages = {34},
publisher = {Association for Computing Machinery},
year = {2023},
}Downloads
No Downloads available for this publication
Permalink
- Jeronimo Castrillon, "Programming abstractions and optimizing compilers for energy-efficient computing", In 1st Workshop on NetZero Carbon Computing (NetZero'23), Co-located with the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29) (invited talk), Feb 2023. [Bibtex & Downloads]
Programming abstractions and optimizing compilers for energy-efficient computing
Reference
Jeronimo Castrillon, "Programming abstractions and optimizing compilers for energy-efficient computing", In 1st Workshop on NetZero Carbon Computing (NetZero'23), Co-located with the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29) (invited talk), Feb 2023.
Abstract
The demise of scaling laws in micro-electronics has led to an era of innovation in software and hardware architectures aimed at improving the energy efficiency of computing systems. Albeit still highly relevant, software optimizations for mainstream systems, which make the bulk of today's computing systems, provide ever-decreasing returns in the range of single-digit percentages. This is why lots of attention has rightfully turn to domain-specific architectures and emerging technologies which promise improvements of one to several orders of magnitude. Software development for these novel systems is still characterized by low-level expert coding and brittle toolchains, preventing hardware innovations from reaching a broader impact. In this talk, we discuss ongoing efforts on providing high-level programming abstractions and optimizing compilers to automatically target emerging computing systems. We do this by looking at three ongoing projects, namely, (i) a collaborative HW-SW effort to reduce the energy footprint of baseband processing in upcoming cellular networks, (ii) and end-to-end compilation for energy-efficient HPC simulations on state-of-the-art reconfigurable systems, and (iii) an extensible compilation framework to optimize for novel in-memory and near-memory computing systems. Finally, we are interested in discussing how these kinds of tools can be embedded in the larger picture of full life-cycle management.
Bibtex
@Misc{castrillon_netzero2023,
author = {Castrillon, Jeronimo},
date = {2023-02},
title = {Programming abstractions and optimizing compilers for energy-efficient computing},
howpublished = {1st Workshop on NetZero Carbon Computing (NetZero'23), Co-located with the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29) (invited talk)},
location = {Montreal, Canada},
url = {https://netzero.sysnet.ucsd.edu},
abstract = {The demise of scaling laws in micro-electronics has led to an era of innovation in software and hardware architectures aimed at improving the energy efficiency of computing systems. Albeit still highly relevant, software optimizations for mainstream systems, which make the bulk of today's computing systems, provide ever-decreasing returns in the range of single-digit percentages. This is why lots of attention has rightfully turn to domain-specific architectures and emerging technologies which promise improvements of one to several orders of magnitude. Software development for these novel systems is still characterized by low-level expert coding and brittle toolchains, preventing hardware innovations from reaching a broader impact. In this talk, we discuss ongoing efforts on providing high-level programming abstractions and optimizing compilers to automatically target emerging computing systems. We do this by looking at three ongoing projects, namely, (i) a collaborative HW-SW effort to reduce the energy footprint of baseband processing in upcoming cellular networks, (ii) and end-to-end compilation for energy-efficient HPC simulations on state-of-the-art reconfigurable systems, and (iii) an extensible compilation framework to optimize for novel in-memory and near-memory computing systems. Finally, we are interested in discussing how these kinds of tools can be embedded in the larger picture of full life-cycle management.},
month = feb,
year = {2023},
}Downloads
2302_castrillon_NetZero [PDF]
Permalink
- Jeronimo Castrillon, Karol Desnos, Andrés Goens, Christian Menard, "Dataflow Models of Computation for Programming Heterogeneous Multicores", Springer Nature Singapore, pp. 1–40, Singapore, Jan 2023. [doi] [Bibtex & Downloads]
Dataflow Models of Computation for Programming Heterogeneous Multicores
Reference
Jeronimo Castrillon, Karol Desnos, Andrés Goens, Christian Menard, "Dataflow Models of Computation for Programming Heterogeneous Multicores", Springer Nature Singapore, pp. 1–40, Singapore, Jan 2023. [doi]
Abstract
The hardware complexity of modern integrated circuits keeps increasing at a steady pace. Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) integrate general-purpose processing elements, domain-specific processors, dedicated hardware accelerators, reconfigurable logic, as well as complex memory hierarchies and interconnect. While offering unprecedented computational power and energy efficiency, MPSoCs are notoriously difficult to program. This chapter presents Models of Computation (MoCs) as an appealing alternative to traditional programming methodologies to harness the full capacities of modern MPSoCs. By raising the level of abstraction, MoCs make it possible to specify complex systems with little knowledge of the target architecture. The properties of MoCs make it possible for tools to automatically generate efficient implementations for heterogeneous MPSoCs, relieving developers from time-consuming manual exploration. This chapter focuses on a specific MoC family called dataflow MoCs. Dataflow MoCs represent systems as graphs of computational entities and communication channels. This graph-based system specification enables intuitive description of parallelism and supports many analysis and optimization techniques for deriving safe and highly efficient implementations on MPSoCs.
Bibtex
@InBook{castrillon_hca22,
author = {Jeronimo Castrillon and Karol Desnos and Andrés Goens and Christian Menard},
booktitle = {Handbook of Computer Architecture},
date = {2023-01},
pages = {1--40},
title = {Dataflow Models of Computation for Programming Heterogeneous Multicores},
doi = {10.1007/978-981-15-6401-7_45-2},
isbn = {978-981-15-6401-7},
url = {https://doi.org/10.1007/978-981-15-6401-7_45-2},
editor = {Anupam Chattopadhyay et al.},
publisher = {Springer Nature Singapore},
address = {Singapore},
abstract = {The hardware complexity of modern integrated circuits keeps increasing at a steady pace. Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) integrate general-purpose processing elements, domain-specific processors, dedicated hardware accelerators, reconfigurable logic, as well as complex memory hierarchies and interconnect. While offering unprecedented computational power and energy efficiency, MPSoCs are notoriously difficult to program. This chapter presents Models of Computation (MoCs) as an appealing alternative to traditional programming methodologies to harness the full capacities of modern MPSoCs. By raising the level of abstraction, MoCs make it possible to specify complex systems with little knowledge of the target architecture. The properties of MoCs make it possible for tools to automatically generate efficient implementations for heterogeneous MPSoCs, relieving developers from time-consuming manual exploration. This chapter focuses on a specific MoC family called dataflow MoCs. Dataflow MoCs represent systems as graphs of computational entities and communication channels. This graph-based system specification enables intuitive description of parallelism and supports many analysis and optimization techniques for deriving safe and highly efficient implementations on MPSoCs.},
month = jan,
year = {2023},
}Downloads
2301_Castrillon_dataflow-programmig_preview-www [PDF]
Permalink
- Karl F. A. Friebel, Asif Ali Khan, Lorenzo Chelini, Jeronimo Castrillon, "Modelling linear algebra kernels as polyhedral volume operations", In Proceeding: 13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2023. [Bibtex & Downloads]
Modelling linear algebra kernels as polyhedral volume operations
Reference
Karl F. A. Friebel, Asif Ali Khan, Lorenzo Chelini, Jeronimo Castrillon, "Modelling linear algebra kernels as polyhedral volume operations", In Proceeding: 13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2023.
Bibtex
@InProceedings{friebel_impact23,
author = {Karl F. A. Friebel and Asif Ali Khan and Lorenzo Chelini and Jeronimo Castrillon},
booktitle = {13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Modelling linear algebra kernels as polyhedral volume operations},
location = {Toulouse, France},
url = {https://impact-workshop.org/papers/paper10.pdf},
month = jan,
year = {2023},
}Downloads
2301_Friebel_IMPACT [PDF]
Permalink
- Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs (Artifact)", In Dagstuhl Artifacts Series (Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 9, no. 2, pp. 16:1–16:3, Dagstuhl, Germany, 2023. [doi] [Bibtex & Downloads]
ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs (Artifact)
Reference
Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs (Artifact)", In Dagstuhl Artifacts Series (Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 9, no. 2, pp. 16:1–16:3, Dagstuhl, Germany, 2023. [doi]
Bibtex
@Article{suchert_et_al:DARTS.9.2.16,
author = {Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian},
title = ,
pages = {16:1--16:3},
journal = {Dagstuhl Artifacts Series},
ISSN = {2509-8195},
year = {2023},
volume = {9},
number = {2},
editor = {Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2023/18256},
URN = {urn:nbn:de:0030-drops-182562},
doi = {10.4230/DARTS.9.2.16},
}Downloads
No Downloads available for this publication
Permalink
2022
- Fazal Hameed, Jeronimo Castrillon, "BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, pp. 5288–5298, Dec 2022. [doi] [Bibtex & Downloads]
BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture
Reference
Fazal Hameed, Jeronimo Castrillon, "BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, pp. 5288–5298, Dec 2022. [doi]
Abstract
Racetrack memory (RTM) is a promising nonvolatile memory that provides multi-bit storage cells achieving a higher area and leakage energy efficiency compared to contemporary volatile and non-volatile memories. These features make RTM a potential candidate to be used as a Last-Level-Cache (LLC). One drawback of the multi-bit RTM cell is the serialized access to the stored data, resulting in a shift penalty to access a particular bit within the cell. This overhead is particularly critical for LLC tags, for which prior RTM designs place tags either in SRAM or in single-bit RTM cells. While this avoids shifting, these designs require large number of leaky cells incurring high energy consumption. To address this problem, this paper proposes an energy efficient RTM design called BlendCache that efficiently stores the tags in the leakage optimized multi-bit RTM cells. To reduce the RTM shift penalty of these cells, BlendCache exploits the spatial locality of programs by maximizing accesses to nearby locations in RTM. Employing 32-bit RTM cells for a single-core, BlendCache reduces the energy consumption by 20.8% and area by 15.2% compared to the state-of-the-art while its impact on performance is negligible. For a 4-core system, the energy improvement translates to 35.9% with 3% performance degradation.
Bibtex
@Article{hameed_tcad22,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {{BlendCache}: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture},
doi = {10.1109/TCAD.2022.3161198},
issn = {0278-0070},
issue = {12},
pages = {5288--5298},
url = {https://ieeexplore.ieee.org/document/9739802},
volume = {41},
abstract = {Racetrack memory (RTM) is a promising nonvolatile memory that provides multi-bit storage cells achieving a higher area and leakage energy efficiency compared to contemporary volatile and non-volatile memories. These features make RTM a potential candidate to be used as a Last-Level-Cache (LLC). One drawback of the multi-bit RTM cell is the serialized access to the stored data, resulting in a shift penalty to access a particular bit within the cell. This overhead is particularly critical for LLC tags, for which prior RTM designs place tags either in SRAM or in single-bit RTM cells. While this avoids shifting, these designs require large number of leaky cells incurring high energy consumption. To address this problem, this paper proposes an energy efficient RTM design called BlendCache that efficiently stores the tags in the leakage optimized multi-bit RTM cells. To reduce the RTM shift penalty of these cells, BlendCache exploits the spatial locality of programs by maximizing accesses to nearby locations in RTM. Employing 32-bit RTM cells for a single-core, BlendCache reduces the energy consumption by 20.8\% and area by 15.2\% compared to the state-of-the-art while its impact on performance is negligible. For a 4-core system, the energy improvement translates to 35.9\% with 3\% performance degradation.},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)},
month = dec,
project = {tracesymm,cfaed},
year = {2022},
}Downloads
2204_Hameed_TCAD [PDF]
Related Paths
Permalink
- Julian Robledo, Jeronimo Castrillon, "Parameterizable mobile workloads for adaptable base station optimizations", Proceedings of the IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-22), pp. 381-386, Dec 2022. [doi] [Bibtex & Downloads]
Parameterizable mobile workloads for adaptable base station optimizations
Reference
Julian Robledo, Jeronimo Castrillon, "Parameterizable mobile workloads for adaptable base station optimizations", Proceedings of the IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-22), pp. 381-386, Dec 2022. [doi]
Abstract
Recent works on 5G baseband processing systems address the optimization of applications with different requirements of quality of service (QoS). The volume and heterogeneity of applications that have to be processed on a base station are growing and 5G introduces new use cases that push system designers towards more flexible and adaptable approaches. To investigate future network challenges of mobile communications, a good methodology for the generation of realistic workloads, that allows target optimizations of different traffic scenarios, is required. In this paper, we study the variation of real traffic data on multiple base stations and identify the main sources for the high variation of the 5G workloads. We propose a methodology for parameterizable workload generation for users with different QoS requirements that enables optimization techniques in baseband processing systems. We demonstrate the feasibility of our approach based on a virtual base station using a heterogeneous hardware model and various state-of-the-art mapping policies.
Bibtex
@InProceedings{robledo_mcsoc22,
author = {Julian Robledo and Jeronimo Castrillon},
booktitle = {Proceedings of the IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-22)},
title = {Parameterizable mobile workloads for adaptable base station optimizations},
doi = {10.1109/MCSoC57363.2022.00067},
url = {https://ieeexplore.ieee.org/document/10008473},
location = {Penang, Malaysia},
pages = {381--386},
abstract = {Recent works on 5G baseband processing systems address the optimization of applications with different requirements of quality of service (QoS). The volume and heterogeneity of applications that have to be processed on a base station are growing and 5G introduces new use cases that push system designers towards more flexible and adaptable approaches. To investigate future network challenges of mobile communications, a good methodology for the generation of realistic workloads, that allows target optimizations of different traffic scenarios, is required. In this paper, we study the variation of real traffic data on multiple base stations and identify the main sources for the high variation of the 5G workloads. We propose a methodology for parameterizable workload generation for users with different QoS requirements that enables optimization techniques in baseband processing systems. We demonstrate the feasibility of our approach based on a virtual base station using a heterogeneous hardware model and various state-of-the-art mapping policies.},
month = dec,
year = {2022},
}Downloads
2212_Robledo_MCSOC [PDF]
Permalink
- Nesrine Khouzami, Jeronimo Castrillon, "Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations", Supercomputing Conference (SC 2022) - Women in HPC Workshop (WHPC), Nov 2022. [Bibtex & Downloads]
Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations
Reference
Nesrine Khouzami, Jeronimo Castrillon, "Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations", Supercomputing Conference (SC 2022) - Women in HPC Workshop (WHPC), Nov 2022.
Abstract
We present OpenPME (Open Particle-Mesh Environment), a Problem Solving Environment (PSE) which provides a Domain Specific Language (DSL) built atop a domain model general enough to write numerical simulations in scientific computing using particle-mesh abstractions. This helps to close the productivity gap in HPC applications and effectively lowers the programming barrier to enable the smooth implementation of scalable simulations. We also introduce a model-based autotuning approach of discretization methods for OpenPME compiler. We evaluate the autotuner in two diffusion test cases and the results show that we consistently find configurations that outperform those found by state-of-the-art general-purpose autotuners.
Bibtex
@InProceedings{khouzami_sc_whpc22,
author = {Nesrine Khouzami and Jeronimo Castrillon},
title = {Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations},
location = {Dallas, Texas},
publisher = {Supercomputing Conference (SC 2022) - Women in HPC Workshop (WHPC)},
abstract = {We present OpenPME (Open Particle-Mesh Environment), a Problem Solving Environment (PSE) which provides a Domain Specific Language (DSL) built atop a domain model general enough to write numerical simulations in scientific computing using particle-mesh abstractions. This helps to close the productivity gap in HPC applications and effectively lowers the programming barrier to enable the smooth implementation of scalable simulations. We also introduce a model-based autotuning approach of discretization methods for OpenPME compiler. We evaluate the autotuner in two diffusion test cases and the results show that we consistently find configurations that outperform those found by state-of-the-art general-purpose autotuners.},
month = nov,
numpages = {3},
year = {2022},
}Downloads
2211_Khouzami_WHPC-SC [PDF]
Permalink
- Felix Suchert, Jeronimo Castrillon, "STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks", In Proceeding: Proceeeding of the BenchCouncil Transactions on Benchmarks, Standards and Evaluations (Bench22) (Gainaru, Ana and Zhang, Ce and Luo, Chunjie), Springer International Publishing, pp. 160–175, Cham, Nov 2022. [doi] [Bibtex & Downloads]
STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks
Reference
Felix Suchert, Jeronimo Castrillon, "STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks", In Proceeding: Proceeeding of the BenchCouncil Transactions on Benchmarks, Standards and Evaluations (Bench22) (Gainaru, Ana and Zhang, Ce and Luo, Chunjie), Springer International Publishing, pp. 160–175, Cham, Nov 2022. [doi]
Abstract
Software Transactional Memory has been used as a synchronization mechanism that is easier to use and compose than locking ones. The mechanisms continued relevance in research and application design motivates considerations regarding safer implementations than existing C libraries. In this paper, we study the impact of the Rust programming language on STM performance and code quality. To facilitate the comparison, we manually translated the STAMP benchmark suite to Rust and also generated a version using a state-of-the-art C-to-Rust transpiler. We find that, while idiomatic implementations using safe Rust are generally slower than both C and transpiled code, they guarantee memory safety and improve code quality.
Bibtex
@InProceedings{suchert_bench22,
author = {Felix Suchert and Jeronimo Castrillon},
booktitle = {Proceeeding of the BenchCouncil Transactions on Benchmarks, Standards and Evaluations (Bench22)},
title = {STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks},
doi = {10.1007/978-3-031-31180-2_10},
editor = {Gainaru, Ana and Zhang, Ce and Luo, Chunjie},
isbn = {978-3-031-31180-2},
pages = {160--175},
publisher = {Springer International Publishing},
url = {https://link.springer.com/chapter/10.1007/978-3-031-31180-2_10},
abstract = {Software Transactional Memory has been used as a synchronization mechanism that is easier to use and compose than locking ones. The mechanisms continued relevance in research and application design motivates considerations regarding safer implementations than existing C libraries. In this paper, we study the impact of the Rust programming language on STM performance and code quality. To facilitate the comparison, we manually translated the STAMP benchmark suite to Rust and also generated a version using a state-of-the-art C-to-Rust transpiler. We find that, while idiomatic implementations using safe Rust are generally slower than both C and transpiled code, they guarantee memory safety and improve code quality.},
address = {Cham},
month = nov,
year = {2022},
}Downloads
2211_Suchert_BENCH [PDF]
Permalink
- Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "Shisha: Online scheduling of CNN pipelines on heterogeneous architectures", Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics, pp. 249–262, Nov 2022. [Bibtex & Downloads]
Shisha: Online scheduling of CNN pipelines on heterogeneous architectures
Reference
Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "Shisha: Online scheduling of CNN pipelines on heterogeneous architectures", Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics, pp. 249–262, Nov 2022.
Abstract
Many modern multicore processors integrate asymmetric core clusters. With the trend towards Multi-Chip-Modules (MCMs) and interposer-based packaging technologies, platforms will feature heterogeneity at the level of cores, memory subsystem and the interconnect. Due to their potential high memory throughput and energy efficient core modules, these platforms are prominent targets for emerging machine learning applications, such as Convolutional Neural Networks (CNNs). To exploit and adapt to the diversity of modern heterogeneous chips, CNNs need to be quickly optimized in terms of scheduling and workload distribution among computing resources. To address this we propose Shisha, an online approach to generate and schedule parallel CNN pipelines on heterogeneous MCM-based architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by 35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.
Bibtex
@InProceedings{soomro_ppam22,
author = {Pirah Noor Soomro and Mustafa Abduljabbar and Jeronimo Castrillon and Miquel Pericás},
booktitle = {Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics},
date = {2022-11},
title = {Shisha: Online scheduling of CNN pipelines on heterogeneous architectures},
location = {Gdansk, Poland},
abstract = {Many modern multicore processors integrate asymmetric core clusters. With the trend towards Multi-Chip-Modules (MCMs) and interposer-based packaging technologies, platforms will feature heterogeneity at the level of cores, memory subsystem and the interconnect. Due to their potential high memory throughput and energy efficient core modules, these platforms are prominent targets for emerging machine learning applications, such as Convolutional Neural Networks (CNNs). To exploit and adapt to the diversity of modern heterogeneous chips, CNNs need to be quickly optimized in terms of scheduling and workload distribution among computing resources. To address this we propose Shisha, an online approach to generate and schedule parallel CNN pipelines on heterogeneous MCM-based architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by ~35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.},
month = nov,
year = {2022},
pages = {249--262},
url = {https://www.springerprofessional.de/shisha-online-scheduling-of-cnn-pipelines-on-heterogeneous-archi/25292292},
}Downloads
2211_Soomro_PPAM [PDF]
Permalink
- Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi] [Bibtex & Downloads]
DNA Pre-alignment Filter using Processing Near Racetrack Memory
Reference
Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi]
Abstract
Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)–an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68% and 52%, respectively, compared to the state of the art proposed DRAM-based architecture.
Bibtex
@Article{hameed_ieeecal22,
author = {Fazal Hameed and Asif Ali Khan and Sebastien Ollivier and Alex K. Jones and Jeronimo Castrillon},
date = {2022-08},
journal = {IEEE Computer Architecture Letters},
title = {DNA Pre-alignment Filter using Processing Near Racetrack Memory},
abstract = {Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)--an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68\% and 52\%, respectively, compared to the state of the art proposed DRAM-based architecture.},
month = jul,
numpages = {4},
publisher = {IEEE},
year = {2022},
doi = {10.1109/LCA.2022.3194263},
pages = {1--4},
url = {https://ieeexplore.ieee.org/document/9841612},
}Downloads
No Downloads available for this publication
Permalink
- Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi] [Bibtex & Downloads]
ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees
Reference
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi]
Bibtex
@Article{hakert_toc22,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Che},
title = {{ROLLED}: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees},
journal = {IEEE Transactions on Computers},
month = jul,
year = {2022},
doi = {10.1109/TC.2022.3197094},
pages = {1--14},
url = {https://ieeexplore.ieee.org/document/9851943},
}Downloads
No Downloads available for this publication
Permalink
- Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022)(keynote), Jun 2022. [Bibtex & Downloads]
Domain-specific programming methodologies for domain-specific and emerging computing systems
Reference
Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022)(keynote), Jun 2022.
Abstract
Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. The golden age of computer architecture must be thus accompanied by a golden age of research in compilers and programming languages. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. Concretely, we will discuss abstractions for tensor expressions as vehicle to optimize for modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing. The talk closes with an outlook on other emerging architectures and challenges for high-level compilation.
Bibtex
@Misc{castrillon_lctes2022,
author = {Castrillon, Jeronimo},
date = {2022-06},
title = {Domain-specific programming methodologies for domain-specific and emerging computing systems},
howpublished = {23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022)(keynote)},
location = {San Diego, CA, USA},
abstract = {Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. The golden age of computer architecture must be thus accompanied by a golden age of research in compilers and programming languages. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. Concretely, we will discuss abstractions for tensor expressions as vehicle to optimize for modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing. The talk closes with an outlook on other emerging architectures and challenges for high-level compilation.},
month = jun,
year = {2022},
}Downloads
220614_castrillon_LCTES-keynote_compressed [PDF]
Permalink
- Carlos Escuin, Asif Ali Khan, Pablo Ibañez, Teresa Monreal, Victor Viñals, Jeronimo Castrillon, "HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches", In Proceeding: DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 53–58, New York, NY, USA, Jun 2022. [doi] [Bibtex & Downloads]
HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches
Reference
Carlos Escuin, Asif Ali Khan, Pablo Ibañez, Teresa Monreal, Victor Viñals, Jeronimo Castrillon, "HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches", In Proceeding: DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 53–58, New York, NY, USA, Jun 2022. [doi]
Abstract
Recent years have seen a rising trend in the exploration of non- volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24x, compared to the cycle-accurate Gem5 simulator, with high-fidelity.
Bibtex
@InProceedings{escuin_rapido22,
author = {Carlos Escuin and Asif Ali Khan and Pablo Iba{\~n}ez and Teresa Monreal and Victor Vi{\~n}als and Jeronimo Castrillon},
booktitle = {DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
date = {2022-06},
title = {HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches},
doi = {10.1145/3522784.3522801},
isbn = {9781450395663},
location = {Budapest, Hungary},
pages = {53–58},
publisher = {Association for Computing Machinery},
series = {DroneSE and RAPIDO '22},
url = {https://doi.org/10.1145/3522784.3522801},
abstract = {Recent years have seen a rising trend in the exploration of non- volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24x, compared to the cycle-accurate Gem5 simulator, with high-fidelity.},
address = {New York, NY, USA},
month = jun,
numpages = {6},
year = {2022},
}Downloads
No Downloads available for this publication
Permalink
- Lars Schütze, Cornelius Kummer, Jeronimo Castrillon, "Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language", Proceedings of the 14th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'22), Association for Computing Machinery, pp. 27–34, New York, NY, USA, Jun 2022. [doi] [Bibtex & Downloads]
Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language
Reference
Lars Schütze, Cornelius Kummer, Jeronimo Castrillon, "Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language", Proceedings of the 14th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'22), Association for Computing Machinery, pp. 27–34, New York, NY, USA, Jun 2022. [doi]
Abstract
Adaptive programming models are increasingly important as context-dependent software conquers more domains. One such a model is role-oriented programming where behavioral changes are implemented by objects playing and renouncing roles. As with other adaptive models, the overhead introduced by source code adaptations is a major showstopper for role-oriented programs. This is in part because the optimizations of object-oriented virtual machines (VMs) do not provide the same performance gains when applied to role-oriented programs. Recently, dispatch plans have been shown to enable optimizations beyond those in VMs, thereby improving the performance of role programs with low variability. This paper introduces guarded dispatch plans, an extension of dispatch plans with a context-aware guarding mechanism that allows reuse in high-variability scenarios. Fine-grained guards use run-time feedback to partially reuse dispatch plans across call sites when contexts are changing. We present an algorithm to construct and compose guarded dispatch plans and provide a reference implementation of the approach. We show that our approach is able to gracefully degrade into a default dispatch approach when variability increases. The implementation is evaluated with synthetic benchmarks capturing different characteristics. Compared to the state-of-the-art implementation in ObjectTeams we achieved a mean speedup of 3.3 \texttimes in static cases, 3.0 \texttimes at low variability and the same performance in highly dynamic cases.
Bibtex
@InProceedings{schuetze_cop22,
author = {Sch\"{u}tze, Lars and Kummer, Cornelius and Castrillon, Jeronimo},
booktitle = {Proceedings of the 14th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'22)},
title = {Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language},
doi = {10.1145/3570353.3570357},
isbn = {9781450399869},
pages = {27--34},
publisher = {Association for Computing Machinery},
series = {COP '22},
url = {https://doi.org/10.1145/3570353.3570357},
abstract = {Adaptive programming models are increasingly important as context-dependent software conquers more domains. One such a model is role-oriented programming where behavioral changes are implemented by objects playing and renouncing roles. As with other adaptive models, the overhead introduced by source code adaptations is a major showstopper for role-oriented programs. This is in part because the optimizations of object-oriented virtual machines (VMs) do not provide the same performance gains when applied to role-oriented programs. Recently, dispatch plans have been shown to enable optimizations beyond those in VMs, thereby improving the performance of role programs with low variability. This paper introduces guarded dispatch plans, an extension of dispatch plans with a context-aware guarding mechanism that allows reuse in high-variability scenarios. Fine-grained guards use run-time feedback to partially reuse dispatch plans across call sites when contexts are changing. We present an algorithm to construct and compose guarded dispatch plans and provide a reference implementation of the approach. We show that our approach is able to gracefully degrade into a default dispatch approach when variability increases. The implementation is evaluated with synthetic benchmarks capturing different characteristics. Compared to the state-of-the-art implementation in ObjectTeams we achieved a mean speedup of 3.3 \texttimes{} in static cases, 3.0 \texttimes{} at low variability and the same performance in highly dynamic cases.},
address = {New York, NY, USA},
month = jun,
numpages = {8},
year = {2022},
}Downloads
2206_schuetze_COP [PDF]
Permalink
- Jeronimo Castrillon, "Language and compiler research for heterogeneous emerging computing systems", In SPCL_Bcast(COMM_WORLD) seminar series, SPCL ETH Zurich (invited talk), May 2022. [Bibtex & Downloads]
Language and compiler research for heterogeneous emerging computing systems
Reference
Jeronimo Castrillon, "Language and compiler research for heterogeneous emerging computing systems", In SPCL_Bcast(COMM_WORLD) seminar series, SPCL ETH Zurich (invited talk), May 2022.
Abstract
Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging, non Von-Neumann computer architectures. The so-called golden age of computer architecture thus must be accompanied by a, hopefully, golden age of research in compilers and programming languages. This talk discusses research along two fronts, namely, (1) on domain specific languages (DSLs) to hide complexity from non-expert programmers while passing richer information to compilers, and (2) on understanding the fundamental changes in emerging computing paradigms and their consequences for compilers. Concretely, we will talk about DSLs for physics simulations, compute-in-memory with emerging technologies, and current efforts in unifying intermediate representations with the MLIR compiler framework.
Bibtex
@Misc{castrillon_spcl2022,
author = {Castrillon, Jeronimo},
year = {2022},
month = may,
title = {Language and compiler research for heterogeneous emerging computing systems},
howpublished = {SPCL\_Bcast(COMM\_WORLD) seminar series, SPCL ETH Zurich (invited talk)},
location = {Virtual},
url = {https://www.youtube.com/watch?v=-NoRpUBlNrU},
abstract = {Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging, non Von-Neumann computer architectures. The so-called golden age of computer architecture thus must be accompanied by a, hopefully, golden age of research in compilers and programming languages. This talk discusses research along two fronts, namely, (1) on domain specific languages (DSLs) to hide complexity from non-expert programmers while passing richer information to compilers, and (2) on understanding the fundamental changes in emerging computing paradigms and their consequences for compilers. Concretely, we will talk about DSLs for physics simulations, compute-in-memory with emerging technologies, and current efforts in unifying intermediate representations with the MLIR compiler framework.},
}Downloads
No Downloads available for this publication
Permalink
- Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Burkhard Ringlein, Michele Fiorito, Fabrizio Ferrandi, Donatella Sciuto, Christian Pilato, Stephanie Soldavini, "EVEREST: Intermediate report of the compilation framework", Technical report, EVEREST consortium, Apr 2022. [Bibtex & Downloads]
EVEREST: Intermediate report of the compilation framework
Reference
Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Burkhard Ringlein, Michele Fiorito, Fabrizio Ferrandi, Donatella Sciuto, Christian Pilato, Stephanie Soldavini, "EVEREST: Intermediate report of the compilation framework", Technical report, EVEREST consortium, Apr 2022.
Bibtex
@Report{castrillon_everestD4.2_2022,
author = {Jeronimo Castrillon and Felix Wittwer and Karl Friebel and Burkhard Ringlein and Michele Fiorito and Fabrizio Ferrandi and Donatella Sciuto and Christian Pilato and Stephanie Soldavini},
institution = {EVEREST consortium},
title = {{EVEREST}: Intermediate report of the compilation framework},
type = {techreport},
url = {https://drive.switch.ch/index.php/s/tqCXugYLqGQNChd},
month = apr,
year = {2022},
}Downloads
2204_Castrillon-Everest-D4 [2]
Permalink
- Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones, "Brain-inspired Cognition in Next Generation Racetrack Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 21, no. 6, pp. 79:1–79:28, New York, NY, USA, Mar 2022. [doi] [Bibtex & Downloads]
Brain-inspired Cognition in Next Generation Racetrack Memories
Reference
Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones, "Brain-inspired Cognition in Next Generation Racetrack Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 21, no. 6, pp. 79:1–79:28, New York, NY, USA, Mar 2022. [doi]
Abstract
Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional von Neumann architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. In contrast, even partial implementations of an HDC framework inside memory can provide considerable performance/energy gains as demonstrated in prior work using memristors. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within memory. The proposed solution requires minimal additional CMOS circuitry by leveraging a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the CMOS circuitry overhead, an RTM nanowire-based counting mechanism is proposed. Using language recognition as the example workload, the proposed RTM HDC system reduces the energy consumption by 8.6x compared to the state-of-the-art in-memory implementation. Compared to dedicated hardware design realized with an FPGA, RTM-based HDC processing demonstrates 7.8x and 5.3x improvements in the overall runtime and energy consumption, respectively.
Bibtex
@Article{khan_tecs22,
author = {Asif Ali Khan and Sebastien Ollivier and Stephen Longofono and Gerald Hempel and Jeronimo Castrillon and Alex K. Jones},
title = {Brain-inspired Cognition in Next Generation Racetrack Memories},
abstract = {Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional von Neumann architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. In contrast, even partial implementations of an HDC framework inside memory can provide considerable performance/energy gains as demonstrated in prior work using memristors. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within memory. The proposed solution requires minimal additional CMOS circuitry by leveraging a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the CMOS circuitry overhead, an RTM nanowire-based counting mechanism is proposed. Using language recognition as the example workload, the proposed RTM HDC system reduces the energy consumption by 8.6x compared to the state-of-the-art in-memory implementation. Compared to dedicated hardware design realized with an FPGA, RTM-based HDC processing demonstrates 7.8x and 5.3x improvements in the overall runtime and energy consumption, respectively.},
address = {New York, NY, USA},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
month = mar,
numpages = {28},
publisher = {Association for Computing Machinery},
year = {2022},
doi = {10.1145/3524071},
issn = {1539-9087},
url = {https://doi.org/10.1145/3524071},
volume = {21},
number = {6},
articleno = {79},
pages = {79:1--79:28},
}Downloads
2203_Khan_TECS [PDF]
Permalink
2021
- Nesrine Khouzami, Friedrich Michel, Pietro Incardona, Jeronimo Castrillon, Ivo F. Sbalzarini, "Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations", In Journal of Computational Science, vol. 57, pp. 1–11, Dec 2021. [doi] [Bibtex & Downloads]
Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations
Reference
Nesrine Khouzami, Friedrich Michel, Pietro Incardona, Jeronimo Castrillon, Ivo F. Sbalzarini, "Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations", In Journal of Computational Science, vol. 57, pp. 1–11, Dec 2021. [doi]
Abstract
We present an autotuning approach for compile-time optimization of numerical discretization methods in simulations of partial differential equations. Our approach is based on data-driven regression of performance models for numerical methods. We use these models at compile time to automatically determine the parameters (e.g., resolution, time step size, etc.) of numerical simulations of continuum spatio-temporal models in order to optimize the tradeoff between simulation accuracy and runtime. The resulting autotuner is developed for the compiler of a Domain-Specific Language (DSL) for numerical simulations. The abstractions in the DSL enable the compiler to automatically determine the performance models and know which discretization parameters to tune. We demonstrate that this high-level approach can explore a large space of possible simulations, with simulation runtimes spanning multiple orders of magnitude. We evaluate our approach in two test cases: the linear diffusion equation and the nonlinear Gray-Scott reaction–diffusion equation. The results show that our model-based autotuner consistently finds configurations that outperform those found by state-of-the-art general-purpose autotuners. Specifically, our autotuner yields simulations that are on average 4.2x faster than those found by the best generic exploration algorithms, while using 16x less tuning time. Compared to manual tuning by a group of researchers with varying levels of expertise, the autotuner was slower than the best users by not more than a factor of 2, whereas it was able to significantly outperform half of them.
Bibtex
@Article{khouzami_jocs21,
author = {Nesrine Khouzami and Friedrich Michel and Pietro Incardona and Jeronimo Castrillon and Ivo F. Sbalzarini},
date = {2021-12},
title = {Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations},
doi = {10.1016/j.jocs.2021.101489},
issn = {1877-7503},
pages = {1--11},
url = {https://www.sciencedirect.com/science/article/pii/S1877750321001563},
volume = {57},
abstract = {We present an autotuning approach for compile-time optimization of numerical discretization methods in simulations of partial differential equations. Our approach is based on data-driven regression of performance models for numerical methods. We use these models at compile time to automatically determine the parameters (e.g., resolution, time step size, etc.) of numerical simulations of continuum spatio-temporal models in order to optimize the tradeoff between simulation accuracy and runtime. The resulting autotuner is developed for the compiler of a Domain-Specific Language (DSL) for numerical simulations. The abstractions in the DSL enable the compiler to automatically determine the performance models and know which discretization parameters to tune. We demonstrate that this high-level approach can explore a large space of possible simulations, with simulation runtimes spanning multiple orders of magnitude. We evaluate our approach in two test cases: the linear diffusion equation and the nonlinear Gray-Scott reaction–diffusion equation. The results show that our model-based autotuner consistently finds configurations that outperform those found by state-of-the-art general-purpose autotuners. Specifically, our autotuner yields simulations that are on average 4.2x faster than those found by the best generic exploration algorithms, while using 16x less tuning time. Compared to manual tuning by a group of researchers with varying levels of expertise, the autotuner was slower than the best users by not more than a factor of 2, whereas it was able to significantly outperform half of them.},
journal = {Journal of Computational Science},
month = dec,
numpages = {15},
project = {openpme},
year = {2021},
}Downloads
2111_Khouzami_JOCS [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Models of computation for energy-efficient time-aware distributed embedded systems", In IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2021) (keynote), Dec 2021. [Bibtex & Downloads]
Models of computation for energy-efficient time-aware distributed embedded systems
Reference
Jeronimo Castrillon, "Models of computation for energy-efficient time-aware distributed embedded systems", In IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2021) (keynote), Dec 2021.
Abstract
Programming heterogeneous and distributed manycores under timing constraints in cyber-physical systems is an extremely hard task. Managing reactive behaviour to outside stimuli, adapting to variable workloads and handling parallelism while ensuring correct and time-predictable execution are examples of key challenges in these kinds of systems. This talk discusses the role of formal models of computation to help architect programming methodologies, making it easier to manage the complexity and provide guarantees than with ad-hoc programming models. We will discuss dataflow-based programming for joint optimization of multiple adaptable applications while respecting real-time constraints. We will then introduce the reactor model which adds time semantics to dataflow to support time-determinism when needed and to account for reactive behavior. The benefits and overheads of this model-based approach to programming distributed embedded systems will be analysed with use cases from the automotive and 5G-communication domains.
Bibtex
@Misc{castrillon_mcsoc2021,
author = {Castrillon, Jeronimo},
title = {Models of computation for energy-efficient time-aware distributed embedded systems},
howpublished = {IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2021) (keynote)},
location = {Singapore},
url = {https://mcsoc-forum.org/m2021/wp-content/uploads/2021/12/211222_castrillon_MCSOC_compressed.pdf},
abstract = {Programming heterogeneous and distributed manycores under timing constraints in cyber-physical systems is an extremely hard task. Managing reactive behaviour to outside stimuli, adapting to variable workloads and handling parallelism while ensuring correct and time-predictable execution are examples of key challenges in these kinds of systems. This talk discusses the role of formal models of computation to help architect programming methodologies, making it easier to manage the complexity and provide guarantees than with ad-hoc programming models. We will discuss dataflow-based programming for joint optimization of multiple adaptable applications while respecting real-time constraints. We will then introduce the reactor model which adds time semantics to dataflow to support time-determinism when needed and to account for reactive behavior. The benefits and overheads of this model-based approach to programming distributed embedded systems will be analysed with use cases from the automotive and 5G-communication domains.},
month = dec,
year = {2021},
}Downloads
No Downloads available for this publication
Permalink
- Joonas Iisakki Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, Pekka Jääskeläinen, "Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory", In IEEE Transactions on Computers, pp. 1-1, Oct 2021. [doi] [Bibtex & Downloads]
Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory
Reference
Joonas Iisakki Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, Pekka Jääskeläinen, "Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory", In IEEE Transactions on Computers, pp. 1-1, Oct 2021. [doi]
Bibtex
@Article{multanen_toc21,
author = {Joonas Iisakki Multanen and Kari Hepola and Asif Ali Khan and Jeronimo Castrillon and Pekka J{\"a}{\"a}skel{\"a}inen},
title = {Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory},
doi = {10.1109/TC.2021.3117439},
number = {9},
pages = {2010--2021},
url = {https://ieeexplore.ieee.org/document/9557799},
volume = {71},
abstract = {As performance and energy-efficiency improvements from technology scaling are slowing down, new technologies are being researched in hopes of disrupting results. Domain wall memory (DWM) is an emerging non-volatile technology that promises extreme data density, fast access times and low power consumption. However, DWM access time depends on the memory location distance from access ports, requiring expensive shifting. This causes overheads on performance and energy consumption. In this article, we implement our previously proposed shift-reducing instruction memory placement (SHRIMP) on a RISC-V core in RTL, provide the first thorough evaluation of the control logic required for DWM and SHRIMP and evaluate the effects on system energy and energy-efficiency. SHRIMP reduces the number of shifts by 36\% on average compared to a linear placement in CHStone and Coremark benchmark suites when evaluated on the RISC-V processor system. The reduced shift amount leads to an average reduction of 14\% in cycle counts compared to the linear placement. When compared to an SRAM-based system, although increasing memory usage by 26\%, DWM with SHRIMP allows a 73\% reduction in memory energy and 42\% relative energy delay product. We estimate overall energy reductions of 14\%, 15\% and 19\% in three example embedded systems.},
journal = {IEEE Transactions on Computers},
month = oct,
year = {2021},
}Downloads
2110_Multanen_TOC [PDF]
Related Paths
Permalink
- Karl F. A. Friebel, Stephanie Soldavini, Gerald Hempel, Christian Pilato, Jeronimo Castrillon, "From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics", Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER) — FPGA for HPC Workshop, pp. 759–766, Sep 2021. [doi] [Bibtex & Downloads]
From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics
Reference
Karl F. A. Friebel, Stephanie Soldavini, Gerald Hempel, Christian Pilato, Jeronimo Castrillon, "From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics", Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER) — FPGA for HPC Workshop, pp. 759–766, Sep 2021. [doi]
Bibtex
@InProceedings{friebel_fpga4hpc21,
author = {Karl F. A. Friebel and Stephanie Soldavini and Gerald Hempel and Christian Pilato and Jeronimo Castrillon},
booktitle = {Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER) --- FPGA for HPC Workshop},
title = {From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics},
doi = {10.1109/Cluster48925.2021.00112},
location = {Portland (virtual), OR, USA},
pages = {759--766},
url = {https://ieeexplore.ieee.org/document/9556064},
month = sep,
numpages = {8},
year = {2021},
}Downloads
2109_Friebel_fpga4hpc [PDF]
Permalink
- Robert Khasanov, Julian Robledo, Christian Menard, Andr'es Goens, Jeronimo Castrillon, "Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks", In ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), Association for Computing Machinery, vol. 20, no. 5s, New York, NY, USA, Sep 2021. [doi] [Bibtex & Downloads]
Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks
Reference
Robert Khasanov, Julian Robledo, Christian Menard, Andr'es Goens, Jeronimo Castrillon, "Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks", In ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), Association for Computing Machinery, vol. 20, no. 5s, New York, NY, USA, Sep 2021. [doi]
Abstract
Advancing telecommunication standards continuously push for larger bandwidths, lower latencies, and faster data rates. The receiver baseband unit not only has to deal with a huge number of users expecting connectivity but also with a high workload heterogeneity. As a consequence of the required flexibility, baseband processing has seen a trend towards software implementations in cloud Radio Access Networks (cRANs). The flexibility gained from software implementation comes at the price of impoverished energy efficiency. This paper addresses the trade-off between flexibility and efficiency by proposing a domain-specific hybrid mapping algorithm. Hybrid mapping is an established approach from the model-based design of embedded systems that allows us to retain flexibility while targeting heterogeneous hardware. Depending on the current workload, the runtime system selects the most energy-efficient mapping configuration without violating timing constraints. We leverage the structure of baseband processing, and refine the scheduling methodology, to enable efficient mapping of 100s of tasks at the millisecond granularity, improving upon state-of-the-art hybrid approaches. We validate our approach on an Odroid XU4 and virtual platforms with application-specific accelerators on an open-source prototype. On different LTE workloads, our hybrid approach shows significant improvements both at design time and at runtime. At design-time, mappings of similar quality to those obtained by state-of-the-art methods are generated around four orders of magnitude faster. At runtime, multi-application schedules are computed 37.7% faster than the state-of-the-art without compromising on the quality.
Bibtex
@Article{khasanov_cases21,
author = {Robert Khasanov and Julian Robledo and Christian Menard and Andrés Goens and Jeronimo Castrillon},
title = {Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks},
doi = {10.1145/3476991},
issn = {1539-9087},
number = {5s},
url = {https://doi.org/10.1145/3476991},
volume = {20},
abstract = {Advancing telecommunication standards continuously push for larger bandwidths, lower latencies, and faster data rates. The receiver baseband unit not only has to deal with a huge number of users expecting connectivity but also with a high workload heterogeneity. As a consequence of the required flexibility, baseband processing has seen a trend towards software implementations in cloud Radio Access Networks (cRANs). The flexibility gained from software implementation comes at the price of impoverished energy efficiency. This paper addresses the trade-off between flexibility and efficiency by proposing a domain-specific hybrid mapping algorithm. Hybrid mapping is an established approach from the model-based design of embedded systems that allows us to retain flexibility while targeting heterogeneous hardware. Depending on the current workload, the runtime system selects the most energy-efficient mapping configuration without violating timing constraints. We leverage the structure of baseband processing, and refine the scheduling methodology, to enable efficient mapping of 100s of tasks at the millisecond granularity, improving upon state-of-the-art hybrid approaches. We validate our approach on an Odroid XU4 and virtual platforms with application-specific accelerators on an open-source prototype. On different LTE workloads, our hybrid approach shows significant improvements both at design time and at runtime. At design-time, mappings of similar quality to those obtained by state-of-the-art methods are generated around four orders of magnitude faster. At runtime, multi-application schedules are computed 37.7% faster than the state-of-the-art without compromising on the quality.},
address = {New York, NY, USA},
articleno = {60},
issue_date = {October 2021},
journal = {ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
location = {Virtual conference},
month = sep,
numpages = {26},
publisher = {Association for Computing Machinery},
year = {2021},
}Downloads
2110_Khasanov_CASES [PDF]
Related Paths
Permalink
- Alexander Brauckmann, Andr'es Goens, Jeronimo Castrillon, "PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning", Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 17-29, Sep 2021. [doi] [Bibtex & Downloads]
PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning
Reference
Alexander Brauckmann, Andr'es Goens, Jeronimo Castrillon, "PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning", Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 17-29, Sep 2021. [doi]
Abstract
The polyhedral model allows a structured way of defining semantics-preserving transformations to improve the performance of a large class of loops. Finding profitable points in this space is a hard problem which is usually approached by heuristics that generalize from domain-expert knowledge. Existing search space formulations in state-of-the-art heuristics depend on the shape of particular loops, making it hard to leverage generic and more powerful optimization techniques from the machine learning domain. In this paper, we propose a shape-agnostic formulation for the space of legal transformations in the polyhedral model as a Markov Decision Process (MDP). Instead of using transformations, the formulation is based on an abstract space of possible schedules. In this formulation, states model partial schedules, which are constructed by actions that are reusable across different loops. With a simple heuristic to traverse the space, we demonstrate that our formulation is powerful enough to match and outperform state-of-the-art heuristics. On the Polybench benchmark suite, we found the search space to contain transformations that lead to a speedup of 3.39x over LLVM O3, which is 1.34x better than the best transformations found in the search space of isl, and 1.83x better than the speedup achieved by the default heuristics of isl. Our generic MDP formulation enables future work to use reinforcement learning to learn optimization heuristics over a wide range of loops. This also contributes to the emerging field of machine learning in compilers, as it exposes a novel problem formulation that can push the limits of existing methods.
Bibtex
@InProceedings{brauckmann_pact21,
author = {Brauckmann, Alexander and Goens, Andrés and Castrillon, Jeronimo},
booktitle = {Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
title = {PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning},
month = sep,
doi = {10.1109/PACT52795.2021.00009},
pages = {17-29},
url = {https://ieeexplore.ieee.org/document/9563041},
year = {2021},
abstract = {The polyhedral model allows a structured way of defining semantics-preserving transformations to improve the performance of a large class of loops. Finding profitable points in this space is a hard problem which is usually approached by heuristics that generalize from domain-expert knowledge. Existing search space formulations in state-of-the-art heuristics depend on the shape of particular loops, making it hard to leverage generic and more powerful optimization techniques from the machine learning domain. In this paper, we propose a shape-agnostic formulation for the space of legal transformations in the polyhedral model as a Markov Decision Process (MDP). Instead of using transformations, the formulation is based on an abstract space of possible schedules. In this formulation, states model partial schedules, which are constructed by actions that are reusable across different loops. With a simple heuristic to traverse the space, we demonstrate that our formulation is powerful enough to match and outperform state-of-the-art heuristics. On the Polybench benchmark suite, we found the search space to contain transformations that lead to a speedup of 3.39x over LLVM O3, which is 1.34x better than the best transformations found in the search space of isl, and 1.83x better than the speedup achieved by the default heuristics of isl. Our generic MDP formulation enables future work to use reinforcement learning to learn optimization heuristics over a wide range of loops. This also contributes to the emerging field of machine learning in compilers, as it exposes a novel problem formulation that can push the limits of existing methods.},
}Downloads
2109_Brauckmann_PACT [PDF]
Permalink
- Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1674-1686, Aug 2021. [doi] [Bibtex & Downloads]
OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory
Reference
Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1674-1686, Aug 2021. [doi]
Bibtex
@Article{khan_tcad21,
author = {Adam Siemieniuk and Lorenzo Chelini and Asif Ali Khan and Jeronimo Castrillon and Andi Drebes and Henk Corporaal and Tobias Grosser and Martin Kong},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},
title = {{OCC}: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory},
month = aug,
volume={41},
number={6},
pages={1674--1686},
numpages = {12 pp},
doi = {10.1109/TCAD.2021.3101464},
issn = {1937-4151},
url = {https://ieeexplore.ieee.org/document/9502921},
publisher = {IEEE Press},
year = {2021},
abstract = {Memristive devices promise an alternative approach toward non-Von Neumann architectures, where specific computational tasks are performed within the memory devices. In the machine learning (ML) domain, crossbar arrays of resistive devices have shown great promise for ML inference, as they allow for hardware acceleration of matrix multiplications. But, to enable widespread adoption of these novel architectures, it is critical to have an automatic compilation flow as opposed to relying on a manual mapping of specific kernels on the crossbar arrays. We demonstrate the programmability of memristor-based accelerators using the new compiler design principle of multilevel rewriting, where a hierarchy of abstractions lowers programs level-by-level and perform code transformations at the most suitable abstraction. In particular, we develop a prototype compiler, which progressively lowers a mathematical notation for tensor operations arising in ML workloads, to fixed-function memristor-based hardware blocks.},
}Downloads
2107_Khan_TCAD [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Domain specific languages to tame heterogeneous and emerging computing systems", In ACM SIGHPC conference Platform for Advanced Scientific Computing PASC'21 (keynote), Jul 2021. [Bibtex & Downloads]
Domain specific languages to tame heterogeneous and emerging computing systems
Reference
Jeronimo Castrillon, "Domain specific languages to tame heterogeneous and emerging computing systems", In ACM SIGHPC conference Platform for Advanced Scientific Computing PASC'21 (keynote), Jul 2021.
Abstract
Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging computer architectures. This complexity will make it harder to democratize high-performance computing, which already today highly relies on expert programmers to write efficient parallel code. This talk discusses domain specific languages (DSLs) as a promising avenue to tame heterogeneity for non-expert programmers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning. The talk closes with insights on how compilers can leverage the high-level semantics of DSLs to optimize for emerging memory technologies.
Bibtex
@Misc{castrillon_pasc21,
author = {Castrillon, Jeronimo},
title = {Domain specific languages to tame heterogeneous and emerging computing systems},
howpublished = {ACM SIGHPC conference Platform for Advanced Scientific Computing PASC'21 (keynote)},
location = {Geneva (virtual), Switzerland},
abstract = {Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging computer architectures. This complexity will make it harder to democratize high-performance computing, which already today highly relies on expert programmers to write efficient parallel code. This talk discusses domain specific languages (DSLs) as a promising avenue to tame heterogeneity for non-expert programmers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning. The talk closes with insights on how compilers can leverage the high-level semantics of DSLs to optimize for emerging memory technologies.},
month = jul,
year = {2021},
}Downloads
210709_castrillon_PASC-sent [PDF]
Related Paths
Permalink
- Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi] [Bibtex & Downloads]
BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory
Reference
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi]
Bibtex
@InProceedings{khan_dac21,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Chen},
booktitle = {Proceedings of the 58th Annual Design Automation Conference (DAC'21)},
title = {{BLO}wing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory},
doi = {10.1109/DAC18074.2021.9586167},
location = {San Francisco, California},
pages = {1111--1116},
series = {DAC '21},
url = {https://ieeexplore.ieee.org/document/9586167},
publisher = {ACM},
month = jul,
year = {2021},
}Downloads
2112_Hakert_DAC [PDF]
Permalink
- Andr'es Goens, Jeronimo Castrillon, "Embeddings of Task Mappings to Multicore Systems", Proceedings of the 21st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, pp. 161–176, Berlin, Heidelberg, Jul 2021. [doi] [Bibtex & Downloads]
Embeddings of Task Mappings to Multicore Systems
Reference
Andr'es Goens, Jeronimo Castrillon, "Embeddings of Task Mappings to Multicore Systems", Proceedings of the 21st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, pp. 161–176, Berlin, Heidelberg, Jul 2021. [doi]
Abstract
The problem of finding good mappings is central to designing and executing applications efficiently in embedded systems. In heterogeneous multicores, which are ubiquitous today, this problem yields an intractably large design space of possible mappings. Most methods explore this space using heuristics, many of which implicitly use geometric notions in mappings. In this paper we explore the geometry of the mapping problem explicitly, for finding embeddings of the mapping space that capture its structure. This allows us to formulate new mapping strategies by leveraging the geometry of the mapping space, as well as improving existing heuristics that do so implicitly. We evaluate our approach on a novel mapping heuristic based on gradient descent, as well as multiple existing meta-heuristics. For complex architectures, our methods improved the results of established exploration meta-heuristics by about an order of magnitude in average.
Bibtex
@InProceedings{goens_samos21,
author = {Andrés Goens and Jeronimo Castrillon},
booktitle = {Proceedings of the 21st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS)},
date = {2021-07},
title = {Embeddings of Task Mappings to Multicore Systems},
doi = {10.1007/978-3-031-04580-6_11},
isbn = {978-3-031-04579-0},
location = {Samos, Greece},
organization = {IEEE},
pages = {161--176},
publisher = {Springer-Verlag},
url = {https://doi.org/10.1007/978-3-031-04580-6_11},
abstract = {The problem of finding good mappings is central to designing and executing applications efficiently in embedded systems. In heterogeneous multicores, which are ubiquitous today, this problem yields an intractably large design space of possible mappings. Most methods explore this space using heuristics, many of which implicitly use geometric notions in mappings. In this paper we explore the geometry of the mapping problem explicitly, for finding embeddings of the mapping space that capture its structure. This allows us to formulate new mapping strategies by leveraging the geometry of the mapping space, as well as improving existing heuristics that do so implicitly. We evaluate our approach on a novel mapping heuristic based on gradient descent, as well as multiple existing meta-heuristics. For complex architectures, our methods improved the results of established exploration meta-heuristics by about an order of magnitude in average.},
address = {Berlin, Heidelberg},
month = jul,
numpages = {16},
year = {2021},
}Downloads
2107_Goens_SAMOS [PDF]
Related Paths
Permalink
- Andr'es Goens, Timo Nicolai, Jeronimo Castrillon, "mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1592-1605, Jul 2021. [doi] [Bibtex & Downloads]
mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies
Reference
Andr'es Goens, Timo Nicolai, Jeronimo Castrillon, "mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1592-1605, Jul 2021. [doi]
Bibtex
@Article{goens_tcad21,
author = {Andrés Goens and Timo Nicolai and Jeronimo Castrillon},
date = {2021-07},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},
title = {mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies},
month = jul,
numpages = {14 pp},
volume={41},
number={6},
pages={1592-1605},
doi = {10.1109/TCAD.2021.3102512},
issn = {1937-4151},
url = {https://ieeexplore.ieee.org/document/9506807},
publisher = {IEEE Press},
year = {2021},
}Downloads
2107_Goens_TCAD [PDF]
Permalink
- Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Burkhard Ringlein, Stephanie Soldavini, Christian Pilato, Mattia Tibaldi, Fabrizio Ferrandi, Stanislav Böhm, Francesco Regazzoni, Kartik Nayak, "EVEREST: Definition of the compilation framework", Technical report, EVEREST consortium, Jul 2021. [Bibtex & Downloads]
EVEREST: Definition of the compilation framework
Reference
Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Burkhard Ringlein, Stephanie Soldavini, Christian Pilato, Mattia Tibaldi, Fabrizio Ferrandi, Stanislav Böhm, Francesco Regazzoni, Kartik Nayak, "EVEREST: Definition of the compilation framework", Technical report, EVEREST consortium, Jul 2021.
Bibtex
@Report{castrillon_everestD4.1_2021,
author = {Jeronimo Castrillon and Felix Wittwer and Karl Friebel and Gerald Hempel and Burkhard Ringlein and Stephanie Soldavini and Christian Pilato and Mattia Tibaldi and Fabrizio Ferrandi and Stanislav B{\"o}hm and Francesco Regazzoni and Kartik Nayak},
institution = {EVEREST consortium},
title = {{EVEREST}: Definition of the compilation framework},
type = {techreport},
url = {https://drive.switch.ch/index.php/s/3lloP4p1ukGUdJx},
month = jul,
year = {2021},
}Downloads
2107_Castrillon-Everest-D4 [1]
Permalink
- Nesrine Khouzami, Lars Schütze, Pietro Incardona, Landfried Kraaz, Tina Subic, Jeronimo Castrillon, Ivo F. Sbalzarini, "The OpenPME Problem Solving Environment for Numerical Simulations", In Proceeding: International Conference on Computational Science (ICCS'21) (Paszynski, Maciej and Kranzlmüller, Dieter and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.), Springer International Publishing, pp. 614–627, Cham, Jun 2021. [doi] [Bibtex & Downloads]
The OpenPME Problem Solving Environment for Numerical Simulations
Reference
Nesrine Khouzami, Lars Schütze, Pietro Incardona, Landfried Kraaz, Tina Subic, Jeronimo Castrillon, Ivo F. Sbalzarini, "The OpenPME Problem Solving Environment for Numerical Simulations", In Proceeding: International Conference on Computational Science (ICCS'21) (Paszynski, Maciej and Kranzlmüller, Dieter and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.), Springer International Publishing, pp. 614–627, Cham, Jun 2021. [doi]
Abstract
We introduce OpenPME, the Open Particle-Mesh Environment, a problem solving environment that provides a Domain Specific Language (DSL) for numerical simulations in scientific computing. It is built atop a domain metamodel that is general enough to cover the main types of numerical simulations: simulations using particles, meshes, and hybrid combinations of particles and meshes. Using model-to-model transformations, OpenPME generates code against the state-of-the-art C++ parallel computing library OpenFPM. This effectively lowers the programming barrier and enables users to implement scalable simulation codes for high-performance computing (HPC) systems using high-level abstractions. Plenty of recent research has shown that higher-level abstractions and problem solving environments are well suited to alleviate low-level implementation overhead. We demonstrate this for OpenPME and its compiler on three different test cases—particle-based, mesh-based, and hybrid particle-mesh—showing up to 7-fold reduction in the number of lines of code compared to a direct OpenFPM implementation in C++.
Bibtex
@InProceedings{khouzami_iccs21,
author = {Nesrine Khouzami and Lars Sch{\"u}tze and Pietro Incardona and Landfried Kraaz and Tina Subic and Jeronimo Castrillon and Ivo F. Sbalzarini},
booktitle = {International Conference on Computational Science (ICCS'21)},
title = {The OpenPME Problem Solving Environment for Numerical Simulations},
doi = {10.1007/978-3-030-77961-0_49},
editor = {Paszynski, Maciej and Kranzlm{\"u}ller, Dieter and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.},
isbn = {978-3-030-77961-0},
location = {Krakow (virtual), Poland},
organization = {Springer},
pages = {614--627},
publisher = {Springer International Publishing},
url = {https://link.springer.com/chapter/10.1007%2F978-3-030-77961-0_49},
abstract = {We introduce OpenPME, the Open Particle-Mesh Environment, a problem solving environment that provides a Domain Specific Language (DSL) for numerical simulations in scientific computing. It is built atop a domain metamodel that is general enough to cover the main types of numerical simulations: simulations using particles, meshes, and hybrid combinations of particles and meshes. Using model-to-model transformations, OpenPME generates code against the state-of-the-art C++ parallel computing library OpenFPM. This effectively lowers the programming barrier and enables users to implement scalable simulation codes for high-performance computing (HPC) systems using high-level abstractions. Plenty of recent research has shown that higher-level abstractions and problem solving environments are well suited to alleviate low-level implementation overhead. We demonstrate this for OpenPME and its compiler on three different test cases---particle-based, mesh-based, and hybrid particle-mesh---showing up to 7-fold reduction in the number of lines of code compared to a direct OpenFPM implementation in C++.},
address = {Cham},
month = jun,
numpages = {14},
year = {2021},
}Downloads
2106_Khouzami_ICCS [PDF]
Related Paths
Permalink
- Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi] [Bibtex & Downloads]
ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering
Reference
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi]
Abstract
Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44% and 21-49%, respectively.
Bibtex
@Article{hameed_tetc21,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
journal = {IEEE Transactions on Emerging Topics in Computing (IEEE TETC)},
title = {{ALPHA}: A Novel Algorithm-Hardware Co-design for Accelerating {DNA} Seed Location Filtering},
pages = {12 pp.},
abstract = {Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54\%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44\% and 21-49\%, respectively.},
month = jun,
year = {2021},
doi = {10.1109/TETC.2021.3093840},
issn = {2168-6750},
}Downloads
2107_hameed_TETC [PDF]
Permalink
- Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "An online guided tuning approach to run CNN pipelines on edge devices", Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21), Association for Computing Machinery (ACM), pp. 45–53, New York, NY, USA, May 2021. [doi] [Bibtex & Downloads]
An online guided tuning approach to run CNN pipelines on edge devices
Reference
Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "An online guided tuning approach to run CNN pipelines on edge devices", Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21), Association for Computing Machinery (ACM), pp. 45–53, New York, NY, USA, May 2021. [doi]
Abstract
Modern edge and mobile devices are equipped with powerful computing resources. These are often organized as heterogeneous multi-cores, featuring performance-asymmetric core clusters. This raises the question on how to effectively execute the inference pass of convolutional neural networks (CNN) on such devices. Existing CNN implementations on edge devices leverage offline profiling data to determine a better schedule for CNN applications. This approach requires a time consuming phase of generating a performance profile for each type of representative kernel on various core configurations available on the device, coupled with a search space exploration. We propose an online tuning technique which utilizes compile time hints and online profiling data to generate high throughput CNN pipelines. We explore core heterogeneity and compatible core-layer configurations through an online guided search. Unlike exhaustive search, we adopt an evolutionary approach with a guided starting point in order to find the solution. We show that by pruning and navigating through the complex search space using compile time hints, 79% of the tested configurations turn out to be near-optimal candidates for a throughput maximizing pipeline on NVIDIA Jetson TX2 platform.
Bibtex
@InProceedings{soomro_cf21,
author = {Pirah Noor Soomro and Mustafa Abduljabbar and Jeronimo Castrillon and Miquel Peric{\'a}s},
booktitle = {Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21)},
title = {An online guided tuning approach to run CNN pipelines on edge devices},
doi = {10.1145/3457388.3458662},
isbn = {9781450384049},
location = {Virtual Event, Italy},
pages = {45–53},
publisher = {Association for Computing Machinery (ACM)},
series = {CF '21},
url = {https://doi.org/10.1145/3457388.3458662},
abstract = {Modern edge and mobile devices are equipped with powerful computing resources. These are often organized as heterogeneous multi-cores, featuring performance-asymmetric core clusters. This raises the question on how to effectively execute the inference pass of convolutional neural networks (CNN) on such devices. Existing CNN implementations on edge devices leverage offline profiling data to determine a better schedule for CNN applications. This approach requires a time consuming phase of generating a performance profile for each type of representative kernel on various core configurations available on the device, coupled with a search space exploration. We propose an online tuning technique which utilizes compile time hints and online profiling data to generate high throughput CNN pipelines. We explore core heterogeneity and compatible core-layer configurations through an online guided search. Unlike exhaustive search, we adopt an evolutionary approach with a guided starting point in order to find the solution. We show that by pruning and navigating through the complex search space using compile time hints, 79\% of the tested configurations turn out to be near-optimal candidates for a throughput maximizing pipeline on NVIDIA Jetson TX2 platform.},
address = {New York, NY, USA},
month = may,
numpages = {9},
year = {2021},
}Downloads
2105_Soomro-CF [PDF]
Permalink
- Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Jan Martinovic, Stanislav Böhm, Martin Surkovsky, Michele Paolino, Fabrizio Ferrandi, Serena Curzel, Michele Fiorito, Christian Pilato, Stephanie Soldavini, Gianluca Palermo, Dionysios Diamantopoulos, "EVEREST: Definition of Language Requirements", Technical report, EVEREST consortium, Apr 2021. [Bibtex & Downloads]
EVEREST: Definition of Language Requirements
Reference
Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Jan Martinovic, Stanislav Böhm, Martin Surkovsky, Michele Paolino, Fabrizio Ferrandi, Serena Curzel, Michele Fiorito, Christian Pilato, Stephanie Soldavini, Gianluca Palermo, Dionysios Diamantopoulos, "EVEREST: Definition of Language Requirements", Technical report, EVEREST consortium, Apr 2021.
Bibtex
@Report{castrillon_everestD2.2_2021,
author = {Jeronimo Castrillon and Felix Wittwer and Karl Friebel and Gerald Hempel and Jan Martinovic and Stanislav B{\"o}hm and Martin Surkovsky and Michele Paolino and Fabrizio Ferrandi and Serena Curzel and Michele Fiorito and Christian Pilato and Stephanie Soldavini and Gianluca Palermo and Dionysios Diamantopoulos},
institution = {EVEREST consortium},
title = {{EVEREST}: Definition of Language Requirements},
type = {techreport},
url = {https://drive.switch.ch/index.php/s/ddn1yGnHavgzXpB},
month = apr,
year = {2021},
}Downloads
2104_Castrillon-Everest-D2 [2]
Permalink
- Weihua Sheng, Jeronimo Castrillon, Rainer Leupers, "Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms", Chapter in Multi-Processor System-on-Chip 2 (Liliana Andrade and Frédéric Rousseau), ISTE Ltd London and Wiley Hoboke, pp. 203–235, Mar 2021. [Bibtex & Downloads]
Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms
Reference
Weihua Sheng, Jeronimo Castrillon, Rainer Leupers, "Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms", Chapter in Multi-Processor System-on-Chip 2 (Liliana Andrade and Frédéric Rousseau), ISTE Ltd London and Wiley Hoboke, pp. 203–235, Mar 2021.
Abstract
This chapter addresses the challenges associated with compilation and optimization techniques for heterogeneous multi-core computing systems in the embedded industry. Wireless terminals and modems are typical examples of such systems, which demand high performance and energy efficiency at the same time. To fully exploit the computing power of those systems, the existing compiler technology for single processor systems does not suit the need and scale for multi-core architectures anymore. The authors have applied a systematic approach to tackle the problems of application modeling, source-to-source compilation, flexible compiler infrastructure construction and software distribution for multi-core architectures from a practical perspective. Several real-world multi-core platforms as well as system-level virtual platforms have been successfully used to demonstrate the achievable speed-ups and versatility of the compilation and optimization techniques developed in this work.
Bibtex
@InCollection{sheng_mpsoc21,
author = {Weihua Sheng and Jeronimo Castrillon and Rainer Leupers},
booktitle = {Multi-Processor System-on-Chip 2},
year = {2021},
title = {Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms},
chapter = {10},
editor = {Liliana Andrade and Fr{\'e}d{\'e}ric Rousseau},
isbn = {978-17-894-5022-4},
pages = {203--235},
publisher = {ISTE Ltd London and Wiley Hoboke},
abstract = {This chapter addresses the challenges associated with compilation and optimization techniques for heterogeneous multi-core computing systems in the embedded industry. Wireless terminals and modems are typical examples of such systems, which demand high performance and energy efficiency at the same time. To fully exploit the computing power of those systems, the existing compiler technology for single processor systems does not suit the need and scale for multi-core architectures anymore. The authors have applied a systematic approach to tackle the problems of application modeling, source-to-source compilation, flexible compiler infrastructure construction and software distribution for multi-core architectures from a practical perspective. Several real-world multi-core platforms as well as system-level virtual platforms have been successfully used to demonstrate the achievable speed-ups and versatility of the compilation and optimization techniques developed in this work.},
month = mar,
url = {http://www.iste.co.uk/book.php?id=1739}
}Downloads
2103_Sheng_MPSoC20-preprint [PDF]
Permalink
- Christian Pilato, Stanislav Bohm, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Vojtech Cima, Radim Cmar, Dionysios Diamantopoulos, Fabrizio Ferrandi, Jan Martinovic, Gianluca Palermo, Michele Paolino, Antonio Parodi, Lorenzo Pittaluga, Daniel Raho, Francesco Regazzoni, Katerina Slaninova, Christoph Hagleitner, "EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms", Proceedings of the 2021 Design, Automation and Test in Europe Conference (DATE), pp. 1320–1325, Feb 2021. [doi] [Bibtex & Downloads]
EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms
Reference
Christian Pilato, Stanislav Bohm, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Vojtech Cima, Radim Cmar, Dionysios Diamantopoulos, Fabrizio Ferrandi, Jan Martinovic, Gianluca Palermo, Michele Paolino, Antonio Parodi, Lorenzo Pittaluga, Daniel Raho, Francesco Regazzoni, Katerina Slaninova, Christoph Hagleitner, "EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms", Proceedings of the 2021 Design, Automation and Test in Europe Conference (DATE), pp. 1320–1325, Feb 2021. [doi]
Bibtex
@InProceedings{pilato_date21,
author = {Christian Pilato and Stanislav Bohm and Fabien Brocheton and Jeronimo Castrillon and Riccardo Cevasco and Vojtech Cima and Radim Cmar and Dionysios Diamantopoulos and Fabrizio Ferrandi and Jan Martinovic and Gianluca Palermo and Michele Paolino and Antonio Parodi and Lorenzo Pittaluga and Daniel Raho and Francesco Regazzoni and Katerina Slaninova and Christoph Hagleitner},
booktitle = {Proceedings of the 2021 Design, Automation and Test in Europe Conference (DATE)},
title = {{EVEREST}: A design environment for extreme-scale big data analytics on heterogeneous platforms},
location = {Virtual Conference},
series = {DATE'21},
month = feb,
year = {2021},
pages = {1320--1325},
url = {https://ieeexplore.ieee.org/document/9473940},
doi = {10.23919/DATE51398.2021.9473940},
}Downloads
2102_Pilato_DATE [PDF]
Permalink
- Hasna Bouraoui, Chadlia Jerad, Jeronimo Castrillon, "Towards Adaptive multi-Alternative Process Network", Proceedings of the 12th Workshop and 10th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'21), co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (Bispo, João and Cherubin, Stefano and Flich, Jos'e), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 88, pp. 1:1–1:11, Dagstuhl, Germany, Jan 2021. [doi] [Bibtex & Downloads]
Towards Adaptive multi-Alternative Process Network
Reference
Hasna Bouraoui, Chadlia Jerad, Jeronimo Castrillon, "Towards Adaptive multi-Alternative Process Network", Proceedings of the 12th Workshop and 10th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'21), co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (Bispo, João and Cherubin, Stefano and Flich, Jos'e), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 88, pp. 1:1–1:11, Dagstuhl, Germany, Jan 2021. [doi]
Bibtex
@InProceedings{bouraoui_parma21,
author = {Hasna Bouraoui and Chadlia Jerad and Jeronimo Castrillon},
booktitle = {Proceedings of the 12th Workshop and 10th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'21), co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Towards Adaptive multi-Alternative Process Network},
doi = {10.4230/OASIcs.PARMA-DITAM.2021.1},
editor = {Bispo, Jo\~{a}o and Cherubin, Stefano and Flich, Jos\'{e}},
isbn = {978-3-95977-181-8},
location = {Budapest, Hungary},
pages = {1:1--1:11},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
series = {Open Access Series in Informatics (OASIcs)},
url = {https://drops.dagstuhl.de/opus/volltexte/2021/13637/pdf/OASIcs-PARMA-DITAM-2021-1.pdf},
volume = {88},
address = {Dagstuhl, Germany},
issn = {2190-6807},
month = jan,
numpages = {10},
urn = {urn:nbn:de:0030-drops-136378},
year = {2021},
}Downloads
2101_Bouraoui_PARMA [PDF]
Permalink
- Christian Menard, Andr'es Goens, Gerald Hempel, Robert Khasanov, Julian Robledo, Felix Teweleitt, Jeronimo Castrillon, "Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores", Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 66–73, New York, NY, USA, Jan 2021. (Video Presentation) [doi] [Bibtex & Downloads]
Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores
Reference
Christian Menard, Andr'es Goens, Gerald Hempel, Robert Khasanov, Julian Robledo, Felix Teweleitt, Jeronimo Castrillon, "Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores", Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 66–73, New York, NY, USA, Jan 2021. (Video Presentation) [doi]
Abstract
We present Mocasin, an open-source rapid prototyping framework for researching, implementing and validating new algorithms and solutions in the field of mapping software to heterogeneous multi-cores. In contrast to the many existing tools that often specialize for a particular use-case, Mocasin is an open, flexible and generic research environment that abstracts over the approaches taken by other tools. Mocasin is designed to support a wide range of models of computation and input formats, implements manifold mapping strategies and provides an adjustable high-level simulator for performance estimation. This infrastructure serves as a flexible vehicle for exploring new approaches and as a blueprint for building customized tools. We highlight the key design aspects of Mocasin that enable its flexibility and illustrate its capabilities in a case-study showing how Mocasin can be used for building a customized tool for researching runtime mapping strategies in an LTE uplink receiver.
Bibtex
@InProceedings{menard_rapido21,
author = {Christian Menard and Andrés Goens and Gerald Hempel and Robert Khasanov and Julian Robledo and Felix Teweleitt and Jeronimo Castrillon},
title = {Mocasin---Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores},
booktitle = {Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
year = {2021},
address = {New York, NY, USA},
month = jan,
publisher = {ACM},
doi = {10.1145/3444950.3447285},
isbn = {9781450389525},
location = {Budapest, Hungary},
pages = {66–73},
publisher = {Association for Computing Machinery},
series = {DroneSE and RAPIDO '21},
url = {https://doi.org/10.1145/3444950.3447285},
abstract = {We present Mocasin, an open-source rapid prototyping framework for researching, implementing and validating new algorithms and solutions in the field of mapping software to heterogeneous multi-cores. In contrast to the many existing tools that often specialize for a particular use-case, Mocasin is an open, flexible and generic research environment that abstracts over the approaches taken by other tools. Mocasin is designed to support a wide range of models of computation and input formats, implements manifold mapping strategies and provides an adjustable high-level simulator for performance estimation. This infrastructure serves as a flexible vehicle for exploring new approaches and as a blueprint for building customized tools. We highlight the key design aspects of Mocasin that enable its flexibility and illustrate its capabilities in a case-study showing how Mocasin can be used for building a customized tool for researching runtime mapping strategies in an LTE uplink receiver.},
numpages = {8},
}Downloads
2101_Menard_RAPIDO [PDF]
Permalink
2020
- Lars Schütze, Jeronimo Castrillon, "Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages", Proceedings of the17th International Conference on Managed Programming Languages & Runtimes (MPLR'20), Association for Computing Machinery, pp. 52–62, New York, NY, USA, Nov 2020. [doi] [Bibtex & Downloads]
Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages
Reference
Lars Schütze, Jeronimo Castrillon, "Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages", Proceedings of the17th International Conference on Managed Programming Languages & Runtimes (MPLR'20), Association for Computing Machinery, pp. 52–62, New York, NY, USA, Nov 2020. [doi]
Abstract
Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. The cost of adaptivity is however a high runtime overhead stemming from executing compositions of behavior-modifying code. It has been shown that the overhead can be reduced by optimizing dispatch plans at runtime for static cases, but no method exists to reduce the overhead in cases with high variability. This paper presents a novel approach to implement polymorphic role dispatch, taking advantage of dependent types and using run-time information to effectively guard abstractions and enable reuse. The concept of polymorphic inline caches is extended to role invocations. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a geometric mean speedup of 4.0$\times$ (3.8$\times$ up to 4.5$\times$) in the static case, and close to no overhead in the dynamic case over the current implementation of contextual roles in Object Teams.
Bibtex
@InProceedings{schuetze_mplr20,
author = {Lars Sch{\"u}tze and Jeronimo Castrillon},
booktitle = {Proceedings of the17th International Conference on Managed Programming Languages \& Runtimes (MPLR'20)},
title = {Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages},
location = {Virtual, UK},
pages = {52--62},
numpages = {11},
series = {MPLR'20},
abstract = {Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. The cost of adaptivity is however a high runtime overhead stemming from executing compositions of behavior-modifying code. It has been shown that the overhead can be reduced by optimizing dispatch plans at runtime for static cases, but no method exists to reduce the overhead in cases with high variability. This paper presents a novel approach to implement polymorphic role dispatch, taking advantage of dependent types and using run-time information to effectively guard abstractions and enable reuse. The concept of polymorphic inline caches is extended to role invocations. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a geometric mean speedup of 4.0$\times$ (3.8$\times$ up to 4.5$\times$) in the static case, and close to no overhead in the dynamic case over the current implementation of contextual roles in Object Teams.},
year = {2020},
month = nov,
numpages = {9},
isbn = {9781450388535},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3426182.3426186},
doi = {10.1145/3426182.3426186},
}Downloads
2010_Schuetze_MPLR [PDF]
Permalink
- Robert Wittig, Andrés Goens, Christian Menard, Emil Matus, Gerhard P. Fettweis, Jeronimo Castrillon, "Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach", Proceedings of the 27th International Conference on Telecommunications (ICT), pp. 1-5, Oct 2020. [doi] [Bibtex & Downloads]
Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach
Reference
Robert Wittig, Andrés Goens, Christian Menard, Emil Matus, Gerhard P. Fettweis, Jeronimo Castrillon, "Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach", Proceedings of the 27th International Conference on Telecommunications (ICT), pp. 1-5, Oct 2020. [doi]
Abstract
In the era of 5G and beyond, adaptive workloads and the need for energy efficiency drive are becoming increasingly vital. Changes in parameters of the physical layer algorithm can cascade throughout the algorithm, requiring additional changes to keep a correct functionality within the timing bounds. These factors drive the process of designing systems for mobile communication towards reconfigurability. In this paper we analyze the trade-offs involved in changing algorithmic parameters and show how reconfigurable systems can be used to produce energy-efficient systems. We argue that we ought to resort to formal models to tame this reconfigurability and examine where existing formal models fall short.
Bibtex
@InProceedings{goens_ict20,
author = {Robert Wittig and Andr{\'e}s Goens and Christian Menard and Emil Matus and Gerhard P. Fettweis and Jeronimo Castrillon},
booktitle = {Proceedings of the 27th International Conference on Telecommunications (ICT)},
title = {Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach},
location = {Virtual. Bali, Indonesia},
month = oct,
abstract = {In the era of 5G and beyond, adaptive workloads and the need for energy efficiency drive are becoming increasingly vital. Changes in parameters of the physical layer algorithm can cascade throughout the algorithm, requiring additional changes to keep a correct functionality within the timing bounds. These factors drive the process of designing systems for mobile communication towards reconfigurability. In this paper we analyze the trade-offs involved in changing algorithmic parameters and show how reconfigurable systems can be used to produce energy-efficient systems. We argue that we ought to resort to formal models to tame this reconfigurability and examine where existing formal models fall short.},
year = {2020},
pages={1-5},
doi={10.1109/ICT49546.2020.9239539},
url = {https://ieeexplore.ieee.org/document/9239539},
}Downloads
2010_Wittig_ICT [PDF]
Permalink
- Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20), IEEE Press, vol. 39, no. 11, pp. 3968–3980, Oct 2020. [doi] [Bibtex & Downloads]
Polyhedral Compilation for Racetrack Memories
Reference
Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20), IEEE Press, vol. 39, no. 11, pp. 3968–3980, Oct 2020. [doi]
Abstract
Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth and area requirements of modern embedded and high-performance computer systems. Racetrack Memory (RTM), an emerging non-volatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and non-volatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks. Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85% fewer shifts (average 41%), improving both performance and energy consumption by an average of 17.9% and 39.8%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.
Bibtex
@Article{khan_cases20,
author = {Asif Ali Khan and Hauke Mewes and Tobias Grosser and Torsten Hoefler and Jeronimo Castrillon},
title = {Polyhedral Compilation for Racetrack Memories},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20)},
year = {2020},
series = {CASES ’20},
month = oct,
doi = {10.1109/TCAD.2020.3012266},
url = {https://ieeexplore.ieee.org/document/9216560},
volume={39},
number={11},
pages={3968--3980},
issn = {1937-4151},
issn = {1937-4151},
abstract = {Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth and area requirements of modern embedded and high-performance computer systems. Racetrack Memory (RTM), an emerging non-volatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and non-volatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks. Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85\% fewer shifts (average 41\%), improving both performance and energy consumption by an average of 17.9\% and 39.8\%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.},
booktitle = {Proceedings of the 2020 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
location = {Virtual conference},
numpages = {12},
publisher = {IEEE Press},
}Downloads
2009_Khan_CASES [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "The role of domain-specific languages for cyber-physical systems", In Seminar series: Design and Programming Cyber-Physical Systems and IoT Applications (invited talk), Oct 2020. [Bibtex & Downloads]
The role of domain-specific languages for cyber-physical systems
Reference
Jeronimo Castrillon, "The role of domain-specific languages for cyber-physical systems", In Seminar series: Design and Programming Cyber-Physical Systems and IoT Applications (invited talk), Oct 2020.
Abstract
Embedded and cyber-physical systems (CPS) are heterogeneous interconnected computing systems with an ever increasing complexity. As CPSs become more widespread, the developer community widens, exposing the complexity to mainstream programmers. In this talk, I will talk about domain-specific languages (DSLs) as a promising avenue to handle complexity without compromising on efficiency. The talk will provide background on programming languages and go over sample DSLs from different communities. An in-depth example will serve to grasp the power that lies in DSLs for efficiency, correctness and ease to target complex emerging systems.
Bibtex
@Misc{castrillon_ensi20,
author = {Castrillon, Jeronimo},
title = {The role of domain-specific languages for cyber-physical systems},
year = {2020},
howpublished = {Seminar series: Design and Programming Cyber-Physical Systems and IoT Applications (invited talk)},
location = {Tunis, Tunisia (Virtual)},
month = oct,
abstract = {Embedded and cyber-physical systems (CPS) are heterogeneous interconnected computing systems with an ever increasing complexity. As CPSs become more widespread, the developer community widens, exposing the complexity to mainstream programmers. In this talk, I will talk about domain-specific languages (DSLs) as a promising avenue to handle complexity without compromising on efficiency. The talk will provide background on programming languages and go over sample DSLs from different communities. An in-depth example will serve to grasp the power that lies in DSLs for efficiency, correctness and ease to target complex emerging systems. },
url = {https://sites.google.com/ensi-uma.tn/seminar-series-on-cps-n-iot/home}
}Downloads
201006_castrill_ENSI-compressed [PDF]
Permalink
- Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi] [Bibtex & Downloads]
Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling
Reference
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi]
Abstract
In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7% compared to a state-of-the-art direct-mapped LLC design and by 7.2% compared to an existing associative LLC design.
Bibtex
@Article{hameed_tc20,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling},
journal = {IEEE Transactions on Computers},
year = {2020},
month = oct,
abstract = {In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7\% compared to a state-of-the-art direct-mapped LLC design and by 7.2\% compared to an existing associative LLC design.},
doi = {10.1109/TC.2020.3029615},
url = {https://ieeexplore.ieee.org/document/9220805},
issn = {0018-9340},
numpages = {14},
volume={70},
number={11},
pages={1914-1927},
}Downloads
2010_Hameed_TC [PDF]
Related Paths
Permalink
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi] [Bibtex & Downloads]
Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories
Reference
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi]
Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.
Bibtex
@Article{khan_tecs20,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
year = {2020},
month = sep,
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {6},
issn = {1539-9087},
url = {https://doi.org/10.1145/3396235},
doi = {10.1145/3396235},
articleno = {44},
numpages = {26},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32\% and 73\% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80\%.},
}Downloads
2009_Khan_TECS [PDF]
Related Paths
Permalink
- Alexander Brauckmann, Andrés Goens, Jeronimo Castrillon, "ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-4, Sep 2020. [doi] [Bibtex & Downloads]
ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers
Reference
Alexander Brauckmann, Andrés Goens, Jeronimo Castrillon, "ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-4, Sep 2020. [doi]
Abstract
Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.
Bibtex
@InProceedings{brauckmann_fdl20,
author = {Alexander Brauckmann and Andr\'{e}s Goens and Jeronimo Castrillon},
title = {ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers},
booktitle = {2020 Forum for Specification and Design Languages (FDL)},
year = {2020},
location = {Kiel, Germany},
month = sep,
pages={1-4},
doi={10.1109/FDL50818.2020.9232946},
url = {https://ieeexplore.ieee.org/document/9232946},
abstract = {Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.},
}Downloads
2009_Brauckmann_FDL [PDF]
Permalink
- Marten Lohstroh, Christian Menard, Alexander Schulz-Rosengarten, Matthew Weber, Jeronimo Castrillon, Edward A. Lee, "A Language for Deterministic Coordination Across Multiple Timelines", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-8, Sep 2020. (Best paper award candidate) [doi] [Bibtex & Downloads]
A Language for Deterministic Coordination Across Multiple Timelines
Reference
Marten Lohstroh, Christian Menard, Alexander Schulz-Rosengarten, Matthew Weber, Jeronimo Castrillon, Edward A. Lee, "A Language for Deterministic Coordination Across Multiple Timelines", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-8, Sep 2020. (Best paper award candidate) [doi]
Abstract
We discuss a novel approach for constructing deterministic reactive systems that evolves around a temporal model which incorporates a multiplicity of timelines. This model is central to LINGUA FRANCA (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called Reactors, which are objects that react to and emit discrete events. What sets LF apart from other languages that treat time as a first-class citizen is that it confronts the issue that in any reactive system there are at least two distinct timelines involved; a logical one and a physical one-and possibly multiple of each kind. LF provides a mechanism for relating events across timelines, and guarantees deterministic program behavior under quantifiable assumptions.
Bibtex
@InProceedings{lohstroh_fdl20,
author = {Marten Lohstroh and Christian Menard and Alexander Schulz-Rosengarten and Matthew Weber and Jeronimo Castrillon and Edward A. Lee},
title = {A Language for Deterministic Coordination Across Multiple Timelines},
booktitle = {2020 Forum for Specification and Design Languages (FDL)},
year = {2020},
location = {Kiel, Germany},
month = sep,
abstract = {We discuss a novel approach for constructing deterministic reactive systems that evolves around a temporal model which incorporates a multiplicity of timelines. This model is central to LINGUA FRANCA (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called Reactors, which are objects that react to and emit discrete events. What sets LF apart from other languages that treat time as a first-class citizen is that it confronts the issue that in any reactive system there are at least two distinct timelines involved; a logical one and a physical one-and possibly multiple of each kind. LF provides a mechanism for relating events across timelines, and guarantees deterministic program behavior under quantifiable assumptions.},
pages={1-8},
doi={10.1109/FDL50818.2020.9232939},
url = {https://ieeexplore.ieee.org/document/9232939},
}Downloads
2009_Lohstroh_FDL [PDF]
Permalink
- Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas Grass, Anthony Gutierrez, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Miquel Moreto, Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris, Lena E. Olson, Marc Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas, William Wang, Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, Éder F. Zulian, "The gem5 Simulator: Version 20.0+", In arXiv preprint arXiv:2007.03152, Jul 2020. [Bibtex & Downloads]
The gem5 Simulator: Version 20.0+
Reference
Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas Grass, Anthony Gutierrez, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Miquel Moreto, Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris, Lena E. Olson, Marc Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas, William Wang, Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, Éder F. Zulian, "The gem5 Simulator: Version 20.0+", In arXiv preprint arXiv:2007.03152, Jul 2020.
Abstract
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.
Bibtex
@article{lowe-power_gem5_2020,
author={Jason Lowe-Power and Abdul Mutaal Ahmad and Ayaz Akram and Mohammad Alian and Rico Amslinger and Matteo Andreozzi and Adri{\`a} Armejach and Nils Asmussen and Brad Beckmann and Srikant Bharadwaj and Gabe Black and Gedare Bloom and Bobby R. Bruce and Daniel Rodrigues Carvalho and Jeronimo Castrillon and Lizhong Chen and Nicolas Derumigny and Stephan Diestelhorst and Wendy Elsasser and Carlos Escuin and Marjan Fariborz and Amin Farmahini-Farahani and Pouya Fotouhi and Ryan Gambord and Jayneel Gandhi and Dibakar Gope and Thomas Grass and Anthony Gutierrez and Bagus Hanindhito and Andreas Hansson and Swapnil Haria and Austin Harris and Timothy Hayes and Adrian Herrera and Matthew Horsnell and Syed Ali Raza Jafri and Radhika Jagtap and Hanhwi Jang and Reiley Jeyapaul and Timothy M. Jones and Matthias Jung and Subash Kannoth and Hamidreza Khaleghzadeh and Yuetsu Kodama and Tushar Krishna and Tommaso Marinelli and Christian Menard and Andrea Mondelli and Miquel Moreto and Tiago M{\"u}ck and Omar Naji and Krishnendra Nathella and Hoa Nguyen and Nikos Nikoleris and Lena E. Olson and Marc Orr and Binh Pham and Pablo Prieto and Trivikram Reddy and Alec Roelke and Mahyar Samani and Andreas Sandberg and Javier Setoain and Boris Shingarov and Matthew D. Sinclair and Tuan Ta and Rahul Thakur and Giacomo Travaglini and Michael Upton and Nilay Vaish and Ilias Vougioukas and William Wang and Zhengrong Wang and Norbert Wehn and Christian Weis and David A. Wood and Hongil Yoon and {\'E}der F. Zulian},
title = {The gem5 Simulator: Version 20.0+},
journal = {arXiv preprint arXiv:2007.03152},
url = {https://arxiv.org/abs/2007.03152},
year = {2020},
month = jul,
abstract = {The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.},
}Downloads
2007_Lowe-Power-Gem5 [PDF]
Permalink
- Christian Menard, Andrés Goens, Marten Lohstroh, Jeronimo Castrillon, "Achieving Determinism in Adaptive AUTOSAR", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 822–827, Mar 2020. (Best paper award candidate A-Track, Video Presentation) [doi] [Bibtex & Downloads]
Achieving Determinism in Adaptive AUTOSAR
Reference
Christian Menard, Andrés Goens, Marten Lohstroh, Jeronimo Castrillon, "Achieving Determinism in Adaptive AUTOSAR", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 822–827, Mar 2020. (Best paper award candidate A-Track, Video Presentation) [doi]
Abstract
AUTOSAR AP is an emerging industry standard that tackles the challenges of modern automotive software design, but does not provide adequate mechanisms to enforce deterministic execution. This poses profound challenges to testing and maintenance of the application software, which is particularly problematic for safety-critical applications. In this paper, we analyze the problem of nondeterminism in AP and propose a framework for the design of deterministic automotive software that transparently integrates with the AP communication mechanisms. We illustrate our approach in a case study based on the brake assistant demonstrator application that is provided by the AUTOSAR consortium. We show that the original implementation is nondeterministic and discuss a deterministic solution based on our framework.
Bibtex
@InProceedings{menard_date20,
author = {Christian Menard and Andr{\'e}s Goens and Marten Lohstroh and Jeronimo Castrillon},
title = {Achieving Determinism in Adaptive AUTOSAR},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
month = mar,
publisher = {IEEE},
location = {Grenoble, France},
abstract = {AUTOSAR AP is an emerging industry standard that tackles the challenges of modern automotive software design, but does not provide adequate mechanisms to enforce deterministic execution. This poses profound challenges to testing and maintenance of the application software, which is particularly problematic for safety-critical applications. In this paper, we analyze the problem of nondeterminism in AP and propose a framework for the design of deterministic automotive software that transparently integrates with the AP communication mechanisms. We illustrate our approach in a case study based on the brake assistant demonstrator application that is provided by the AUTOSAR consortium. We show that the original implementation is nondeterministic and discuss a deterministic solution based on our framework.},
isbn = {978-3-9819263-4-7},
pages = {822--827},
doi = {10.23919/DATE48585.2020.9116430},
url = {https://ieeexplore.ieee.org/abstract/document/9116430},
}Downloads
2003_Menard_DATE [PDF]
Related Paths
Permalink
- Robert Khasanov, Jeronimo Castrillon, "Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 909–914, Mar 2020. (Best paper award candidate E-Track, Video Presentation) [doi] [Bibtex & Downloads]
Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping
Reference
Robert Khasanov, Jeronimo Castrillon, "Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 909–914, Mar 2020. (Best paper award candidate E-Track, Video Presentation) [doi]
Abstract
Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.
Bibtex
@InProceedings{khasanov_date20,
author = {Robert Khasanov and Jeronimo Castrillon},
title = {Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
month = mar,
publisher = {IEEE},
location = {Grenoble, France},
isbn = {978-3-9819263-4-7},
pages = {909--914},
doi = {10.23919/DATE48585.2020.9116381},
url = {https://ieeexplore.ieee.org/document/9116381},
abstract = {Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.},
}Downloads
2003_Khasanov_DATE [PDF]
Related Paths
Permalink
- Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi] [Bibtex & Downloads]
Generalized Data Placement Strategies for Racetrack Memories
Reference
Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi]
Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.
Bibtex
@InProceedings{khan_date20,
author = {Asif Ali Khan and Andr{\'e}s Goens and Fazal Hameed and Jeronimo Castrillon},
title = {Generalized Data Placement Strategies for Racetrack Memories},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
publisher = {IEEE},
location = {Grenoble, France},
month = mar,
isbn = {978-3-9819263-4-7},
pages = {1502--1507},
doi = {10.23919/DATE48585.2020.9116245},
url = {https://ieeexplore.ieee.org/document/9116245},
abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.},
}Downloads
2003_Khan_DATE [PDF]
Related Paths
Permalink
- Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi] [Bibtex & Downloads]
Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade
Reference
Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi]
Bibtex
@Article{khan_pieee20,
author = {Robin Bl{\"a}sing and Asif Ali Khan and Panagiotis Ch. Filippou and Chirag Garg and Fazal Hameed and Jeronimo Castrillon and Stuart S. P. Parkin},
title = {Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade},
journal = {Proceedings of the IEEE},
year = {2020},
month = mar,
volume={108},
number={8},
pages={1303-1321},
doi = {10.1109/JPROC.2020.2975719},
url = {https://ieeexplore.ieee.org/document/9045991},
}Downloads
2003_Khan_JPROC [PDF]
Related Paths
Permalink
- Marten Lohstroh, Íñigo Íncer Romero, Andrés Goens, Patricia Derler, Jeronimo Castrillon, Edward A. Lee, Alberto Sangiovanni-Vincentelli, "Reactors: A Deterministic Model for Composable Reactive Systems", Cyber Physical Systems. Model-Based Design – Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019) (Chamberlain, Roger and Edin Grimheden, Martin and Taha, Walid), Springer International Publishing, pp. 59–85, Cham, Feb 2020. [doi] [Bibtex & Downloads]
Reactors: A Deterministic Model for Composable Reactive Systems
Reference
Marten Lohstroh, Íñigo Íncer Romero, Andrés Goens, Patricia Derler, Jeronimo Castrillon, Edward A. Lee, Alberto Sangiovanni-Vincentelli, "Reactors: A Deterministic Model for Composable Reactive Systems", Cyber Physical Systems. Model-Based Design – Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019) (Chamberlain, Roger and Edin Grimheden, Martin and Taha, Walid), Springer International Publishing, pp. 59–85, Cham, Feb 2020. [doi]
Abstract
This paper describes a component-based concurrent model of computation for reactive systems. The components in this model, featuring ports and hierarchy, are called reactors. The model leverages a semantic notion of time, an event scheduler, and a synchronous-reactive style of communication to achieve determinism. Reactors enable a programming model that ensures determinism, unless explicitly abandoned by the programmer. We show how the coordination of reactors can safely and transparently exploit parallelism, both in shared-memory and distributed systems.
Bibtex
@InProceedings{Lohstroh_cyphy19,
author = {Marten Lohstroh and {\'I}{\~n}igo {\'I}ncer Romero and Andr\'{e}s Goens and Patricia Derler and Jeronimo Castrillon and Edward A. Lee and Alberto Sangiovanni-Vincentelli},
title = {Reactors: A Deterministic Model for Composable Reactive Systems},
editor= {Chamberlain, Roger and Edin Grimheden, Martin and Taha, Walid},
booktitle={Cyber Physical Systems. Model-Based Design -- Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019)},
year = {2020},
location = {New York City, NY, USA},
month = feb,
publisher={Springer International Publishing},
address={Cham},
pages={59--85},
abstract={This paper describes a component-based concurrent model of computation for reactive systems. The components in this model, featuring ports and hierarchy, are called reactors. The model leverages a semantic notion of time, an event scheduler, and a synchronous-reactive style of communication to achieve determinism. Reactors enable a programming model that ensures determinism, unless explicitly abandoned by the programmer. We show how the coordination of reactors can safely and transparently exploit parallelism, both in shared-memory and distributed systems.},
isbn={978-3-030-41131-2},
url = {https://link.springer.com/chapter/10.1007/978-3-030-41131-2_4},
doi = {10.1007/978-3-030-41131-2_4},
numpages = {27pp},
}Downloads
1910_Lohstroh_CyPhy [PDF]
Related Paths
Permalink
- Alexander Brauckmann, Andrés Goens, Sebastian Ertel, Jeronimo Castrillon, "Compiler-Based Graph Representations for Deep Learning Models of Code", Proceedings of the 29th ACM SIGPLAN International Conference on Compiler Construction (CC 2020), Association for Computing Machinery, pp. 201–211, New York, NY, USA, Feb 2020. [doi] [Bibtex & Downloads]
Compiler-Based Graph Representations for Deep Learning Models of Code
Reference
Alexander Brauckmann, Andrés Goens, Sebastian Ertel, Jeronimo Castrillon, "Compiler-Based Graph Representations for Deep Learning Models of Code", Proceedings of the 29th ACM SIGPLAN International Conference on Compiler Construction (CC 2020), Association for Computing Machinery, pp. 201–211, New York, NY, USA, Feb 2020. [doi]
Bibtex
@InProceedings{brauckmann_cc20,
author = {Alexander Brauckmann and Andr\'{e}s Goens and Sebastian Ertel and Jeronimo Castrillon},
title = {Compiler-Based Graph Representations for Deep Learning Models of Code},
booktitle = {Proceedings of the 29th ACM SIGPLAN International Conference on Compiler Construction (CC 2020)},
year = {2020},
isbn = {9781450371209},
url = {https://doi.org/10.1145/3377555.3377894},
doi = {10.1145/3377555.3377894},
series = {CC 2020},
pages = {201–211},
numpages = {11},
publisher = {Association for Computing Machinery},
location = {San Diego, CA, USA},
month = feb,
address = {New York, NY, USA},
keywords = {conf},
}Downloads
2002_Brauckmann_CC [PDF]
Permalink
2019
- Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi] [Bibtex & Downloads]
ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0
Reference
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi]
Abstract
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.
Bibtex
@Article{khan_taco19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart S. P. Parkin and Jeronimo Castrillon},
title = {ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
issue_date = {December 2019},
volume = {16},
number = {4},
month = dec,
year = {2019},
issn = {1544-3566},
pages = {56:1--56:23},
articleno = {56},
numpages = {23},
url = {http://doi.acm.org/10.1145/3372489},
doi = {10.1145/3372489},
acmid = {3372489},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5\%, outperforming the state of the art by up to 16.1\%.},
}Downloads
1912_Khan_TACO [PDF]
Related Paths
Permalink
- Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi] [Bibtex & Downloads]
A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement
Reference
Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi]
Abstract
High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78% and improves the average performance by 13% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24% and 17.1% respectively.
Bibtex
@Article{hameed_tvlsi19,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {A Novel Hybrid {DRAM}/{STT-RAM} {L}ast-{L}evel-{C}ache Architecture for Performance, Energy and Endurance Enhancement},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2019},
month = oct,
abstract = {High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78\% and improves the average performance by 13\% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24\% and 17.1\% respectively.},
volume = {27},
number = {10},
pages = {2375-2386},
numpages = {12pp},
doi={10.1109/TVLSI.2019.2918385},
url = {https://ieeexplore.ieee.org/document/8734763},
}Downloads
1905_Hameed_TVLSI [PDF]
Related Paths
Permalink
- Lars Schütze, Jeronimo Castrillon, "Efficient Late Binding of Dynamic Function Compositions", Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering, ACM, pp. 141–151, New York, NY, USA, Oct 2019. [doi] [Bibtex & Downloads]
Efficient Late Binding of Dynamic Function Compositions
Reference
Lars Schütze, Jeronimo Castrillon, "Efficient Late Binding of Dynamic Function Compositions", Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering, ACM, pp. 141–151, New York, NY, USA, Oct 2019. [doi]
Bibtex
@InProceedings{schuetze_sle19,
author = {Lars Sch{\"u}tze and Jeronimo Castrillon},
title = {Efficient Late Binding of Dynamic Function Compositions},
booktitle = {Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering},
year = {2019},
series = {SLE 2019},
address = {New York, NY, USA},
month = oct,
publisher = {ACM},
keywords = {conf},
location = {Athens, Greece},
isbn = {978-1-4503-6981-7},
pages = {141--151},
numpages = {11},
url = {http://doi.acm.org/10.1145/3357766.3359543},
doi = {10.1145/3357766.3359543},
acmid = {3359543},
}Downloads
1910_Schuetze_SLE [PDF]
Permalink
- Tobias Reiher, Alexander Senier, Jeronimo Castrillon, Thorsten Strufe, "RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers", In Proceeding: International Conference on Formal Aspects of Component Software (Arbab, Farhad and Jongmans, Sung-Shik), Springer International Publishing, pp. 170–190, Cham, Oct 2019. [doi] [Bibtex & Downloads]
RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers
Reference
Tobias Reiher, Alexander Senier, Jeronimo Castrillon, Thorsten Strufe, "RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers", In Proceeding: International Conference on Formal Aspects of Component Software (Arbab, Farhad and Jongmans, Sung-Shik), Springer International Publishing, pp. 170–190, Cham, Oct 2019. [doi]
Abstract
Various vulnerabilities have been found in message parsers of protocol implementations in the past. Even highly sensitive software components like TLS libraries are affected regularly. Resulting issues range from denial-of-service attacks to the extraction of sensitive information. The complexity of protocols and imprecise specifications in natural language are the core reasons for subtle bugs in implementations, which are hard to find. The lack of precise specifications impedes formal verification.
Bibtex
@InProceedings{reiher_facs19,
author = {Tobias Reiher and Alexander Senier and Jeronimo Castrillon and Thorsten Strufe},
title = {RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers},
booktitle = {International Conference on Formal Aspects of Component Software},
year = {2019},
editor = {Arbab, Farhad and Jongmans, Sung-Shik},
organization = {Springer},
publisher = {Springer International Publishing},
location = {Amsterdam, The Netherlands},
address = {Cham},
month = oct,
pages = {170--190},
numpages = {21},
abstract = {Various vulnerabilities have been found in message parsers of protocol implementations in the past. Even highly sensitive software components like TLS libraries are affected regularly. Resulting issues range from denial-of-service attacks to the extraction of sensitive information. The complexity of protocols and imprecise specifications in natural language are the core reasons for subtle bugs in implementations, which are hard to find. The lack of precise specifications impedes formal verification.},
isbn={978-3-030-40914-2},
doi = {10.1007/978-3-030-40914-2_9},
url = {https://link.springer.com/chapter/10.1007/978-3-030-40914-2_9},
}Downloads
1910_Reiher_FACS [PDF]
Permalink
- Jeronimo Castrillon, "Embedded manycore programming: From auto-parallelization to domain specific languages", In IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019) (keynote), Oct 2019. [Bibtex & Downloads]
Embedded manycore programming: From auto-parallelization to domain specific languages
Reference
Jeronimo Castrillon, "Embedded manycore programming: From auto-parallelization to domain specific languages", In IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019) (keynote), Oct 2019.
Abstract
Programming manycores remains a daunting task, especially in the presence of the heterogeneity and application constraints typical in the embedded domain. This talk reviews efforts to cope with this complexity from the last 10+ years of research. It starts with the challenges faced by auto-parallelizing compilers, discussing how far they have made it since the start of the multi-core era. The talk also reviews explicit parallel programming and associated programming methodologies, with focus on recent advances that aim at increasing the adaptivity and robustness of dataflow applications. The talk then advocates for even higher-level programming abstractions in the form of domain specific languages, particularly important to deal with the increased complexity brought by emerging computing paradigms.
Bibtex
@Misc{castrillon_mcsoc2019,
author = {Castrillon, Jeronimo},
title = {Embedded manycore programming: From auto-parallelization to domain specific languages},
howpublished = {IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019) (keynote)},
month = oct,
year = {2019},
abstract = {Programming manycores remains a daunting task, especially in the presence of the heterogeneity and application constraints typical in the embedded domain. This talk reviews efforts to cope with this complexity from the last 10+ years of research. It starts with the challenges faced by auto-parallelizing compilers, discussing how far they have made it since the start of the multi-core era. The talk also reviews explicit parallel programming and associated programming methodologies, with focus on recent advances that aim at increasing the adaptivity and robustness of dataflow applications. The talk then advocates for even higher-level programming abstractions in the form of domain specific languages, particularly important to deal with the increased complexity brought by emerging computing paradigms.},
url = {http://mcsoc-forum.org/m2019/wp-content/uploads/2019/10/191002_castrillon_mcsoc-opt.pdf},
keywords = {invitedtalk},
location = {Singapore},
}Downloads
191002_castrillon_mcsoc-opt [PDF]
Permalink
- Jeronimo Castrillon, "Dataflow and higher level abstractions for parallel programming", In CPS Summer School 2019: Designing Cyber-Physical Systems - From concepts to implementation (keynote), Sep 2019. [Bibtex & Downloads]
Dataflow and higher level abstractions for parallel programming
Reference
Jeronimo Castrillon, "Dataflow and higher level abstractions for parallel programming", In CPS Summer School 2019: Designing Cyber-Physical Systems - From concepts to implementation (keynote), Sep 2019.
Abstract
Computing systems continue to increase in complexity, today including multiple cores, complex memory hierarchies and domain-specific accelerators, and soon with components built with emerging hardware technologies. This complexity calls for advances in a variety of domains, like programming and modeling languages, models of hardware, system simulators, design exploration methodologies and hardware architectures. From the standpoint of programming languages and compilers, this lecture discusses the challenges in mainstream sequential programming to motivate higher-level abstractions. It then provides an introduction to dataflow programming methodologies as a promising solution for embedded applications. We will review the fundamentals of dataflow models of computation, basic programming methodologies and look at current research to account for the adaptivity that new applications require, especially in the context of cyber physical systems. The lecture closes with an outlook on higher level programming abstractions and challenges posed by emerging computing architectures.
Bibtex
@Misc{castrillon_cpss19,
author = {Castrillon, Jeronimo},
title = {Dataflow and higher level abstractions for parallel programming},
howpublished = {CPS Summer School 2019: {Designing Cyber-Physical Systems - From concepts to implementation (keynote)}},
month = sep,
year = {2019},
abstract = {Computing systems continue to increase in complexity, today including multiple cores, complex memory hierarchies and domain-specific accelerators, and soon with components built with emerging hardware technologies. This complexity calls for advances in a variety of domains, like programming and modeling languages, models of hardware, system simulators, design exploration methodologies and hardware architectures. From the standpoint of programming languages and compilers, this lecture discusses the challenges in mainstream sequential programming to motivate higher-level abstractions. It then provides an introduction to dataflow programming methodologies as a promising solution for embedded applications. We will review the fundamentals of dataflow models of computation, basic programming methodologies and look at current research to account for the adaptivity that new applications require, especially in the context of cyber physical systems. The lecture closes with an outlook on higher level programming abstractions and challenges posed by emerging computing architectures.},
keywords = {invitedtalk},
location = {Alghero, Sardinia, Italy},
project = {cfaed, haec},
url = {http://www.cpsschool.eu/dataflow-and-higher-level-abstractions-for-parallel-programming/}
}Downloads
No Downloads available for this publication
Related Paths
Permalink
- Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism", Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell, ACM, pp. 146–161, New York, NY, USA, Aug 2019. [doi] [Bibtex & Downloads]
STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism
Reference
Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism", Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell, ACM, pp. 146–161, New York, NY, USA, Aug 2019. [doi]
Abstract
Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflow execution must answer the following question: How can parallelism between two dependent nodes in a dataflow graph be exploited? This is difficult when the dataflow language or programming model is implemented by a monad, as is common in the functional community, since expressing dependence between nodes by a monadic bind suggests sequential execution.
Even in monadic constructs that explicitly separate state from computation, problems arise due to the need to reason about opaquely defined state. Specifically, when abstractions of the chosen programming model do not enable adequate reasoning about state, it is difficult to detect parallelism between composed stateful computations.
In this paper, we propose a programming model that enables the composition of stateful computations and still exposes opportunities for parallelization. We also introduce smap, a higher-order function that can exploit parallelism in stateful computations. We present an implementation of our programming model and smap in Haskell and show that basic concepts from functional reactive programming can be built on top of our programming model with little effort. We compare these implementations to a state-of-the-art approach using monad-par and LVars to expose parallelism explicitly and reach the same level of performance, showing that our programming model successfully extracts parallelism that is present in an algorithm. Further evaluation shows that smap is expressive enough to implement parallel reductions and our programming model resolves short-comings of the stream-based programming model for current state-of-the-art big data processing systems.Bibtex
@InProceedings{ertel_haskell19,
author = {Ertel, Sebastian and Adam, Justus and Rink, Norman A. and Goens, Andr{\'e}s and Castrillon, Jeronimo},
title = {{STCLang}: State Thread Composition as a Foundation for Monadic Dataflow Parallelism},
booktitle = {Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell},
year = {2019},
series = {Haskell 2019},
pages = {146--161},
address = {New York, NY, USA},
month = aug,
publisher = {ACM},
abstract = {Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflow execution must answer the following question: How can parallelism between two dependent nodes in a dataflow graph be exploited? This is difficult when the dataflow language or programming model is implemented by a monad, as is common in the functional community, since expressing dependence between nodes by a monadic bind suggests sequential execution.
Even in monadic constructs that explicitly separate state from computation, problems arise due to the need to reason about opaquely defined state. Specifically, when abstractions of the chosen programming model do not enable adequate reasoning about state, it is difficult to detect parallelism between composed stateful computations.
In this paper, we propose a programming model that enables the composition of stateful computations and still exposes opportunities for parallelization. We also introduce smap, a higher-order function that can exploit parallelism in stateful computations. We present an implementation of our programming model and smap in Haskell and show that basic concepts from functional reactive programming can be built on top of our programming model with little effort. We compare these implementations to a state-of-the-art approach using monad-par and LVars to expose parallelism explicitly and reach the same level of performance, showing that our programming model successfully extracts parallelism that is present in an algorithm. Further evaluation shows that smap is expressive enough to implement parallel reductions and our programming model resolves short-comings of the stream-based programming model for current state-of-the-art big data processing systems.},
acmid = {3342600},
doi = {10.1145/3331545.3342600},
isbn = {978-1-4503-6813-1},
keywords = {conf},
location = {Berlin, Germany},
numpages = {16},
url = {http://doi.acm.org/10.1145/3331545.3342600}
}Downloads
1908_Ertel_Haskell [PDF]
Related Paths
Permalink
- Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi] [Bibtex & Downloads]
SHRIMP: Efficient Instruction Delivery with Domain Wall Memory
Reference
Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi]
Bibtex
@InProceedings{multanen_islped19,
author = {Joonas Multanen and Asif Ali Khan and Pekka J{\"a}{\"a}skel{\"a}inen and Fazal Hameed and Jeronimo Castrillon},
title = {{SHRIMP}: Efficient Instruction Delivery with Domain Wall Memory},
booktitle = {Proceedings of the International Symposium on Low Power Electronics and Design},
year = {2019},
month = jul,
series = {ISLPED '19},
location = {Lausanne, Switzerland},
pages = {6pp},
numpages = {6},
publisher = {ACM},
address = {New York, NY, USA},
doi={10.1109/ISLPED.2019.8824954},
url = {https://ieeexplore.ieee.org/document/8824954},
}Downloads
1907_Multanen_ISLPED [PDF]
Related Paths
Permalink
- Andrés Goens, Christian Menard, Jeronimo Castrillon, "On Compact Mappings for Multicore Systems", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS) (D. Pnevmatikatos and M. Pelcat and M. Jung), Springer, Cham, vol. 11733, pp. 325–335, Jul 2019. [doi] [Bibtex & Downloads]
On Compact Mappings for Multicore Systems
Reference
Andrés Goens, Christian Menard, Jeronimo Castrillon, "On Compact Mappings for Multicore Systems", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS) (D. Pnevmatikatos and M. Pelcat and M. Jung), Springer, Cham, vol. 11733, pp. 325–335, Jul 2019. [doi]
Bibtex
@InProceedings{goens_samos19,
author = {Andr{\'e}s Goens and Christian Menard and Jeronimo Castrillon},
title = {On Compact Mappings for Multicore Systems},
booktitle = {Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS)},
year = {2019},
editor = {D. Pnevmatikatos and M. Pelcat and M. Jung},
volume = {11733},
pages = {325--335},
month = jul,
organization = {IEEE},
publisher = {Springer, Cham},
doi = {10.1007/978-3-030-27562-4_23},
isbn = {978-3-030-27561-7},
location = {Pythagorion, Greece},
numpages = {11},
url = {https://link.springer.com/chapter/10.1007/978-3-030-27562-4_23}
}Downloads
1907_Goens_SAMOS [PDF]
Related Paths
Permalink
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]
Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads
Reference
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi]
Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.
Bibtex
@InProceedings{kahn_lctes19,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads},
booktitle = {Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES)},
series = {LCTES 2019},
pages = {5--18},
numpages = {12},
numpages = {14},
isbn = {978-1-4503-6724-0/19/06},
doi = {10.1145/3316482.3326351},
url = {http://doi.acm.org/10.1145/3316482.3326351},
acmid = {3326351},
year = {2019},
month = jun,
location = {Phoenix, AZ, USA},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.},
acmid = {3326351},
}Downloads
1906_Khan_LCTES [PDF]
Related Paths
Permalink
- Norman A. Rink, Jeronimo Castrillon, "TeIL: a type-safe imperative Tensor Intermediate Language", Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, pp. 57–68, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]
TeIL: a type-safe imperative Tensor Intermediate Language
Reference
Norman A. Rink, Jeronimo Castrillon, "TeIL: a type-safe imperative Tensor Intermediate Language", Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, pp. 57–68, New York, NY, USA, Jun 2019. [doi]
Abstract
Each of the popular tensor frameworks from the machine learning domain comes with its own language for expressing tensor kernels. Since these tensor languages lack precise specifications, it is impossible to understand and reason about tensor kernels that exhibit unexpected behaviour. In this paper, we give examples of such kernels.
The tensor languages are superficially similar to the well-known functional array languages, for which formal definitions often exist. However, the tensor languages are inherently imperative. In this paper we present TeIL, an imperative tensor intermediate language with precise formal semantics. For the popular tensor languages, TeIL can serve as a common ground on the basis of which precise reasoning about kernels becomes possible. Based on TeIL's formal semantics we develop a type-safety result in the Coq proof assistant.Bibtex
@InProceedings{rink_array19,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {{TeIL}: a type-safe imperative {Tensor Intermediate Language}},
booktitle = {Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY)},
year = {2019},
series = {ARRAY 2019},
pages = {57--68},
address = {New York, NY, USA},
month = jun,
publisher = {ACM},
doi = {10.1145/3315454.3329959},
url = {http://doi.acm.org/10.1145/3315454.3329959},
acmid = {3329959},
isbn = {978-1-4503-6717-2/19/06},
location = {Phoenix, AZ, USA},
numpages = {12},
abstract = {Each of the popular tensor frameworks from the machine learning domain comes with its own language for expressing tensor kernels. Since these tensor languages lack precise specifications, it is impossible to understand and reason about tensor kernels that exhibit unexpected behaviour. In this paper, we give examples of such kernels.
The tensor languages are superficially similar to the well-known functional array languages, for which formal definitions often exist. However, the tensor languages are inherently imperative. In this paper we present TeIL, an imperative tensor intermediate language with precise formal semantics. For the popular tensor languages, TeIL can serve as a common ground on the basis of which precise reasoning about kernels becomes possible. Based on TeIL's formal semantics we develop a type-safety result in the Coq proof assistant.},
}Downloads
1906_Rink_Array [PDF]
Related Paths
Permalink
- Andrés Goens, Alexander Brauckmann, Sebastian Ertel, Chris Cummins, Hugh Leather, Jeronimo Castrillon, "A Case Study on Machine Learning for Synthesizing Benchmarks", Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), ACM, pp. 38–46, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]
A Case Study on Machine Learning for Synthesizing Benchmarks
Reference
Andrés Goens, Alexander Brauckmann, Sebastian Ertel, Chris Cummins, Hugh Leather, Jeronimo Castrillon, "A Case Study on Machine Learning for Synthesizing Benchmarks", Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), ACM, pp. 38–46, New York, NY, USA, Jun 2019. [doi]
Abstract
Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators.
Bibtex
@InProceedings{goens_mapl19,
author = {Andr\'{e}s Goens and Alexander Brauckmann and Sebastian Ertel and Chris Cummins and Hugh Leather and Jeronimo Castrillon},
title = {A Case Study on Machine Learning for Synthesizing Benchmarks},
booktitle = {Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL)},
year = {2019},
series = {MAPL 2019},
doi = {10.1145/3315508.3329976},
url = {http://doi.acm.org/10.1145/3315508.3329976},
acmid = {3329976},
isbn = {978-1-4503-6719-6/19/06},
pages = {38--46},
address = {New York, NY, USA},
month = jun,
publisher = {ACM},
keywords = {conf},
location = {Phoenix, AZ, USA},
numpages = {9},
abstract = {Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators.},
}Downloads
1906_Goens_MAPL [PDF]
Permalink
- Jeronimo Castrillon, "SoC programming in the era of the Internet of Things, machine learning and emerging technologies", In Groupement De Recherche SOC2: System On Chip, Systèmes embarqués et Objets Connecté (keynote), Jun 2019. [Bibtex & Downloads]
SoC programming in the era of the Internet of Things, machine learning and emerging technologies
Reference
Jeronimo Castrillon, "SoC programming in the era of the Internet of Things, machine learning and emerging technologies", In Groupement De Recherche SOC2: System On Chip, Systèmes embarqués et Objets Connecté (keynote), Jun 2019.
Abstract
The design of a system on chip has traditionally been one of the most complex tasks in computing
systems. Designers have to deal with stringent application constraints under a low-power budget while
reducing non-recurring engineering costs. Modeling languages, costs models of hardware, system
simulators, design exploration methodologies and alike have made it possible to cope with this high
complexity. Today, three recent trends represent a non trivial complexity increase and thus a challenge
for SoC designers and programmers, namely, 1) additional system dynamics in the context of the
Internet of Things, 2) the ubiquity of machine learning workloads, and 3) the added complexity brought
by specialization and emerging technologies. This talk discusses how models and higher level
programming abstractions can be leveraged to cope with these trends. A dataflow programming
methodology is extended to account for dynamic execution scenarios at runtime. A tensor abstraction,
common in machine learning, is introduced that eases programming and design tasks. Finally, the
talk shows how the tensor abstraction is useful to efficiently map tensor computations to SoCs with
non-volatile racetrack scratch-pad memory.Bibtex
@Misc{castrillon_gdrsoc2019,
author = {Castrillon, Jeronimo},
title = {SoC programming in the era of the Internet of Things, machine learning and emerging technologies},
howpublished = {Groupement De Recherche SOC2: System On Chip, Syst{\`e}mes embarqu{\'e}s et Objets Connect{\'e} (keynote)},
month = jun,
year = {2019},
abstract = {The design of a system on chip has traditionally been one of the most complex tasks in computing
systems. Designers have to deal with stringent application constraints under a low-power budget while
reducing non-recurring engineering costs. Modeling languages, costs models of hardware, system
simulators, design exploration methodologies and alike have made it possible to cope with this high
complexity. Today, three recent trends represent a non trivial complexity increase and thus a challenge
for SoC designers and programmers, namely, 1) additional system dynamics in the context of the
Internet of Things, 2) the ubiquity of machine learning workloads, and 3) the added complexity brought
by specialization and emerging technologies. This talk discusses how models and higher level
programming abstractions can be leveraged to cope with these trends. A dataflow programming
methodology is extended to account for dynamic execution scenarios at runtime. A tensor abstraction,
common in machine learning, is introduced that eases programming and design tasks. Finally, the
talk shows how the tensor abstraction is useful to efficiently map tensor computations to SoCs with
non-volatile racetrack scratch-pad memory.},
location = {Montpellier, France},
url = {http://www.gdr-soc.cnrs.fr/programme-colloque-2019/}
}Downloads
190620_castrillon_gdrsoc2-lowres [PDF]
Permalink
- Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''", In CoRR, vol. abs/1906.12098, Jun 2019. [Bibtex & Downloads]
Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''
Reference
Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''", In CoRR, vol. abs/1906.12098, Jun 2019.
Bibtex
@Article{ertel_haskellsup19,
author = {Sebastian Ertel and Justus Adam and Norman A. Rink and Andr{\'{e}}s Goens and Jeronimo Castrillon},
title = {Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''},
journal = {CoRR},
year = {2019},
volume = {abs/1906.12098},
month = jun,
archiveprefix = {arXiv},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1906-12098},
eprint = {1906.12098},
url = {http://arxiv.org/abs/1906.12098}
}Downloads
1906_Ertel_Haskellsupp [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Programming abstractions: When domain-specific goes mainstream", In 28th Workshop of the Gesellschaft für Informatik, interest group Parallele Algorithmen, Rechenstrukturen und Systemsoftware (PARS'19) (invited talk), Mar 2019. [Bibtex & Downloads]
Programming abstractions: When domain-specific goes mainstream
Reference
Jeronimo Castrillon, "Programming abstractions: When domain-specific goes mainstream", In 28th Workshop of the Gesellschaft für Informatik, interest group Parallele Algorithmen, Rechenstrukturen und Systemsoftware (PARS'19) (invited talk), Mar 2019.
Abstract
We have seen several inflection points in computing in this century: from single to multi-core,
from homogeneous to heterogeneous, and in the near future to fundamentally new computing
paradigms with emerging technologies. With domain-specific hardware becoming mainstream,
and programming reaching out to professions outside computer science, higher-level
programming abstractions, domain-specific languages (DSLs) and tools are badly needed. This
talk provides examples of such programming abstractions as a basis for discussion. It first
discusses how dataflow programming models from the embedded domain can be leveraged in
more general purpose setups. It then presents DSLs for particle-based simulations and for tensor
expressions. The latter is one example of multiple tensor DSLs available today, spawned
by the recent machine learning boom. The talk closes with examples of emerging technologies
and a brief discussion about how they may impact our current assumptions.Bibtex
@Misc{castrillon_pars2019,
author = {Castrillon, Jeronimo},
title = {Programming abstractions: When domain-specific goes mainstream},
howpublished = {28th Workshop of the Gesellschaft f{\"u}r Informatik, interest group Parallele Algorithmen, Rechenstrukturen und Systemsoftware (PARS'19) (invited talk)},
month = mar,
year = {2019},
abstract = {We have seen several inflection points in computing in this century: from single to multi-core,
from homogeneous to heterogeneous, and in the near future to fundamentally new computing
paradigms with emerging technologies. With domain-specific hardware becoming mainstream,
and programming reaching out to professions outside computer science, higher-level
programming abstractions, domain-specific languages (DSLs) and tools are badly needed. This
talk provides examples of such programming abstractions as a basis for discussion. It first
discusses how dataflow programming models from the embedded domain can be leveraged in
more general purpose setups. It then presents DSLs for particle-based simulations and for tensor
expressions. The latter is one example of multiple tensor DSLs available today, spawned
by the recent machine learning boom. The talk closes with examples of emerging technologies
and a brief discussion about how they may impact our current assumptions.},
location = {Berlin, Germany},
url = {https://fg-pars.gi.de/veranstaltung/pars-workshop-2019/}
}Downloads
190321_castrill_PARS-sent [PDF]
Permalink
- Gerhard Fettweis, Meik Dörpinghaus, Jeronimo Castrillon, Akash Kumar, Christel Baier, Karlheinz Bock, Frank Ellinger, Andreas Fery, Frank H. P. Fitzek, Hermann Härtig, Kambiz Jamshidi, Thomas Kissinger, Wolfgang Lehner, Michael Mertig, Wolfgang E. Nagel, Giang T. Nguyen, Dirk Plettemeier, Michael Schröter, Thorsten Strufe, "Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing", In Proceedings of the IEEE, vol. 107, no. 1, pp. 204–231, Jan 2019. [doi] [Bibtex & Downloads]
Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing
Reference
Gerhard Fettweis, Meik Dörpinghaus, Jeronimo Castrillon, Akash Kumar, Christel Baier, Karlheinz Bock, Frank Ellinger, Andreas Fery, Frank H. P. Fitzek, Hermann Härtig, Kambiz Jamshidi, Thomas Kissinger, Wolfgang Lehner, Michael Mertig, Wolfgang E. Nagel, Giang T. Nguyen, Dirk Plettemeier, Michael Schröter, Thorsten Strufe, "Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing", In Proceedings of the IEEE, vol. 107, no. 1, pp. 204–231, Jan 2019. [doi]
Bibtex
@Article{fettweis_ieeeproc19,
author = {Gerhard Fettweis and Meik D{\"o}rpinghaus and Jeronimo Castrillon and Akash Kumar and Christel Baier and Karlheinz Bock and Frank Ellinger and Andreas Fery and Frank H. P. Fitzek and Hermann H{\"a}rtig and Kambiz Jamshidi and Thomas Kissinger and Wolfgang Lehner and Michael Mertig and Wolfgang E. Nagel and Giang T. Nguyen and Dirk Plettemeier and Michael Schr{\"o}ter and Thorsten Strufe},
title = {Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing},
journal = {Proceedings of the IEEE},
year = {2019},
volume = {107},
number = {1},
pages = {204--231},
month = jan,
doi = {10.1109/JPROC.2018.2874895},
issn = {0018-9219},
url = {https://ieeexplore.ieee.org/document/8565890}
}Downloads
1812_Fettweis_IEEEProc [PDF]
Related Paths
Permalink
- Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi] [Bibtex & Downloads]
RTSim: A Cycle-accurate Simulator for Racetrack Memories
Reference
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi]
Bibtex
@Article{khan_ieeecal19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart Parkin and Jeronimo Castrillon},
title = {{RTS}im: A Cycle-accurate Simulator for Racetrack Memories},
journal = {IEEE Computer Architecture Letters},
year = {2019},
volume = {18},
number = {1},
pages = {43--46},
month = jan,
doi = {10.1109/LCA.2019.2899306},
issn = {1556-6056},
publisher = {IEEE},
url = {https://ieeexplore.ieee.org/document/8642352}
}Downloads
1902_khan_IEEECAL [PDF]
Related Paths
Permalink
- Hasna Bouraoui, Jeronimo Castrillon, Chadlia Jerad, "Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications", Proceedings of the 10th Workshop and 8th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'19), co-located with 14th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 4:1–4:6, New York, NY, USA, Jan 2019. [doi] [Bibtex & Downloads]
Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications
Reference
Hasna Bouraoui, Jeronimo Castrillon, Chadlia Jerad, "Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications", Proceedings of the 10th Workshop and 8th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'19), co-located with 14th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 4:1–4:6, New York, NY, USA, Jan 2019. [doi]
Bibtex
@InProceedings{bouraoui_parma19,
author = {Hasna Bouraoui and Jeronimo Castrillon and Chadlia Jerad},
title = {Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications},
booktitle = {Proceedings of the 10th Workshop and 8th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'19), co-located with 14th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
year = {2019},
series = {PARMA-DITAM 2019},
pages = {4:1--4:6},
articleno = {4},
numpages = {6},
address = {New York, NY, USA},
month = jan,
publisher = {ACM},
isbn = {978-1-4503-6321-1},
url = {http://doi.acm.org/10.1145/3310411.3310417},
doi = {10.1145/3310411.3310417},
acmid = {3310417},
location = {Valencia, Spain},
numpages = {6}
}Downloads
1901_Bouraoui_PARMA [PDF]
Related Paths
Permalink
2018
- Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, Claude Tadonki, "Meta-programming for Cross-Domain Tensor Optimizations", Proceedings of 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'18), ACM, pp. 79–92, New York, NY, USA, Nov 2018. [doi] [Bibtex & Downloads]
Meta-programming for Cross-Domain Tensor Optimizations
Reference
Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, Claude Tadonki, "Meta-programming for Cross-Domain Tensor Optimizations", Proceedings of 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'18), ACM, pp. 79–92, New York, NY, USA, Nov 2018. [doi]
Bibtex
@InProceedings{rink_gpce18,
author = {Adilla Susungi and Norman A. Rink and Albert Cohen and Jeronimo Castrillon and Claude Tadonki},
title = {Meta-programming for Cross-Domain Tensor Optimizations},
booktitle = {Proceedings of 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'18)},
year = {2018},
series = {GPCE 2018},
pages = {79--92},
numpages = {14},
address = {New York, NY, USA},
month = nov,
publisher = {ACM},
keywords = {conf},
location = {Boston, MA, USA},
isbn = {978-1-4503-6045-6},
url = {http://doi.acm.org/10.1145/3278122.3278131},
doi = {10.1145/3278122.3278131},
acmid = {3278131},
}Downloads
1811_Rink_GPCE [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Parallel programming methodologies for manycores", In NeXtream Solution Seminar & Silexica Technology Workshop (invited talk), Oct 2018. [Bibtex & Downloads]
Parallel programming methodologies for manycores
Reference
Jeronimo Castrillon, "Parallel programming methodologies for manycores", In NeXtream Solution Seminar & Silexica Technology Workshop (invited talk), Oct 2018.
Bibtex
@Misc{castrillon_neXtream2018,
author = {Castrillon, Jeronimo},
title = {Parallel programming methodologies for manycores},
howpublished = {NeXtream Solution Seminar \& Silexica Technology Workshop (invited talk)},
month = oct,
year = {2018},
keywords = {invitedtalk},
location = {Tokyo, Japan},
url = {https://nextream.bz/nss/2018/?page_id=29}
}Downloads
181017_castrill_slx-tokyo_sent-2 [PDF]
Permalink
- Rainer Leupers, Miguel A. Aguilar, Jeronimo Castrillon, Weihua Sheng, "Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems", Chapter in Handbook of Signal Processing Systems (3rd Edition) (Bhattacharyya, Shuvra S. and Deprettere, Ed F. and Leupers, Rainer and Takala, Jarmo), Springer New York, pp. 1021–1062, Sep 2018. [doi] [Bibtex & Downloads]
Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems
Reference
Rainer Leupers, Miguel A. Aguilar, Jeronimo Castrillon, Weihua Sheng, "Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems", Chapter in Handbook of Signal Processing Systems (3rd Edition) (Bhattacharyya, Shuvra S. and Deprettere, Ed F. and Leupers, Rainer and Takala, Jarmo), Springer New York, pp. 1021–1062, Sep 2018. [doi]
Abstract
The increasing demands of modern embedded systems, such as high-performance and energy-efficiency, have motivated the use of heterogeneous multi-core platforms enabled by Multiprocessor System-on-Chips (MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a tool-chain to tackle the problems of application modeling, platform description, software parallelization, software distribution and code generation for an efficient usage of the target platform. This chapter discusses various aspects of compilers for heterogeneous embedded multi-core systems, using the well-established single-core C compiler technology as a baseline for comparison. After a brief introduction to the MPSoC compiler technology, the important ingredients of the compilation process are explained in detail. Finally, a number of case studies from academia and industry are presented to illustrate the concepts discussed in this chapter.
Bibtex
@InCollection{leupers18_spschapter,
author = {Leupers, Rainer and Aguilar, Miguel A. and Castrillon, Jeronimo and Sheng, Weihua},
title = {Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems},
booktitle = {Handbook of Signal Processing Systems (3rd Edition)},
publisher = {Springer New York},
year = {2018},
month = sep,
editor = {Bhattacharyya, Shuvra S. and Deprettere, Ed F. and Leupers, Rainer and Takala, Jarmo},
pages = {1021--1062},
abstract = {The increasing demands of modern embedded systems, such as high-performance and energy-efficiency, have motivated the use of heterogeneous multi-core platforms enabled by Multiprocessor System-on-Chips (MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a tool-chain to tackle the problems of application modeling, platform description, software parallelization, software distribution and code generation for an efficient usage of the target platform. This chapter discusses various aspects of compilers for heterogeneous embedded multi-core systems, using the well-established single-core C compiler technology as a baseline for comparison. After a brief introduction to the MPSoC compiler technology, the important ingredients of the compilation process are explained in detail. Finally, a number of case studies from academia and industry are presented to illustrate the concepts discussed in this chapter.},
doi = {10.1007/978-3-319-91734-4_28},
isbn = {978-3-319-91733-7},
url = {https://link.springer.com/chapter/10.1007/978-3-319-91734-4_28},
}Downloads
1809_Leupers_SPSBookChapter [PDF]
Permalink
- Andrés Goens, Christian Menard, Jeronimo Castrillon, "On the Representation of Mappings to Multicores", Proceedings of the IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-18), pp. 184–191, Vietnam National University, Hanoi, Vietnam, Sep 2018. [doi] [Bibtex & Downloads]
On the Representation of Mappings to Multicores
Reference
Andrés Goens, Christian Menard, Jeronimo Castrillon, "On the Representation of Mappings to Multicores", Proceedings of the IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-18), pp. 184–191, Vietnam National University, Hanoi, Vietnam, Sep 2018. [doi]
Abstract
Application requirements for embedded systems are growing rapidly, as is the complexity of systems designed to execute them. A common abstraction used to tame this growing complexity is that of a mapping, which assigns parts of an application to different hardware resources. Modern flows need to explore an intractably large design space of mappings, and be able to quickly find near-optimal mappings for different objectives, sometimes at runtime. With systems featuring thousands of cores in the near horizon, we need methods to make this exploration step truly scalable. In this paper we argue that the mathematical representation of a mapping is central to achieve this. We present different representations and how these could be applied to different contexts and objectives, like complex design- space exploration meta-heuristics or efficient runtime systems.
Bibtex
@InProceedings{goen_mcsoc18,
author = {Andr\'{e}s Goens and Christian Menard and Jeronimo Castrillon},
title = {On the Representation of Mappings to Multicores},
booktitle = {Proceedings of the IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-18)},
year = {2018},
address = {Vietnam National University, Hanoi, Vietnam},
month = sep,
pages = {184--191},
doi = {10.1109/MCSoC2018.2018.00039},
url = {https://ieeexplore.ieee.org/document/8540232},
isbn = {978-1-5386-6689-0/18/},
abstract = {Application requirements for embedded systems are growing rapidly, as is the complexity of systems designed to execute them. A common abstraction used to tame this growing complexity is that of a mapping, which assigns parts of an application to different hardware resources. Modern flows need to explore an intractably large design space of mappings, and be able to quickly find near-optimal mappings for different objectives, sometimes at runtime. With systems featuring thousands of cores in the near horizon, we need methods to make this exploration step truly scalable. In this paper we argue that the mathematical representation of a mapping is central to achieve this. We present different representations and how these could be applied to different contexts and objectives, like complex design- space exploration meta-heuristics or efficient runtime systems.},
}Downloads
1809_Goens_MCSoC [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz, Marcus Völp, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés Goens, Sebastian Haas, Dirk Habich, Hermann Härtig, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling, Steffen Märcker, Christian Menard, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, Sascha Wunderlich, "A Hardware/Software Stack for Heterogeneous Systems", In IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 3, pp. 243-259, Jul 2018. [doi] [Bibtex & Downloads]
A Hardware/Software Stack for Heterogeneous Systems
Reference
Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz, Marcus Völp, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés Goens, Sebastian Haas, Dirk Habich, Hermann Härtig, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling, Steffen Märcker, Christian Menard, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, Sascha Wunderlich, "A Hardware/Software Stack for Heterogeneous Systems", In IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 3, pp. 243-259, Jul 2018. [doi]
Abstract
Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on a tiled heterogeneous multicore.
Bibtex
@Article{castrillon_tmscs17,
author = {Jeronimo Castrillon and Matthias Lieber and Sascha Kl{\"u}ppelholz and Marcus V{\"o}lp and Nils Asmussen and Uwe Assmann and Franz Baader and Christel Baier and Gerhard Fettweis and Jochen Fr{\"o}hlich and Andr\'{e}s Goens and Sebastian Haas and Dirk Habich and Hermann H{\"a}rtig and Mattis Hasler and Immo Huismann and Tomas Karnagel and Sven Karol and Akash Kumar and Wolfgang Lehner and Linda Leuschner and Siqi Ling and Steffen M{\"a}rcker and Christian Menard and Johannes Mey and Wolfgang Nagel and Benedikt N{\"o}then and Rafael Pe{\~n}aloza and Michael Raitza and J{\"o}rg Stiller and Annett Ungeth{\"u}m and Axel Voigt and Sascha Wunderlich},
title = {A Hardware/Software Stack for Heterogeneous Systems},
journal = {IEEE Transactions on Multi-Scale Computing Systems},
year = {2018},
month = jul,
volume={4},
number={3},
pages={243-259},
abstract = {Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on a tiled heterogeneous multicore.},
doi = {10.1109/TMSCS.2017.2771750},
issn = {2332-7766},
url = {http://ieeexplore.ieee.org/document/8103042/}
}Downloads
1711_Castrillon_TMSCS [PDF]
Related Paths
Permalink
- Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi] [Bibtex & Downloads]
Performance and Energy Efficient Design of STT-RAM Last-Level-Cache
Reference
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi]
Abstract
Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75% and improve the system performance by 6.5%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82% and improves system performance by 6.8%.
Bibtex
@Article{hameed_tvlsi18,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Performance and Energy Efficient Design of STT-RAM Last-Level-Cache},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2018},
volume = {26},
number = {6},
pages = {1059--1072},
month = jun,
abstract = {Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75\% and improve the system performance by 6.5\%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82\% and improves system performance by 6.8\%.},
doi = {10.1109/TVLSI.2018.2804938},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1803_Hameed_TVLSI.pdf:PDF},
issn = {1063-8210},
numpages = {14},
url = {http://ieeexplore.ieee.org/document/8307465/}
}Downloads
1803_Hameed_TVLSI [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Heterogeneous Post-CMOS Technologies Meet Software", In Post Moore Interconnects Workshop, ISC High Performance 2018 (invited talk), Jun 2018. [Bibtex & Downloads]
Heterogeneous Post-CMOS Technologies Meet Software
Reference
Jeronimo Castrillon, "Heterogeneous Post-CMOS Technologies Meet Software", In Post Moore Interconnects Workshop, ISC High Performance 2018 (invited talk), Jun 2018.
Bibtex
@Misc{castrillon2018ISC,
author = {Castrillon, Jeronimo},
title = {Heterogeneous Post-CMOS Technologies Meet Software},
howpublished = {Post Moore Interconnects Workshop, ISC High Performance 2018 (invited talk)},
month = jun,
year = {2018},
keywords = {invitedtalk},
location = {Frankfurt, Germany},
url = {https://beyondcmos.ornl.gov/2018/agenda.html}
}Downloads
180628_castrillon_isc-postmoore_send [PDF]
Permalink
- Jeronimo Castrillon, "Parallel programming: Current and future systems", In 50-year Celebration: Department of Electronics, Universidad de Antioquia, in the context of the IEEE Colombian Conference on Communications and Computing (COLCOM'18) (invited talk), May 2018. [Bibtex & Downloads]
Parallel programming: Current and future systems
Reference
Jeronimo Castrillon, "Parallel programming: Current and future systems", In 50-year Celebration: Department of Electronics, Universidad de Antioquia, in the context of the IEEE Colombian Conference on Communications and Computing (COLCOM'18) (invited talk), May 2018.
Bibtex
@Misc{castrillon2018UdeA,
author = {Castrillon, Jeronimo},
title = {Parallel programming: Current and future systems},
howpublished = {50-year Celebration: Department of Electronics, Universidad de Antioquia, in the context of the IEEE Colombian Conference on Communications and Computing (COLCOM'18) (invited talk)},
month = may,
year = {2018},
location = {Universidad de Antioquia, Medell{\'i}n, Colombia}
}Downloads
180516_castrillon_50years_EE_UdeA [PDF]
Permalink
- Sven Karol, Tobias Nett, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Domain-Specific Language and Editor for Parallel Particle Methods", In ACM Transactions on Mathematical Software (TOMS), ACM, vol. 44, no. 3, pp. 32, New York, NY, USA, Mar 2018. [doi] [Bibtex & Downloads]
A Domain-Specific Language and Editor for Parallel Particle Methods
Reference
Sven Karol, Tobias Nett, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Domain-Specific Language and Editor for Parallel Particle Methods", In ACM Transactions on Mathematical Software (TOMS), ACM, vol. 44, no. 3, pp. 32, New York, NY, USA, Mar 2018. [doi]
Abstract
Domain-specific languages (DSLs) are of increasing importance in scientific high-performance computing to reduce development costs, raise the level of abstraction and, thus, ease scientific programming. However, designing DSLs is not easy, as it requires knowledge of the application domain and experience in language engineering and compilers. Consequently, many DSLs follow a weak approach using macros or text generators, which lack many of the features that make a DSL comfortable for programmers. Some of these features –e.g., syntax highlighting, type inference, error reporting– are easily provided by language workbenches, which combine language engineering techniques and tools in a common ecosystem. In this paper, we present the Parallel Particle-Mesh Environment (PPME), a DSL and development environment for numerical simulations based on particle methods and hybrid particle-mesh methods. PPME uses the Meta Programming System (MPS), a projectional language workbench. PPME is the successor of the Parallel Particle-Mesh Language, a Fortran-based DSL that uses conventional implementation strategies. We analyze and compare both languages and demonstrate how the programmer’s experience is improved using static analyses and projectional editing, i.e., code-structure editing, constrained by syntax, as opposed to free-text editing. We present an explicit domain model for particle abstractions and the first formal type system for partircle methods.
Bibtex
@Article{karol_toms18,
author = {Karol, Sven and Nett, Tobias and Castrillon, Jeronimo and Sbalzarini, Ivo F.},
title = {A Domain-Specific Language and Editor for Parallel Particle Methods},
journal = {ACM Transactions on Mathematical Software (TOMS)},
issue_date = {March 2018},
volume = {44},
number = {3},
month = mar,
year = {2018},
issn = {0098-3500},
pages = {34:1--34:32},
articleno = {34},
numpages = {32},
url = {http://doi.acm.org/10.1145/3175659},
doi = {10.1145/3175659},
acmid = {3175659},
publisher = {ACM},
address = {New York, NY, USA},
pages = {32},
abstract = {
Domain-specific languages (DSLs) are of increasing importance in scientific high-performance computing to reduce development costs, raise the level of abstraction and, thus, ease scientific programming. However, designing DSLs is not easy, as it requires knowledge of the application domain and experience in language engineering and compilers. Consequently, many DSLs follow a weak approach using macros or text generators, which lack many of the features that make a DSL comfortable for programmers. Some of these features --e.g., syntax highlighting, type inference, error reporting-- are easily provided by language workbenches, which combine language engineering techniques and tools in a common ecosystem. In this paper, we present the Parallel Particle-Mesh Environment (PPME), a DSL and development environment for numerical simulations based on particle methods and hybrid particle-mesh methods. PPME uses the Meta Programming System (MPS), a projectional language workbench. PPME is the successor of the Parallel Particle-Mesh Language, a Fortran-based DSL that uses conventional implementation strategies. We analyze and compare both languages and demonstrate how the programmer’s experience is improved using static analyses and projectional editing, i.e., code-structure editing, constrained by syntax, as opposed to free-text editing. We present an explicit domain model for particle abstractions and the first formal type system for partircle methods.},
}Downloads
1709_Karol_TOMS-arxiv [PDF]
Related Paths
Biological Systems Path, Orchestration Path
Permalink
- Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018. [Bibtex & Downloads]
STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement
Reference
Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018.
Bibtex
@InProceedings{hameed_nvmw18,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement},
booktitle = {Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018)},
year = {2018},
month = mar,
location = {San Diego, CA, USA},
numpages = {2}
}Downloads
1803_Hameed_NVMW [PDF]
Related Paths
Permalink
- Sebastian Ertel, Andrés Goens, Justus Adam, Jeronimo Castrillon, "Compiling for Concise Code and Efficient I/O", Proceedings of the 27th International Conference on Compiler Construction (CC 2018), ACM, pp. 104–115, New York, NY, USA, Feb 2018. [doi] [Bibtex & Downloads]
Compiling for Concise Code and Efficient I/O
Reference
Sebastian Ertel, Andrés Goens, Justus Adam, Jeronimo Castrillon, "Compiling for Concise Code and Efficient I/O", Proceedings of the 27th International Conference on Compiler Construction (CC 2018), ACM, pp. 104–115, New York, NY, USA, Feb 2018. [doi]
Abstract
Large infrastructures of Internet companies, such as Facebook and Twitter, are composed of several layers of micro-services. While this modularity provides scalability to the system, the I/O associated with each service request strongly impacts its performance. In this context, writing concise programs which execute I/O efficiently is especially challenging. In this paper, we introduce Ÿauhau, a novel compile-time solution. Ÿauhau reduces the number of I/O calls through rewrites on a simple expression language. To execute I/O concurrently, it lowers the expression language to a dataflow representation. Our approach can be used alongside an existing programming language, permitting the use of legacy code. We describe an implementation in the JVM and use it to evaluate our approach. Experiments show that Ÿauhau can significantly improve I/O, both in terms of the number of I/O calls and concurrent execution. Ÿauhau outperforms state-of-the-art approaches with similar goals.
Bibtex
@InProceedings{ertel_cc18,
author = {Sebastian Ertel and Andr\'{e}s Goens and Justus Adam and Jeronimo Castrillon},
title = {Compiling for Concise Code and Efficient I/O},
booktitle = {Proceedings of the 27th International Conference on Compiler Construction (CC 2018)},
series = {CC 2018},
year = {2018},
month = feb,
location = {Vienna, Austria},
publisher = {ACM},
numpages = {12},
pages = {104--115},
doi = {10.1145/3178372.3179505},
url = {https://dl.acm.org/citation.cfm?id=3179505},
acmid = {3179505},
address = {New York, NY, USA},
abstract = {Large infrastructures of Internet companies, such as Facebook and Twitter, are composed of several layers of micro-services. While this modularity provides scalability to the system, the I/O associated with each service request strongly impacts its performance. In this context, writing concise programs which execute I/O efficiently is especially challenging. In this paper, we introduce Ÿauhau, a novel compile-time solution. Ÿauhau reduces the number of I/O calls through rewrites on a simple expression language. To execute I/O concurrently, it lowers the expression language to a dataflow representation. Our approach can be used alongside an existing programming language, permitting the use of legacy code. We describe an implementation in the JVM and use it to evaluate our approach. Experiments show that Ÿauhau can significantly improve I/O, both in terms of the number of I/O calls and concurrent execution. Ÿauhau outperforms state-of-the-art approaches with similar goals.},
}Downloads
cc-2018-slides [PDF]
1802_Ertel_CC [PDF]
Related Paths
Permalink
- Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Supporting Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), ACM, pp. 41–50, New York, NY, USA, Feb 2018. [doi] [Bibtex & Downloads]
Supporting Fine-grained Dataflow Parallelism in Big Data Systems
Reference
Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Supporting Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), ACM, pp. 41–50, New York, NY, USA, Feb 2018. [doi]
Abstract
Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.
Bibtex
@InProceedings{ertel_pmam18,
author = {Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {Supporting Fine-grained Dataflow Parallelism in Big Data Systems},
booktitle = {Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM)},
year = {2018},
series = {PMAM'18},
address = {New York, NY, USA},
month = feb,
publisher = {ACM},
doi = {10.1145/3178442.3178447},
isbn = {978-1-4503-5645-9},
location = {Vienna, Austria},
pages = {41--50},
numpages = {10},
acmid = {3178447},
url = {http://doi.acm.org/10.1145/3178442.3178447},
abstract = {Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.},
}Downloads
pmam-2018-slides [PDF]
1802_Ertel_PMAM [PDF]
Related Paths
Permalink
- Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, Claude Tadonki, "CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics", Proceedings of the 3rd International Workshop on Real World Domain Specific Languages (RWDSL 2018), ACM, pp. 5:1–5:10, New York, NY, USA, Feb 2018. [doi] [Bibtex & Downloads]
CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics
Reference
Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, Claude Tadonki, "CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics", Proceedings of the 3rd International Workshop on Real World Domain Specific Languages (RWDSL 2018), ACM, pp. 5:1–5:10, New York, NY, USA, Feb 2018. [doi]
Abstract
Numerical simulations continue to enable fast and enormous progress in science and engineering. Writing efficient numerical codes is a difficult challenge that encompasses a variety of tasks from designing the right algorithms to exploiting the full potential of a platform's architecture. Domain-specific languages (DSLs) can ease these tasks by offering the right abstractions for expressing numerical problems. With the aid of domain knowledge, efficient code can then be generated automatically from abstract expressions. In this work, we present the CFDlang DSL for expressing tensor operations that constitute the performance-critical code sections in a class of real numerical applications from fluid dynamics. We demonstrate that CFDlang can be used to generate code automatically that performs as well, if not better, than carefully hand-optimized code.
Bibtex
@InProceedings{rink_rwdsl18,
author = {Norman A. Rink and Immo Huismann and Adilla Susungi and Jeronimo Castrillon and J{\"o}rg Stiller and Jochen Fr{\"o}hlich and Claude Tadonki},
title = {CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics},
booktitle = {Proceedings of the 3rd International Workshop on Real World Domain Specific Languages (RWDSL 2018)},
year = {2018},
series = {RWDSL2018},
pages = {5:1--5:10},
address = {New York, NY, USA},
month = feb,
publisher = {ACM},
abstract = {Numerical simulations continue to enable fast and enormous progress in science and engineering. Writing efficient numerical codes is a difficult challenge that encompasses a variety of tasks from designing the right algorithms to exploiting the full potential of a platform's architecture. Domain-specific languages (DSLs) can ease these tasks by offering the right abstractions for expressing numerical problems. With the aid of domain knowledge, efficient code can then be generated automatically from abstract expressions. In this work, we present the CFDlang DSL for expressing tensor operations that constitute the performance-critical code sections in a class of real numerical applications from fluid dynamics. We demonstrate that CFDlang can be used to generate code automatically that performs as well, if not better, than carefully hand-optimized code.},
acmid = {3183900},
articleno = {5},
doi = {10.1145/3183895.3183900},
isbn = {978-1-4503-6355-6},
location = {Vienna, Austria},
numpages = {10},
url = {http://doi.acm.org/10.1145/3183895.3183900}
}Downloads
1802_Rink_RWDSL [PDF]
Related Paths
Permalink
- Hermann Härtig, Nils Asmussen, Jeronimo Castrillon, Adam Lackorzynski, Michael Roitzsch, Carsten Weinhold, Akash Kumar, "Extremely Heterogeneous Systems – Not Just For Niches", In Proceeding: Extreme Heterogeneity Workshop, Feb 2018. [Bibtex & Downloads]
Extremely Heterogeneous Systems – Not Just For Niches
Reference
Hermann Härtig, Nils Asmussen, Jeronimo Castrillon, Adam Lackorzynski, Michael Roitzsch, Carsten Weinhold, Akash Kumar, "Extremely Heterogeneous Systems – Not Just For Niches", In Proceeding: Extreme Heterogeneity Workshop, Feb 2018.
Bibtex
@InProceedings{haertig_ehw18,
author = {Hermann H{\"a}rtig and Nils Asmussen and Jeronimo Castrillon and Adam Lackorzynski and Michael Roitzsch and Carsten Weinhold and Akash Kumar},
title = {Extremely Heterogeneous Systems -- Not Just For Niches},
booktitle = {Extreme Heterogeneity Workshop},
year = {2018},
month = feb,
note = {(Workshop took place over remote conferencing)},
location = {Gaithersburg, MD, USA}
}Downloads
1802_Haertig_EHW [PDF]
Related Paths
Permalink
- Robert Khasanov, Andrés Goens, Jeronimo Castrillon, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap", Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 20–25, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]
Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap
Reference
Robert Khasanov, Andrés Goens, Jeronimo Castrillon, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap", Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 20–25, New York, NY, USA, Jan 2018. [doi]
Abstract
Modern embedded systems are rapidly increasing their complexity, both in terms of numbers of cores, as well as heterogeneity. To generate efficient code for these systems, it is common to leverage formal models of computation.
Among these, the dataflow model of Kahn Process Networks (KPN) is widespread because it is expressive but guarantees a deterministic execution. However, the KPN model is ill-suited to expose data-level parallelism, since this has to be made explicit in the process network. This is aggravated by the fact that its most common execution model, Kahn-MacQueen, poses restrictive conditions on the scheduling of data-parallel processes, leading to an inefficient execution. In this paper we present a novel extension to the KPN model and a relaxed execution strategy that addresses this problem, while keeping the deterministic KPN semantics. It improves run-time adaptivity in malleable way and provides implicit parallelism. We evaluate our approach on two architectures, improving the performance of a benchmark by up to 25.6% on an Intel chip with hyper-threading, and by up to 78.0% on a heterogeneous embedded ARM big.LITTLE architecture.Bibtex
@InProceedings{khasanov_parma18,
author = {Robert Khasanov and Andr\'{e}s Goens and Jeronimo Castrillon},
title = {Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap},
booktitle = {Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {PARMA-DITAM '18},
isbn = {978-1-4503-6444-7},
pages = {20--25},
year = {2018},
month = jan,
numpages = {6},
url = {http://doi.acm.org/10.1145/3183767.3183790},
doi = {10.1145/3183767.3183790},
acmid = {3183790},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
abstract = {Modern embedded systems are rapidly increasing their complexity, both in terms of numbers of cores, as well as heterogeneity. To generate efficient code for these systems, it is common to leverage formal models of computation.
Among these, the dataflow model of Kahn Process Networks (KPN) is widespread because it is expressive but guarantees a deterministic execution. However, the KPN model is ill-suited to expose data-level parallelism, since this has to be made explicit in the process network. This is aggravated by the fact that its most common execution model, Kahn-MacQueen, poses restrictive conditions on the scheduling of data-parallel processes, leading to an inefficient execution. In this paper we present a novel extension to the KPN model and a relaxed execution strategy that addresses this problem, while keeping the deterministic KPN semantics. It improves run-time adaptivity in malleable way and provides implicit parallelism. We evaluate our approach on two architectures, improving the performance of a benchmark by up to 25.6% on an Intel chip with hyper-threading, and by up to 78.0% on a heterogeneous embedded ARM big.LITTLE architecture.},
}Downloads
1801_Khasanov_PARMA-DITAM [PDF]
Related Paths
Permalink
- Andrés Goens, Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers", Proceedings of the 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG'2018), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2018. [Bibtex & Downloads]
Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers
Reference
Andrés Goens, Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers", Proceedings of the 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG'2018), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2018.
Bibtex
@InProceedings{goens_multiprog18,
author = {Andr{\'e}s Goens and Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers},
booktitle = {Proceedings of the 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG'2018), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
year = {2018},
url = {http://research.ac.upc.edu/multiprog/multiprog2018/papers/MULTIPROG-2018_Goens.pdf},
month = jan,
location = {Manchester, United Kingdom}
}Downloads
1801_Goens_MULTIRPOG [PDF]
Related Paths
Permalink
- Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]
NVMain Extension for Multi-Level Cache Systems
Reference
Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi]
Bibtex
@InProceedings{khan_rapido18,
author = {Asif Ali Khan and Fazal Hameed and Jeronimo Castrillon},
title = {NVMain Extension for Multi-Level Cache Systems},
booktitle = {Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {RAPIDO '18},
year = {2018},
month = jan,
pages = {7:1--7:6},
articleno = {7},
numpages = {6},
url = {http://doi.acm.org/10.1145/3180665.3180672},
doi = {10.1145/3180665.3180672},
acmid = {3180672},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
isbn = {978-1-4503-6417-1},
}Downloads
1801_Khan_RAPIDO [PDF]
Related Paths
Permalink
2017
- Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]
Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache
Reference
Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi]
Bibtex
@InProceedings{hameed_memsys17,
author = {Fazal Hameed and Christian Menard and Jeronimo Castrillon},
title = {Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache},
booktitle = {Proceedings of the International Symposium on Memory Systems (MemSys'17)},
series = {MEMSYS '17},
year = {2017},
month = oct,
isbn = {978-1-4503-5335-9},
location = {Alexandria, Virginia},
pages = {141--151},
numpages = {11},
url = {http://doi.acm.org/10.1145/3132402.3132414},
doi = {10.1145/3132402.3132414},
acmid = {3132414},
publisher = {ACM},
address = {New York, NY, USA},
}Downloads
1710_Hameed_Memsys [PDF]
Related Paths
Permalink
- Miguel Angel Aguilar, Abhishek Aggarwal, Awaid Shaheen, Rainer Leupers, Gerd Ascheid, Jeronimo Castrillon, Liam Fitzpatrick, "Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress", Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), ACM, pp. 14:1–14:2, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]
Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress
Reference
Miguel Angel Aguilar, Abhishek Aggarwal, Awaid Shaheen, Rainer Leupers, Gerd Ascheid, Jeronimo Castrillon, Liam Fitzpatrick, "Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress", Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), ACM, pp. 14:1–14:2, New York, NY, USA, Oct 2017. [doi]
Bibtex
@inproceedings{aguilar_cases17,
author = {Aguilar, Miguel Angel and Aggarwal, Abhishek and Shaheen, Awaid and Leupers, Rainer and Ascheid, Gerd and Castrillon, Jeronimo and Fitzpatrick, Liam},
title = {Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress},
booktitle = {Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES)},
series = {CASES '17},
year = {2017},
month = oct,
isbn = {978-1-4503-5184-3},
location = {Seoul, Republic of Korea},
pages = {14:1--14:2},
articleno = {14},
numpages = {2},
url = {http://doi.acm.org/10.1145/3125501.3125521},
doi = {10.1145/3125501.3125521},
acmid = {3125521},
publisher = {ACM},
address = {New York, NY, USA},
}Downloads
1710_Aguilar_CASES [PDF]
Permalink
- Adilla Susungi, Norman A. Rink, Jeronimo Castrillon, Immo Huismann, Albert Cohen, Claude Tadonki, Jörg Stiller, Jochen Fröhlich, "Towards Compositional and Generative Tensor Optimizations", Proceedings of 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'17), ACM, pp. 169–175, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]
Towards Compositional and Generative Tensor Optimizations
Reference
Adilla Susungi, Norman A. Rink, Jeronimo Castrillon, Immo Huismann, Albert Cohen, Claude Tadonki, Jörg Stiller, Jochen Fröhlich, "Towards Compositional and Generative Tensor Optimizations", Proceedings of 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'17), ACM, pp. 169–175, New York, NY, USA, Oct 2017. [doi]
Bibtex
@InProceedings{rink_gpce17,
author = {Adilla Susungi and Norman A. Rink and Jeronimo Castrillon and Immo Huismann and Albert Cohen and Claude Tadonki and J{\"o}rg Stiller and Jochen Fr{\"o}hlich},
title = {Towards Compositional and Generative Tensor Optimizations},
booktitle = {Proceedings of 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'17)},
series = {GPCE 2017},
year = {2017},
pages = {169--175},
month = oct,
isbn = {978-1-4503-5524-7},
location = {Vancouver, BC, Canada},
pages = {169--175},
numpages = {7},
url = {http://doi.acm.org/10.1145/3136040.3136050},
doi = {10.1145/3136040.3136050},
acmid = {3136050},
publisher = {ACM},
address = {New York, NY, USA},
}Downloads
1710_Rink_GPCE [PDF]
Related Paths
Permalink
- Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017) (Lawrence Rauchwerger), Springer, Cham, pp. 281–282, Oct 2017. [doi] [Bibtex & Downloads]
POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems
Reference
Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017) (Lawrence Rauchwerger), Springer, Cham, pp. 281–282, Oct 2017. [doi]
Bibtex
@InProceedings{ertel_lcpc17,
author = {Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems},
booktitle = {Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017)},
year = {2017},
editor = {Lawrence Rauchwerger},
publisher = {Springer, Cham},
location = {Texas A{\&}M University, College Station, Texas},
month = oct,
isbn = {978-3-030-35224-0},
pages = {281--282},
doi = {10.1007/978-3-030-35225-7},
url = {https://link.springer.com/book/10.1007%2F978-3-030-35225-7},
}Downloads
1710_Ertel_LCPC [PDF]
Related Paths
Permalink
- Sven Karol, Tobias Nett, Pietro Incardona, Nesrine Khouzami, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Language and Development Environment for Parallel Particle Methods", Proceedings of the 5th International Conference on Particle-based Methods. Fundamentals and Applications PARTICLES 2017 (P. Wriggers and M. Bischoff and E. Oñate and D.R.J. Owen and T. Zohdi), Sep 2017. [Bibtex & Downloads]
A Language and Development Environment for Parallel Particle Methods
Reference
Sven Karol, Tobias Nett, Pietro Incardona, Nesrine Khouzami, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Language and Development Environment for Parallel Particle Methods", Proceedings of the 5th International Conference on Particle-based Methods. Fundamentals and Applications PARTICLES 2017 (P. Wriggers and M. Bischoff and E. Oñate and D.R.J. Owen and T. Zohdi), Sep 2017.
Bibtex
@InProceedings{karol_particles17,
author = {Sven Karol and Tobias Nett and Pietro Incardona and Nesrine Khouzami and Jeronimo Castrillon and Ivo F. Sbalzarini},
title = {A Language and Development Environment for Parallel Particle Methods},
booktitle = {Proceedings of the 5th International Conference on Particle-based Methods. Fundamentals and Applications PARTICLES 2017},
year = {2017},
editor = {P. Wriggers and M. Bischoff and E. O{\~n}ate and D.R.J. Owen and T. Zohdi},
url = {https://www.semanticscholar.org/paper/A-Language-and-Development-Environment-for-Paralle-Karol-Nett/2b79bd3836aeb8e2fb2a2b5d9949f9efb1bdfab7?tab=abstract},
month = sep,
}Downloads
1709_Karol_particles [PDF]
Related Paths
Biological Systems Path, Orchestration Path
Permalink
- Jeronimo Castrillon, Tei-Wei Kuo, Heike E. Riel, Matthias Lieber, "Wildly Heterogeneous Post-CMOS Technologies Meet Software (Dagstuhl Seminar 17061)", In Dagstuhl Reports (Jerónimo Castrillón-Mazo and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, vol. 7, no. 2, pp. 1–22, Dagstuhl, Germany, Aug 2017. [doi] [Bibtex & Downloads]
Wildly Heterogeneous Post-CMOS Technologies Meet Software (Dagstuhl Seminar 17061)
Reference
Jeronimo Castrillon, Tei-Wei Kuo, Heike E. Riel, Matthias Lieber, "Wildly Heterogeneous Post-CMOS Technologies Meet Software (Dagstuhl Seminar 17061)", In Dagstuhl Reports (Jerónimo Castrillón-Mazo and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, vol. 7, no. 2, pp. 1–22, Dagstuhl, Germany, Aug 2017. [doi]
Bibtex
@Article{castrillnmazo_et_al:DR:2017:7349,
author = {Jeronimo Castrillon and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber},
title = ,
journal = {Dagstuhl Reports},
year = {2017},
volume = {7},
number = {2},
month = aug,
pages = {1--22},
address = {Dagstuhl, Germany},
annote = {Keywords: 3D integration, compilers, emerging post-CMOS circuit materials and technologies, hardware/software co-design, heterogeneous hardware, nanoelectronics},
doi = {10.4230/DagRep.7.2.1},
editor = {Jer{\'o}nimo Castrill{\'o}n-Mazo and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber},
issn = {2192-5283},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
url = {http://drops.dagstuhl.de/opus/volltexte/2017/7349},
urn = {urn:nbn:de:0030-drops-73499}
}Downloads
No Downloads available for this publication
Related Paths
Permalink
- Andrés Goens, Sergio Siccha, Jeronimo Castrillon, "Symmetry in Software Synthesis", In ACM Transactions on Architecture and Code Optimization (TACO),, ACM, vol. 14, no. 2, pp. 20:1–20:26, New York, NY, USA, Jul 2017. [doi] [Bibtex & Downloads]
Symmetry in Software Synthesis
Reference
Andrés Goens, Sergio Siccha, Jeronimo Castrillon, "Symmetry in Software Synthesis", In ACM Transactions on Architecture and Code Optimization (TACO),, ACM, vol. 14, no. 2, pp. 20:1–20:26, New York, NY, USA, Jul 2017. [doi]
Abstract
With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this paper, we present a formal framework that can determine the inherent symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-size benchmarks, our approach consistently yields reductions of the overall execution time of algorithms, accelerating them by a factor up to 10 in our experiments, or improving the quality of the results.
Bibtex
@article{goens_taco17symmetry,
author = {Goens, Andr{\'e}s and Siccha, Sergio and Castrillon, Jeronimo},
title = {Symmetry in Software Synthesis},
journal = {ACM Transactions on Architecture and Code Optimization (TACO),},
issue_date = {July 2017},
volume = {14},
number = {2},
month = jul,
year = {2017},
issn = {1544-3566},
pages = {20:1--20:26},
articleno = {20},
numpages = {26},
url = {http://doi.acm.org/10.1145/3095747},
doi = {10.1145/3095747},
acmid = {3095747},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Scalability, automation, clusters, design-space exploration, group theory, heterogeneous, inverse-semigroups, mapping, metaheuristics, network-on-chip, symmetry},
eprint = "arXiv:1704.06623",
abstract = {With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this paper, we present a formal framework that can determine the inherent symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-size benchmarks, our approach consistently yields reductions of the overall execution time of algorithms, accelerating them by a factor up to 10 in our experiments, or improving the quality of the results.}
}Downloads
1704_Goens_TACO-arxiv [PDF]
Related Paths
Permalink
- Christian Menard, Matthias Jung, Jeronimo Castrillon, Norbert Wehn, "System Simulation with gem5 and SystemC: The Keystone for Full Interoperability", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), pp. 62–69, Jul 2017. [doi] [Bibtex & Downloads]
System Simulation with gem5 and SystemC: The Keystone for Full Interoperability
Reference
Christian Menard, Matthias Jung, Jeronimo Castrillon, Norbert Wehn, "System Simulation with gem5 and SystemC: The Keystone for Full Interoperability", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), pp. 62–69, Jul 2017. [doi]
Abstract
SystemC TLM based virtual prototypes have become the main tool in industry and research for concurrent hardware and software development, as well as hardware design space exploration. However, there exists a lack of accurate, free, changeable and realistic SystemC models of modern CPUs. Therefore, many researchers use the cycle accurate open source system simulator gem5, which has been developed in parallel to the SystemC standard. In this paper we present a coupling of gem5 with SystemC that offers full interoperability between both simulation frameworks, and therefore enables a huge set of possibilities for system level design space exploration. Furthermore, we show that the coupling itself only induces a relatively small overhead to the total execution time of the simulation.
Bibtex
@InProceedings{menard_samos17,
author = {Christian Menard and Matthias Jung and Jeronimo Castrillon and Norbert Wehn},
title = {System Simulation with gem5 and SystemC: The Keystone for Full Interoperability},
booktitle = {Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS)},
year = {2017},
month = jul,
location = {Pythagorion, Greece},
pages = {62--69},
organization = {IEEE},
doi = {10.1109/SAMOS.2017.8344612},
url = {https://ieeexplore.ieee.org/document/8344612/},
isbn = {978-1-5386-3437-0},
abstract = {SystemC TLM based virtual prototypes have become the main tool in industry and research for concurrent hardware and software development, as well as hardware design space exploration. However, there exists a lack of accurate, free, changeable and realistic SystemC models of modern CPUs. Therefore, many researchers use the cycle accurate open source system simulator gem5, which has been developed in parallel to the SystemC standard. In this paper we present a coupling of gem5 with SystemC that offers full interoperability between both simulation frameworks, and therefore enables a huge set of possibilities for system level design space exploration. Furthermore, we show that the coupling itself only induces a relatively small overhead to the total execution time of the simulation.},
}Downloads
1707_Menard_SAMOS [PDF]
Related Paths
Permalink
- Andrés Goens, Robert Khasanov, Marcus Hähnel, Till Smejkal, Hermann Härtig, Jeronimo Castrillon, "TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES'17), ACM, pp. 11–20, New York, NY, USA, Jun 2017. [doi] [Bibtex & Downloads]
TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings
Reference
Andrés Goens, Robert Khasanov, Marcus Hähnel, Till Smejkal, Hermann Härtig, Jeronimo Castrillon, "TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES'17), ACM, pp. 11–20, New York, NY, USA, Jun 2017. [doi]
Bibtex
@InProceedings{goens_scopes17,
author = {Andr\'{e}s Goens and Robert Khasanov and Marcus H{\"a}hnel and Till Smejkal and Hermann H{\"a}rtig and Jeronimo Castrillon},
title = {TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings},
booktitle = {Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES'17)},
year = {2017},
month = jun,
series = {SCOPES '17},
isbn = {978-1-4503-5039-6},
location = {Sankt Goar, Germany},
pages = {11--20},
numpages = {10},
url = {http://doi.acm.org/10.1145/3078659.3078663},
doi = {10.1145/3078659.3078663},
acmid = {3078663},
publisher = {ACM},
address = {New York, NY, USA}
}Downloads
1706_Goens_SCOPES [PDF]
Related Paths
Permalink
- Gerald Hempel, Andrés Goens, Josefine Asmus, Jeronimo Castrillon, Ivo F. Sbalzarini, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17), ACM, pp. 21–30, New York, NY, USA, Jun 2017. [doi] [Bibtex & Downloads]
Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering
Reference
Gerald Hempel, Andrés Goens, Josefine Asmus, Jeronimo Castrillon, Ivo F. Sbalzarini, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17), ACM, pp. 21–30, New York, NY, USA, Jun 2017. [doi]
Bibtex
@InProceedings{hempel_scopes17,
author = {Gerald Hempel and Andr\'{e}s Goens and Josefine Asmus and Jeronimo Castrillon and Ivo F. Sbalzarini},
title = {Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering},
booktitle = {Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17)},
year = {2017},
series = {SCOPES '17},
pages = {21--30},
address = {New York, NY, USA},
month = jun,
publisher = {ACM},
acmid = {3078667},
doi = {10.1145/3078659.3078667},
isbn = {978-1-4503-5039-6},
location = {Sankt Goar, Germany},
numpages = {10},
url = {http://doi.acm.org/10.1145/3078659.3078667}
}Downloads
1706_Hempel_SCOPES [PDF]
Related Paths
Biological Systems Path, Orchestration Path
Permalink
- Johanna Sepúlveda, Vania Marangozova-Martin, Jeronimo Castrillon, "Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface", Elsevier, Jun 2017. [doi] [Bibtex & Downloads]
Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface
Reference
Johanna Sepúlveda, Vania Marangozova-Martin, Jeronimo Castrillon, "Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface", Elsevier, Jun 2017. [doi]
Bibtex
@Article{sepulveda_alchemy17_preface,
author = {Sep{\'u}lveda, Johanna and Marangozova-Martin, Vania and Castrillon, Jeronimo},
title = {Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface},
year = {2017},
month = jun,
doi = {10.1016/j.procs.2017.05.276},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1706_sepulveda_alchemy.pdf:PDF},
url = {http://www.sciencedirect.com/science/article/pii/S1877050917309286},
publisher = {Elsevier}
}Downloads
No Downloads available for this publication
Permalink
- Norman A. Rink, Jeronimo Castrillon, "Extending a Compiler Backend for Complete Memory Error Detection", In Proceeding: Lecture Notes in Informatics: Automotive - Safety & Security 2017 (Peter Dencker and Herbert Klenk and Hubert Kelle and Erhard Plödereder), pp. 61–74, May 2017. (Best paper award) [Bibtex & Downloads]
Extending a Compiler Backend for Complete Memory Error Detection
Reference
Norman A. Rink, Jeronimo Castrillon, "Extending a Compiler Backend for Complete Memory Error Detection", In Proceeding: Lecture Notes in Informatics: Automotive - Safety & Security 2017 (Peter Dencker and Herbert Klenk and Hubert Kelle and Erhard Plödereder), pp. 61–74, May 2017. (Best paper award)
Abstract
Technological advances drive hardware to ever smaller feature sizes, causing devices to become more vulnerable to faults. Applications can be protected against errors resulting from faults by adding error detection and recovery measures in software. This is popularly achieved by applying automatic program transformations. However, transformations applied to intermediate program representations are fundamentally incapable of protecting against vulnerabilities that are introduced during compilation. In particular, the compiler backend may introduce additional memory accesses. This report presents an extended compiler backend that protects these accesses against faults in the memory system. It is demonstrated that this enables the detection of all single bit flips in memory. On a subset of SPEC CINT2006 the runtime overhead caused by the extended backend amounts to 1.50x for the 32-bit processor architecture i386, and 1.13x for the 64-bit architecture x86 64.
Bibtex
@InProceedings{rink_automotive17,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {Extending a Compiler Backend for Complete Memory Error Detection},
booktitle = {Lecture Notes in Informatics: Automotive - Safety \& Security 2017},
editor = {Peter Dencker and Herbert Klenk and Hubert Kelle and Erhard Pl{\"o}dereder},
year = {2017},
pages = {61--74},
month = may,
abstract = {Technological advances drive hardware to ever smaller feature sizes, causing devices to become more vulnerable to faults. Applications can be protected against errors resulting from faults by adding error detection and recovery measures in software. This is popularly achieved by applying automatic program transformations. However, transformations applied to intermediate program representations are fundamentally incapable of protecting against vulnerabilities that are introduced during compilation. In particular, the compiler backend may introduce additional memory accesses. This report presents an extended compiler backend that protects these accesses against faults in the memory system. It is demonstrated that this enables the detection of all single bit flips in memory. On a subset of SPEC CINT2006 the runtime overhead caused by the extended backend amounts to 1.50x for the 32-bit processor architecture i386, and 1.13x for the 64-bit architecture x86 64.},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1705_rink_automotive.pdf:PDF},
isbn = {978-3-88579-663-3},
issn = {1617-5468},
url = {https://dl.gi.de/bitstream/handle/20.500.12116/147/paper04.pdf?sequence=1&isAllowed=y},
}Downloads
1705_rink_automotive [PDF]
Related Paths
Orchestration Path, Resilience Path
Permalink
- Markus Haehnel, Frehiwot Melak Arega, Waltenegus Dargie, Robert Khasanov, Jeronimo Castrillon, "Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures", Proceedings of the 7th International Workshop on Big Data in Cloud Performance (DCPerf'17), IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 432-437, May 2017. [doi] [Bibtex & Downloads]
Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures
Reference
Markus Haehnel, Frehiwot Melak Arega, Waltenegus Dargie, Robert Khasanov, Jeronimo Castrillon, "Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures", Proceedings of the 7th International Workshop on Big Data in Cloud Performance (DCPerf'17), IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 432-437, May 2017. [doi]
Bibtex
@InProceedings{khasanov_dcperf17,
author = {Markus Haehnel and Frehiwot Melak Arega and Waltenegus Dargie and Robert Khasanov and Jeronimo Castrillon},
title = {Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures},
booktitle = {Proceedings of the 7th International Workshop on Big Data in Cloud Performance (DCPerf'17), IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)},
year = {2017},
month = may,
volume={},
number={},
pages={432-437},
doi={10.1109/INFCOMW.2017.8116415},
ISSN={},
url = {http://ieeexplore.ieee.org/document/8116415/},
location = {Atlanta, USA}
}Downloads
1705_Khasanov_DCPerf [PDF]
Related Paths
Permalink
- Norman A. Rink, Jeronimo Castrillon, "Trading Fault Tolerance for Performance in AN Encoding", Proceedings of the ACM International Conference on Computing Frontiers (CF'17), ACM, pp. 183–190, New York, NY, USA, May 2017. [doi] [Bibtex & Downloads]
Trading Fault Tolerance for Performance in AN Encoding
Reference
Norman A. Rink, Jeronimo Castrillon, "Trading Fault Tolerance for Performance in AN Encoding", Proceedings of the ACM International Conference on Computing Frontiers (CF'17), ACM, pp. 183–190, New York, NY, USA, May 2017. [doi]
Bibtex
@InProceedings{rink_cf17,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {Trading Fault Tolerance for Performance in {AN} Encoding},
booktitle = {Proceedings of the ACM International Conference on Computing Frontiers (CF'17)},
year = {2017},
isbn = {978-1-4503-4487-6},
location = {Siena, Italy},
pages = {183--190},
numpages = {8},
url = {http://doi.acm.org/10.1145/3075564.3075565},
doi = {10.1145/3075564.3075565},
acmid = {3075565},
publisher = {ACM},
address = {New York, NY, USA},
month = may,
}Downloads
1705_Rink_cf [PDF]
Related Paths
Orchestration Path, Resilience Path
Permalink
- Rainer Leupers, Miguel Angel Aguilar, Juan Fernando Eusse, Jeronimo Castrillon, Weihua Sheng, "MAPS: A Software Development Environment for Embedded Multicore Applications", Springer Netherlands, pp. 1–33, Dordrecht, Apr 2017. [doi] [Bibtex & Downloads]
MAPS: A Software Development Environment for Embedded Multicore Applications
Reference
Rainer Leupers, Miguel Angel Aguilar, Juan Fernando Eusse, Jeronimo Castrillon, Weihua Sheng, "MAPS: A Software Development Environment for Embedded Multicore Applications", Springer Netherlands, pp. 1–33, Dordrecht, Apr 2017. [doi]
Abstract
The use of heterogeneous Multi-Processor System-on-Chip (MPSoC) is a widely accepted solution to address the increasing demands on high performance and energy efficiency for modern embedded devices. To enable the full potential of these platforms, new tools are needed to tackle the programming complexity of MPSoCs, while allowing for high productivity. This chapter discusses the MPSoC Application Programming Studio (MAPS), a framework that provides facilities for expressing parallelism and tool flows for parallelization, mapping/scheduling, and code generation for heterogeneous MPSoCs. Two case studies of the use of MAPS in commercial environments are presented. This chapter closes by discussing early experiences of transferring the MAPS technology into Silexica GmbH, a start-up company that provides multi-core programming tools.
Bibtex
@InBook{leupers_hhcd17,
title = {MAPS: A Software Development Environment for Embedded Multicore Applications},
author = {Rainer Leupers and Miguel Angel Aguilar and Juan Fernando Eusse and Jeronimo Castrillon and Weihua Sheng},
editor = {Soonhoi Ha and J{\"u}rgen Teich},
publisher = {Springer Netherlands},
year = {2017},
address = {Dordrecht},
month = apr,
booktitle = {Handbook of Hardware/Software Codesign},
doi = {10.1007/978-94-017-7358-4_2-1},
isbn = {978-94-017-7358-4},
url = {http://dx.doi.org/10.1007/978-94-017-7358-4_2-1},
pages = {1--33},
abstract = {The use of heterogeneous Multi-Processor System-on-Chip (MPSoC) is a widely accepted solution to address the increasing demands on high performance and energy efficiency for modern embedded devices. To enable the full potential of these platforms, new tools are needed to tackle the programming complexity of MPSoCs, while allowing for high productivity. This chapter discusses the MPSoC Application Programming Studio (MAPS), a framework that provides facilities for expressing parallelism and tool flows for parallelization, mapping/scheduling, and code generation for heterogeneous MPSoCs. Two case studies of the use of MAPS in commercial environments are presented. This chapter closes by discussing early experiences of transferring the MAPS technology into Silexica GmbH, a start-up company that provides multi-core programming tools.},
}Downloads
No Downloads available for this publication
Permalink
- Lars Schütze, Jeronimo Castrillon, "Analyzing State-of-the-Art Role-based Programming Languages", Proceedings of the First International Conference on the Art, Science and Engineering of Programming (Programming'17), ACM, pp. 9:1–9:6, New York, NY, USA, Apr 2017. [doi] [Bibtex & Downloads]
Analyzing State-of-the-Art Role-based Programming Languages
Reference
Lars Schütze, Jeronimo Castrillon, "Analyzing State-of-the-Art Role-based Programming Languages", Proceedings of the First International Conference on the Art, Science and Engineering of Programming (Programming'17), ACM, pp. 9:1–9:6, New York, NY, USA, Apr 2017. [doi]
Bibtex
@InProceedings{schuetze_lassy17,
author = {Lars Sch{\"u}tze and Jeronimo Castrillon},
title = {Analyzing State-of-the-Art Role-based Programming Languages},
booktitle = {Proceedings of the First International Conference on the Art, Science and Engineering of Programming (Programming'17)},
series = {Programming '17},
year = {2017},
month = apr,
isbn = {978-1-4503-4836-2},
location = {Brussels, Belgium},
pages = {9:1--9:6},
articleno = {9},
numpages = {6},
url = {http://doi.acm.org/10.1145/3079368.3079386},
doi = {10.1145/3079368.3079386},
acmid = {3079386},
publisher = {ACM},
address = {New York, NY, USA},
}Downloads
1704_Schuetze_lassy [PDF]
Permalink
- Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi] [Bibtex & Downloads]
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Reference
Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi]
Abstract
State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2% and 11.4% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62%.
Bibtex
@InProceedings{hameed_date17,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization},
booktitle = {Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE)},
year = {2017},
series = {DATE '17},
pages = {362--367},
month = mar,
publisher = {EDA Consortium},
abstract = {State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70\% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2\% and 11.4\% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62\%.},
isbn = {978-3-9815370-8-6},
doi={10.23919/DATE.2017.7927017},
url = {http://ieeexplore.ieee.org/document/7927017/},
location = {Lausanne, Switzerland}
}Downloads
1703_Hameed_DATE [PDF]
Related Paths
Permalink
- Norman A. Rink, Jeronimo Castrillon, "flexMEDiC: flexible Memory Error Detection by Combined data encoding and duplication", Proceedings of the 2nd International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with DATE 2017, pp. 15–22, Mar 2017. [Bibtex & Downloads]
flexMEDiC: flexible Memory Error Detection by Combined data encoding and duplication
Reference
Norman A. Rink, Jeronimo Castrillon, "flexMEDiC: flexible Memory Error Detection by Combined data encoding and duplication", Proceedings of the 2nd International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with DATE 2017, pp. 15–22, Mar 2017.
Abstract
Errors in memory are known to be a major cause of system failures. Moreover, it has recently been found that single-error correcting, double-error detecting (SECDED) codes, which are widely used in ECC memory modules, are incapable of handling large fractions of errors that occur in practice. This calls for more powerful error detection measures. However, the higher the number of bit flips that can still be detected as an error, the larger the memory overhead. Cost considerations and the varying needs for reliability of different applications may not always warrant laying down extra hardware to accommodate overheads. Software-implemented error detection offers a flexible alternative. In this work we propose the software-implemented flexMEDiC scheme for detecting errors in the memory system, including main memory, on-chip caches, and load-store queues. It is shown that single and double bit flips are detected by flexMEDiC, and evidence is given that suggests that up to five bit flips within a single data word can still be detected as errors. The average runtime overhead incurred by flexMEDiC is 1.55x.
Bibtex
@InProceedings{rees:2017,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {{flexMEDiC}: flexible {M}emory {E}rror {D}etection by Combined data encoding and duplication},
booktitle = {Proceedings of the 2nd International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with DATE 2017},
year = {2017},
month = mar,
pages = {15--22},
abstract = {Errors in memory are known to be a major cause of system failures. Moreover, it has recently been found that single-error correcting, double-error detecting (SECDED) codes, which are widely used in ECC memory modules, are incapable of handling large fractions of errors that occur in practice. This calls for more powerful error detection measures. However, the higher the number of bit flips that can still be detected as an error, the larger the memory overhead. Cost considerations and the varying needs for reliability of different applications may not always warrant laying down extra hardware to accommodate overheads. Software-implemented error detection offers a flexible alternative. In this work we propose the software-implemented flexMEDiC scheme for detecting errors in the memory system, including main memory, on-chip caches, and load-store queues. It is shown that single and double bit flips are detected by flexMEDiC, and evidence is given that suggests that up to five bit flips within a single data word can still be detected as errors. The average runtime overhead incurred by flexMEDiC is 1.55x.},
}Downloads
1703_Rink_REES [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Programming for adaptive and energy-efficient computing", In International Conference on High Performance Compilation, Computing and Communications (HP3C-2017) (keynote), Mar 2017. [Bibtex & Downloads]
Programming for adaptive and energy-efficient computing
Reference
Jeronimo Castrillon, "Programming for adaptive and energy-efficient computing", In International Conference on High Performance Compilation, Computing and Communications (HP3C-2017) (keynote), Mar 2017.
Bibtex
@Misc{castrillon2017hp3c,
author = {Castrillon, Jeronimo},
title = {Programming for adaptive and energy-efficient computing},
howpublished = {International Conference on High Performance Compilation, Computing and Communications (HP3C-2017) (keynote)},
month = mar,
year = {2017},
location = {Kuala Lumpur, Malaysia}
}Downloads
170323_castrill_hp3c [PDF]
Permalink
- Andrés Goens, Jeronimo Castrillon, "Optimizing for Data-Parallelism in Kahn Process Networks", In Proceeding: ACM SRC at International Symposium on
Code Generationand Optimization (CGO), Feb 2017. [Bibtex & Downloads]
Optimizing for Data-Parallelism in Kahn Process Networks
Reference
Andrés Goens, Jeronimo Castrillon, "Optimizing for Data-Parallelism in Kahn Process Networks", In Proceeding: ACM SRC at International Symposium on Code Generationand Optimization (CGO), Feb 2017.
Bibtex
@inproceedings{goens17cgo,
author = {Andr\'{e}s Goens and Jeronimo Castrillon},
title = {Optimizing for Data-Parallelism in Kahn Process Networks},
year = {2017},
month = feb,
booktitle= {ACM SRC at International Symposium on
Code Generationand Optimization (CGO)},
location = {Austin, TX, USA},
}Downloads
1701_Goens_SRCCGO [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "On Mapping to Multi/Manycores", In 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017. [Bibtex & Downloads]
On Mapping to Multi/Manycores
Reference
Jeronimo Castrillon, "On Mapping to Multi/Manycores", In 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017.
Bibtex
@Misc{castrillon2017multiprog,
author = {Castrillon, Jeronimo},
title = {On Mapping to Multi/Manycores},
howpublished = {10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk)},
month = jan,
year = {2017},
location = {Stockholm, Sweden}
}Downloads
No Downloads available for this publication
Related Paths
Permalink
- Jeronimo Castrillon, "Flexible and Scalable Dataflow Programming for Manycores", In Tutorial for heterogeneous multicore design automation: current and future, held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017. [Bibtex & Downloads]
Flexible and Scalable Dataflow Programming for Manycores
Reference
Jeronimo Castrillon, "Flexible and Scalable Dataflow Programming for Manycores", In Tutorial for heterogeneous multicore design automation: current and future, held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017.
Bibtex
@Misc{castrillon2017hipeactut,
author = {Castrillon, Jeronimo},
title = {Flexible and Scalable Dataflow Programming for Manycores},
howpublished = {Tutorial for heterogeneous multicore design automation: current and future, held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk)},
month = jan,
year = {2017},
location = {Stockholm, Sweden}
}Downloads
No Downloads available for this publication
Related Paths
Permalink
2016
- Norman A. Rink, Jeronimo Castrillon, "Comprehensive Backend Support for Local Memory Fault Tolerance", Technical report, Technische Universität Dresden, pp. 11, Dec 2016. [Bibtex & Downloads]
Comprehensive Backend Support for Local Memory Fault Tolerance
Reference
Norman A. Rink, Jeronimo Castrillon, "Comprehensive Backend Support for Local Memory Fault Tolerance", Technical report, Technische Universität Dresden, pp. 11, Dec 2016.
Bibtex
@TechReport{rink_techrep16,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {Comprehensive Backend Support for Local Memory Fault Tolerance},
institution = {Technische Universit{\"a}t Dresden},
year = {2016},
month = dec,
issn = {1430-211X},
pages = {11},
url = {https://cfaed.tu-dresden.de/files/user/nrink/tech-report-ro.pdf}
}Downloads
tech-report-ro [PDF]
Related Paths
Permalink
- Marcus Völp, Sascha Klüppelholz, Jeronimo Castrillon, Hermann Härtig, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andres Goens, Sebastian Haas, Dirk Habich, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Wolfgang Lehner, Linda Leuschner, Matthias Lieber, Siqi Ling, Steffen Märcker, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, "The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware", Proceedings of the 1st International Workshop on Post-Moore's Era Supercomputing (PMES), Co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, USA, Nov 2016. [Bibtex & Downloads]
The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware
Reference
Marcus Völp, Sascha Klüppelholz, Jeronimo Castrillon, Hermann Härtig, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andres Goens, Sebastian Haas, Dirk Habich, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Wolfgang Lehner, Linda Leuschner, Matthias Lieber, Siqi Ling, Steffen Märcker, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, "The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware", Proceedings of the 1st International Workshop on Post-Moore's Era Supercomputing (PMES), Co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, USA, Nov 2016.
Abstract
Future systems based on post-CMOS technologies,
will be wildly heterogeneous, with properties largely unknown today.,
This paper presents our design of a new hardware/software stack to address the,
challenge of preparing software development for such systems.,
It combines well-understood technologies from different areas, e.g., network-on-chips,
capability operating systems, flexible programming models and model checking.,
We describe our approach and provide details on key technologies.Bibtex
@InProceedings{voelp16_pmes,
author = {Marcus V{\"o}lp and Sascha Kl{\"u}ppelholz and Jeronimo Castrillon and Hermann H{\"a}rtig and Nils Asmussen and Uwe Assmann and Franz Baader and Christel Baier and Gerhard Fettweis and Jochen Fr{\"o}hlich and Andres Goens and Sebastian Haas and Dirk Habich and Mattis Hasler and Immo Huismann and Tomas Karnagel and Sven Karol and Wolfgang Lehner and Linda Leuschner and Matthias Lieber and Siqi Ling and Steffen M{\"a}rcker and Johannes Mey and Wolfgang Nagel and Benedikt N{\"o}then and Rafael Pe{\~n}aloza and Michael Raitza and J{\"o}rg Stiller and Annett Ungeth{\"u}m and Axel Voigt},
title = {The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware},
booktitle = {Proceedings of the 1st International Workshop on Post-Moore's Era Supercomputing (PMES), Co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16)},
year = {2016},
address = {Salt Lake City, USA},
month = nov,
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/1611_Voelp_PMES.pdf},
abstract = {Future systems based on post-CMOS technologies,
will be wildly heterogeneous, with properties largely unknown today.,
This paper presents our design of a new hardware/software stack to address the,
challenge of preparing software development for such systems.,
It combines well-understood technologies from different areas, e.g., network-on-chips,
capability operating systems, flexible programming models and model checking.,
We describe our approach and provide details on key technologies.},
}Downloads
1611_Voelp_PMES [PDF]
Related Paths
Permalink
- Christian Menard, Andrés Goens, Jeronimo Castrillon, "High-Level NoC Model for MPSoC Compilers", Proceedings of the IEEE Nordic Circuits and Systems Conference (NORCAS'16), pp. 1-6, Copenhagen, Denmark, Nov 2016. [doi] [Bibtex & Downloads]
High-Level NoC Model for MPSoC Compilers
Reference
Christian Menard, Andrés Goens, Jeronimo Castrillon, "High-Level NoC Model for MPSoC Compilers", Proceedings of the IEEE Nordic Circuits and Systems Conference (NORCAS'16), pp. 1-6, Copenhagen, Denmark, Nov 2016. [doi]
Bibtex
@InProceedings{menard_norcas16,
author = {Christian Menard and Andr\'{e}s Goens and Jeronimo Castrillon},
title = {High-Level NoC Model for MPSoC Compilers},
booktitle = {Proceedings of the IEEE Nordic Circuits and Systems Conference (NORCAS'16)},
year = {2016},
pages={1-6},
doi = {10.1109/NORCHIP.2016.7792876},
series = {NORCAS},
address = {Copenhagen, Denmark},
month = nov,
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/1611_Menard_NORCAS.pdf}
}Downloads
1611_Menard_NORCAS [PDF]
Related Paths
Permalink
- Andres Goens, Robert Khasanov, Jeronimo Castrillon, Simon Polstra, Andy Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study", Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16), pp. 281-288, Ecole Centrale de Lyon, Lyon, France, Sep 2016. [doi] [Bibtex & Downloads]
Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study
Reference
Andres Goens, Robert Khasanov, Jeronimo Castrillon, Simon Polstra, Andy Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study", Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16), pp. 281-288, Ecole Centrale de Lyon, Lyon, France, Sep 2016. [doi]
Abstract
Software abstractions are crucial to effectively program heterogeneous Multi-Processor Systems on Chip (MPSoCs). Prime examples of such abstractions are Kahn Process Networks (KPNs) and execution traces. When modeling computation as a KPN, one of the key challenges is to obtain a good mapping, i.e., an assignment of logical computation and communication to physical resources. In this paper we compare two system-level frameworks for solving the mapping problem: Sesame and MAPS. These frameworks, while superficially similar, embody different approaches. Sesame, motivated by modeling and design-space exploration, uses evolutionary algorithms for mapping. MAPS, being a compiler framework, uses simple and fast heuristics instead. In this work we highlight the value of common abstractions, such as KPNs and traces, as a vehicle to enable comparisons between large independent frameworks. These types of comparisons are fundamental for advancing research in the area. At the same time, we illustrate how the lack of formalized models at the hardware level are an obstacle to achieving fair comparisons. Additionally, using a set of applications from the embedded systems domain, we observe that genetic algorithms tend to outperform heuristics by a factor between 1x and 5x, with notable exceptions. This performance comes at the cost of a longer computation time, between 0 and 2 orders of magnitude in our experiments.
Bibtex
@InProceedings{goen_mcsoc16,
author= {Andres Goens and Robert Khasanov and Jeronimo Castrillon and Simon Polstra and Andy Pimentel},
title= {Why Comparing System-level {MPSoC} Mapping Approaches is Difficult: a Case Study},
booktitle= {Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16)},
year= {2016},
address= {Ecole Centrale de Lyon, Lyon, France},
month= sep,
pages = {281-288},
doi = {10.1109/MCSoC.2016.48},
abstract = {Software abstractions are crucial to effectively program heterogeneous Multi-Processor Systems on Chip (MPSoCs). Prime examples of such abstractions are Kahn Process Networks (KPNs) and execution traces. When modeling computation as a KPN, one of the key challenges is to obtain a good mapping, i.e., an assignment of logical computation and communication to physical resources. In this paper we compare two system-level frameworks for solving the mapping problem: Sesame and MAPS. These frameworks, while superficially similar, embody different approaches. Sesame, motivated by modeling and design-space exploration, uses evolutionary algorithms for mapping. MAPS, being a compiler framework, uses simple and fast heuristics instead. In this work we highlight the value of common abstractions, such as KPNs and traces, as a vehicle to enable comparisons between large independent frameworks. These types of comparisons are fundamental for advancing research in the area. At the same time, we illustrate how the lack of formalized models at the hardware level are an obstacle to achieving fair comparisons. Additionally, using a set of applications from the embedded systems domain, we observe that genetic algorithms tend to outperform heuristics by a factor between 1x and 5x, with notable exceptions. This performance comes at the cost of a longer computation time, between 0 and 2 orders of magnitude in our experiments.},
days= {21},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/1609_Goens_MCSoC.pdf}
}Downloads
1609_Goens_MCSoC [PDF]
Related Paths
Permalink
- Benjamin Schiller, Clemens Deusser, Jeronimo Castrillon, Thorsten Strufe, "Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis", In Journal of Applied Network Science, vol. 1, no. 9, pp. 1–22, Sep 2016. [doi] [Bibtex & Downloads]
Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis
Reference
Benjamin Schiller, Clemens Deusser, Jeronimo Castrillon, Thorsten Strufe, "Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis", In Journal of Applied Network Science, vol. 1, no. 9, pp. 1–22, Sep 2016. [doi]
Bibtex
@Article{schiller16_jans,
author = {Benjamin Schiller and Clemens Deusser and Jeronimo Castrillon and Thorsten Strufe},
title = {Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis},
journal = {Journal of Applied Network Science},
year = {2016},
volume = {1},
number = {9},
pages = {1--22},
month = sep,
doi = {10.1007/s41109-016-0011-2},
url= {http://dynamic-networks.org/publications/papers/papers/gds-dynamic.pdf}
}Downloads
1607_Schiller_JANS [PDF]
Related Paths
HAEC, Orchestration Path, Resilience Path
Permalink
- Jeronimo Castrillon, "Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems", In IEEE 5G Dresden Summit (invited talk), Sep 2016. [Bibtex & Downloads]
Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems
Reference
Jeronimo Castrillon, "Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems", In IEEE 5G Dresden Summit (invited talk), Sep 2016.
Bibtex
@Misc{castrillon20165gsummit,
author = {Castrillon, Jeronimo},
title = {Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems},
howpublished = {IEEE 5G Dresden Summit (invited talk)},
month = sep,
year = {2016},
location = {Dresden, Germany},
url= {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/160929_castrillon_5G-summit.pdf}
}Downloads
160929_castrillon_5G-summit [PDF]
Permalink
- Andrés Goens, Jeronimo Castrillon, Maximilian Odendahl, Rainer Leupers, "An Optimal Allocation of Memory Buffers for Complex Multicore Platforms", In Journal of Systems Architecture, Elsevier, vol. 66-67, pp. 69–83, May 2016. [doi] [Bibtex & Downloads]
An Optimal Allocation of Memory Buffers for Complex Multicore Platforms
Reference
Andrés Goens, Jeronimo Castrillon, Maximilian Odendahl, Rainer Leupers, "An Optimal Allocation of Memory Buffers for Complex Multicore Platforms", In Journal of Systems Architecture, Elsevier, vol. 66-67, pp. 69–83, May 2016. [doi]
Abstract
In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.
In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear pro- gramming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.Bibtex
@Article{goens_jsa16,
Title={An Optimal Allocation of Memory Buffers for Complex Multicore Platforms},
Author={Goens, Andr\'{e}s and Castrillon, Jeronimo and Odendahl, Maximilian and Leupers, Rainer},
Journal={Journal of Systems Architecture},
volume={66-67},
pages={69--83},
doi={10.1016/j.sysarc.2016.05.002},
publisher={Elsevier},
Year={2016},
month=may,
abstract={In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.
In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear pro- gramming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.}
}Downloads
No Downloads available for this publication
Related Paths
Permalink
- Jeronimo Castrillon, "Programming Heterogeneous Embedded Systems for IoT", In Workshop get-togethers toward a sustainable collaboration in IoT (invited talk), Apr 2016. ([link]) [Bibtex & Downloads]
Programming Heterogeneous Embedded Systems for IoT
Reference
Jeronimo Castrillon, "Programming Heterogeneous Embedded Systems for IoT", In Workshop get-togethers toward a sustainable collaboration in IoT (invited talk), Apr 2016. ([link])
Bibtex
@Misc{castrillon2016tunis,
author={Castrillon, Jeronimo},
title={Programming Heterogeneous Embedded Systems for IoT},
howpublished={Workshop get-togethers toward a sustainable collaboration in IoT (invited talk)},
month=apr,
year={2016},
location={Tunis, Tunisia},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/160418_castrillon_dataflow4IoT.pdf}
}Downloads
160418_castrillon_dataflow4IoT [PDF]
Related Paths
Permalink
- Sven Karol, Norman A. Rink, Bálint Gyapjas, Jeronimo Castrillon, "Fault Tolerance with Aspects: a Feasibility Study", Proceedings of the 15th International Conference on Modularity, ACM, pp. 66–69, New York, NY, USA, Mar 2016. [doi] [Bibtex & Downloads]
Fault Tolerance with Aspects: a Feasibility Study
Reference
Sven Karol, Norman A. Rink, Bálint Gyapjas, Jeronimo Castrillon, "Fault Tolerance with Aspects: a Feasibility Study", Proceedings of the 15th International Conference on Modularity, ACM, pp. 66–69, New York, NY, USA, Mar 2016. [doi]
Bibtex
@inproceedings{karol2016faulttolerance,
author={Karol, Sven and Rink, Norman A. and Gyapjas, B\'{a}lint and Castrillon, Jeronimo},
title={Fault Tolerance with Aspects: a Feasibility Study},
booktitle={Proceedings of the 15th International Conference on Modularity},
series={MODULARITY 2016},
year={2016},
pages={66--69},
address={New York, NY, USA},
month={mar},
publisher={ACM},
doi={10.1145/2889443.2889453},
isbn={978-1-4503-3995-7/16/03},
location={M{\'a}laga, Spain},
}Downloads
1603_Karol_Modularity_preprint [PDF]
Related Paths
Permalink
2015
- Andrés Goens, Jeronimo Castrillon, "Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs", In Proceeding: System Level Design from HW/SW to Memory for Embedded Systems. IESS 2015. IFIP Advances in Information and Communication Technology, vol 523 (Götz, Marcelo and Schirner, Gunar and Wehrmeister, Marco Aurélio and Al Faruque, Mohammad Abdullah and Rettberg, Achim), Springer International Publishing, pp. 116–127, Foz do Iguaçu, Brazil, Nov 2015. [doi] [Bibtex & Downloads]
Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs
Reference
Andrés Goens, Jeronimo Castrillon, "Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs", In Proceeding: System Level Design from HW/SW to Memory for Embedded Systems. IESS 2015. IFIP Advances in Information and Communication Technology, vol 523 (Götz, Marcelo and Schirner, Gunar and Wehrmeister, Marco Aurélio and Al Faruque, Mohammad Abdullah and Rettberg, Achim), Springer International Publishing, pp. 116–127, Foz do Iguaçu, Brazil, Nov 2015. [doi]
Abstract
Current approaches for mapping Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) applications rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior. We address this problem by leveraging inherent mathematical structures of the dataflow models and the hardware architectures. On the side of the dataflow models, we rely on the monoid structure of histories and traces. This structure help us formalize the behavior of multiple executions of a given dynamic application. By defining metrics we have a formal framework for comparing the executions. On the side of the hardware, we take advantage of symmetries in the architecture to reduce the search space for the mapping problem. We evaluate our implementation on execution variations of a randomly-generated KPN application and on a low-variation JPEG encoder benchmark. Using the described methods we show that trace differences are not sufficient for characterizing performance losses. Additionally, using platform symmetries we manage to reduce the design space in the experiments by two orders of magnitude.
Bibtex
@InProceedings{goens_iess15,
author = {Goens, Andr\'{e}s and Castrillon, Jeronimo},
title = {Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs},
booktitle = {System Level Design from HW/SW to Memory for Embedded Systems. IESS 2015. IFIP Advances in Information and Communication Technology, vol 523},
year = {2015},
editor = {G{\"o}tz, Marcelo and Schirner, Gunar and Wehrmeister, Marco Aur{\'e}lio and Al Faruque, Mohammad Abdullah and Rettberg, Achim},
pages = {116--127},
address = {Foz do Igua{\c{c}}u, Brazil},
month = nov,
publisher = {Springer International Publishing},
doi = {10.1007/978-3-319-90023-0_10},
url = {https://link.springer.com/chapter/10.1007%2F978-3-319-90023-0_10},
isbn={978-3-319-90023-0},
abstract = {Current approaches for mapping Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) applications rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior. We address this problem by leveraging inherent mathematical structures of the dataflow models and the hardware architectures. On the side of the dataflow models, we rely on the monoid structure of histories and traces. This structure help us formalize the behavior of multiple executions of a given dynamic application. By defining metrics we have a formal framework for comparing the executions. On the side of the hardware, we take advantage of symmetries in the architecture to reduce the search space for the mapping problem. We evaluate our implementation on execution variations of a randomly-generated KPN application and on a low-variation JPEG encoder benchmark. Using the described methods we show that trace differences are not sufficient for characterizing performance losses. Additionally, using platform symmetries we manage to reduce the design space in the experiments by two orders of magnitude.},
}Downloads
1511_Goens_IESS [PDF]
Related Paths
Permalink
- Benjamin Schiller, Jeronimo Castrillon, Thorsten Strufe, "Efficient data structures for dynamic graph analysis", Proceedings of the 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (Lisa O'Conner), IEEE Computer Society, pp. 497–504, Bangkok, Thailand, Nov 2015. [doi] [Bibtex & Downloads]
Efficient data structures for dynamic graph analysis
Reference
Benjamin Schiller, Jeronimo Castrillon, Thorsten Strufe, "Efficient data structures for dynamic graph analysis", Proceedings of the 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (Lisa O'Conner), IEEE Computer Society, pp. 497–504, Bangkok, Thailand, Nov 2015. [doi]
Bibtex
@InProceedings{schiller_sitis15,
Title={Efficient data structures for dynamic graph analysis},
Author={Schiller, Benjamin and Castrillon, Jeronimo and Strufe, Thorsten},
Booktitle={Proceedings of the 11th International Conference on Signal-Image Technology \& Internet-Based Systems (SITIS)},
Year={2015},
Address={Bangkok, Thailand},
Editor={Lisa O'Conner},
Month=nov,
Publisher={IEEE Computer Society},
Series={SITIS 2015},
pages={497--504},
doi={10.1109/SITIS.2015.94}
}Downloads
1511_Schiller_SITIS [PDF]
Related Paths
Orchestration Path, Resilience Path, HAEC
Permalink
- Norman A. Rink, Jeronimo Castrillon, "Improving Code Generation for Software-based Error Detection", Proceedings of the 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with ESWEEK 2015, pp. 16–30, Amsterdam, The Netherlands, Oct 2015. ([link]) [Bibtex & Downloads]
Improving Code Generation for Software-based Error Detection
Reference
Norman A. Rink, Jeronimo Castrillon, "Improving Code Generation for Software-based Error Detection", Proceedings of the 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with ESWEEK 2015, pp. 16–30, Amsterdam, The Netherlands, Oct 2015. ([link])
Bibtex
@InProceedings{rink_ress15,
Title={Improving Code Generation for Software-based Error Detection},
Author={Rink, Norman A. and Castrillon, Jeronimo},
Booktitle={Proceedings of the 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with ESWEEK 2015},
Year={2015},
Series={REES 2015},
Address={Amsterdam, The Netherlands},
Month=oct,
Pages={16--30},
}Downloads
1510_Rink_REES [PDF]
Related Paths
Orchestration Path, Resilience Path
Permalink
- Jeronimo Castrillon, "Analysis and software synthesis of KPN applications", In Design of Robotics and Embedded systems, Analysis, and Modeling Seminar (DREAMS) (invited talk), Oct 2015. ([link]) [Bibtex & Downloads]
Analysis and software synthesis of KPN applications
Reference
Jeronimo Castrillon, "Analysis and software synthesis of KPN applications", In Design of Robotics and Embedded systems, Analysis, and Modeling Seminar (DREAMS) (invited talk), Oct 2015. ([link])
Abstract
Programming models based on dataflow or process
networks are a good match for streaming
applications, common in the signal processing,
multimedia and automotive domains. In such models,
parallelism is expressed explicitly which makes
them well-suited for programming parallel
machines. Since today's applications are no
longer static, expressive programming models are
needed, such as those based on Kahn Process
Networks (KPNs). In these models, tasks cannot be
handled as black boxes, but have to be analyzed,
profiled and traced to characterize their
behavior. This is especially important in the case
of heterogenous platforms with many processors of
multiple different types. This presentation
describes a tool flow to handle KPN applications
and gives insights into mapping algorithms for
heterogeneous platforms.Bibtex
@Misc{castrillon15_dreams,
Title={Analysis and software synthesis of KPN applications},
Author={Jeronimo Castrillon},
HowPublished={Design of Robotics and Embedded systems, Analysis, and Modeling Seminar (DREAMS) (invited talk)},
Month=oct,
Year={2015},
Day={22},
Location={Berkeley, CA, USA},
Abstract={Programming models based on dataflow or process
networks are a good match for streaming
applications, common in the signal processing,
multimedia and automotive domains. In such models,
parallelism is expressed explicitly which makes
them well-suited for programming parallel
machines. Since today's applications are no
longer static, expressive programming models are
needed, such as those based on Kahn Process
Networks (KPNs). In these models, tasks cannot be
handled as black boxes, but have to be analyzed,
profiled and traced to characterize their
behavior. This is especially important in the case
of heterogenous platforms with many processors of
multiple different types. This presentation
describes a tool flow to handle KPN applications
and gives insights into mapping algorithms for
heterogeneous platforms.},
url={https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/151022_castrillon_dreams.pdf},
}Downloads
151022_castrillon_dreams [PDF]
Related Paths
Orchestration Path, Resilience Path
Permalink
- Jeronimo Castrillon, "Dataflow programming for heterogeneous computing systems", In Tutorial Algorithmic Specification, Tools and Algorithms for Programming Heterogeneous Platforms. Co-located with the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT'15), Oct 2015. ([link]) [Bibtex & Downloads]
Dataflow programming for heterogeneous computing systems
Reference
Jeronimo Castrillon, "Dataflow programming for heterogeneous computing systems", In Tutorial Algorithmic Specification, Tools and Algorithms for Programming Heterogeneous Platforms. Co-located with the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT'15), Oct 2015. ([link])
Abstract
This tutorial talk starts by introducing new types of heterogeneous systems and their challenges for hardware/software programming stacks. These systems are currently being investigated in the context of the German cluster of excellence Cfaed – ''Center for Advancing Electronics Dresden''. We will then look at dataflow modeling concepts, with emphasis on the dynamic models that are needed to express today's changing workloads. Finally, the talk will introduce methods and algorithms for mapping sets of applications modeled in this way to heterogeneous systems.
Bibtex
@Misc{castrillon15_pacttut,
Title={Dataflow programming for heterogeneous computing systems},
Author={Jeronimo Castrillon},
HowPublished={Tutorial Algorithmic Specification, Tools and Algorithms for Programming Heterogeneous Platforms. Co-located with the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT'15)},
Month=oct,
Year={2015},
Abstract={This tutorial talk starts by introducing new types of heterogeneous systems and their challenges for hardware/software programming stacks. These systems are currently being investigated in the context of the German cluster of excellence Cfaed – ''Center for Advancing Electronics Dresden''. We will then look at dataflow modeling concepts, with emphasis on the dynamic models that are needed to express today's changing workloads. Finally, the talk will introduce methods and algorithms for mapping sets of applications modeled in this way to heterogeneous systems.},
Day={18},
}Downloads
151018_castrillon_dataflow_pacttut [PDF]
Related Paths
Permalink
- Markus Vogt, Gerald Hempel, Jeronimo Castrillon, Christian Hochberger, "GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs", Proceedings of the Second International Workshop on FPGAs for Software Programmers (FSP), Sep 2015. ([link]) [Bibtex & Downloads]
GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs
Reference
Markus Vogt, Gerald Hempel, Jeronimo Castrillon, Christian Hochberger, "GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs", Proceedings of the Second International Workshop on FPGAs for Software Programmers (FSP), Sep 2015. ([link])
Abstract
In recent years, architectures combining a reconfigurable fabric and a general purpose processor on a single chip became increasingly popular. Such hybrid architectures allow extending embedded software with application specific hardware accelerators to improve performance and/or energy efficiency. Aiding system designers and programmers at handling the complexity of the required process of hardware/software (HW/SW) partitioning is an important issue. Current methods are often restricted, either to bare-metal systems, to subsets of mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a high entry barrier for the wider community of programmers that new hybrid architectures are intended for. In this paper we revisit HW/SW partitioning and present a seamless programming flow for unrestricted, legacy C code. It consists of a retargetable GCC plugin that automatically identifies code sections for hardware acceleration and generates code accordingly. The proposed workflow was evaluated on the Xilinx Zynq platform using unmodified code from an embedded benchmark suite.
Bibtex
@InProceedings{vogt15,
Title={GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs},
Author={Vogt , Markus and Hempel, Gerald and Castrillon, Jeronimo and Hochberger, Christian},
Booktitle={Proceedings of the Second International Workshop on FPGAs for Software Programmers (FSP)},
Year={2015},
Month=sep,
Series={FSP 2015},
archivePrefix={arXiv},
arxivId={1509.00025},
eprint={1509.00025},
abstract={In recent years, architectures combining a reconfigurable fabric and a general purpose processor on a single chip became increasingly popular. Such hybrid architectures allow extending embedded software with application specific hardware accelerators to improve performance and/or energy efficiency. Aiding system designers and programmers at handling the complexity of the required process of hardware/software (HW/SW) partitioning is an important issue. Current methods are often restricted, either to bare-metal systems, to subsets of mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a high entry barrier for the wider community of programmers that new hybrid architectures are intended for. In this paper we revisit HW/SW partitioning and present a seamless programming flow for unrestricted, legacy C code. It consists of a retargetable GCC plugin that automatically identifies code sections for hardware acceleration and generates code accordingly. The proposed workflow was evaluated on the Xilinx Zynq platform using unmodified code from an embedded benchmark suite.},
}Downloads
1509_Vogt_FSP [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Orchestration: Turning material breakthroughs into application performance", In Dresden Microelectronics Academy, (invited talk), Sep 2015. [Bibtex & Downloads]
Orchestration: Turning material breakthroughs into application performance
Reference
Jeronimo Castrillon, "Orchestration: Turning material breakthroughs into application performance", In Dresden Microelectronics Academy, (invited talk), Sep 2015.
Bibtex
@Misc{castrillon2015dma,
Title={Orchestration: Turning material breakthroughs into application performance},
Author={Castrillon, Jeronimo},
HowPublished={Dresden Microelectronics Academy, (invited talk)},
Month=sep,
Year={2015},
Location={Dresden, Germany},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150918_castrillon_dma.pdf}
}Downloads
150918_castrillon_dma [PDF]
Related Paths
Permalink
- Norman A. Rink, Dmitrii Kuvaiskii, Jeronimo Castrillon, Christof Fetzer, "Compiling for Resilience: the Performance Gap", Chapter in Parallel Computing: On the Road to Exascale (ParCo 2015). Extended from Proceedings of the Mini-Symposium on Energy and Resilience in Parallel Programming (ERPP 2015) (Gerhard R. Joubert and Hugh Leather and Mark Parsons and Frans Peters and Mark Sawyer), IOS Press, vol. 27, pp. 721–730, Edinburgh, Scotland, Sep 2015. [doi] [Bibtex & Downloads]
Compiling for Resilience: the Performance Gap
Reference
Norman A. Rink, Dmitrii Kuvaiskii, Jeronimo Castrillon, Christof Fetzer, "Compiling for Resilience: the Performance Gap", Chapter in Parallel Computing: On the Road to Exascale (ParCo 2015). Extended from Proceedings of the Mini-Symposium on Energy and Resilience in Parallel Programming (ERPP 2015) (Gerhard R. Joubert and Hugh Leather and Mark Parsons and Frans Peters and Mark Sawyer), IOS Press, vol. 27, pp. 721–730, Edinburgh, Scotland, Sep 2015. [doi]
Abstract
In order to perform reliable computations on unreliable hardware, software-based protection mechanisms have been proposed. In this paper we present a compiler infrastructure for software-based code hardening based on encoding. We analyze the trade-off between performance and fault coverage. We look at different code generation strategies that improve the performance of hardened programs by up to 2x while incurring little fault coverage degradation.
Bibtex
@InCollection{rink_erpp2015,
author={Rink, Norman A. and Kuvaiskii, Dmitrii and Castrillon, Jeronimo and Fetzer, Christof},
title={Compiling for Resilience: the Performance Gap},
booktitle={Parallel Computing: On the Road to Exascale (ParCo 2015). Extended from Proceedings of the Mini-Symposium on Energy and Resilience in Parallel Programming (ERPP 2015)},
publisher={IOS Press},
year={2015},
editor={Gerhard R. Joubert and Hugh Leather and Mark Parsons and Frans Peters and Mark Sawyer},
volume={27},
series={ParCo 2015},
pages={721--730},
address={Edinburgh, Scotland},
month=sep,
abstract={In order to perform reliable computations on unreliable hardware, software-based protection mechanisms have been proposed. In this paper we present a compiler infrastructure for software-based code hardening based on encoding. We analyze the trade-off between performance and fault coverage. We look at different code generation strategies that improve the performance of hardened programs by up to 2x while incurring little fault coverage degradation.},
doi={10.3233/978-1-61499-621-7-721},
}Downloads
No Downloads available for this publication
Related Paths
Orchestration Path, Resilience Path
Permalink
- Gerald Hempel, Markus Vogt, Jeronimo Castrillon, Christian Hochberger, "Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs", Proceedings of 41st EUROMICRO Conference on Software Engineering and Advanced Applications - Work in Progress Session (Grosspietsch, Erwin and Klöckner, Konrad), SEA-Publications: SEA-SR-44, Funchal, Madeira (Portugal), August 2015. [Bibtex & Downloads]
Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs
Reference
Gerald Hempel, Markus Vogt, Jeronimo Castrillon, Christian Hochberger, "Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs", Proceedings of 41st EUROMICRO Conference on Software Engineering and Advanced Applications - Work in Progress Session (Grosspietsch, Erwin and Klöckner, Konrad), SEA-Publications: SEA-SR-44, Funchal, Madeira (Portugal), August 2015.
Bibtex
@InProceedings{hempeldsd15,
Title={Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs},
Author={Hempel, Gerald and Vogt, Markus and Castrillon, Jeronimo and Hochberger, Christian},
Booktitle={Proceedings of 41st EUROMICRO Conference on Software Engineering and Advanced Applications - Work in Progress Session},
Year={2015},
Address={Funchal, Madeira (Portugal)},
Editor={Grosspietsch, Erwin and Kl{\"o}ckner, Konrad},
Month={August},
Publisher={SEA-Publications: SEA-SR-44},
Series={DSD/SEAA 2015},
ISBN={978-3-902457-44-8}
}Downloads
1508_Hempel_DSD [PDF]
Related Paths
Permalink
- Sven Karol, Pietro Incardona, Yaser Afshar, Ivo Sbalzarini, Jeronimo Castrillon, "Towards a Next-Generation Parallel Particle-Mesh Language", Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI), pp. 15–18, Jul 2015. ([link]) [Bibtex & Downloads]
Towards a Next-Generation Parallel Particle-Mesh Language
Reference
Sven Karol, Pietro Incardona, Yaser Afshar, Ivo Sbalzarini, Jeronimo Castrillon, "Towards a Next-Generation Parallel Particle-Mesh Language", Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI), pp. 15–18, Jul 2015. ([link])
Bibtex
@InProceedings{karol15,
Title={Towards a Next-Generation Parallel Particle-Mesh Language},
Author={Karol, Sven and Incardona, Pietro and Afshar, Yaser and Sbalzarini, Ivo and Castrillon, Jeronimo},
Booktitle={Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI)},
series={DSLDI'15},
Year={2015},
Month=jul,
pages={15--18},
}Downloads
1507_Karol_PPML [PDF]
Related Paths
Permalink
- Diana Göhringer, Michael Hübner, Jeronimo Castrillon, Cristina Silvano, "ViPES 2015-Preface", Proceedings of the 15th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 347–347, Jul 2015. [doi] [Bibtex & Downloads]
ViPES 2015-Preface
Reference
Diana Göhringer, Michael Hübner, Jeronimo Castrillon, Cristina Silvano, "ViPES 2015-Preface", Proceedings of the 15th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 347–347, Jul 2015. [doi]
Bibtex
@InProceedings{gohringer2015vipes,
author={G{\"o}hringer, Diana and H{\"u}bner, Michael and Castrillon, Jeronimo and Silvano, Cristina},
title={ViPES 2015-Preface},
booktitle={Proceedings of the 15th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)},
year={2015},
pages={347--347},
organization={IEEE},
month=jul,
doi={10.1109/SAMOS.2015.7363696},
}Downloads
No Downloads available for this publication
Permalink
- Jeronimo Castrillon, "Portable Libraries and Programming Environments", In HiPEAC Computing Systems Week, (invited talk), May 2015. [Bibtex & Downloads]
Portable Libraries and Programming Environments
Reference
Jeronimo Castrillon, "Portable Libraries and Programming Environments", In HiPEAC Computing Systems Week, (invited talk), May 2015.
Bibtex
@Misc{castrillon2015csw,
Title={Portable Libraries and Programming Environments},
Author={Castrillon, Jeronimo},
HowPublished={HiPEAC Computing Systems Week, (invited talk)},
Year={2015},
Month=may,
Location={Oslo, Noway},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150505_castrillon_csw.pdf}
}Downloads
150505_castrillon_csw [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, Lothar Thiele, Lars Schorr, Weihua Sheng, Ben Juurlink, Mauricio Alvarez-Mesa, Angela Pohl, Ralph Jessenberger, Victor Reyes, Rainer Leupers, "Multi/Many-core Programming: Where Are We Standing?", Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), EDA Consortium, pp. 1708–1717, San Jose, CA, USA, Mar 2015. ([link]) [Bibtex & Downloads]
Multi/Many-core Programming: Where Are We Standing?
Reference
Jeronimo Castrillon, Lothar Thiele, Lars Schorr, Weihua Sheng, Ben Juurlink, Mauricio Alvarez-Mesa, Angela Pohl, Ralph Jessenberger, Victor Reyes, Rainer Leupers, "Multi/Many-core Programming: Where Are We Standing?", Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), EDA Consortium, pp. 1708–1717, San Jose, CA, USA, Mar 2015. ([link])
Bibtex
@inproceedings{Castrillon:2015,
author={Castrillon, Jeronimo and Thiele, Lothar and Schorr, Lars and Sheng, Weihua and Juurlink, Ben and Alvarez-Mesa, Mauricio and Pohl, Angela and Jessenberger, Ralph and Reyes, Victor and Leupers, Rainer},
title={Multi/Many-core Programming: Where Are We Standing?},
booktitle={Proceedings of the 2015 Design, Automation \& Test in Europe Conference \& Exhibition (DATE)},
series={DATE '15},
year={2015},
Month=mar,
location={Grenoble, France},
pages={1708--1717},
numpages={10},
acmid={2757208},
publisher={EDA Consortium},
address={San Jose, CA, USA},
Bdsk-url-1={http://dl.acm.org/citation.cfm?id=2757012.2757208},
}Downloads
No Downloads available for this publication
Related Paths
Permalink
- Jeronimo Castrillon, "Tools and dataflow-based programming models for heterogeneous MPSoCs", In Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM'15) in conjunction with the HiPEAC Conference (invited talk), Jan 2015. [Bibtex & Downloads]
Tools and dataflow-based programming models for heterogeneous MPSoCs
Reference
Jeronimo Castrillon, "Tools and dataflow-based programming models for heterogeneous MPSoCs", In Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM'15) in conjunction with the HiPEAC Conference (invited talk), Jan 2015.
Bibtex
@Misc{castrillon2015pegpum,
Title={Tools and dataflow-based programming models for heterogeneous MPSoCs},
Author={Castrillon, Jeronimo},
HowPublished={Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM'15) in conjunction with the HiPEAC Conference (invited talk)},
Year={2015},
Month=jan,
Location={Amsterdam, The Netherlands},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150121_castrillon_pegpum.pdf}
}Downloads
150121_castrillon_pegpum [PDF]
Related Paths
Permalink
- Jeronimo Castrillon, "Simulation and Estimation for MPSoC Programming Tools", In Proceeding: Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO'15), in conjunction with the HiPEAC Conference (keynote), Jan 2015. [Bibtex & Downloads]
Simulation and Estimation for MPSoC Programming Tools
Reference
Jeronimo Castrillon, "Simulation and Estimation for MPSoC Programming Tools", In Proceeding: Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO'15), in conjunction with the HiPEAC Conference (keynote), Jan 2015.
Bibtex
@InProceedings{castrillon2015rapido,
Title={Simulation and Estimation for MPSoC Programming Tools},
Author={Castrillon, Jeronimo},
Booktitle={Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO'15), in conjunction with the HiPEAC Conference (keynote)},
Year={2015},
Month=jan,
Location={Amsterdam, The Netherlands},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150121_castrillon_rapido.pdf}
}Downloads
150121_castrillon_rapido [PDF]
Related Paths
Permalink
2014
- Jeronimo Castrillon, "Compiler Flow for Processors and Systems", In Winter School on Design, Programming and Applications of Multi Processor System on Chip (invited talk), Nov 2014. [Bibtex & Downloads]
Compiler Flow for Processors and Systems
Reference
Jeronimo Castrillon, "Compiler Flow for Processors and Systems", In Winter School on Design, Programming and Applications of Multi Processor System on Chip (invited talk), Nov 2014.
Bibtex
@Misc{castrillon2015tunis,
Title={Compiler Flow for Processors and Systems},
Author={Castrillon, Jeronimo},
HowPublished={Winter School on Design, Programming and Applications of Multi Processor System on Chip (invited talk)},
Year={2014},
Month=nov,
Location={Tunis, Tunisia},
url={https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/141127_castrillon_compilers.pdf}
}Downloads
141127_castrillon_compilers [PDF]
Related Paths
Permalink
- Timothy Hansen, Florina M. Ciorba, Anthony A. Maciejewski, Howard Jay Siegel, Srishti Srivastava, Ioana Banicescu, "Heuristics for robust allocation of resources to parallel applications with uncertain execution times in heterogeneous systems with uncertain availability" , Proceedings of the World Congress on Engineering (WCE), Jul 2014. [Bibtex & Downloads]
Heuristics for robust allocation of resources to parallel applications with uncertain execution times in heterogeneous systems with uncertain availability
Reference
Timothy Hansen, Florina M. Ciorba, Anthony A. Maciejewski, Howard Jay Siegel, Srishti Srivastava, Ioana Banicescu, "Heuristics for robust allocation of resources to parallel applications with uncertain execution times in heterogeneous systems with uncertain availability" , Proceedings of the World Congress on Engineering (WCE), Jul 2014.
Bibtex
@InProceedings{ auto-key*bh,
author = {Timothy Hansen and Florina M. Ciorba and Anthony A. Maciejewski and Howard Jay Siegel and Srishti Srivastava and Ioana Banicescu},
title = {Heuristics for robust allocation of resources to parallel applications with uncertain execution times in heterogeneous systems with uncertain availability},
booktitle = {Proceedings of the World Congress on Engineering (WCE)},
month = jul,
year = {2014},
project = {A04}
}
Downloads
No Downloads available for this publication
Related Paths
Permalink
- Jeronimo Castrillon, Rainer Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap", Springer, pp. 258, 2014. ([link]) [Bibtex & Downloads]
Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap
Reference
Jeronimo Castrillon, Rainer Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap", Springer, pp. 258, 2014. ([link])
Bibtex
@Book{castrillon14_springer,
Title={Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap},
Author={Castrillon, Jeronimo and Leupers, Rainer},
Publisher={Springer},
Year={2014},
ISBN={978-3-319-00675-8},
Pages={258}
}Downloads
No Downloads available for this publication
Permalink
- Diandian Zhang, Jeronimo Castrillon, Stefan Schürmans, Gerd Ascheid, Rainer Leupers, and Bart Vanthournout, "System-Level Analysis of MPSoCs with a Hardware Scheduler", Hershey: IGI Global, pp. 335–367, 2014. [doi] [Bibtex & Downloads]
System-Level Analysis of MPSoCs with a Hardware Scheduler
Reference
Diandian Zhang, Jeronimo Castrillon, Stefan Schürmans, Gerd Ascheid, Rainer Leupers, and Bart Vanthournout, "System-Level Analysis of MPSoCs with a Hardware Scheduler", Hershey: IGI Global, pp. 335–367, 2014. [doi]
Bibtex
@InBook{zhang2014_inbook,
Title={System-Level Analysis of MPSoCs with a Hardware Scheduler},
Author={Diandian Zhang and Jeronimo Castrillon and Stefan Schürmans and Gerd Ascheid and Rainer Leupers and and Bart Vanthournout},
Chapter={Advancing Embedded Systems and Real-Time Communications with Emerging Technologies},
Editor={Seppo Virtanen},
Pages={335--367},
Publisher={Hershey: IGI Global},
doi={10.4018/978-1-4666-6034-2.ch014},
Year={2014},
}Downloads
No Downloads available for this publication
Permalink
Previous Years
- [Bibtex & Downloads]
Reference
Bibtex
@Misc{castrillon_date2024,
author = {Castrillon, Jeronimo},
date = {2024-03},
title = {Automatic optimization for heterogeneous in-memory computing},
howpublished = {Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk)},
location = {Valencia, Spain},
abstract = {Fuelled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann machines in different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts on building an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on cross-bars, on content addressable memories and bulk-wise logic operations. We believe this kind of automation to be key to more quickly navigate the heterogeneous landscape of in-memory accelerators and to bring the benefits of emerging architectures to a boarder range of applications.},
month = mar,
year = {2024},
}Downloads
No Downloads available for this publication
Permalink
- [Bibtex & Downloads]
Reference
Bibtex
@InProceedings{suchert_fccm24,
author = {Stephanie Soldavini and Felix Suchert and Serena Curzel and Michele Fiorito and Karl Friedrich Alexander Friebel and Fabrizio Ferrandi and Radim Cmar and Jeronimo Castrillon and Christian Pilato},
booktitle = {Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract)},
title = {Etna: {MLIR}-Based System-Level Design and Optimization for Transparent Application Execution on {CPU}-{FPGA} Nodes},
location = {Orlando, CA USA},
pages = {1pp},
series = {FCCM’24},
month = may,
year = {2024},
}Downloads
No Downloads available for this publication
Permalink