Fazal Hameed

E-mail

Phone

Visitor's Address

fazal.hameed@tu-dresden.de

+49 (0)351 463 43729

Georg-Schumann-Str. 7A
2nd floor, room 205
01187 Dresden
Germany

Curriculum Vitae

Fazal Hameed joined the compiler-construction group of Prof. Castrillon in March 2016. Before, he worked as a Post-doctoral researcher at the Chair of Dependable and Nano Computing (CDNC) Karlsruhe Institute of Technology (KIT), Germany. There, he mainly worked in the architecture group with a focus on memories. In the Chair of Compiler group, he is currently working on the development of a simulation framework to evaluate the performance, energy, and reliability of heterogenous multi-core system architecture. For this purpose, development of cross-layer framework is in progress covering the entire abstraction stack including device, circuit, architectural, operating system, and upper software layers. The project includes development of system level architectures combining heterogenous logic and memory components, in particular with an increased heterogeneity within the future systems.

Fazal Hameed received his Ph.D. (Dr.-Ing.) in Computer Science from the Karlsruhe Institute of Technology (KIT) Germany in 2015. He received CODES+ISSS'13 best paper nomination for his work on DRAM cache management in multi-core systems. Mr. Hameed has also served as an external reviewer for major conferences in embedded systems and computer architecture.

Publications

2024
Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi] [Bibtex & Downloads]

Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory

Reference

Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi]

Bibtex

@Article{khan_ieeecal24,
author = {Asif Ali Khan and Fazal Hameed and Taha Shahroodi and Alex K. Jones and Jeronimo Castrillon},
title = {Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory},
pages = {4pp},
journal = {IEEE Computer Architecture Letters},
month = jan,
publisher = {IEEE},
year = {2024},
doi = {10.1109/LCA.2024.3350701},
url = {https://ieeexplore.ieee.org/document/10409506},
}

Downloads

2401_Khan_IEEECAL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3714

×

2023
Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi] [Bibtex & Downloads]

DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories

Reference

Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi]

Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6%, and 70.8% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2%.

Bibtex

@Article{khan_toc23,
author = {Asif Ali Khan and Sebastien Ollivier and Fazal Hameed and Jeronimo Castrillon and Alex K. Jones},
date = {2023-03},
journal = {IEEE Transactions on Computers},
doi = {10.1109/TC.2023.3257509},
title = {DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories},
abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6\%, and 70.8\% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2\%.},
month = mar,
numpages = {15},
publisher = {IEEE},
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3524

×

2022
Fazal Hameed, Jeronimo Castrillon, "BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, pp. 5288–5298, Dec 2022. [doi] [Bibtex & Downloads]

BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture

Reference

Fazal Hameed, Jeronimo Castrillon, "BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, pp. 5288–5298, Dec 2022. [doi]

Abstract
Racetrack memory (RTM) is a promising nonvolatile memory that provides multi-bit storage cells achieving a higher area and leakage energy efficiency compared to contemporary volatile and non-volatile memories. These features make RTM a potential candidate to be used as a Last-Level-Cache (LLC). One drawback of the multi-bit RTM cell is the serialized access to the stored data, resulting in a shift penalty to access a particular bit within the cell. This overhead is particularly critical for LLC tags, for which prior RTM designs place tags either in SRAM or in single-bit RTM cells. While this avoids shifting, these designs require large number of leaky cells incurring high energy consumption. To address this problem, this paper proposes an energy efficient RTM design called BlendCache that efficiently stores the tags in the leakage optimized multi-bit RTM cells. To reduce the RTM shift penalty of these cells, BlendCache exploits the spatial locality of programs by maximizing accesses to nearby locations in RTM. Employing 32-bit RTM cells for a single-core, BlendCache reduces the energy consumption by 20.8% and area by 15.2% compared to the state-of-the-art while its impact on performance is negligible. For a 4-core system, the energy improvement translates to 35.9% with 3% performance degradation.

Bibtex

@Article{hameed_tcad22,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {{BlendCache}: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture},
doi = {10.1109/TCAD.2022.3161198},
issn = {0278-0070},
issue = {12},
pages = {5288--5298},
url = {https://ieeexplore.ieee.org/document/9739802},
volume = {41},
abstract = {Racetrack memory (RTM) is a promising nonvolatile memory that provides multi-bit storage cells achieving a higher area and leakage energy efficiency compared to contemporary volatile and non-volatile memories. These features make RTM a potential candidate to be used as a Last-Level-Cache (LLC). One drawback of the multi-bit RTM cell is the serialized access to the stored data, resulting in a shift penalty to access a particular bit within the cell. This overhead is particularly critical for LLC tags, for which prior RTM designs place tags either in SRAM or in single-bit RTM cells. While this avoids shifting, these designs require large number of leaky cells incurring high energy consumption. To address this problem, this paper proposes an energy efficient RTM design called BlendCache that efficiently stores the tags in the leakage optimized multi-bit RTM cells. To reduce the RTM shift penalty of these cells, BlendCache exploits the spatial locality of programs by maximizing accesses to nearby locations in RTM. Employing 32-bit RTM cells for a single-core, BlendCache reduces the energy consumption by 20.8\% and area by 15.2\% compared to the state-of-the-art while its impact on performance is negligible. For a 4-core system, the energy improvement translates to 35.9\% with 3\% performance degradation.},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)},
month = dec,
project = {tracesymm,cfaed},
year = {2022},
}

Downloads

2204_Hameed_TCAD [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3333

×
Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi] [Bibtex & Downloads]

DNA Pre-alignment Filter using Processing Near Racetrack Memory

Reference

Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi]

Abstract
Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)–an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68% and 52%, respectively, compared to the state of the art proposed DRAM-based architecture.

Bibtex

@Article{hameed_ieeecal22,
author = {Fazal Hameed and Asif Ali Khan and Sebastien Ollivier and Alex K. Jones and Jeronimo Castrillon},
date = {2022-08},
journal = {IEEE Computer Architecture Letters},
title = {DNA Pre-alignment Filter using Processing Near Racetrack Memory},
abstract = {Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)--an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68\% and 52\%, respectively, compared to the state of the art proposed DRAM-based architecture.},
month = jul,
numpages = {4},
publisher = {IEEE},
year = {2022},
doi = {10.1109/LCA.2022.3194263},
pages = {1--4},
url = {https://ieeexplore.ieee.org/document/9841612},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3361

×
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi] [Bibtex & Downloads]

ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees

Reference

Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi]

Bibtex

@Article{hakert_toc22,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Che},
title = {{ROLLED}: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees},
journal = {IEEE Transactions on Computers},
month = jul,
year = {2022},
doi = {10.1109/TC.2022.3197094},
pages = {1--14},
url = {https://ieeexplore.ieee.org/document/9851943},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3365

×

2021
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi] [Bibtex & Downloads]

BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory

Reference

Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi]

Bibtex

@InProceedings{khan_dac21,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Chen},
booktitle = {Proceedings of the 58th Annual Design Automation Conference (DAC'21)},
title = {{BLO}wing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory},
doi = {10.1109/DAC18074.2021.9586167},
location = {San Francisco, California},
pages = {1111--1116},
series = {DAC '21},
url = {https://ieeexplore.ieee.org/document/9586167},
publisher = {ACM},
month = jul,
year = {2021},
}

Downloads

2112_Hakert_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2975

×
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi] [Bibtex & Downloads]

ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi]

Abstract
Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44% and 21-49%, respectively.

Bibtex

@Article{hameed_tetc21,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
journal = {IEEE Transactions on Emerging Topics in Computing (IEEE TETC)},
title = {{ALPHA}: A Novel Algorithm-Hardware Co-design for Accelerating {DNA} Seed Location Filtering},
pages = {12 pp.},
abstract = {Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54\%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44\% and 21-49\%, respectively.},
month = jun,
year = {2021},
doi = {10.1109/TETC.2021.3093840},
issn = {2168-6750},
}

Downloads

2107_hameed_TETC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3116

×

2020
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi] [Bibtex & Downloads]

Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi]

Abstract
In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7% compared to a state-of-the-art direct-mapped LLC design and by 7.2% compared to an existing associative LLC design.

Bibtex

@Article{hameed_tc20,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling},
journal = {IEEE Transactions on Computers},
year = {2020},
month = oct,
abstract = {In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7\% compared to a state-of-the-art direct-mapped LLC design and by 7.2\% compared to an existing associative LLC design.},
doi = {10.1109/TC.2020.3029615},
url = {https://ieeexplore.ieee.org/document/9220805},
issn = {0018-9340},
numpages = {14},
volume={70},
number={11},
pages={1914-1927},
}

Downloads

2010_Hameed_TC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2880

×
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi] [Bibtex & Downloads]

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Reference

Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi]

Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.

Bibtex

@Article{khan_tecs20,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
year = {2020},
month = sep,
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {6},
issn = {1539-9087},
url = {https://doi.org/10.1145/3396235},
doi = {10.1145/3396235},
articleno = {44},
numpages = {26},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32\% and 73\% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80\%.},
}

Downloads

2009_Khan_TECS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2649

×
Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi] [Bibtex & Downloads]

Generalized Data Placement Strategies for Racetrack Memories

Reference

Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi]

Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.

Bibtex

@InProceedings{khan_date20,
author = {Asif Ali Khan and Andr{\'e}s Goens and Fazal Hameed and Jeronimo Castrillon},
title = {Generalized Data Placement Strategies for Racetrack Memories},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
publisher = {IEEE},
location = {Grenoble, France},
month = mar,
isbn = {978-3-9819263-4-7},
pages = {1502--1507},
doi = {10.23919/DATE48585.2020.9116245},
url = {https://ieeexplore.ieee.org/document/9116245},

abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.},
}

Downloads

2003_Khan_DATE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2547

×
Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi] [Bibtex & Downloads]

Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade

Reference

Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi]

Bibtex

@Article{khan_pieee20,
author = {Robin Bl{\"a}sing and Asif Ali Khan and Panagiotis Ch. Filippou and Chirag Garg and Fazal Hameed and Jeronimo Castrillon and Stuart S. P. Parkin},
title = {Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade},
journal = {Proceedings of the IEEE},
year = {2020},
month = mar,
volume={108},
number={8},
pages={1303-1321},
doi = {10.1109/JPROC.2020.2975719},
url = {https://ieeexplore.ieee.org/document/9045991},
}

Downloads

2003_Khan_JPROC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2599

×

2019
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi] [Bibtex & Downloads]

ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0

Reference

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi]

Abstract
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.

Bibtex

@Article{khan_taco19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart S. P. Parkin and Jeronimo Castrillon},
title = {ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
issue_date = {December 2019},
volume = {16},
number = {4},
month = dec,
year = {2019},
issn = {1544-3566},
pages = {56:1--56:23},
articleno = {56},
numpages = {23},
url = {http://doi.acm.org/10.1145/3372489},
doi = {10.1145/3372489},
acmid = {3372489},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5\%, outperforming the state of the art by up to 16.1\%.},
}

Downloads

1912_Khan_TACO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2289

×
Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi] [Bibtex & Downloads]

A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement

Reference

Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi]

Abstract
High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78% and improves the average performance by 13% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24% and 17.1% respectively.

Bibtex

@Article{hameed_tvlsi19,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {A Novel Hybrid {DRAM}/{STT-RAM} {L}ast-{L}evel-{C}ache Architecture for Performance, Energy and Endurance Enhancement},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2019},
month = oct,
abstract = {High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78\% and improves the average performance by 13\% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24\% and 17.1\% respectively.},
volume = {27},
number = {10},
pages = {2375-2386},
numpages = {12pp},
doi={10.1109/TVLSI.2019.2918385},
url = {https://ieeexplore.ieee.org/document/8734763},
}

Downloads

1905_Hameed_TVLSI [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2454

×
Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi] [Bibtex & Downloads]

SHRIMP: Efficient Instruction Delivery with Domain Wall Memory

Reference

Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi]

Bibtex

@InProceedings{multanen_islped19,
author = {Joonas Multanen and Asif Ali Khan and Pekka J{\"a}{\"a}skel{\"a}inen and Fazal Hameed and Jeronimo Castrillon},
title = {{SHRIMP}: Efficient Instruction Delivery with Domain Wall Memory},
booktitle = {Proceedings of the International Symposium on Low Power Electronics and Design},
year = {2019},
month = jul,
series = {ISLPED '19},
location = {Lausanne, Switzerland},
pages = {6pp},
numpages = {6},
publisher = {ACM},
address = {New York, NY, USA},
doi={10.1109/ISLPED.2019.8824954},
url = {https://ieeexplore.ieee.org/document/8824954},
}

Downloads

1907_Multanen_ISLPED [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2452

×
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]

Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads

Reference

Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi]

Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.

Bibtex

@InProceedings{kahn_lctes19,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads},
booktitle = {Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES)},
series = {LCTES 2019},
pages = {5--18},
numpages = {12},
numpages = {14},
isbn = {978-1-4503-6724-0/19/06},
doi = {10.1145/3316482.3326351},
url = {http://doi.acm.org/10.1145/3316482.3326351},
acmid = {3326351},
year = {2019},
month = jun,
location = {Phoenix, AZ, USA},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.},
acmid = {3326351},
}

Downloads

1906_Khan_LCTES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2419

×
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi] [Bibtex & Downloads]

RTSim: A Cycle-accurate Simulator for Racetrack Memories

Reference

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi]

Bibtex

@Article{khan_ieeecal19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart Parkin and Jeronimo Castrillon},
title = {{RTS}im: A Cycle-accurate Simulator for Racetrack Memories},
journal = {IEEE Computer Architecture Letters},
year = {2019},
volume = {18},
number = {1},
pages = {43--46},
month = jan,
doi = {10.1109/LCA.2019.2899306},
issn = {1556-6056},
publisher = {IEEE},
url = {https://ieeexplore.ieee.org/document/8642352}
}

Downloads

1902_khan_IEEECAL [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2288

×

2018
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi] [Bibtex & Downloads]

Performance and Energy Efficient Design of STT-RAM Last-Level-Cache

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi]

Abstract
Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75% and improve the system performance by 6.5%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82% and improves system performance by 6.8%.

Bibtex

@Article{hameed_tvlsi18,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Performance and Energy Efficient Design of STT-RAM Last-Level-Cache},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2018},
volume = {26},
number = {6},
pages = {1059--1072},
month = jun,
abstract = {Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75\% and improve the system performance by 6.5\%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82\% and improves system performance by 6.8\%.},
doi = {10.1109/TVLSI.2018.2804938},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1803_Hameed_TVLSI.pdf:PDF},
issn = {1063-8210},
numpages = {14},
url = {http://ieeexplore.ieee.org/document/8307465/}
}

Downloads

1803_Hameed_TVLSI [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2099

×
Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018. [Bibtex & Downloads]

STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement

Reference

Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018.

Bibtex

@InProceedings{hameed_nvmw18,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement},
booktitle = {Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018)},
year = {2018},
month = mar,
location = {San Diego, CA, USA},
numpages = {2}
}

Downloads

1803_Hameed_NVMW [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1795

×
Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]

NVMain Extension for Multi-Level Cache Systems

Reference

Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi]

Bibtex

@InProceedings{khan_rapido18,
author = {Asif Ali Khan and Fazal Hameed and Jeronimo Castrillon},
title = {NVMain Extension for Multi-Level Cache Systems},
booktitle = {Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {RAPIDO '18},
year = {2018},
month = jan,
pages = {7:1--7:6},
articleno = {7},
numpages = {6},
url = {http://doi.acm.org/10.1145/3180665.3180672},
doi = {10.1145/3180665.3180672},
acmid = {3180672},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
isbn = {978-1-4503-6417-1},
}

Downloads

1801_Khan_RAPIDO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2098

×

2017
Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]

Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache

Reference

Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi]

Bibtex

@InProceedings{hameed_memsys17,
author = {Fazal Hameed and Christian Menard and Jeronimo Castrillon},
title = {Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache},
booktitle = {Proceedings of the International Symposium on Memory Systems (MemSys'17)},
series = {MEMSYS '17},
year = {2017},
month = oct,
isbn = {978-1-4503-5335-9},
location = {Alexandria, Virginia},
pages = {141--151},
numpages = {11},
url = {http://doi.acm.org/10.1145/3132402.3132414},
doi = {10.1145/3132402.3132414},
acmid = {3132414},
publisher = {ACM},
address = {New York, NY, USA},
}

Downloads

1710_Hameed_Memsys [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1476

×
Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi] [Bibtex & Downloads]

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Reference

Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi]

Abstract
State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2% and 11.4% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62%.

Bibtex

@InProceedings{hameed_date17,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization},
booktitle = {Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE)},
year = {2017},
series = {DATE '17},
pages = {362--367},
month = mar,
publisher = {EDA Consortium},
abstract = {State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70\% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2\% and 11.4\% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62\%.},
isbn = {978-3-9815370-8-6},
doi={10.23919/DATE.2017.7927017},
url = {http://ieeexplore.ieee.org/document/7927017/},
location = {Lausanne, Switzerland}
}

Downloads

1703_Hameed_DATE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1254

×

S. Mohanachandran Nair, R. Bishnoi, M. S. Golanbari, F. Oboril, F. Hameed, and M. B. Tahoori, "VAET-STT: A Variation Aware STT-MRAM Analysis and Design Space Exploration Tool", in IEEE Transcactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2017

2016 and Before

For my publications in the previous years, please have a look at my Google Scholar profile.

Fazal Hameed

2024

2023

2022

2021

2020

2019

2018

2017

2016 and Before