Fazal Hameed |
||
![]() |
Phone Fax Visitor's Address |
+49 (0)351 463 43729 +49 (0)351 463 39995 Georg-Schumann-Str. 7A |
Fazal Hameed joined the compiler-construction group of Prof. Castrillon in March 2016. Before, he worked as a Post-doctoral researcher at the Chair of Dependable and Nano Computing (CDNC) Karlsruhe Institute of Technology (KIT), Germany. There, he mainly worked in the architecture group with a focus on memories. In the Chair of Compiler group, he is currently working on the development of a simulation framework to evaluate the performance, energy, and reliability of heterogenous multi-core system architecture. For this purpose, development of cross-layer framework is in progress covering the entire abstraction stack including device, circuit, architectural, operating system, and upper software layers. The project includes development of system level architectures combining heterogenous logic and memory components, in particular with an increased heterogeneity within the future systems.
Fazal Hameed received his Ph.D. (Dr.-Ing.) in Computer Science from the Karlsruhe Institute of Technology (KIT) Germany in 2015. He received CODES+ISSS'13 best paper nomination for his work on DRAM cache management in multi-core systems. Mr. Hameed has also served as an external reviewer for major conferences in embedded systems and computer architecture.
2021
- Christian Hakert, Asif-Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory" (to appear), Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, Jul 2021. [Bibtex & Downloads]
BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory
Reference
Christian Hakert, Asif-Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory" (to appear), Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, Jul 2021.
Bibtex
@InProceedings{khan_dac21,
author = {Christian Hakert and Asif-Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Chen},
booktitle = {Proceedings of the 58th Annual Design Automation Conference (DAC'21)},
title = {{BLO}wing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory},
location = {San Francisco, California},
publisher = {ACM},
series = {DAC '21},
month = jul,
year = {2021},
}Downloads
No Downloads available for this publication
Permalink
2020
- Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, Oct 2020. [doi] [Bibtex & Downloads]
Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling
Reference
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, Oct 2020. [doi]
Abstract
In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7% compared to a state-of-the-art direct-mapped LLC design and by 7.2% compared to an existing associative LLC design.
Bibtex
@Article{hameed_tc20,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling},
journal = {IEEE Transactions on Computers},
year = {2020},
month = oct,
abstract = {In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7\% compared to a state-of-the-art direct-mapped LLC design and by 7.2\% compared to an existing associative LLC design.},
doi = {10.1109/TC.2020.3029615},
url = {https://ieeexplore.ieee.org/document/9220805},
issn = {0018-9340},
numpages = {14},
}Downloads
2010_Hameed_TC [PDF]
Related Paths
Permalink
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi] [Bibtex & Downloads]
Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories
Reference
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi]
Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.
Bibtex
@Article{khan_tecs20,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
year = {2020},
month = sep,
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {6},
issn = {1539-9087},
url = {https://doi.org/10.1145/3396235},
doi = {10.1145/3396235},
articleno = {44},
numpages = {26},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32\% and 73\% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80\%.},
}Downloads
2009_Khan_TECS [PDF]
Related Paths
Permalink
- Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi] [Bibtex & Downloads]
Generalized Data Placement Strategies for Racetrack Memories
Reference
Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi]
Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.
Bibtex
@InProceedings{khan_date20,
author = {Asif Ali Khan and Andr{\'e}s Goens and Fazal Hameed and Jeronimo Castrillon},
title = {Generalized Data Placement Strategies for Racetrack Memories},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
publisher = {IEEE},
location = {Grenoble, France},
month = mar,
isbn = {978-3-9819263-4-7},
pages = {1502--1507},
doi = {10.23919/DATE48585.2020.9116245},
url = {https://ieeexplore.ieee.org/document/9116245},
abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.},
}Downloads
2003_Khan_DATE [PDF]
Related Paths
Permalink
- Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi] [Bibtex & Downloads]
Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade
Reference
Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi]
Bibtex
@Article{khan_pieee20,
author = {Robin Bl{\"a}sing and Asif Ali Khan and Panagiotis Ch. Filippou and Chirag Garg and Fazal Hameed and Jeronimo Castrillon and Stuart S. P. Parkin},
title = {Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade},
journal = {Proceedings of the IEEE},
year = {2020},
month = mar,
volume={108},
number={8},
pages={1303-1321},
doi = {10.1109/JPROC.2020.2975719},
url = {https://ieeexplore.ieee.org/document/9045991},
}Downloads
2003_Khan_JPROC [PDF]
Related Paths
Permalink
2019
- Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi] [Bibtex & Downloads]
ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0
Reference
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi]
Abstract
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.
Bibtex
@Article{khan_taco19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart S. P. Parkin and Jeronimo Castrillon},
title = {ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
issue_date = {December 2019},
volume = {16},
number = {4},
month = dec,
year = {2019},
issn = {1544-3566},
pages = {56:1--56:23},
articleno = {56},
numpages = {23},
url = {http://doi.acm.org/10.1145/3372489},
doi = {10.1145/3372489},
acmid = {3372489},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5\%, outperforming the state of the art by up to 16.1\%.},
}Downloads
1912_Khan_TACO [PDF]
Related Paths
Permalink
- Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi] [Bibtex & Downloads]
A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement
Reference
Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi]
Abstract
High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78% and improves the average performance by 13% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24% and 17.1% respectively.
Bibtex
@Article{hameed_tvlsi19,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {A Novel Hybrid {DRAM}/{STT-RAM} {L}ast-{L}evel-{C}ache Architecture for Performance, Energy and Endurance Enhancement},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2019},
month = oct,
abstract = {High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78\% and improves the average performance by 13\% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24\% and 17.1\% respectively.},
volume = {27},
number = {10},
pages = {2375-2386},
numpages = {12pp},
doi={10.1109/TVLSI.2019.2918385},
url = {https://ieeexplore.ieee.org/document/8734763},
}Downloads
1905_Hameed_TVLSI [PDF]
Related Paths
Permalink
- Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, pp. 6pp, New York, NY, USA, Jul 2019. [doi] [Bibtex & Downloads]
SHRIMP: Efficient Instruction Delivery with Domain Wall Memory
Reference
Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, pp. 6pp, New York, NY, USA, Jul 2019. [doi]
Bibtex
@InProceedings{multanen_islped19,
author = {Joonas Multanen and Asif Ali Khan and Pekka J{\"a}{\"a}skel{\"a}inen and Fazal Hameed and Jeronimo Castrillon},
title = {{SHRIMP}: Efficient Instruction Delivery with Domain Wall Memory},
booktitle = {Proceedings of the International Symposium on Low Power Electronics and Design},
year = {2019},
month = jul,
series = {ISLPED '19},
location = {Lausanne, Switzerland},
pages = {6pp},
numpages = {6},
publisher = {ACM},
address = {New York, NY, USA},
doi={10.1109/ISLPED.2019.8824954},
url = {https://ieeexplore.ieee.org/document/8824954},
}Downloads
1907_Multanen_ISLPED [PDF]
Related Paths
Permalink
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]
Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads
Reference
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi]
Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.
Bibtex
@InProceedings{kahn_lctes19,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads},
booktitle = {Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES)},
series = {LCTES 2019},
pages = {5--18},
numpages = {12},
numpages = {14},
isbn = {978-1-4503-6724-0/19/06},
doi = {10.1145/3316482.3326351},
url = {http://doi.acm.org/10.1145/3316482.3326351},
acmid = {3326351},
year = {2019},
month = jun,
location = {Phoenix, AZ, USA},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.},
acmid = {3326351},
}Downloads
1906_Khan_LCTES [PDF]
Related Paths
Permalink
- Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi] [Bibtex & Downloads]
RTSim: A Cycle-accurate Simulator for Racetrack Memories
Reference
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi]
Bibtex
@Article{khan_ieeecal19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart Parkin and Jeronimo Castrillon},
title = {{RTS}im: A Cycle-accurate Simulator for Racetrack Memories},
journal = {IEEE Computer Architecture Letters},
year = {2019},
volume = {18},
number = {1},
pages = {43--46},
month = jan,
doi = {10.1109/LCA.2019.2899306},
issn = {1556-6056},
publisher = {IEEE},
url = {https://ieeexplore.ieee.org/document/8642352}
}Downloads
1902_khan_IEEECAL [PDF]
Related Paths
Permalink
2018
- Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi] [Bibtex & Downloads]
Performance and Energy Efficient Design of STT-RAM Last-Level-Cache
Reference
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi]
Abstract
Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75% and improve the system performance by 6.5%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82% and improves system performance by 6.8%.
Bibtex
@Article{hameed_tvlsi18,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Performance and Energy Efficient Design of STT-RAM Last-Level-Cache},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2018},
volume = {26},
number = {6},
pages = {1059--1072},
month = jun,
abstract = {Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75\% and improve the system performance by 6.5\%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82\% and improves system performance by 6.8\%.},
doi = {10.1109/TVLSI.2018.2804938},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1803_Hameed_TVLSI.pdf:PDF},
issn = {1063-8210},
numpages = {14},
url = {http://ieeexplore.ieee.org/document/8307465/}
}Downloads
1803_Hameed_TVLSI [PDF]
Related Paths
Permalink
- Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018. [Bibtex & Downloads]
STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement
Reference
Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018.
Bibtex
@InProceedings{hameed_nvmw18,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement},
booktitle = {Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018)},
year = {2018},
month = mar,
location = {San Diego, CA, USA},
numpages = {2}
}Downloads
1803_Hameed_NVMW [PDF]
Related Paths
Permalink
- Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]
NVMain Extension for Multi-Level Cache Systems
Reference
Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi]
Bibtex
@InProceedings{khan_rapido18,
author = {Asif Ali Khan and Fazal Hameed and Jeronimo Castrillon},
title = {NVMain Extension for Multi-Level Cache Systems},
booktitle = {Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {RAPIDO '18},
year = {2018},
month = jan,
pages = {7:1--7:6},
articleno = {7},
numpages = {6},
url = {http://doi.acm.org/10.1145/3180665.3180672},
doi = {10.1145/3180665.3180672},
acmid = {3180672},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
isbn = {978-1-4503-6417-1},
}Downloads
1801_Khan_RAPIDO [PDF]
Related Paths
Permalink
2017
- Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]
Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache
Reference
Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi]
Bibtex
@InProceedings{hameed_memsys17,
author = {Fazal Hameed and Christian Menard and Jeronimo Castrillon},
title = {Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache},
booktitle = {Proceedings of the International Symposium on Memory Systems (MemSys'17)},
series = {MEMSYS '17},
year = {2017},
month = oct,
isbn = {978-1-4503-5335-9},
location = {Alexandria, Virginia},
pages = {141--151},
numpages = {11},
url = {http://doi.acm.org/10.1145/3132402.3132414},
doi = {10.1145/3132402.3132414},
acmid = {3132414},
publisher = {ACM},
address = {New York, NY, USA},
}Downloads
1710_Hameed_Memsys [PDF]
Related Paths
Permalink
- Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi] [Bibtex & Downloads]
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Reference
Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi]
Abstract
State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2% and 11.4% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62%.
Bibtex
@InProceedings{hameed_date17,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization},
booktitle = {Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE)},
year = {2017},
series = {DATE '17},
pages = {362--367},
month = mar,
publisher = {EDA Consortium},
abstract = {State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70\% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2\% and 11.4\% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62\%.},
isbn = {978-3-9815370-8-6},
doi={10.23919/DATE.2017.7927017},
url = {http://ieeexplore.ieee.org/document/7927017/},
location = {Lausanne, Switzerland}
}Downloads
1703_Hameed_DATE [PDF]
Related Paths
Permalink
S. Mohanachandran Nair, R. Bishnoi, M. S. Golanbari, F. Oboril, F. Hameed, and M. B. Tahoori, "VAET-STT: A Variation Aware STT-MRAM Analysis and Design Space Exploration Tool", in IEEE Transcactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2017
2016 and Before
For my publications in the previous years, please have a look at my Google Scholar profile.