- Chair of Compiler Construction
- Chair of Emerging Electronic Technologies
- Chair of Knowledge-Based Systems
- Chair of Molecular Functional Materials
- Chair of Network Dynamics
- Chair of Organic Devices
- Chair of Processor Design
João Paulo Cardoso de Lima |
||
![]() |
Phone Visitor's Address |
+49 (0)351 463 42336 Helmholtzstrasse 18,3rd floor, BAR III59 01069 Dresden |
Curriculum Vitae
I received my bachelor's degree in Computer Engineering from the Federal University of Santa Catarina (UFSC) in 2017, followed by a master's degree in 2019 and a Ph.D. in Computer Science from the Federal University of Rio Grande do Sul (UFRGS) in 2025. I joined the Chair for Compiler Construction to research and develop code optimizations for emerging AI systems as part of the ScaDS.AI Dresden/Leipzig center. My main research interests include Processing-in-Memory architectures, system design, hardware/software co-design, design automation tools, compilers, reliability evaluation, and fault tolerance methods. On the application side, I am particularly interested in efficient methods for machine learning algorithms through memory-centric optimizations, whether compiler-driven, hand-tuned, or enabled by domain-specific tools, for energy efficiency. A complete list of my publications can be found on my Google Scholar profile.
My research interests focus on advancing the field of energy-efficient and high-performance computing through innovative approaches like computing-near-memory (CNM) and computing-in-memory (CIM), especially for machine learning (ML) and data analytics applications. I also focus on optimizing ML models for energy efficiency, which is essential for both IoT devices and data centres, where energy use is a growing concern. I can help you with these topics for project work or Bachelor/Master's thesis, especially for those interested in hardware-software co-design, energy-efficient ML, and emerging computing paradigms.
-
System and Compiler Design for Emerging CNM/CIM Architectures
Our goal is to enable the portability of AI and Big Data applications across existing CNM/CIM systems and novel accelerator designs, prioritizing performance, accuracy, and energy efficiency. Given the substantial differences compared to conventional machines, new compiler abstractions and frameworks are crucial to fully exploit the potential of CIM by providing automatic device-aware and device-agnostic optimizations and facilitating widespread adoption. Visit the ScaDS-AI website for a more detailed description of this project.
-
Model and Code Optimization Methods for Energy-efficient Machine Learning
Optimizing machine learning models is essential for improving performance and energy efficiency, especially given the resource constraints in IoT devices and the rising energy demands of data centres. Our research focuses on post-training analysis, conversion techniques, and code optimizations to reduce model size and computational complexity without compromising accuracy. Our efforts have focused on quantization, pruning, and bitslicing methods to boost alternative execution models and design approaches, aiming at faster and more energy-efficient inference tasks. You will find details of this project on the ScaDS-AI website.
-
Cross-Layer Resilience for Reliable CIM
CIM architectures promise high energy efficiency and throughput but suffer from reliability degradation due to device non-idealities, such as process variations, stochastic switching, and bit errors, that severely impact the application's correctness or accuracy. To overcome these challenges, we investigate cross-layer co-design approaches that integrate system architecture, micro-architecture, non-conventional arithmetic methods, and algorithmic resilience. Either by leveraging fault-tolerant paradigms or by actively exploiting hardware defects, this line of research aims to build CIM systems that maintain accuracy and performance despite imperfections.
2025
- João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025. [Bibtex & Downloads]
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs
Reference
João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025.
Bibtex
@InProceedings{delima_ccmcc25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Marc Dietrich and Jeronimo Castrillon and Asif Ali Khan},
booktitle = {Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC)},
title = {Efficient In-Memory Acceleration of Sparse Block Diagonal {LLM}s},
location = {Dresden, Germany},
publisher = {IEEE},
month = oct,
numpages = {6},
year = {2025},
}Downloads
No Downloads available for this publication
Permalink
- Xiaobo Sharon Hu, Ming-Yen Lee, Mengyuan Li, João Paulo Cardoso de Lima, Liu Liu, Zhenhua Zhu, Jeronimo Castrillon, Michael Niemier, Yu Wang, "Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies", In IEEE Design & Test, Special Issue on the 20 years of the IEEE CEDA, IEEE, Aug 2025. [doi] [Bibtex & Downloads]
Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies
Reference
Xiaobo Sharon Hu, Ming-Yen Lee, Mengyuan Li, João Paulo Cardoso de Lima, Liu Liu, Zhenhua Zhu, Jeronimo Castrillon, Michael Niemier, Yu Wang, "Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies", In IEEE Design & Test, Special Issue on the 20 years of the IEEE CEDA, IEEE, Aug 2025. [doi]
Abstract
Data transfer between processors and memory remains a critical bottleneck in improving application performance on traditional computing hardware, particularly for data-intensive workloads such as machine learning, bioinformatics, and security applications. In-memory computing (IMC), a paradigm where a substantial portion of data processing occurs directly within memory, has emerged as a promising solution to mitigate this bottleneck. The advancement of emerging non-volatile memory (NVM) technologies has further accelerated the development of IMC hardware fabrics. However, harnessing the full potential of IMC requires a cross-layer design approach that spans memory technologies, circuits, architectures, and systems. Essential cross-layer tools –including modeling and simulation, data partitioning and mapping, and operation scheduling–play a pivotal role in designing efficient IMC-based hardware. This article reviews key advancements in simulation and design tools for IMC fabrics, with a focus on NVM-based crossbar arrays and content-addressable memories, while highlighting the necessity of cross-layer collaboration. Additionally, we discuss current challenges and emerging opportunities in the field.
Bibtex
@Article{hu_dnt25,
author = {Xiaobo Sharon Hu and Ming-Yen Lee and Mengyuan Li and João Paulo Cardoso de Lima and Liu Liu and Zhenhua Zhu and Jeronimo Castrillon and Michael Niemier and Yu Wang},
journal = {IEEE Design \& Test, Special Issue on the 20 years of the IEEE CEDA},
title = {Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies},
doi = {10.1109/MDAT.2025.3603495},
url = {https://ieeexplore.ieee.org/document/11142851},
month = aug,
numpages = {11},
publisher = {IEEE},
year = {2025},
abstract = {Data transfer between processors and memory remains a critical bottleneck in improving application performance on traditional computing hardware, particularly for data-intensive workloads such as machine learning, bioinformatics, and security applications. In-memory computing (IMC), a paradigm where a substantial portion of data processing occurs directly within memory, has emerged as a promising solution to mitigate this bottleneck. The advancement of emerging non-volatile memory (NVM) technologies has further accelerated the development of IMC hardware fabrics. However, harnessing the full potential of IMC requires a cross-layer design approach that spans memory technologies, circuits, architectures, and systems. Essential cross-layer tools --including modeling and simulation, data partitioning and mapping, and operation scheduling--play a pivotal role in designing efficient IMC-based hardware. This article reviews key advancements in simulation and design tools for IMC fabrics, with a focus on NVM-based crossbar arrays and content-addressable memories, while highlighting the necessity of cross-layer collaboration. Additionally, we discuss current challenges and emerging opportunities in the field.},
}Downloads
No Downloads available for this publication
Permalink
- Anderson Faustino da Silva, Hamid Farzaneh, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems", Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025. [Bibtex & Downloads]
LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems
Reference
Anderson Faustino da Silva, Hamid Farzaneh, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems", Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025.
Abstract
Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.
Bibtex
@InProceedings{dasilva_samos25,
author = {Anderson Faustino da Silva and Hamid Farzaneh and Joao Paulo Cardoso De Lima and Asif Ali Khan and Jeronimo Castrillon},
booktitle = {Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS)},
date = {2025-07},
title = {{LearnCNM2Predict}: Transfer Learning-based Performance Model for CNM Systems},
location = {Samos, Greece},
organization = {IEEE},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
month = jul,
numpages = {17},
year = {2025},
abstract = {Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.},
}Downloads
No Downloads available for this publication
Permalink
- João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, New York, NY, USA, Jun 2025. [Bibtex & Downloads]
All-in-memory Stochastic Computing using ReRAM
Reference
João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, New York, NY, USA, Jun 2025.
Bibtex
@InProceedings{delima_dac25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Mehran Shoushtari Moghadam and Sercan Aygun and Jeronimo Castrillon and M. Hassan Najafi and Asif Ali Khan},
booktitle = {Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25)},
title = {All-in-memory Stochastic Computing using {ReRAM}},
location = {San Francisco, California},
publisher = {Association for Computing Machinery},
series = {DAC '25},
address = {New York, NY, USA},
month = jun,
numpages = {6},
year = {2025},
}Downloads
2506_deLima_DAC [PDF]
Permalink
- Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025. [Bibtex & Downloads]
Modeling and Simulating Emerging Memory Technologies: A Tutorial
Reference
Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025.
Bibtex
@Article{chen2025_sppsim,
author = {Yun-Chih Chen and Tristan Seidl and Nils Hölscher and Christian Hakert and Minh Duy Truong and Jian-Jia Chen and João Paulo C. de Lima and Asif Ali Khan and Jeronimo Castrillon and Ali Nezhadi and Lokesh Siddhu and Hassan Nassar and Mahta Mayahinia and Mehdi Baradaran Tahoori and Jörg Henkel and Nils Wilbert and Stefan Wildermann and Jürgen Teich},
title = {Modeling and Simulating Emerging Memory Technologies: A Tutorial},
eprint = {2502.10167},
url = {https://arxiv.org/abs/2502.10167},
archiveprefix = {arXiv},
primaryclass = {cs.AR},
year = {2025},
month = feb,
}Downloads
2502_Chen_SPPSim [PDF]
Permalink
2024
- João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-memory High-Radix Counting", Arxiv, pp. 1-14, Sep 2024. [Bibtex & Downloads]
Count2Multiply: Reliable In-memory High-Radix Counting
Reference
João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-memory High-Radix Counting", Arxiv, pp. 1-14, Sep 2024.
Abstract
Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.
Bibtex
@Misc{delima_count2multiply,
author = {Jo{\~a}o Paulo C. de Lima and Benjamin F. Morris III and Asif Ali Khan and Jeronimo Castrillon and Alex K. Jones},
title = {Count2Multiply: Reliable In-memory High-Radix Counting},
pages = {1-14},
publisher = {Arxiv},
month=sep,
year={2024},
eprint={2409.10136},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2409.10136},
abstract = {Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.},
}Downloads
No Downloads available for this publication
Permalink
- Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi] [Bibtex & Downloads]
SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs
Reference
Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi]
Abstract
Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes and 4.6\texttimes, respectively.
Bibtex
@InProceedings{farzaneh_dac24,
author = {Hamid Farzaneh and Jo{\~a}o Paulo Cardoso De Lima and Ali Nezhadi Khelejani and Asif Ali Khan and Mahta Mayahinia and Mehdi Tahoori and Jeronimo Castrillon},
booktitle = {Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24)},
title = {{SHERLOCK}: Scheduling Efficient and Reliable Bulk Bitwise Operations in {NVMs}},
location = {San Francisco, California},
series = {DAC '24},
month = jun,
year = {2024},
isbn = {9798400706011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3649329.3658485},
doi = {10.1145/3649329.3658485},
abstract = {Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes{} and 4.6\texttimes{}, respectively.},
articleno = {293},
numpages = {6},
}Downloads
2406_Farzaneh_DAC [PDF]
Permalink
- Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]
C4CAM: A Compiler for CAM-based In-memory Accelerators
Reference
Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi]
Abstract
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.
Bibtex
@InProceedings{farzaneh_asplos24,
author = {Hamid Farzaneh and João Paulo Cardoso de Lima and Mengyuan Li and Asif Ali Khan and Xiaobo Sharon Hu and Jeronimo Castrillon},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3},
title = {C4CAM: A Compiler for CAM-based In-memory Accelerators},
doi = {10.1145/3620666.3651386},
isbn = {9798400703867},
location = {La Jolla, CA, USA},
pages = {164--177},
publisher = {Association for Computing Machinery},
series = {ASPLOS '24},
url = {https://arxiv.org/abs/2309.06418},
abstract = {Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.},
address = {New York, NY, USA},
month = may,
numpages = {14},
year = {2024},
}Downloads
2405_Farzaneh_ASPLOS [PDF]
Permalink
- João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024. [Bibtex & Downloads]
Full-Stack Optimization for CAM-Only DNN Inference
Reference
João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024.
Abstract
The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy
Bibtex
@InProceedings{delima_date24,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Luigi Carro and Jeronimo Castrillon},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Full-Stack Optimization for CAM-Only DNN Inference},
location = {Valencia, Spain},
pages = {1-6},
publisher = {IEEE},
series = {DATE'24},
url = {https://ieeexplore.ieee.org/document/10546805},
abstract = {The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy},
month = mar,
year = {2024},
}Downloads
2403_deLima_DATE [PDF]
Permalink
- Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024. [Bibtex & Downloads]
Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers
Reference
Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024.
Bibtex
@InProceedings{niemier_date24,
author = {Michael Niemier and Zephan Enciso and Mohammad Mehdi Sharifi and X. Sharon Hu and Ian O'Connor and Alexander Graening and Ravit Sharma and Puneet Gupta and Jeronimo Castrillon and João Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Nashrah Afroze and Asif Islam Khan and Julien Ryckaert},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546772},
pages = {1--10},
publisher = {IEEE},
series = {DATE'24},
month = mar,
year = {2024},
}Downloads
2403_Niemier_DATE [PDF]
Permalink
- Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024. [Bibtex & Downloads]
The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview
Reference
Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024.
Bibtex
@Report{khan_cimlandscape_2024,
author = {Asif Ali Khan and João Paulo C. De Lima and Hamid Farzaneh and Jeronimo Castrillon},
title = {The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview},
eprint = {2401.14428},
url = {https://arxiv.org/abs/2401.14428},
archiveprefix = {arXiv},
month = jan,
primaryclass = {cs.AR},
year = {2024},
}Downloads
No Downloads available for this publication
Permalink
2023
- Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi] [Bibtex & Downloads]
Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications
Reference
Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi]
Abstract
This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.
Bibtex
@InProceedings{henkel_cases23,
author = {J\"{o}rg Henkel and Lokesh Siddhu and Lars Bauer and J\"{u}rgen Teich and Stefan Wildermann and Mehdi Tahoori and Mahta Mayahinia and Jeronimo Castrillon and Asif Ali Khan and Hamid Farzaneh and Jo\~{a}o Paulo C. de Lima and Jian-Jia Chen and Christian Hakert and Kuan-Hsun Chen and Chia-Lin Yang and Hsiang-Yun Cheng},
booktitle = {Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
title = {Special Session -- Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications},
location = {Hamburg, Germany},
abstract = {This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.},
pages = {11--20},
url = {https://ieeexplore.ieee.org/abstract/document/10316216},
doi = {10.1145/3607889.3609088},
isbn = {9798400702907},
series = {CASES '23 Companion},
issn = {2643-1726},
month = sep,
numpages = {10},
year = {2023},
}Downloads
2309_Henkel_CASES [PDF]
Permalink
- João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023. [Bibtex & Downloads]
Efficient Associative Processing with RTM-TCAMs
Reference
João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023.
Bibtex
@InProceedings{lima_imacaw23,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Jeronimo Castrillon},
booktitle = {1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23)},
title = {Efficient Associative Processing with RTM-TCAMs},
location = {San Francisco, CA, USA},
pages = {2pp},
month = jul,
year = {2023},
}Downloads
2307_deLima_iMACAW [PDF]
Permalink
2022
- Rafael Fão de Moura, João Paulo Cardoso de Lima, Luigi Carro, "Data and Computation Reuse in CNNs using Memristor TCAMs", In ACM Transactions on Reconfigurable Technology and Systems, Association for Computing Machinery (ACM), Jul 2022. [doi] [Bibtex & Downloads]
Data and Computation Reuse in CNNs using Memristor TCAMs
Reference
Rafael Fão de Moura, João Paulo Cardoso de Lima, Luigi Carro, "Data and Computation Reuse in CNNs using Memristor TCAMs", In ACM Transactions on Reconfigurable Technology and Systems, Association for Computing Machinery (ACM), Jul 2022. [doi]
Bibtex
@article{de_Moura_2022,
doi = {10.1145/3549536},
url = {https://doi.org/10.1145%2F3549536},
year = 2022,
month = {jul},
publisher = {Association for Computing Machinery ({ACM})},
author = {Rafael Fao de Moura and Joao Paulo Cardoso de Lima and Luigi Carro},
title = {Data and Computation Reuse in {CNNs} using Memristor {TCAMs}},
journal = {{ACM} Transactions on Reconfigurable Technology and Systems}
}Downloads
No Downloads available for this publication
Permalink
- João Paulo Cardoso de Lima, Marcelo Brandalero, Michael Hübner, Luigi Carro, "STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs", In ACM Journal on Emerging Technologies in Computing Systems, Association for Computing Machinery (ACM), vol. 18, no. 2, pp. 1–22, Apr 2022. [doi] [Bibtex & Downloads]
STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs
Reference
João Paulo Cardoso de Lima, Marcelo Brandalero, Michael Hübner, Luigi Carro, "STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs", In ACM Journal on Emerging Technologies in Computing Systems, Association for Computing Machinery (ACM), vol. 18, no. 2, pp. 1–22, Apr 2022. [doi]
Bibtex
@article{de_Lima_2022,
doi = {10.1145/3450769},
url = {https://doi.org/10.1145%2F3450769},
year = 2022,
month = {apr},
publisher = {Association for Computing Machinery ({ACM})},
volume = {18},
number = {2},
pages = {1--22},
author = {Jo{\~{a}}o Paulo Cardoso de Lima and Marcelo Brandalero and Michael Hübner and Luigi Carro},
title = {{STAP}: An Architecture and Design Tool for Automata Processing on Memristor {TCAMs}},
journal = {{ACM} Journal on Emerging Technologies in Computing Systems}
}Downloads
No Downloads available for this publication
Permalink
- Joao Paulo C. de Lima, Luigi Carro, "Quantization-Aware In-situ Training for Reliable and Accurate Edge AI", In Proceeding: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, Mar 2022. [doi] [Bibtex & Downloads]
Quantization-Aware In-situ Training for Reliable and Accurate Edge AI
Reference
Joao Paulo C. de Lima, Luigi Carro, "Quantization-Aware In-situ Training for Reliable and Accurate Edge AI", In Proceeding: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, Mar 2022. [doi]
Bibtex
@inproceedings{de_Lima_2022,
doi = {10.23919/date54114.2022.9774657},
url = {https://doi.org/10.23919%2Fdate54114.2022.9774657},
year = 2022,
month = {mar},
publisher = ,
author = {Joao Paulo C. de Lima and Luigi Carro},
title = {Quantization-Aware In-situ Training for Reliable and Accurate Edge {AI}},
booktitle = {2022 Design, Automation {\&} Test in Europe Conference {\&} Exhibition ({DATE})}
}Downloads
No Downloads available for this publication
Permalink
2021
- Paulo C. Santos, João P. C. de Lima, Rafael F. de Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions", In International Journal of Parallel Programming, Springer Science and Business Media LLC, vol. 49, no. 2, pp. 237–252, Jan 2021. [doi] [Bibtex & Downloads]
Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions
Reference
Paulo C. Santos, João P. C. de Lima, Rafael F. de Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions", In International Journal of Parallel Programming, Springer Science and Business Media LLC, vol. 49, no. 2, pp. 237–252, Jan 2021. [doi]
Bibtex
@article{Santos_2021,
doi = {10.1007/s10766-020-00674-y},
url = {https://doi.org/10.1007%2Fs10766-020-00674-y},
year = 2021,
month = {jan},
publisher = {Springer Science and Business Media {LLC}},
volume = {49},
number = {2},
pages = {237--252},
author = {Paulo C. Santos and Jo{\~{a}}o P. C. de Lima and Rafael F. de Moura and Marco A. Z. Alves and Antonio C. S. Beck and Luigi Carro},
title = {Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions},
journal = {International Journal of Parallel Programming}
}Downloads
No Downloads available for this publication
Permalink
2020
- Joao Paulo Cardoso de Lima, Marcelo Brandalero, Luigi Carro, "Endurance-Aware RRAM-Based Reconfigurable Architecture using TCAM Arrays", In Proceeding: 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), IEEE, Aug 2020. [doi] [Bibtex & Downloads]
Endurance-Aware RRAM-Based Reconfigurable Architecture using TCAM Arrays
Reference
Joao Paulo Cardoso de Lima, Marcelo Brandalero, Luigi Carro, "Endurance-Aware RRAM-Based Reconfigurable Architecture using TCAM Arrays", In Proceeding: 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), IEEE, Aug 2020. [doi]
Bibtex
@inproceedings{Cardoso_de_Lima_2020,
doi = {10.1109/fpl50879.2020.00018},
url = {https://doi.org/10.1109%2Ffpl50879.2020.00018},
year = 2020,
month = {aug},
publisher = ,
author = {Joao Paulo Cardoso de Lima and Marcelo Brandalero and Luigi Carro},
title = {Endurance-Aware {RRAM}-Based Reconfigurable Architecture using {TCAM} Arrays},
booktitle = {2020 30th International Conference on Field-Programmable Logic and Applications ({FPL})}
}Downloads
No Downloads available for this publication
Permalink
2019
- Hameeza Ahmed, Paulo C. Santos, Joao P. C. Lima, Rafael F. Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions", In Proceeding: 2019 Design, Automation &$\mathsemicolon$ Test in Europe Conference &$\mathsemicolon$ Exhibition (DATE), IEEE, Mar 2019. [doi] [Bibtex & Downloads]
A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions
Reference
Hameeza Ahmed, Paulo C. Santos, Joao P. C. Lima, Rafael F. Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions", In Proceeding: 2019 Design, Automation &$\mathsemicolon$ Test in Europe Conference &$\mathsemicolon$ Exhibition (DATE), IEEE, Mar 2019. [doi]
Bibtex
@inproceedings{Ahmed_2019,
doi = {10.23919/date.2019.8714956},
url = {https://doi.org/10.23919%2Fdate.2019.8714956},
year = 2019,
month = {mar},
publisher = ,
author = {Hameeza Ahmed and Paulo C. Santos and Joao P. C. Lima and Rafael F. Moura and Marco A. Z. Alves and Antonio C. S. Beck and Luigi Carro},
title = {A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions},
booktitle = {2019 Design, Automation {\&}amp$\mathsemicolon$ Test in Europe Conference {\&}amp$\mathsemicolon$ Exhibition ({DATE})}
}Downloads
No Downloads available for this publication
Permalink