João Paulo Cardoso de Lima

E-mail

Phone

Visitor's Address

joao.lima@tu-dresden.de

+49 (0)351 463 42336

Helmholtzstrasse 18,3rd floor, BAR III59

01069 Dresden
Germany

Curriculum Vitae

I received my bachelor's degree in Computer Engineering from the Federal University of Santa Catarina (UFSC) in 2017, followed by a master's degree in 2019 and a Ph.D. in Computer Science from the Federal University of Rio Grande do Sul (UFRGS) in 2025. I joined the Chair for Compiler Construction to research and develop code optimizations for emerging AI systems as part of the ScaDS.AI Dresden/Leipzig center. My main research interests include Processing-in-Memory architectures, system design, hardware/software co-design, design automation tools, compilers, reliability evaluation, and fault tolerance methods. On the application side, I am particularly interested in efficient methods for machine learning algorithms through memory-centric optimizations, whether compiler-driven, hand-tuned, or enabled by domain-specific tools, for energy efficiency. A complete list of my publications can be found on my Google Scholar profile.

Student Thesis Topics

My research interests focus on advancing the field of energy-efficient and high-performance computing through innovative approaches like computing-near-memory (CNM) and computing-in-memory (CIM), especially for machine learning (ML) and data analytics applications. I also focus on optimizing ML models for energy efficiency, which is essential for both IoT devices and data centres, where energy use is a growing concern. I can help you with these topics for project work or Bachelor/Master's thesis, especially for those interested in hardware-software co-design, energy-efficient ML, and emerging computing paradigms.

System and Compiler Design for Emerging CNM/CIM Architectures

Our goal is to enable the portability of AI and Big Data applications across existing CNM/CIM systems and novel accelerator designs, prioritizing performance, accuracy, and energy efficiency. Given the substantial differences compared to conventional machines, new compiler abstractions and frameworks are crucial to fully exploit the potential of CIM by providing automatic device-aware and device-agnostic optimizations and facilitating widespread adoption. Visit the ScaDS-AI website for a more detailed description of this project.

Model and Code Optimization Methods for Energy-efficient Machine Learning

Optimizing machine learning models is essential for improving performance and energy efficiency, especially given the resource constraints in IoT devices and the rising energy demands of data centres. Our research focuses on post-training analysis, conversion techniques, and code optimizations to reduce model size and computational complexity without compromising accuracy. Our efforts have focused on quantization, pruning, and bitslicing methods to boost alternative execution models and design approaches, aiming at faster and more energy-efficient inference tasks. You will find details of this project on the ScaDS-AI website.

Cross-Layer Resilience for Reliable CIM

CIM architectures promise high energy efficiency and throughput but suffer from reliability degradation due to device non-idealities, such as process variations, stochastic switching, and bit errors, that severely impact the application's correctness or accuracy. To overcome these challenges, we investigate cross-layer co-design approaches that integrate system architecture, micro-architecture, non-conventional arithmetic methods, and algorithmic resilience. Either by leveraging fault-tolerant paradigms or by actively exploiting hardware defects, this line of research aims to build CIM systems that maintain accuracy and performance despite imperfections.

Publications

2026
João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-Memory High-Radix Counting" (to appear), Proceedings of the 32th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2026), IEEE Computer Society, Los Alamitos, CA, USA, Feb 2026. [Bibtex & Downloads]

Count2Multiply: Reliable In-Memory High-Radix Counting

Reference

João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-Memory High-Radix Counting" (to appear), Proceedings of the 32th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2026), IEEE Computer Society, Los Alamitos, CA, USA, Feb 2026.

Bibtex

@InProceedings{delima_hpca26,
author = {João Paulo C. de Lima and Benjamin F. Morris III and Asif Ali Khan and Jeronimo Castrillon and Alex K. Jones},
booktitle = {Proceedings of the 32th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2026)},
title = {Count2Multiply: Reliable In-Memory High-Radix Counting},
organization = {IEEE},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = feb,
year = {2026},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3861

×

2025
João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025. [Bibtex & Downloads]

Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

Reference

João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025.

Bibtex

@InProceedings{delima_ccmcc25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Marc Dietrich and Jeronimo Castrillon and Asif Ali Khan},
booktitle = {Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC)},
title = {Efficient In-Memory Acceleration of Sparse Block Diagonal {LLM}s},
location = {Dresden, Germany},
publisher = {IEEE},
month = oct,
numpages = {6},
year = {2025},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3852

×
Xiaobo Sharon Hu, Ming-Yen Lee, Mengyuan Li, João Paulo Cardoso de Lima, Liu Liu, Zhenhua Zhu, Jeronimo Castrillon, Michael Niemier, Yu Wang, "Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies", In IEEE Design & Test, Special Issue on the 20 years of the IEEE CEDA, IEEE, vol. 42, no. 6, pp. 75–86, Aug 2025. [doi] [Bibtex & Downloads]

Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies

Reference

Xiaobo Sharon Hu, Ming-Yen Lee, Mengyuan Li, João Paulo Cardoso de Lima, Liu Liu, Zhenhua Zhu, Jeronimo Castrillon, Michael Niemier, Yu Wang, "Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies", In IEEE Design & Test, Special Issue on the 20 years of the IEEE CEDA, IEEE, vol. 42, no. 6, pp. 75–86, Aug 2025. [doi]

Abstract
Data transfer between processors and memory remains a critical bottleneck in improving application performance on traditional computing hardware, particularly for data-intensive workloads such as machine learning, bioinformatics, and security applications. In-memory computing (IMC), a paradigm where a substantial portion of data processing occurs directly within memory, has emerged as a promising solution to mitigate this bottleneck. The advancement of emerging non-volatile memory (NVM) technologies has further accelerated the development of IMC hardware fabrics. However, harnessing the full potential of IMC requires a cross-layer design approach that spans memory technologies, circuits, architectures, and systems. Essential cross-layer tools –including modeling and simulation, data partitioning and mapping, and operation scheduling–play a pivotal role in designing efficient IMC-based hardware. This article reviews key advancements in simulation and design tools for IMC fabrics, with a focus on NVM-based crossbar arrays and content-addressable memories, while highlighting the necessity of cross-layer collaboration. Additionally, we discuss current challenges and emerging opportunities in the field.

Bibtex

@Article{hu_dnt25,
author = {Xiaobo Sharon Hu and Ming-Yen Lee and Mengyuan Li and João Paulo Cardoso de Lima and Liu Liu and Zhenhua Zhu and Jeronimo Castrillon and Michael Niemier and Yu Wang},
journal = {IEEE Design \& Test, Special Issue on the 20 years of the IEEE CEDA},
volume={42},
number={6},
pages={75--86},
title = {Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies},
doi = {10.1109/MDAT.2025.3603495},
url = {https://ieeexplore.ieee.org/document/11142851},
month = aug,
numpages = {11},
publisher = {IEEE},
year = {2025},
abstract = {Data transfer between processors and memory remains a critical bottleneck in improving application performance on traditional computing hardware, particularly for data-intensive workloads such as machine learning, bioinformatics, and security applications. In-memory computing (IMC), a paradigm where a substantial portion of data processing occurs directly within memory, has emerged as a promising solution to mitigate this bottleneck. The advancement of emerging non-volatile memory (NVM) technologies has further accelerated the development of IMC hardware fabrics. However, harnessing the full potential of IMC requires a cross-layer design approach that spans memory technologies, circuits, architectures, and systems. Essential cross-layer tools --including modeling and simulation, data partitioning and mapping, and operation scheduling--play a pivotal role in designing efficient IMC-based hardware. This article reviews key advancements in simulation and design tools for IMC fabrics, with a focus on NVM-based crossbar arrays and content-addressable memories, while highlighting the necessity of cross-layer collaboration. Additionally, we discuss current challenges and emerging opportunities in the field.},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3853

×
Anderson Faustino da Silva, Hamid Farzaneh, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems", Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025. (Best paper award candidate) [Bibtex & Downloads]

LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems

Reference

Anderson Faustino da Silva, Hamid Farzaneh, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems", Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025. (Best paper award candidate)

Abstract
Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.

Bibtex

@InProceedings{dasilva_samos25,
author = {Anderson Faustino da Silva and Hamid Farzaneh and Joao Paulo Cardoso De Lima and Asif Ali Khan and Jeronimo Castrillon},
booktitle = {Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS)},
date = {2025-07},
title = {{LearnCNM2Predict}: Transfer Learning-based Performance Model for CNM Systems},

location = {Samos, Greece},
organization = {IEEE},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
month = jul,
numpages = {17},
year = {2025},
abstract = {Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.},
}

Downloads

2507_daSilva_SAMOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3839

×
João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, pp. 50–56, New York, NY, USA, Jun 2025. [doi] [Bibtex & Downloads]

All-in-memory Stochastic Computing using ReRAM

Reference

João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, pp. 50–56, New York, NY, USA, Jun 2025. [doi]

Abstract
As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: 1 generating low-cost true random numbers and SBSs, 2 conducting SC operations, and 3 converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39X, 2.16X) and energy consumption (1.15X, 2.8X) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5% across multiple SBS lengths and image processing tasks.

Bibtex

@InProceedings{delima_dac25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Mehran Shoushtari Moghadam and Sercan Aygun and Jeronimo Castrillon and M. Hassan Najafi and Asif Ali Khan},
booktitle = {Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25)},
title = {All-in-memory Stochastic Computing using {ReRAM}},
doi = {10.1109/DAC63849.2025.11132096},
isbn = {9798331503048},
location = {San Francisco, California},
pages = {50--56},
publisher = {Association for Computing Machinery},
series = {DAC '25},
url = {https://doi.org/10.1109/DAC63849.2025.11132096},
abstract = {As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80\% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: 1 generating low-cost true random numbers and SBSs, 2 conducting SC operations, and 3 converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39X, 2.16X) and energy consumption (1.15X, 2.8X) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5\% across multiple SBS lengths and image processing tasks.},
address = {New York, NY, USA},
articleno = {5},
month = jun,
numpages = {6},
year = {2025},
}

Downloads

2506_deLima_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3818

×
Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025. [Bibtex & Downloads]

Modeling and Simulating Emerging Memory Technologies: A Tutorial

Reference

Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025.

Bibtex

@Article{chen2025_sppsim,
author = {Yun-Chih Chen and Tristan Seidl and Nils Hölscher and Christian Hakert and Minh Duy Truong and Jian-Jia Chen and João Paulo C. de Lima and Asif Ali Khan and Jeronimo Castrillon and Ali Nezhadi and Lokesh Siddhu and Hassan Nassar and Mahta Mayahinia and Mehdi Baradaran Tahoori and Jörg Henkel and Nils Wilbert and Stefan Wildermann and Jürgen Teich},
title = {Modeling and Simulating Emerging Memory Technologies: A Tutorial},
eprint = {2502.10167},
url = {https://arxiv.org/abs/2502.10167},
archiveprefix = {arXiv},
primaryclass = {cs.AR},
year = {2025},
month = feb,
}

Downloads

2502_Chen_SPPSim [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3815

×

2024
Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi] [Bibtex & Downloads]

SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs

Reference

Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi]

Abstract
Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes and 4.6\texttimes, respectively.

Bibtex

@InProceedings{farzaneh_dac24,
author = {Hamid Farzaneh and Jo{\~a}o Paulo Cardoso De Lima and Ali Nezhadi Khelejani and Asif Ali Khan and Mahta Mayahinia and Mehdi Tahoori and Jeronimo Castrillon},
booktitle = {Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24)},
title = {{SHERLOCK}: Scheduling Efficient and Reliable Bulk Bitwise Operations in {NVMs}},
location = {San Francisco, California},
series = {DAC '24},
month = jun,
year = {2024},
isbn = {9798400706011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3649329.3658485},
doi = {10.1145/3649329.3658485},
abstract = {Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes{} and 4.6\texttimes{}, respectively.},
articleno = {293},
numpages = {6},
}

Downloads

2406_Farzaneh_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3726

×
Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]

C4CAM: A Compiler for CAM-based In-memory Accelerators

Reference

Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi]

Abstract
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.

Bibtex

@InProceedings{farzaneh_asplos24,
author = {Hamid Farzaneh and João Paulo Cardoso de Lima and Mengyuan Li and Asif Ali Khan and Xiaobo Sharon Hu and Jeronimo Castrillon},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3},
title = {C4CAM: A Compiler for CAM-based In-memory Accelerators},
doi = {10.1145/3620666.3651386},
isbn = {9798400703867},
location = {La Jolla, CA, USA},
pages = {164--177},
publisher = {Association for Computing Machinery},
series = {ASPLOS '24},
url = {https://arxiv.org/abs/2309.06418},
abstract = {Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.},
address = {New York, NY, USA},
month = may,
numpages = {14},
year = {2024},
}

Downloads

2405_Farzaneh_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3738

×
João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024. [Bibtex & Downloads]

Full-Stack Optimization for CAM-Only DNN Inference

Reference

João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024.

Abstract
The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy

Bibtex

@InProceedings{delima_date24,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Luigi Carro and Jeronimo Castrillon},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Full-Stack Optimization for CAM-Only DNN Inference},
location = {Valencia, Spain},
pages = {1-6},
publisher = {IEEE},
series = {DATE'24},
url = {https://ieeexplore.ieee.org/document/10546805},
abstract = {The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy},
month = mar,
year = {2024},
}

Downloads

2403_deLima_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3701

×
Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024. [Bibtex & Downloads]

Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers

Reference

Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024.

Bibtex

@InProceedings{niemier_date24,
author = {Michael Niemier and Zephan Enciso and Mohammad Mehdi Sharifi and X. Sharon Hu and Ian O'Connor and Alexander Graening and Ravit Sharma and Puneet Gupta and Jeronimo Castrillon and João Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Nashrah Afroze and Asif Islam Khan and Julien Ryckaert},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546772},
pages = {1--10},
publisher = {IEEE},
series = {DATE'24},
month = mar,
year = {2024},
}

Downloads

2403_Niemier_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3715

×
Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024. [Bibtex & Downloads]

The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

Reference

Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024.

Bibtex

@Report{khan_cimlandscape_2024,
author = {Asif Ali Khan and João Paulo C. De Lima and Hamid Farzaneh and Jeronimo Castrillon},
title = {The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview},
eprint = {2401.14428},
url = {https://arxiv.org/abs/2401.14428},
archiveprefix = {arXiv},
month = jan,
primaryclass = {cs.AR},
year = {2024},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3716

×

2023
Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi] [Bibtex & Downloads]

Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications

Reference

Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi]

Abstract
This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.

Bibtex

@InProceedings{henkel_cases23,
author = {J\"{o}rg Henkel and Lokesh Siddhu and Lars Bauer and J\"{u}rgen Teich and Stefan Wildermann and Mehdi Tahoori and Mahta Mayahinia and Jeronimo Castrillon and Asif Ali Khan and Hamid Farzaneh and Jo\~{a}o Paulo C. de Lima and Jian-Jia Chen and Christian Hakert and Kuan-Hsun Chen and Chia-Lin Yang and Hsiang-Yun Cheng},
booktitle = {Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
title = {Special Session -- Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications},
location = {Hamburg, Germany},
abstract = {This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.},
pages = {11--20},
url = {https://ieeexplore.ieee.org/abstract/document/10316216},
doi = {10.1145/3607889.3609088},
isbn = {9798400702907},
series = {CASES '23 Companion},
issn = {2643-1726},
month = sep,
numpages = {10},
year = {2023},
}

Downloads

2309_Henkel_CASES [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3654

×
João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023. [Bibtex & Downloads]

Efficient Associative Processing with RTM-TCAMs

Reference

João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023.

Bibtex

@InProceedings{lima_imacaw23,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Jeronimo Castrillon},
booktitle = {1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23)},
title = {Efficient Associative Processing with RTM-TCAMs},
location = {San Francisco, CA, USA},
pages = {2pp},
month = jul,
year = {2023},
}

Downloads

2307_deLima_iMACAW [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3566

×

2022
Rafael Fão de Moura, João Paulo Cardoso de Lima, Luigi Carro, "Data and Computation Reuse in CNNs using Memristor TCAMs", In ACM Transactions on Reconfigurable Technology and Systems, Association for Computing Machinery (ACM), Jul 2022. [doi] [Bibtex & Downloads]

Data and Computation Reuse in CNNs using Memristor TCAMs

Reference

Rafael Fão de Moura, João Paulo Cardoso de Lima, Luigi Carro, "Data and Computation Reuse in CNNs using Memristor TCAMs", In ACM Transactions on Reconfigurable Technology and Systems, Association for Computing Machinery (ACM), Jul 2022. [doi]

Bibtex

@article{de_Moura_2022,
doi = {10.1145/3549536},
url = {https://doi.org/10.1145%2F3549536},
year = 2022,
month = {jul},
publisher = {Association for Computing Machinery ({ACM})},
author = {Rafael Fao de Moura and Joao Paulo Cardoso de Lima and Luigi Carro},
title = {Data and Computation Reuse in {CNNs} using Memristor {TCAMs}},
journal = {{ACM} Transactions on Reconfigurable Technology and Systems}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3378

×
João Paulo Cardoso de Lima, Marcelo Brandalero, Michael Hübner, Luigi Carro, "STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs", In ACM Journal on Emerging Technologies in Computing Systems, Association for Computing Machinery (ACM), vol. 18, no. 2, pp. 1–22, Apr 2022. [doi] [Bibtex & Downloads]

STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs

Reference

João Paulo Cardoso de Lima, Marcelo Brandalero, Michael Hübner, Luigi Carro, "STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs", In ACM Journal on Emerging Technologies in Computing Systems, Association for Computing Machinery (ACM), vol. 18, no. 2, pp. 1–22, Apr 2022. [doi]

Bibtex

@article{de_Lima_2022,
doi = {10.1145/3450769},
url = {https://doi.org/10.1145%2F3450769},
year = 2022,
month = {apr},
publisher = {Association for Computing Machinery ({ACM})},
volume = {18},
number = {2},
pages = {1--22},
author = {Jo{\~{a}}o Paulo Cardoso de Lima and Marcelo Brandalero and Michael Hübner and Luigi Carro},
title = {{STAP}: An Architecture and Design Tool for Automata Processing on Memristor {TCAMs}},
journal = {{ACM} Journal on Emerging Technologies in Computing Systems}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3380

×
Joao Paulo C. de Lima, Luigi Carro, "Quantization-Aware In-situ Training for Reliable and Accurate Edge AI", In Proceeding: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, Mar 2022. [doi] [Bibtex & Downloads]

Quantization-Aware In-situ Training for Reliable and Accurate Edge AI

Reference

Joao Paulo C. de Lima, Luigi Carro, "Quantization-Aware In-situ Training for Reliable and Accurate Edge AI", In Proceeding: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, Mar 2022. [doi]

Bibtex

@inproceedings{de_Lima_2022,
doi = {10.23919/date54114.2022.9774657},
url = {https://doi.org/10.23919%2Fdate54114.2022.9774657},
year = 2022,
month = {mar},
publisher = ,
author = {Joao Paulo C. de Lima and Luigi Carro},
title = {Quantization-Aware In-situ Training for Reliable and Accurate Edge {AI}},
booktitle = {2022 Design, Automation {\&} Test in Europe Conference {\&} Exhibition ({DATE})}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3377

×

2021
Paulo C. Santos, João P. C. de Lima, Rafael F. de Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions", In International Journal of Parallel Programming, Springer Science and Business Media LLC, vol. 49, no. 2, pp. 237–252, Jan 2021. [doi] [Bibtex & Downloads]

Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions

Reference

Paulo C. Santos, João P. C. de Lima, Rafael F. de Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions", In International Journal of Parallel Programming, Springer Science and Business Media LLC, vol. 49, no. 2, pp. 237–252, Jan 2021. [doi]

Bibtex

@article{Santos_2021,
doi = {10.1007/s10766-020-00674-y},
url = {https://doi.org/10.1007%2Fs10766-020-00674-y},
year = 2021,
month = {jan},
publisher = {Springer Science and Business Media {LLC}},
volume = {49},
number = {2},
pages = {237--252},
author = {Paulo C. Santos and Jo{\~{a}}o P. C. de Lima and Rafael F. de Moura and Marco A. Z. Alves and Antonio C. S. Beck and Luigi Carro},
title = {Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions},
journal = {International Journal of Parallel Programming}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3381

×

2020
Joao Paulo Cardoso de Lima, Marcelo Brandalero, Luigi Carro, "Endurance-Aware RRAM-Based Reconfigurable Architecture using TCAM Arrays", In Proceeding: 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), IEEE, Aug 2020. [doi] [Bibtex & Downloads]

Endurance-Aware RRAM-Based Reconfigurable Architecture using TCAM Arrays

Reference

Joao Paulo Cardoso de Lima, Marcelo Brandalero, Luigi Carro, "Endurance-Aware RRAM-Based Reconfigurable Architecture using TCAM Arrays", In Proceeding: 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), IEEE, Aug 2020. [doi]

Bibtex

@inproceedings{Cardoso_de_Lima_2020,
doi = {10.1109/fpl50879.2020.00018},
url = {https://doi.org/10.1109%2Ffpl50879.2020.00018},
year = 2020,
month = {aug},
publisher = ,
author = {Joao Paulo Cardoso de Lima and Marcelo Brandalero and Luigi Carro},
title = {Endurance-Aware {RRAM}-Based Reconfigurable Architecture using {TCAM} Arrays},
booktitle = {2020 30th International Conference on Field-Programmable Logic and Applications ({FPL})}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3379

×

2019
Hameeza Ahmed, Paulo C. Santos, Joao P. C. Lima, Rafael F. Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions", In Proceeding: 2019 Design, Automation &amp$\mathsemicolon$ Test in Europe Conference &amp$\mathsemicolon$ Exhibition (DATE), IEEE, Mar 2019. [doi] [Bibtex & Downloads]

A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions

Reference

Hameeza Ahmed, Paulo C. Santos, Joao P. C. Lima, Rafael F. Moura, Marco A. Z. Alves, Antonio C. S. Beck, Luigi Carro, "A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions", In Proceeding: 2019 Design, Automation &amp$\mathsemicolon$ Test in Europe Conference &amp$\mathsemicolon$ Exhibition (DATE), IEEE, Mar 2019. [doi]

Bibtex

@inproceedings{Ahmed_2019,
doi = {10.23919/date.2019.8714956},
url = {https://doi.org/10.23919%2Fdate.2019.8714956},
year = 2019,
month = {mar},
publisher = ,
author = {Hameeza Ahmed and Paulo C. Santos and Joao P. C. Lima and Rafael F. Moura and Marco A. Z. Alves and Antonio C. S. Beck and Luigi Carro},
title = {A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions},
booktitle = {2019 Design, Automation {\&}amp$\mathsemicolon$ Test in Europe Conference {\&}amp$\mathsemicolon$ Exhibition ({DATE})}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3382

×

João Paulo Cardoso de Lima

Curriculum Vitae

System and Compiler Design for Emerging CNM/CIM Architectures

Model and Code Optimization Methods for Energy-efficient Machine Learning

Cross-Layer Resilience for Reliable CIM

2026

2025

2024

2023

2022

2021

2020

2019