

ê

والمع

Ħ

Q

#### JUNE 23-27, 2024

MOSCONE WEST CENTER SAN FRANCISCO, CA, USA





F



က် ငျော်

#### SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs

â

Hamid Farzaneh<sup>1</sup>, <u>João Paulo C. de Lima<sup>12</sup></u>, Ali Nezhadi Khelejani<sup>3</sup>, Asif Ali Khan<sup>1</sup>, Mahta Mayahinia<sup>3</sup>, Mehdi Tahoori<sup>3</sup> and Jeronimo Castrillon<sup>12</sup>

<sup>1</sup> Technical University of Dresden, <sup>2</sup> ScaDS.AI, <sup>3</sup> Karlsruhe Institute of Technology

#### **Cost-driven accelerators design**





Computation is almost for free

Abu Sebastian et al, Nature Nanotechnology, 2020



#### Memristor-based computing-in-memory (CIM)

#### Analog dot-product engines



#### **Bulk-bitwise logic**



- CIM alleviates the data movement bottleneck
- Memristive crossbars enable:
  - Analog dot-product using DA/AD converters
  - Bitwise operations (OR, AND and NOT) using custom sense amplifiers



#### Memristor-based computing-in-memory (CIM)





### **Reliability challenges of memristive CIM**

- Decision failure probability
- HRS/LRS ratio
  - Larger HRS/LRS, lower decision failure
  - Technology-dependent parameter



- More inputs, smaller sense margin (
   decision failure)
- But also smaller standard deviation (
   decision failure)







## Sherlock at a glance

- Lack of flexibility
  - Row granularity

#### Performance-Reliability trade-off

- Multiple row activation
- Mapping strategies for performance, energy and reliability
  - Enable fine-grain control over operations
- Optimize data-flow graph mapping



Goals

Problems

- Database, image processing and encryption algorithms
- Improves latency (~10×), energy consumption (~4.6×), and
- reliability ( $\sim$ 1.5×) compared to the SotA









#### **Sherlock flow**

with rotation

O SYSTEMS



- Fine-grain selection of columns

#### **Problem statement with Bitweaving**

. . .

```
for (i = 0 \text{ to } K):
 t1 = NOT(C1[i]);
 T2 = AND(T1, data[i]);
  T3 = AND(T2, meq1);
 m_gt = OR(T3, mgt);
 T4 = NOT(data[i]);
 T5 = AND(T4, C2[i]);
 T6 = AND(T5, m_eq2);
 m_lt = OR(T6, mlt);
 T7 = XOR(data[i], C1[i]);
 T8 = NOT(T7);
 meq1 = AND(meq1, T8);
  T9 = XOR(data[i], C2[i]);
  T10 = NOT(T9);
 meq2 = AND(meq2, T10);
```



op address[arrayID][columns][rows] [cim-op]
write [0][4,8,12,16][932]
Read [0][1,5,9, 13][5]
Shift [0] R[3]
Read [0][4,8,12,16][933,934] [XOR,AND,OR,XOR]



Li and Patel, Bitweaving, SIGMOD'2013.



## **DFG Generation**

```
for (i = 0 \text{ to } K):
t1 = NOT(C1[i]);
  T2 = AND(T1, data[i]);
 T3 = AND(T2, meq1);
 m_gt = OR(T3, mgt);
  T4 = NOT(data[i]);
 T5 = AND(T4, C2[i]);
 T6 = AND(T5, m_eq2);
 m_lt= OR(T6, mlt);
  T7 = XOR(data[i], C1[i]);
 T8 = NOT(T7);
 meq1 = AND(meq1, T8);
  T9 = XOR(data[i], C2[i]);
 T10 = NOT(T9);
 meq2 = AND(meq2, T10);
```



Li and Patel, Bitweaving, SIGMOD'2013.



#### Map operands and operations





#### Pros of naive mapping

- Example
  - Step 1: Priority
  - Step 2: Map sequentially
- Pros
  - Simple
  - Extends SIMD parallelism







### Cons of naive mapping

- DFGs may not fit into a single array
- Additional memory copies across columns are required in larger DFGs
- Resource utilization is poor when vector size < column size</li>





## **Sherlock mapping**

• Goal: Find K Clusters of operations for K columns of the memory





- 1) Assign priority values to OP nodes
- 2) Create initial cluster C
- 3) Start from the highest priority
  - Add Q (highest priority) to C
  - Add more nodes to C or create a new cluster based on different metrics





- 5 cases were identified
- Case 1:
  - A node only has one predecessor
  - If C<sub>maxsize</sub> > Cluster<sub>size</sub>:
    - Node is added to the new cluster
  - Else
    - New cluster is created





- Case 2:
  - Clusters are similar
  - Dependency relation is similar
  - (Cluster<sub>size</sub> + 1) <  $C_{maxsize}$
  - Randomly assignment



#### • Case 3:

- Clusters are similar
- Dependency relation is different
- Put the node in the cluster with less difference in the priority: <u>Critical path</u>





- Case 4:
  - The clusters are similar
  - The dependency relation is different
  - Put the node in the cluster with greater dependency



- Case 5:
  - The clusters are different
  - The dependency relation is similar
  - Put the node in the smaller cluster
  - <u>Balance the load</u>





Generalizing for all cases



 $\alpha$  and  $\beta$  control the effect of cluster size and priority measure

ρ is the priority value difference between d and q



#### Mapping clusters to memory

- The size of a cluster <  $C_{\text{maxsize}}$
- Two clusters can be merged
  - To better utilize memory
  - if (Cluster  $A_{size}$  + Cluster  $B_{size}$ ) <  $C_{maxsize}$
  - Maximize dependency





#### **Optimizing with many-row activation**

- Include multiple row activation in the scheduling
- Merge the operations with multiple operands in the final clusters
- Requirements
  - Operations are of the same type
  - Operands are mapped to the same column





#### Latency and energy comparison

 Parameters to vary include technology (ReRAM, STT-MRAM), array size (512, 1024), and multi-row activation (2-operand only, >2 operands)





#### Impact of MRA on application reliability

• Reliability of Bitweaving output varying the allowed percentage of MRA (> 2 operands)



suitable for applications tolerating some result inaccuracy



#### Main takeaways

- Scouting logic to alleviate the memory wall issue
- Mapping larger applications on the target CIM is not straightforward
- More flexibility by a retargetable scheduler
- Latency (~10×), energy consumption (~4.6×), and reliability (~1.5×) improvement compared to the SotA
- Sherlock aims to maximize the performance of CIM-logic





JUNE 23-27, 2024 MOSCONE WEST CENTER SAN FRANCISCO, CA, USA

# Thank you for listening

ê

**Sugar** 

F

က်နှို

and the

hamid.farzaneh@tu-dresden.de

joao.lima@tu-dresden.de



ê

والمع

Ħ

Q

#### JUNE 23-27, 2024

MOSCONE WEST CENTER SAN FRANCISCO, CA, USA