# **CS 423**

#### **Current Directions**

### **Memory Issues in 3D Architectures**

- Benefits of a 3D chip over a 2D design
  - Reduction on global interconnect
    - Performance: reduced average interconnect length
    - Power: reduction in total wiring length
  - Higher packing density and smaller footprint
  - Support for realization of mixed-technology chips

 $B_{\underline{18}}$  $B_{\underline{10}}$ Layer 2  $B_{16}$  $\mathbb{B}_2$ B  $B_{12}$  $\mathbb{B}_{8}$ P  $\mathbf{B}_7$ B<sub>19</sub>  $B_{13/}$ B  $\mathbf{B}_1$  $\mathbb{B}_6$ Layer 1 B  $\mathbb{B}_{2}$ P

 $\mathbf{B}_{15}$ 

 $B_1$ 

 $B_{20}$ 

B

# **Heterogeneous Chip Multiprocessors**

- A chip multiprocessor
  - High-complexity cores
  - Low-complexity cores
- Better resource-toapplication mapping
  - Speed of a large core
  - Efficiency of a small core



- Alpha Cores
- EV8 is 80X bigger
- Only 2X 3X performance improvement

# **NoC Architecture**

- M × N mesh architecture
- Node in the mesh
  - Processor
  - Memory module
  - Switch



# Background

#### • Intel Many Integrated Core (Intel MIC) architecture

- Processing highly parallel workloads
- Standard programming models (OpenMP and Cilk with a few extensions)

#### **Knights Ferry**

Packaged as a co-processor in a PCI-e card. With 32 cores running four threads apiece, this can process 128 threads at 1.2GHz.



Photo source: ZDNet Bilkent University

# Background



#### **Problems with current parallelization techniques**

- Developed in context of high performance parallel machines
- Most of them parallelize one loop nest at a time
  - Cannot capture inter-nest relations well
- Their main goal is to minimize inter-processor communication
  - Not very suitable for chip multiprocessors
- What we need is data reuse oriented whole program parallelization

## Example



# **Resource Allocation**

- Prior OS-based resource partition approaches
  - Advantage: transparent to applications and programmers
  - Drawback: application oblivious and reactive
- Our goal : *Proactive* resource partitioning scheme

### Architecture and Resource Partition Schemes



(a) Abstract view of an MPSoC architecture with 8 processors

(b) Equal partitioning of resource across two applications

(c) Nonuniform partitioning of resources across two applications

(d) Nonuniform partitioning of resources across three applications

### **Components of Our Approach**



### Scenario with no energy saving scheme



# **Energy Reduction Schemes**

- There are two primary groups
  - Voltage scaling techniques
  - Processor shutdown schemes
- They can be applied using hardware or an optimizing compiler
- They are applied independently
- They are applied in disjoint manner

# **Our Approach**

#### • When no errors

- Determined by the successful termination of the primary copy
- Terminate the replica
- Since the replica has operated with lower voltage/frequency so far, we save energy, compared to the case where the replica is executing with the highest voltage/frequency available

#### • When an error occurs in the primary copy

- Primary copy is aborted
- Replica is switched to the highest voltage/frequency level to minimize the time to complete the task

# **Our Approach**



**Error Free** 

**Transient Error** 

# **Memory Hierarchy Design (2/5)**

- Memory hierarchy management has been well studied
  - Caches
  - From the performance perspective
- Relatively less attention
  - Software managed memories
  - Optimizing energy behavior
- Software-managed hierarchies can be preferable against hardware counterparts
  - Able to design a customized memory hierarchy that suits the needs of the application
  - Data flow is managed by software
    - Energy-efficient  $\rightarrow$  Dynamic lookup in hardware

# **Memory Hierarchy Design (4/5)**



# **Program Representation**



- Our approach works on a control flow graph (CFG).
  - Nodes: Basic blocks
  - Edges: Control flow (conservative)