### **CS 423**

### **CMP Architecture - I**

## **Design choices**

- Processing elements
  - Number
  - Туре
  - Homogeneous or heterogeneous
- Memory
  - Size
  - Hierarchy
  - Private vs shared
  - Software managed vs hardware managed
- Interconnection networks
  - Topology
  - Protocol



### **The Challenges**



#### Power = Capacitance x Voltage<sup>2</sup> x Frequency also Power ~ Voltage<sup>3</sup>

-

#### History (List of Intel microprocessors)

- <u>The 4-bit processors</u> 4004, 4040
- <u>The 8-bit processors</u> 8008, 8080, 8085
- <u>The 16-bit processors: Origin of x86</u> 8086, 8088, 80186, 80188, 80286
- <u>The 32-bit processors: Non x86</u> iAPX 432, 80960, 80860, XScale
- <u>The 32-bit processors: The 80386 Range</u> 80386DX, 80386SX, 80376, 80386SL, 80386EX
- <u>The 32-bit processors: The 80486 Range</u> 80486DX, 80486SX, 80486DX2, 80486SL, 80486DX4
- <u>The 32-bit processors: The Pentium ("I")</u> Pentium, Pentium MMX
- <u>The 32-bit processors: P6/Pentium M</u>
   Pentium Pro, Pentium II, Celeron, Pentium III, PII and III Xeon
   Celeron(PIII), Pentium M, Celeron M, Intel Core, Dual Core Xeon LV
- <u>The 32-bit processors: NetBurst microarchitecture</u> Pentium 4, Xeon, Pentium 4 EE
- The 64-bit processors: IA-64 Itanium, Itanium 2
- <u>The 64-bit processors: EM64T-NetBurst</u> Pentium D, Pentium Extreme Edition, Xeon
- The 64-bit processors: EM64T- Core microarchitecture
  Xeon, Intel Core 2 5

# **Pentium Processors**

#### Pentium I





### Pentium III



#### Pentium II





### Pentium IV



Intel has more than 15 multi-core related projects underway and plans to increase its software and solutions enabling product lines, tools, investment and programs to further spur design and validation

### Nehalem System Example: Apple Mac Pro



### **Building Blocks**



### **Example of modern core: Nehalem**



- ON-chip cache resources:
  - For each core: L1: 32K instruction and 32K data cache, L2: 1MB
  - L3: 8MB shared among all 4 cores
- Integrated, on-chip memory controller (DDR3) Bilkent University

## Superscalar (SS) – Multicore (MP)



6-way superscalar (SS) microarchitecture

| Г  |                 | I-Cache #1 (8K)                    | I-Cache #2 (8K)                    |                           |                          |
|----|-----------------|------------------------------------|------------------------------------|---------------------------|--------------------------|
|    | Clocking & Pads | Processor<br>#1                    | Processor<br>#2                    | External<br>Interface     |                          |
|    |                 |                                    |                                    | L2 Communication Crossbar | On-Chip L2 Cache (256KB) |
| ım |                 | D-Cache #1 (8K)<br>D-Cache #3 (8K) | D-Cache #2 (8K)<br>D-Cache #4 (8K) |                           |                          |
|    |                 | Processor<br>#3                    | Processor<br>#4                    |                           |                          |
|    |                 | I-Cache #3 (8K)                    | I-Cache #4 (8K)                    |                           |                          |

21 mm

Multiprocessor (MP) microarchitecture (4 identical 2-way superscalar processors)

21

# **Multicore Processors**

### Penryn



### Bloomfield



#### Gulftown

| Core Core Core<br>뢎 | Core Core Core  |  |  |  |  |  |
|---------------------|-----------------|--|--|--|--|--|
| Shared L3 Cache     | Shared L3 Cache |  |  |  |  |  |
| Shared L3 Cache     | Shared L3 Cache |  |  |  |  |  |



### Beckton



## **AMD Phenom II Quad-Core)**





- AMD K10 (Barcelona)
- Code name "Deneb"
- 45nm process
- 4 cores, private 512KB L2
- Shared 6MB L3 (2MB in Phenom)
- Integrated Northbridge
  - Up to 4 DIMMs

13

- Sideband Stack optimizer (SSO)
  - Parallelize many POPs and PUSHs (which were dependent on each other)
    - Convert them into pure loads/store instructions
  - No uops in FUs for stack pointer adjustment **Bilkent University** 13

### **ARM11 MPCore**

- Up to 4 processors each with own L1 instruction and data cache
- Distributed Interrupt Controller (DIC)
  - Recall the APIC from Intel's core architecture
- Timer per CPU
- CPU interface
  - Interrupt acknowledgement, masking and completion acknowledgement
- CPU
  - Single ARM11 called MP11
- Vector floating-point unit (VFP)
  - FP co-processor
- L1 cache
- Snoop control unit
  - L1 cache coherency

### **Multicore Organization Alternatives**



# **Cell History**

- IBM, SCEI/Sony, Toshiba Alliance formed in 2000
- Design Center opened in March 2001
  - Based in Austin, Texas
- Single Cell BE operational Spring 2004
- 2-way SMP operational Summer 2004
- February 7, 2005: First technical disclosures
- October 6, 2005: Mercury Announces Cell Blade
- November 9, 2005: Open Source SDK & Simulator Published
- November 14, 2005: Mercury Announces Turismo Cell Offering
- February 8, 2006 IBM Announced Cell Blade





SONY



TOSHIBA

# **Cell Synergy**

• Cell is not a collection of different processors, but a synergistic whole

- Operation paradigms, data formats and semantics consistent
- Share address translation and memory protection model
- PPE for operating systems and program control
- SPE optimized for efficient data processing
  - SPEs share Cell system functions provided by Power Architecture
  - MFC implements interface to memory
    - Copy in/copy out to local storage
- PowerPC provides system functions
  - Virtualization
  - Address translation and protection
  - External exception handling
- EIB integrates system as data transport hub

# **Cell Chip**

### Highlights (3.2 GHz)

- 241M transistors
- 235mm2
- 9 cores, 10 threads (Why?)
- >200 GFlops (SP)
- >20 GFlops (DP)
- Up to 25 GB/s memory B/W
- Up to 75 GB/s I/O B/W
- >300 GB/s EIB
- Top frequency >4GHz (observed in lab)



# **Cell Features**

- Heterogeneous multicore system architecture
  - Power Processor Element for control tasks
  - Synergistic Processor Elements for dataintensive processing
- Synergistic Processor Element (SPE) consists of
  - Synergistic Processor Unit (SPU)
  - Synergistic Memory Flow Control (MFC)
    - Data movement and synchronization
    - Interface to highperformance
       Element
       Interconnect Bus



### **Eight Concurrent Transactions**



### **The First Generation Cell Blade**



### **Cell Temperature Graph**



- Power and heat are key constrains
  - Cell is ~80 watts at 4+ Ghz
  - Cell has 10 temperature sensors
  - Prediction: PS3 will be more like 3 Ghz

Source: IEEE ISSCC, 2005

# **Code Sample**

• PPE code:

```
#include <stdio.h>
#include <libspe.h>
extern spe_program_handle_t hello_spu;
int main(void)
{
    speid_t speid;
    int status;
    speid = spe_create_thread (0, &hello_spu, NULL, NULL, -1, 0);
    spe_wait(speid, &status, 1);
    return 0;
```

### • SPE code:

### "PPE-Centric" & "SPE-Centric" Models

### • "PPE-Centric":

- an offload model
- main line application code runs in PPC core
- individual functions extracted and offloaded to SPEs
- SPUs wait to be given work by the PPC core
- SPE-Centric":
  - most of the application code distributed among SPEs
  - PPC core runs little more than a resource manager for the SPEs (e.g., maintaining in main memory control blocks with work lists for the SPEs)
  - SPE fetches next work item (what function to execute, pointer to data, etc.) from main memory (or its own memory) when it completes current work item Bilkent University