

# **Computer Architecture**

## 第16讲:TLP(2)

## 张献伟

<u>xianweiz.github.io</u>

DCS3013, 11/28/2022





## Quiz Questions



For remote attendees, plz email to zhangxw79@mail.sysu.edu.cn (ddl: 14:40).

- Q1: a cache: capacity is 16KB, 4-way associative, block is 32B. Split the 32b address into tag, index, offset. #sets = 16KB/32B/4 = 128, [31-12][11-5][4-0]
- Q2: for the above 16KB cache. How can you improve its performance? List 2 techniques.

Higher associativity; critical word first; way prediction; victim ...

- Q3: DRAM interface is 64b, chip is 4b wide and 4Gb, what's the rank capacity?
   (64b/4b) \* 4Gb = 64Gb = 8GB
- Q4: why HBM is of much higher bw than DDR/GDDR? Much wider interface (stacking/closer to processor).
- Q5: list 3 advantages of NVM, compared to DRAM. Non-volatile, higher capacity, lower cost, ...

#### Shared Memory[共享内存]

- The term "shared memory" associated with both SMP and DSM refers to the fact that the address space is shared
  - Communication among threads occurs through the shared address space
  - Thus, a memory reference can be made by any processor to any memory location



#### Cache Coherence[缓存一致性]

- Processors may see different values of the same data
  - The view of memory held by two different processors is through their individual caches, which, without any additional precautions, could end up seeing two different values
- Cache coherence problem[缓存一致性问题]
  - Conflicts between global state (main memory) and local state (private cache)
  - At time 4, what if processor B reads X?

| Α     |     | В     | Time | Event                          | Cache contents for<br>processor A | Cache contents for<br>processor B | Memory contents for<br>location X |
|-------|-----|-------|------|--------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
| Cache |     | Cache | 0    |                                |                                   |                                   | 1                                 |
|       | WT  |       | 1    | Processor A reads X            | 1                                 |                                   | 1                                 |
|       | VVI |       | 2    | Processor B reads X            | 1                                 | 1                                 | 1                                 |
|       |     |       | 3    | Processor A stores<br>0 into X | 0                                 | 1                                 | 0                                 |



## A memory system is coherent, if

• A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always return the value written by P

Read

Preserves program order 1





## A memory system is coherent, if

 A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always return the value written by P

Preserves program order 1

- A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses
  - Defines the notion of what it means to have a coherent view of memory (2)



Х

Read ·

## A memory system is coherent, if

• A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always return the value written by P

Preserves program order 1

- A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses
  - Defines the notion of what it means to have a coherent view of memory (2)
- Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors write

Write serialization 3



Read ·

Х



## Consistency also Matters[内存一致性]

- The three properties 123 are sufficient to ensure <u>coherence</u>
- However, when a written value will be seen is also write
   Write
   Read
  - A write of X on one processor precedes a read of X on another processor by a very small time, it may be impossible to ensure that the read returns the value of the data written, since the written data may not even have left the processor at that point
- Memory consistency: when a written value must been seen by a reader
   A and B are initially both 0

Thread 1

(1) A = 1
(2) print(B)

Thread 2

What this program can output?

01: (1)(2)(3)(4) or (3)(4)(1)(2)

o **00**?





Х

## Consistency also Matters[内存一致性]

- The three properties 123 are sufficient to ensure <u>coherence</u>
- However, when a written value will be seen is also write
   Write
   Read
  - A write of X on one processor precedes a read of X on another processor by a very small time, it may be impossible to ensure that the read returns the value of the data written, since the written data may not even have left the processor at that point

Х

• Memory consistency: when a written value must been seen by a reader A and B are initially both 0

Thread 1Thread 2What this program can output?(1) A = 1<br/>(2) print(B)(3) B = 1<br/>(4) print(A)(3) B = 1<br/>(4) print(A)(3) B = 1<br/>(4) print(A)(3) B = 1<br/>(4) print(A)(3) B = 1<br/>(4) B = 0: (2) - 2(3)(3) (2)(4) or (1)(3)(4)(2)<br/>(5) (4) - 2(1) - 2(2) - 2(3)(4) (4) - 2(1) - 2(2) - 2(3) - 2(3)(4) (4) - 2(1) - 2(2) - 2(3) - 2(3)Colspan="4">Colspan="4">Colspan="4">What this program can output?(1) A = 1<br/>(2) print(B)(3) B = 1<br/>(4) print(A)(3) (2)(4) or (1)(3)(4)(2)<br/>(2) (4) - 2(1) - 2(2) - 2(3)(4) (4) - 2(1) - 2(2) - 2(3) - 2(3)(2) C = 12<br/>(4) print(B)(4) print(A)(4) (4) - 2(1) - 2(2) - 2(3) - 2(3)(4) (4) - 2(1) - 2(2) - 2(3) - 2(3)(4) C = 12<br/>(4) print(B)(4) print(B)(4) print(B)(4) print(B)(4) C = 12<br/>(4) print(B)(4) print(B)(4) print(B)(4) C = 12<br/>(4) print(B)(4) print(B)(4) print(B)(5) C = 12<br/>(4) print(B)(5) print(B)(6) print(B)(6) C = 12<br/>(4) print(B)(6) print(B)(7) print(B)(7) C = 12<br/>(4) print(B)(6) print(B)(7) print(B)(7) C = 12<br/>(4) print(B)(7) print(B)(7) print(B)(8) C = 12<br/>(4) print(B)(7) print(B)(7) print(B)(9) C = 12<br/>(1) print(B)(7) print(B)(7) print(B)

## Coherence vs. Consistency[对比]

- Coherence[缓存一致性]
  - Defines what values can be returned by a read
  - All reads by any processor must return the most recently written value
  - Writes to the same location by any two processors are seen in the same order by all processors
- Consistency[内存一致性]
  - Determines when a written value will be returned by a read
  - Consistency insures that writes to different locations will be seen in an order that <u>makes sense</u>, given the source code
  - If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A





## Enforcing Coherence[保证一致性]

- Coherent caches provide
  - Migration: movement of data[搬运]
    - A data item can be moved to a local cache and used there in a transparent fashion
  - Replication: multiple copies of data[备份]
    - Make a copy of the data item in the local cache, so that shared data can be simultaneously read
- Whose responsibility? Software?
  - Can programmer ensure coherence if caches invisible to sw?
  - What if the ISA provided a cache flush instruction?
    - FLUSH-LOCAL A: flushes/invalidates the cache block containing address A from a processor's local cache
    - FLUSH-GLOBAL A: flushes/invalidates the cache block containing address A from all other processors' caches
    - FLUSH-CACHE X: flushes/invalidates all blocks in cache X



## Enforcing Coherence (cont.)

- Software solutions are of high overheads
  - And, programming burden
- Multiprocessors adopt a hardware solution to maintain coherent caches[硬件方案]
  - Supporting the migration and replication is critical to performance in accessing shared data
- For the example,
  - Invalidate all other copies of X when A writes to it

| Α     |      | В     | Time | Event                          | Cache contents for<br>processor A | Cache contents for<br>processor B | Memory contents for<br>location X |
|-------|------|-------|------|--------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
| Cache |      | Cache | 0    |                                |                                   |                                   | 1                                 |
|       | WT   |       | 1    | Processor A reads X            | 1                                 |                                   | 1                                 |
|       | VV I |       | 2    | Processor B reads X            | 1                                 | 1                                 | 1                                 |
|       |      |       | 3    | Processor A stores<br>0 into X | 0                                 | 1                                 | 0                                 |



## Enforcing Coherence (cont.)

- Software solutions are of high overheads
  - And, programming burden
- Multiprocessors adopt a hardware solution to maintain coherent caches[硬件方案]
  - Supporting the migration and replication is critical to performance in accessing shared data
- For the example,
  - Invalidate all other copies of X when A writes to it

| Α     |     | В     | Time | Event                          | Cache contents for<br>processor A | Cache contents for<br>processor B | Memory contents for<br>location X |
|-------|-----|-------|------|--------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
| Cache |     | Cache | 0    |                                |                                   |                                   | 1                                 |
|       | WT  |       | 1    | Processor A reads X            | 1                                 |                                   | 1                                 |
|       | VVI |       | 2    | Processor B reads X            | 1                                 | 1                                 | 1                                 |
|       |     |       | 3    | Processor A stores<br>0 into X | 0                                 | ١X                                | 0                                 |



## Enforcing Coherence (cont.)

- Software solutions are of high overheads
  - And, programming burden
- Multiprocessors adopt a hardware solution to maintain coherent caches[硬件方案]
  - Supporting the migration and replication is critical to performance in accessing shared data
- For the example,
  - Invalidate all other copies of X when A writes to it

| Α     |      | В     | Time | Event                          | Cache contents for<br>processor A | Cache contents for<br>processor B | Memory contents for<br>location X |
|-------|------|-------|------|--------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
| Cache |      | Cache | 0    |                                |                                   |                                   | 1                                 |
|       | WT   |       | 1    | Processor A reads X            | 1                                 |                                   | 1                                 |
|       | VV I |       | 2    | Processor B reads X            | 1                                 | 1                                 | 1                                 |
|       |      |       | 3    | Processor A stores<br>0 into X | 0                                 | ١X                                | 0                                 |

#### How do you know which copies to invalidate?



#### Coherence Protocols[缓存一致性协议]

- Cache coherence protocols: the rules to maintain coherence for multiple processors
  - Key is to track the state of any sharing of a data block
- Two classes of protocols
  - Snooping[窥探]
    - Each core tracks sharing status of each block
  - Directory based[基于目录]
    - Sharing status of each block kept in one location



## Snooping Coherence Protocols[窥探]

- Write invalidation protocol[写无效]
  - Ensure that a processor has <u>exclusive access</u> to a data item before it writes that item
  - Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs
     All other cached copies of the item are invalidated (
- Write update/broadcast protocol[写更新]
  - Update all the cached copies of data item when that item is written
  - Must broadcast all writes to shared cache lines, and thus consumes considerably more bandwidth
- Write invalidation protocol is by far the most common
  - We'll focus on it



- Write invalidate
  - On write, invalidate all other copies
  - Use bus itself to serialize
- Example
  - Invalidation protocol working on a snooping bus for a single block (X) with write-back caches

| Processor activity          | Bus activity          | Contents of processor<br>A's cache | Contents of processor<br>B's cache | Contents of memory<br>location X |
|-----------------------------|-----------------------|------------------------------------|------------------------------------|----------------------------------|
|                             |                       |                                    |                                    | 0                                |
| Processor A reads X         | Cache miss<br>for X   | 0                                  |                                    | 0                                |
| Processor B reads X         | Cache miss<br>for X   | 0                                  | 0                                  | 0                                |
| Processor A writes a 1 to X | Invalidation<br>for X | 1                                  |                                    | 0                                |
| Processor B reads X         | Cache miss<br>for X   | 1                                  | 1                                  | 1                                |



- Write invalidate
  - On write, invalidate all other copies
  - Use bus itself to serialize
- Example
  - Invalidation protocol working on a snooping bus for a single block (X) with write-back caches

| Processor activity          | Bus activity          | Contents of processor<br>A's cache | Contents of processor<br>B's cache | Contents of memory<br>location X |
|-----------------------------|-----------------------|------------------------------------|------------------------------------|----------------------------------|
|                             |                       | Neither cache initially holds      | X and the value of X in me         | emory is 0 0                     |
| Processor A reads X         | Cache miss<br>for X   | 0                                  |                                    | 0                                |
| Processor B reads X         | Cache miss<br>for X   | 0                                  | 0                                  | 0                                |
| Processor A writes a 1 to X | Invalidation<br>for X | 1                                  |                                    | 0                                |
| Processor B reads X         | Cache miss<br>for X   | 1                                  | 1                                  | 1                                |



- Write invalidate
  - On write, invalidate all other copies
  - Use bus itself to serialize
- Example
  - Invalidation protocol working on a snooping bus for a single block (X) with write-back caches

| Processor activity          | Bus activity          | Contents of processor<br>A's cache | Contents of processor<br>B's cache | Contents of memory<br>location X |
|-----------------------------|-----------------------|------------------------------------|------------------------------------|----------------------------------|
|                             | Ň                     | leither cache initially holds      | X and the value of X in me         | mory is 0 0                      |
| Processor A reads X         | Cache miss<br>for X P | 0<br>Processor A reads X, migrati  | ng from memory to the loc          | 0<br>cal cache                   |
| Processor B reads X         | Cache miss<br>for X   | 0                                  | 0                                  | 0                                |
| Processor A writes a 1 to X | Invalidation<br>for X | 1                                  |                                    | 0                                |
| Processor B reads X         | Cache miss<br>for X   | 1                                  | 1                                  | 1                                |



- Write invalidate
  - On write, invalidate all other copies
  - Use bus itself to serialize
- Example
  - Invalidation protocol working on a snooping bus for a single block (X) with write-back caches

| Processor activity          | Bus activity          | Contents of processor<br>A's cache | Contents of processor<br>B's cache | Contents of memory<br>location X |
|-----------------------------|-----------------------|------------------------------------|------------------------------------|----------------------------------|
|                             | Ν                     | Neither cache initially holds      | X and the value of X in me         | emory is 0 0                     |
| Processor A reads X         | Cache miss<br>for X P | 0<br>Processor A reads X, migrati  | ng from memory to the loo          | 0<br>cal cache                   |
| Processor B reads X         | Cache miss<br>for X P | 0<br>Processor B reads X, migrati  | 0<br>ng from memory to the loo     | 0<br>cal cache                   |
| Processor A writes a 1 to X | Invalidation<br>for X | 1                                  |                                    | 0                                |
| Processor B reads X         | Cache miss<br>for X   | 1                                  | 1                                  | 1                                |



- Write invalidate
  - On write, invalidate all other copies
  - Use bus itself to serialize
- Example
  - Invalidation protocol working on a snooping bus for a single block (X) with write-back caches

| Processor activity             | Bus activity          | Contents of processor<br>A's cache | Contents of processor<br>B's cache | Contents of memory<br>location X |
|--------------------------------|-----------------------|------------------------------------|------------------------------------|----------------------------------|
|                                | N                     | leither cache initially holds      | X and the value of X in me         | mory is 0 🛛 🕕                    |
| Processor A reads X            | Cache miss<br>for X P | 0<br>Processor A reads X, migrati  | ng from memory to the loo          | 0<br>cal cache                   |
| Processor B reads X            | Cache miss<br>for X P | 0<br>Processor B reads X, migrati  | 0<br>ng from memory to the loc     | 0<br>cal cache                   |
| Processor A writes a<br>1 to X | Invalidation<br>for X | Processor A writes X, invalid      | ating the copy on B                | 0                                |
| Processor B reads X            | Cache miss<br>for X   | 1                                  | 1                                  | 1                                |



- Write invalidate
  - On write, invalidate all other copies
  - Use bus itself to serialize
- Example
  - Invalidation protocol working on a snooping bus for a single block (X) with write-back caches

| Processor activity             | Bus activity          | Contents of processor<br>A's cache     | Contents of processor<br>B's cache                       | Contents of memory<br>location X |
|--------------------------------|-----------------------|----------------------------------------|----------------------------------------------------------|----------------------------------|
|                                |                       | Neither cache initially holds          | X and the value of X in me                               | emory is 0 0                     |
| Processor A reads X            | Cache miss<br>for X   | 0<br>Processor A reads X, migrati      | ng from memory to the lo                                 | 0<br>cal cache                   |
| Processor B reads X            | Cache miss<br>for X   | 0<br>Processor B reads X, migrati      | 0<br>ng from memory to the loo                           | 0<br>cal cache                   |
| Processor A writes a<br>1 to X | Invalidation<br>for X | 1<br>Processor A writes X, invalid     | ating the copy on B                                      | 0                                |
| Processor B reads X            | Cache miss<br>for X   | Processor B reads X, A respo<br>and up | nds with the value canceli<br>dates both B's cache and r |                                  |
|                                |                       |                                        |                                                          |                                  |

