

#### Efficient and Scalable Core Multiplexing with M<sup>3</sup>v

# Nils Asmussen, Sebastian Haas, Carsten Weinhold, Till Miemietz, Michael Roitzsch

Fachgruppentreffen, Erlangen, 19.09.2022









#### Key ideas:

• DTU as new hardware component





#### Key ideas:

- DTU as new hardware component
- Tiles are isolated by default





#### Key ideas:

- DTU as new hardware component
- Tiles are isolated by default



67

#### Key ideas:

- DTU as new hardware component
- Tiles are isolated by default
- OS on dedicated tile



#### Key ideas:

- DTU as new hardware component
- Tiles are isolated by default
- OS on dedicated tile
- Fast-path communication







# Comparison of Core Multiplexing Approaches



#### $M^3$ (ASPLOS'16)



# Comparison of Core Multiplexing Approaches





# Comparison of Core Multiplexing Approaches























• Suspend App1 until new message, schedule App2





- Suspend App1 until new message, schedule App2
- Resume App1 upon new message





- Suspend App1 until new message, schedule App2
- Resume App1 upon new message
- Multiplexing conflicts with fast-path communication











 Only the OS can provide access to tile-external resources









- Only the OS can provide access to tile-external resources
- Restoring DTU state provides access to all resources
- TileMux **must not** restore DTU state!



- Only the OS can provide access to tile-external resources
- Restoring DTU state provides access to all resources
- TileMux **must not** restore DTU state!



 M<sup>3\*</sup> provides better isolation than conventional architectures





- M<sup>3\*</sup> provides better isolation than conventional architectures
- M<sup>3</sup>x and M<sup>3</sup>v trade some isolation for better resource utilization





- M<sup>3\*</sup> provides better isolation than conventional architectures
- M<sup>3</sup>x and M<sup>3</sup>v trade some isolation for better resource utilization
- M<sup>3</sup>v trades some more isolation for better efficiency





































• If the current app waits for new messages, other apps should get the chance to run



- If the current app waits for new messages, other apps should get the chance to run
- Applications fetch new messages directly from the vDTU


- If the current app waits for new messages, other apps should get the chance to run
- Applications fetch new messages directly from the vDTU
- If there is none *and* other apps are ready, TileMux is used to block



- If the current app waits for new messages, other apps should get the chance to run
- Applications fetch new messages directly from the vDTU
- If there is none *and* other apps are ready, TileMux is used to block
- Race condition: checking for new msgs and blocking (like lost-wakeup problem)



- If the current app waits for new messages, other apps should get the chance to run
- Applications fetch new messages directly from the vDTU
- If there is none *and* other apps are ready, TileMux is used to block
- Race condition: checking for new msgs and blocking (like lost-wakeup problem)
  - The vDTU tracks the number of new messages of the current app



- If the current app waits for new messages, other apps should get the chance to run
- Applications fetch new messages directly from the vDTU
- If there is none *and* other apps are ready, TileMux is used to block
- Race condition: checking for new msgs and blocking (like lost-wakeup problem)
  - The vDTU tracks the number of new messages of the current app
  - The priv. IF offers a command to atomically switch to a new app







#### M<sup>3</sup>x

- Incoming messages cannot be stored in memory, if the receiver is blocked
- The receive EP is not available



#### M<sup>3</sup>x

- Incoming messages cannot be stored in memory, if the receiver is blocked
- The receive EP is not available
- M<sup>3</sup>x resorts to a "slow-path" by forwarding messages over the OS tile



#### M<sup>3</sup>x

- Incoming messages cannot be stored in memory, if the receiver is blocked
- The receive EP is not available
- M<sup>3</sup>x resorts to a "slow-path" by forwarding messages over the OS tile

#### $M^{3}v$

• The vDTU knows all EPs and can always store the message



#### M<sup>3</sup>x

- Incoming messages cannot be stored in memory, if the receiver is blocked
- The receive EP is not available
- M<sup>3</sup>x resorts to a "slow-path" by forwarding messages over the OS tile

#### $M^{3}v$

- The vDTU knows all EPs and can always store the message
- If the owner of the receive EP is blocked, the vDTU injects an interrupt



#### M<sup>3</sup>x

- Incoming messages cannot be stored in memory, if the receiver is blocked
- The receive EP is not available
- M<sup>3</sup>x resorts to a "slow-path" by forwarding messages over the OS tile

#### $M^{3}v$

- The vDTU knows all EPs and can always store the message
- If the owner of the receive EP is blocked, the vDTU injects an interrupt
- TileMux marks the receiver as ready



#### M<sup>3</sup>x

- Incoming messages cannot be stored in memory, if the receiver is blocked
- The receive EP is not available
- M<sup>3</sup>x resorts to a "slow-path" by forwarding messages over the OS tile

#### $M^{3}v$

- The vDTU knows all EPs and can always store the message
- If the owner of the receive EP is blocked, the vDTU injects an interrupt
- TileMux marks the receiver as ready
- Best case: neither the OS tile nor TileMux is involved in the communication



- Setup: gem5 simulator, 3 GHz out-of-order x86-64 cores
- Every tile runs: SQLite/find benchmark and in-memory filesystem



- Setup: gem5 simulator, 3 GHz out-of-order x86-64 cores
- Every tile runs: SQLite/find benchmark and in-memory filesystem





- Setup: gem5 simulator, 3 GHz out-of-order x86-64 cores
- Every tile runs: SQLite/find benchmark and in-memory filesystem





- Setup: gem5 simulator, 3 GHz out-of-order x86-64 cores
- Every tile runs: SQLite/find benchmark and in-memory filesystem





- Setup: gem5 simulator, 3 GHz out-of-order x86-64 cores
- Every tile runs: SQLite/find benchmark and in-memory filesystem















- Xilinx VCU118 FPGA
- RISC-V: in-order Rocket or out-of-order BOOM
- Rocket at 100 MHz, BOOM at 80 MHz
- 2x16 kB L1, 512 kB L2
- vDTU contains 128 EPs





- Xilinx VCU118 FPGA
- RISC-V: in-order Rocket or out-of-order BOOM
- Rocket at 100 MHz, BOOM at 80 MHz
- 2x16 kB L1, 512 kB L2
- vDTU contains 128 EPs





- Xilinx VCU118 FPGA
- RISC-V: in-order Rocket or out-of-order BOOM
- Rocket at 100 MHz, BOOM at 80 MHz
- 2x16 kB L1, 512 kB L2
- vDTU contains 128 EPs





- Xilinx VCU118 FPGA
- RISC-V: in-order Rocket or out-of-order BOOM
- Rocket at 100 MHz, BOOM at 80 MHz
- 2x16 kB L1, 512 kB L2
- vDTU contains 128 EPs





- Xilinx VCU118 FPGA
- RISC-V: in-order Rocket or out-of-order BOOM
- Rocket at 100 MHz, BOOM at 80 MHz
- 2x16 kB L1, 512 kB L2
- vDTU contains 128 EPs





- Xilinx VCU118 FPGA
- RISC-V: in-order Rocket or out-of-order BOOM
- Rocket at 100 MHz, BOOM at 80 MHz
- 2x16 kB L1, 512 kB L2
- vDTU contains 128 EPs



|                     | LUTs [k] | FFs [k] | BRAMs |
|---------------------|----------|---------|-------|
| BOOM                | 143.8    | 71.8    | 159   |
| Rocket              | 46.6     | 22.0    | 152   |
| NoC router          | 3.4      | 2.2     | 0     |
| vDTU                | 15.2     | 5.8     | 0.5   |
| Control Unit        | 10.3     | 3.3     | 0.5   |
| NoC CTRL            | 3.2      | 1.5     | 0     |
| CMD CTRL            | 7.1      | 2.8     | 0.5   |
| Unpriv. IF          | 6.2      | 2.5     | 0.5   |
| Priv. IF            | 0.9      | 0.3     | 0     |
| Register file       | 2.0      | 1.0     | 0     |
| Memory mapper + PMP | 0.6      | 0.2     | 0     |
| I/O FIFOs           | 2.3      | 0.3     | 0     |



|                     | LUTs [k] | FFs [k] | BRAMs |
|---------------------|----------|---------|-------|
| BOOM                | 143.8    | 71.8    | 159   |
| Rocket              | 46.6     | 22.0    | 152   |
| NoC router          | 3.4      | 2.2     | 0     |
| vDTU                | 15.2     | 5.8     | 0.5   |
| Control Unit        | 10.3     | 3.3     | 0.5   |
| NoC CTRL            | 3.2      | 1.5     | 0     |
| CMD CTRL            | 7.1      | 2.8     | 0.5   |
| Unpriv. IF          | 6.2      | 2.5     | 0.5   |
| Priv. IF            | 0.9      | 0.3     | 0     |
| Register file       | 2.0      | 1.0     | 0     |
| Memory mapper + PMP | 0.6      | 0.2     | 0     |
| I/O FIFOs           | 2.3      | 0.3     | 0     |



|                     | LUTs [k] | FFs [k] | BRAMs |
|---------------------|----------|---------|-------|
| BOOM                | 143.8    | 71.8    | 159   |
| Rocket              | 46.6     | 22.0    | 152   |
| NoC router          | 3.4      | 2.2     | 0     |
| vDTU                | 15.2     | 5.8     | 0.5   |
| Control Unit        | 10.3     | 3.3     | 0.5   |
| NoC CTRL            | 3.2      | 1.5     | 0     |
| CMD CTRL            | 7.1      | 2.8     | 0.5   |
| Unpriv. IF          | 6.2      | 2.5     | 0.5   |
| Priv. IF            | 0.9      | 0.3     | 0     |
| Register file       | 2.0      | 1.0     | 0     |
| Memory mapper + PMP | 0.6      | 0.2     | 0     |
| I/O FIFOs           | 2.3      | 0.3     | 0     |



|                     | LUTs [k] | FFs [k] | BRAMs |
|---------------------|----------|---------|-------|
| воом                | 143.8    | 71.8    | 159   |
| Rocket              | 46.6     | 22.0    | 152   |
| NoC router          | 3.4      | 2.2     | 0     |
| vDTU                | 15.2     | 5.8     | 0.5   |
| Control Unit        | 10.3     | 3.3     | 0.5   |
| NoC CTRL            | 3.2      | 1.5     | 0     |
| CMD CTRL            | 7.1      | 2.8     | 0.5   |
| Unpriv. IF          | 6.2      | 2.5     | 0.5   |
| Priv. IF            | 0.9      | 0.3     | 0     |
| Register file       | 2.0      | 1.0     | 0     |
| Memory mapper + PMP | 0.6      | 0.2     | 0     |
| I/O FIFOs           | 2.3      | 0.3     | 0     |

#### Performance Comparison with Linux



- LevelDB receives requests from remote machine and sends result back
- Requests generated with YCSB; different shares of read/insert/update/scan
- Single BOOM core runs: LevelDB, pager, filesystem, network stack

#### Performance Comparison with Linux



- LevelDB receives requests from remote machine and sends result back
- Requests generated with YCSB; different shares of read/insert/update/scan
- Single BOOM core runs: LevelDB, pager, filesystem, network stack



#### Performance Comparison with Linux



- LevelDB receives requests from remote machine and sends result back
- Requests generated with YCSB; different shares of read/insert/update/scan
- Single BOOM core runs: LevelDB, pager, filesystem, network stack





- M<sup>3</sup> explores a new system architecture with a new per-tile hardware component
- M<sup>3</sup>v shows how general-purpose cores can be multiplexed efficiently
- Hardware implementation demonstrates modest additional hardware costs
- Competitive performance to Linux with context-switch-heavy workloads
- The complete hardware/software stack is available as open source: https://github.com/Barkhausen-Institut/M3

# **Backup Slides**

### Microbenchmarks: IPC and Context Switches





### Microbenchmarks: File System





# Microbenchmarks: Networking




## Macrobenchmarks: YCSB





## Comparison with M<sup>3</sup>x: OS-tile utilization





## Hardware Implementation



