



# OS Challenges for Modern Memory Systems

Christian Dietrich, Daniel Lohmann

Technische Universität Hamburg, Leibniz Universität Hannover

Winterschool on Operating Systems 2023





#### 6 OS Challenges for Modern Memory Systems

- 6.1 Virtualizing Memory A Short Recap
- 6.2 Hardware Developments
- 6.3 ParPerOS Contention-Avoiding Design
- 6.4 Summary and Conclusion
- 6.5 Referenzen





- **Operating System** → multiplexing and isolation of hardware by means of hardware virtualization
- Virtual hardware is represented by the OS's basic abstractions: Example UNIX





Virtual hardware is represented by the OS's basic abstractions: Example UNIX









problem is bigger ...



**Operating System**  $\mapsto$  multiplexing and isolation of hardware by means of hardware virtualization

Virtual hardware is represented by the OS's **basic abstractions**: Example UNIX







About This Lecture

SRA

06-MemoryChallenges 2023-04-04



Address space

**0**S User Thread combined into internal memory management. Signal Prozess IRQ But of course, the area and



Virtual hardware is represented by the OS's **basic abstractions**: Example UNIX

RAM

10 Device

problem is **bigger**...

eihniz

Iniversität

Hannover

User

Program

interacts with

environment via





# 6.1 Virtualizing Memory – A Short Recap

### Recap: Physical and Virtual Address Space



 $\rightarrow$  all hardware addresses

- Physical address space A<sub>p</sub>
  - defined by the hardware manufactor (OEM)
  - contains memory-mapped hardware devices and all physical memory (RAM, ROM, NVRAM)
  - → main memory

- Physical address space  $A_p$ 
  - defined by the hardware manufactor (OEM)
  - contains memory-mapped hardware devices and all physical memory (RAM, ROM, NVRAM)
  - main memory  $\sim \rightarrow$
  - Virtual address space  $A_{\nu}$ 
    - defined by the operating system
    - containes memory-mapped files and all code and data of the programm running in p

 $\rightarrow$  (virtual) working memory of process p

OS dynamically maps logical addresses to physical addresses and (optionally) also backing store:  $p: A_l \mapsto A_p \vee BS$ .

 $\rightarrow$  isolation  $\rightarrow$  multiplexing

 $\rightarrow$  all software addresses (for some process p)

 $\rightarrow$  all hardware addresses



- Virtual addresses generated by process *p* are translated
  - transparently via an OS-controlled process-specific mapping  $p: A_{\nu} \mapsto A_{p}$
  - mapping is done "on access" by the MMU (memory management unit)





- Virtual addresses generated by process *p* are translated
  - transparently via an OS-controlled process-specific mapping  $p: A_v \mapsto A_p$
  - mapping is done "on access" by the MMU (memory management unit)



niversität

lannovei

06-MemoryChallenges 2023-04-04

- **Virtual addresses** generated by process *p* are translated
- transparently via an OS-controlled process-specific mapping  $p: A_v \mapsto A_p$
- mapping is done "on access" by the MMU (memory management unit)



niversität

lannovei



- Virtual addresses generated by process *p* are translated
  - transparently via an OS-controlled process-specific mapping  $p: A_v \mapsto A_p$
  - mapping is done "on access" by the MMU (memory management unit)





- Virtual addresses generated by process *p* are translated
  - transparently via an OS-controlled process-specific mapping  $p: A_{\nu} \mapsto A_{\rho}$
  - mapping is done "on access" by the MMU (memory management unit)





- Virtual addresses generated by process *p* are translated
  - transparently via an OS-controlled process-specific mapping  $p: A_v \mapsto A_p$
  - mapping is done "on access" by the MMU (memory management unit)





- Virtual addresses generated by process *p* are translated
  - transparently via an OS-controlled process-specific mapping  $p: A_v \mapsto A_p$
  - mapping is done "on access" by the MMU (memory management unit)



Fundamental principle of **demand paging**  $\rightsquigarrow$  OS can do lots of thing *lazily*.



- Fundamental principle of **demand paging** ~>> provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).

- Fundamental principle of **demand paging** ~→ provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).
  - Textbook examples

fork() and the mighty
copy-on-write (COW)







- Fundamental principle of demand paging ~→ provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).
  - Textbook examples

fork() and the mighty **copy-on-write** (COW)





- Fundamental principle of **demand paging** ~>> provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).
  - Textbook examples

fork() and the mighty
copy-on-write (COW)





- Fundamental principle of **demand paging**  $\rightarrow$  provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).

### Textbook examples







- Fundamental principle of **demand paging**  $\rightarrow$  provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).

### Textbook examples



- Fundamental principle of **demand paging**  $\rightarrow$  provide RAM only *implicitly*.
  - Delay provision of page frames until actually needed.
  - Implicitly share page frames as long as possible.
  - Transfer page frames instead of data (zero copy).

### Textbook examples

Unified buffer cache: [11] The 4 KiB page frame has become the defacto entity for everything!



### $\hookrightarrow$ Save on the scarce and expensive physical memory and avoid/delay costly copy operations.



- Sharing demands extensive OS-internal bookkeeping
  - $\blacksquare$  Object-specific virtual  $\longleftrightarrow$  physical mappings, lots of reference counting.
  - Nested COW-relationships make things even more complicated.

eibniz

niversität

lannovei

- Sharing demands extensive OS-internal bookkeeping
  - $\blacksquare$  Object-specific virtual  $\longleftrightarrow$  physical mappings, lots of reference counting.
  - Nested COW-relationships make things even more complicated.





- Sharing demands extensive OS-internal bookkeeping
  - $\blacksquare$  Object-specific virtual  $\longleftrightarrow$  physical mappings, lots of reference counting.
  - Nested COW-relationships make things even more complicated.



eibniz

Universität

Hannover

- Sharing demands extensive OS-internal bookkeeping
  - $\blacksquare$  Object-specific virtual  $\longleftrightarrow$  physical mappings, lots of reference counting.
  - Nested COW-relationships make things even more complicated.



eibniz

Universität

Hannover

#### Scalability of Frame Allocation in Linux

Example: Linux frame allocation

Linux 6.0 on Xeon(R) Gold 5320: 2 imes 26 physical cores @ 2.20 GHz, 256/512 GiB DRAM/NVRAM per node



#### Scalability of Frame Allocation in Linux

Example: Linux frame allocation

Linux 6.0 on Xeon(R) Gold 5320: 2  $\times$  26 physical cores @ 2.20 GHz, 256/512 GiB DRAM/NVRAM per node



36-MemoryChallenges 2023-04-04

- Sharing demands extensive OS-internal bookkeeping
  - $\blacksquare$  Object-specific virtual  $\longleftrightarrow$  physical mappings, lots of reference counting.
  - Nested COW-relationships make things even more complicated.



eibniz

Universität

Hannover













Normal pages: can be placed everywhere





Normal pages: can be placed everywhere, but the management overhead might differ.





Normal pages: can be placed everywhere, but the management overhead might differ.Huge pages: reduce table overhead and TLB pressure

06-MemoryChallenges 2023-04-04





Normal pages: can be placed everywhere, but the management overhead might differ.
 Huge pages: reduce table overhead and TLB pressure, but require alignment.

06-MemoryChallenges 2023-04-04

#### **Problem 2**: External Fragmentation (Again)





#### RA Problem 2: External Fragmentation (Again)



Example: Linux frame allocation

Linux 6.0 on Xeon(R) Gold 5320: 2 imes 26 physical cores @ 2.20 GHz, 256/512 GiB DRAM/NVRAM per node

55% of normal frames free (RND distribution).

10% are freed and reallocated per iteration.



**Problem:** External fragmentation is back :-(

#### RA Problem 2: External Fragmentation (Again)





Linux 6.0 on Xeon(R) Gold 5320: 2 imes 26 physical cores @ 2.20 GHz, 256/512 GiB DRAM/NVRAM per node







# 6.2 Hardware Developments

#### Hardware Developments: Overview



- PDP-11 Model: Single CPU is Queen of the system
  - One virtual address space, secondary storage is separate
  - Direct memory access is an optimization for I/O



#### Hardware Developments: Overview





- PDP-11 Model: Single CPU is Queen of the system
  - One virtual address space, secondary storage is separate
  - Direct memory access is an optimization for I/O
- Current Reality: More Memory / Users / Interconnects
  - Deep cache hierarchy of shared coherent CPU caches
  - Multiple cores with separate MMUs with non-coherent TLBs
  - Memory Types with different latencies and properties
  - PCle is the interconnect standard
  - SSDs provide fast random block access
  - Remote DMA provides access to the PCIe bus
  - peripheral memory access via IOMMU but w/o coherency
  - Accelerators are more efficient than the CPU

#### 🔁 🛛 Hardware Developments: Cache Hierarchy, NUMA



#### The Memory hierarchy becomes deeper

Example: Intel Xeon Gold 5320 (Random Read Access)

- Caches 1.6 ns (L1), 6 ns (L2), 21 ns (L3)
- RAM Local: 95 ns, Remote: 155 ns
- Other Optane: 170-305 ns[23]
   RDMA: 600 ns (@200G)

#### Hardware Developments: Cache Hierarchy, NUMA, and CXL!



The Memory hierarchy becomes deeper

Example: Intel Xeon Gold 5320 (Random Read Access)

- Caches 1.6 ns (L1), 6 ns (L2), 21 ns (L3)
- RAM Local: 95 ns, Remote: 155 ns
- Other Optane: 170-305 ns[23]
   RDMA: 600 ns (@200G)

- Compute eXpress Link is a PCIe Protocol
  - Use PCIe as inter-machine interconnect
  - CXL.mem: NUMA-like latencies for remote memory
  - CXL.cache: devices with cache coherency



# Problem 3: The Memory Hierarchy is a Network



#### The Memory hierarchy becomes deeper

Example: Intel Xeon Gold 5320 (Random Read Access)

- Caches 1.6 ns (L1), 6 ns (L2), 21 ns (L3)
- RAM Local: 95 ns, Remote: 155 ns
- Other Optane: 170-305 ns[23]
   RDMA: 600 ns (@200G)

Compute eXpress Link is a PCIe Protocol

- Use PCIe as inter-machine interconnect
- CXL.mem: NUMA-like latencies for remote memory
- CXL.cache: devices with cache coherency



Problem: "Your computer is already a distributed system" [3]



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line







- Shared State: tail-pointer



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line



- Shared State: tail-pointer
- lock(): Enqueue themselves via CAS-operation



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line



- Shared State: tail-pointer
- lock(): Enqueue themselves via CAS-operation



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line



- Shared State: tail-pointer
- lock(): Enqueue themselves via CAS-operation
- Wait: Threads poll local cache line



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line



- Shared State: tail-pointer
- lock(): Enqueue themselves via CAS-operation
- Wait: Threads poll local cache line
- unlock(): next->spin=0, 1 cache-line transfer



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line



- Shared State: tail-pointer
- lock(): Enqueue themselves via CAS-operation
- Wait: Threads poll local cache line
- unlock(): next->spin=0, 1 cache-line transfer



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line



- Shared State: tail-pointer
- lock(): Enqueue themselves via CAS-operation
- Wait: Threads poll local cache line
- unlock(): next->spin=0, 1 cache-line transfer

- Traditional spinlocks are problematic on NUMA
  - Common Wisdom: "Locks should be FIFO!"
  - FIFO ensures fairness and avoids starvation
  - But: Lock-holder bounces between NUMA sockets





- **MCS-Lock**: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line

- CNA-Lock: Compact NUMA-aware Spinlock [7]
  - Idea: prefer waiters on local NUMA node
  - Lock-holder has a queue of non-local waiters
  - Become **unfair** in favor of performance





- **MCS-Lock**: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line

- CNA-Lock: Compact NUMA-aware Spinlock [7]
  - Idea: prefer waiters on local NUMA node
  - Lock-holder has a queue of non-local waiters
  - Become **unfair** in favor of performance



#### Enqueue works like MCS lock

06-MemoryChallenges 2023-04-04



- **MCS-Lock**: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line

- **CNA-Lock**: Compact NUMA-aware Spinlock [7]
  - Idea: prefer waiters on local NUMA node
  - Lock-holder has a queue of non-local waiters
  - Become **unfair** in favor of performance



- Enqueue works like MCS lock
- unlock() move remote waiters into 2<sup>nd</sup> queue

36-MemoryChallenges 2023-04-04



- **MCS-Lock**: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line

- CNA-Lock: Compact NUMA-aware Spinlock [7]
  - Idea: prefer waiters on local NUMA node
  - Lock-holder has a queue of non-local waiters
  - Become **unfair** in favor of performance



- Enqueue works like MCS lock
- unlock() move remote waiters into 2<sup>nd</sup> queue
- Secondary queue is passed on



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line

- CNA-Lock: Compact NUMA-aware Spinlock [7]
  - Idea: prefer waiters on local NUMA node
  - Lock-holder has a queue of non-local waiters
  - Become **unfair** in favor of performance



- Enqueue works like MCS lock
- unlock() move remote waiters into 2<sup>nd</sup> queue
- Secondary queue is passed on
- No local waiters, switch NUMA node

6 - 17



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning
  - Standard lock for Linux (replaced test-and-test)
  - Everybody spins on its own cache line

- CNA-Lock: Compact NUMA-aware Spinlock [7]
  - Idea: prefer waiters on local NUMA node
  - Lock-holder has a queue of non-local waiters
  - Become **unfair** in favor of performance

- Enqueue works like MCS lock
- unlock() move remote waiters into 2<sup>nd</sup> queue
- Secondary queue is passed on
- No local waiters, switch NUMA node



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning

- **CNA-Lock**: Compact NUMA-aware Spinlock
  - Idea: prefer waiters on local NUMA node





- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning

- **CNA-Lock**: Compact NUMA-aware Spinlock
  - Idea: prefer waiters on local NUMA node

Both Locks solve Memory Problems!





- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning

- **CNA-Lock**: Compact NUMA-aware Spinlock
  - Idea: prefer waiters on local NUMA node

Both Locks solve Memory Problems!



- "Thundering-Herd Problem"
  - TAS: Invalidate shared cache line  $\Rightarrow$  (n-1) misses
  - MCS: Unlock provokes exactly one cache miss
  - Principle: Shared memory is 1-to-N communication Keep N small!



- MCS-Lock: A Fair, NUMA-oblivious Spinlock
  - Idea: waiter queue, local spinning

- **CNA-Lock**: Compact NUMA-aware Spinlock
  - Idea: prefer waiters on local NUMA node

Both Locks solve Memory Problems!



- "Thundering-Herd Problem"
  - TAS: Invalidate shared cache line  $\Rightarrow$  (n–1) misses
  - MCS: Unlock provokes exactly one cache miss
  - Principle: Shared memory is 1-to-N communication Keep N small!
- NUMA-Aware Programming
  - MCS: Protected state bounces between sockets
  - CNA: Lock sticks to NUMA socket
  - Principle: Keep control flow where the cached data



|                                   | Sun 33 (1990)           | Xeon 5320 (2022)          | Factor     |
|-----------------------------------|-------------------------|---------------------------|------------|
| CPU                               | 1× @ 33 Mhz, 10-11 MIPS | 2×28 @ 2-3 GHz, 100k MIPS | 1000       |
| TLB/Thr.                          | 64 Entries              | 132 L1 + 1500 L2          | 25         |
| L1D: Size                         | 256 B                   | 64 KiB                    | 256        |
| Latency <sup>1</sup>              | 180ns                   | 1 ns                      | 180        |
| RAM: Size                         | ≤ 128 MiB               | ≤ 3 TiB                   | 25000      |
| Latency <sup>1</sup>              | 210 ns                  | 100 ns                    | 2          |
| Read (1 MiB) <sup>1</sup>         | 3200 us                 | 3 us                      | 1000       |
| Bandwidth                         | 200 MiB/s               | 120 GiB/s [20]            | 600        |
| Network (Read 2 KiB) <sup>1</sup> | 1448 us                 | 16 ns                     | 90500      |
| Disk (Read 1 MiB) <sup>1</sup>    | 640 ms                  | 825 us / 125 us (SSD)     | 775 / 5000 |

<sup>1</sup>Typical from https://colin-scott.github.io/personal\_website/research/interactive\_latency.html

#### **Problem 4**: Designed for Scarcity, Not for Latency



|                                   | Sun 33 (1990)           | Xeon 5320 (2022)          | Factor     |
|-----------------------------------|-------------------------|---------------------------|------------|
| CPU                               | 1× @ 33 Mhz, 10-11 MIPS | 2×28 @ 2-3 GHz, 100k MIPS | 1000       |
| TLB/Thr.                          | 64 Entries              | 132 L1 + 1500 L2          | 25         |
| L1D: Size                         | 256 B                   | 64 KiB                    | 256        |
| Latency <sup>1</sup>              | 180ns                   | 1 ns                      | 180        |
| RAM: Size                         | ≤ 128 MiB               | ≤ 3 TiB                   | 25000      |
| Latency <sup>1</sup>              | 210 ns                  | 100 ns                    | 2          |
| Read (1 MiB) <sup>1</sup>         | 3200 us                 | 3 us                      | 1000       |
| Bandwidth                         | 200 MiB/s               | 120 GiB/s [20]            | 600        |
| Network (Read 2 KiB) <sup>1</sup> | 1448 us                 | 16 ns                     | 90500      |
| Disk (Read 1 MiB) <sup>1</sup>    | 640 ms                  | 825 us / 125 us (SSD)     | 775 / 5000 |

<sup>1</sup>Typical from https://colin-scott.github.io/personal\_website/research/interactive\_latency.html

Problem: Memory has become abundant, but latencies and TLB are killers!



- The physical memory is 25k× larger!
  - − 1 Gib  $\stackrel{\wedge}{=}$  512 huge frames  $\stackrel{\wedge}{=}$  262K frames
  - The Sun 33 (1990) had 32K frames
- Challenge: Meta-Data Overhead
  - struct page stores 64 B metadata per frame
    - 1 GiB  $\stackrel{\wedge}{=}$  16 MiB of meta-data
  - Linux spends 1.56 % of its DRAM for this!

- The physical memory is 25k× larger!
  - − 1 Gib  $\stackrel{\wedge}{=}$  512 huge frames  $\stackrel{\wedge}{=}$  262K frames
  - The Sun 33 (1990) had 32K frames
- Challenge: Meta-Data Overhead
  - struct page stores 64 B metadata per frame
    - 1 GiB  $\stackrel{\wedge}{=}$  16 MiB of meta-data
  - Linux spends 1.56 % of its DRAM for this!

- Multiple Frame Sizes
  - Huge frames extend the TLB reach.
  - Using huge frames save on page tables.
- Challenge: Allocation Policy
  - When to allocate which granularity?
  - Huge frames are worse for Copy-on-Write
  - Support for existing software!











- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - + 4 KiB: Less memory, faster copy (CoW)

Break even: 70 4 KiB pages

+ 2 MiB: TLB pressure, less faults







- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - + 4 KiB: Less memory, faster copy (CoW) + 2 MiB: TLB pressure, less faults
- Transparent Huge Pages [15, 17]: Why not both?
  - Idea: Start with 4 KiB and upgrade to 2 MiB lateron.
  - First fault: reserve 2 MiB but map only 4 KiB









- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - + 4 KiB: Less memory, faster copy (CoW) Break even: 2 MiB: TLB pressure, less faults
- Transparent Huge Pages [15, 17]: Why not both?
  - Idea: Start with 4 KiB and upgrade to 2 MiB lateron.
  - First fault: reserve 2 MiB but map only 4 KiB
  - Individual faults up to a threshold (e.g., <50 %)</p>







- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - + 4 KiB: Less memory, faster copy (CoW) Break even: 2 MiB: TLB pressure, less faults
- Transparent Huge Pages [15, 17]: Why not both?
  - Idea: Start with 4 KiB and upgrade to 2 MiB lateron.
  - First fault: reserve 2 MiB but map only 4 KiB
  - Individual faults up to a threshold (e.g., <50 %)</p>





- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - + 4 KiB: Less memory, faster copy (CoW) + 2 MiB: TLB pressure, less faults
- Transparent Huge Pages [15, 17]: Why not both?
  - Idea: Start with 4 KiB and upgrade to 2 MiB lateron.
  - First fault: reserve 2 MiB but map only 4 KiB
  - Individual faults up to a threshold (e.g., <50 %)</p>
  - Upgrade to 2 MiB Mapping



2 MiB (aligned) VM huge page PM



- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - + 4 KiB: Less memory, faster copy (CoW) Break even: 2 MiB: TLB pressure, less faults
- Transparent Huge Pages [15, 17]: Why not both?
  - Idea: Start with 4 KiB and upgrade to 2 MiB lateron.
  - First fault: reserve 2 MiB but map only 4 KiB
  - Individual faults up to a threshold (e.g., <50 %)</p>
  - Upgrade to 2 MiB Mapping
- Linux: struct page-Array for 2 MiB Mappings
  - = 512 imes struct page (64b)  $\stackrel{ riangle}{=}$  32 KiB  $\stackrel{ riangle}{=}$  8 frames 😱
  - Idea: Map the first frame 7 more times
  - Save 28 KiB per 2 MiB mapping (1.36% of all DRAM!)

### Transparent Huge Pages

2 MiB (aligned) VM huge page PM



Problem: This is a Memory-Scarce Design!

- Should the OS map 4 KiB Frame or 2 MiB Frame?
  - 4 KiB: Less memory, faster copy (CoW) \_\_\_\_\_
    2 MiB: TLB pressure, less faults
- Transparent Huge Pages [15, 17]: Why not both?
  - Idea: Start with 4 KiB and upgrade to 2 MiB lateron.
  - First fault: reserve 2 MiB but map only 4 KiB
  - Individual faults up to a threshold (e.g., <50 %)</p>
  - Upgrade to 2 MiB Mapping
- Linux: struct page-Array for 2 MiB Mappings
  - 512 × struct page (64b)  $\stackrel{\wedge}{=}$  32 KiB  $\stackrel{\wedge}{=}$  8 frames 😱
  - Idea: Map the first frame 7 more times
  - Save 28 KiB per 2 MiB mapping (1.36% of all DRAM!)

Break even: 70 4 KiB pages



- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles



- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles





- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles





- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles





- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles





- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles





- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles
  - *Principle*: Shootdown should be a rare event
    - Batching: Combine multiple independent shootdowns
    - Semantics: Avoid shootdowns by weakening guarantees
    - Both are problematic with existing software
    - Hard to implement them correct



### Problem 5: Software must fix Broken Hardware



- TLB: The Last Non-Coherent Cache
  - Each CPU caches the slow page-table walk
  - Huge Impact (5-Levels): 600 ns vs 1 ns
  - The OS must invalidate entries on remote cores! Optimized variant [2]: 3400 – 4300 cycles
- Principle: Shootdown should be a rare event
  - Batching: Combine multiple independent shootdowns
  - Semantics: Avoid shootdowns by weakening guarantees
  - Both are problematic with existing software
  - Hard to implement them correct

#### Problem: Fixing Hardware in Software



### Hardware Development: Secondary Storage



| Medium                           | Capacity | Sequential Read | 4K IOP/S  | € / 1 TB |
|----------------------------------|----------|-----------------|-----------|----------|
| Seagate Savvio 15K.2 (HDD, 2009) | 146 GB   | 160 MB/s        | 204       | 2 000 €  |
| Seagate Exos 2x14 (HDD, 2021)    | 14 TB    | 524 MB/s        | 304       | 27 €     |
| Intel X25-E (SSD, 2009)          | 32 GB    | 250 MB/s        | 35 000    | 21 800 € |
| Samsung PM1735 (SSD, 2019)       | 12.8 TB  | 8000 MB/s       | 1 500 000 | 340 €    |

#### SSDs will replace HDDs

- SSDs are large and cheap (enough).
- Small penalty for random (PM1735: 6 GiB/s)
- Multi-million IOP/s if queues are deep enough

### Problem 6: Designed for Slow and Sequential I/O



| Medium                                                            | Capacity        | Sequential Read      | 4K IOP/S   | € / 1 TB        |
|-------------------------------------------------------------------|-----------------|----------------------|------------|-----------------|
| Seagate Savvio 15K.2 (HDD, 2009)<br>Seagate Exos 2x14 (HDD, 2021) | 146 GB<br>14 TB | 160 MB/s<br>524 MB/s | 204<br>304 | 2 000 €<br>27 € |
| Intel X25-E (SSD, 2009)                                           | 32 GB           | 250 MB/s             | 35 000     | 21 800 €        |
| Samsung PM1735 (SSD, 2019)                                        | 12.8 TB         | 8000 MB/s            | 1 500 000  | 340 €           |

#### SSDs will replace HDDs

- SSDs are large and cheap (enough).
- Small penalty for random (PM1735: 6 GiB/s)
- Multi-million IOP/s if queues are deep enough

Problem: Designed for Slow I/O

"Nothing matters if you have to query the disk."A page fault provokes only one small disk read.

#### I/O Performance on Single Intel® Xeon







# 6.3 ParPerOS – Contention-Avoiding Design

### Linux: Virtual Memory Page Allocation





Linux 5.16, AMD EPYC 7713 processor (64 cores, 128 hardware threads), 512 GB RAM

- Benchmark: Alloc/Free 4 KiB Pages Randomly
  - Random I/O requires random VM operations
  - Allocate page via page fault
  - Free page via MADV\_DONTNEED or munmap()
- munmap(2)
  - Modifies the **global** memory-object list.
  - Memory objects are split and merged
- madvise(2)
  - Modify only the page tables
  - One TLB Shootdown per eviction!
- process\_madvise(2)
  - Vectorized madvise(2)
  - One TLB shootdown per 512 pages.

### Linux: Virtual Memory Page Allocation





Linux 5.16, AMD EPYC 7713 processor (64 cores, 128 hardware threads), 512 GB RAM

#### **Problem:** Designed for Slow I/O

- Benchmark: Alloc/Free 4 KiB Pages Randomly
  - Random I/O requires random VM operations
  - Allocate page via page fault
  - Free page via MADV\_DONTNEED or munmap()
- munmap(2)
  - Modifies the **global** memory-object list.
  - Memory objects are split and merged
- madvise(2)
  - Modify only the page tables
  - One TLB Shootdown per eviction!
- process\_madvise(2)
  - Vectorized madvise(2)
  - One TLB shootdown per 512 pages.



Virtual Address Space

#### Explicit File-Mapped I/O

- No page-faults, no automatic write-back
- Vectorized surface operations (alloc/free)
- Lock-free page-table modifications

- Forbid slow-I/O paths
- Vectorized operations
- Use CPU atomics



#### Process-Local Frame Pool

- Avoids zeroing without leak
- Lock-free global bundle list
- CPU-local bundles (513 frames)

- Memory-abundant design!
- Limited global communication
- Cache-friendly data structures



#### Memory-Mapped IO Vector

36-MemoryChallenges 2023-04-04

- Pre-mapped parameter vector
- Page number (52 bits), length (12 bits)
- Avoids copy\_from\_user() checks

- Re-use loaded cache lines
- Dense special-purpose encoding
- Memory as communication interface



#### **Exported Page Tables**

- Read-only mapping
- In-core information

06-MemoryChallenges 2023-04-04

Cache line is also used by MMU

- Expose hardware specifics
- Controlled isolation violations
- Re-use loaded cache lines

### ExMap: Virtual Memory Allocation Performance





## ExMap:100M - 200MRandom allocations per secondLinux:5MRandom allocations per second (with fixes)

### SRA Challenge: Contention in the Memory Subsystem



Contention



### SRA Challenge: Contention in the Memory Subsystem



Contention



### SRA Challenge: Contention in the Memory Subsystem





06-Memory Challenges 2023-04-04

### RA Challenge: Contention in the Memory Subsystem





### RA Challenge: Contention in the Memory Subsystem





#### $\hookrightarrow$ Crucial to **avoid contention** from the very beginning!

| Do not use locks. Use atomics.                                                                                                                                                                  |                     |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|
| <ul> <li>CAS, FAA, LL/SC,</li> <li>This has <i>a lot</i> of implications on data structures.</li> <li>And even more on NVM.</li> </ul>                                                          | [5, 10, 13, 14, 21] |
| Respect your hardware. Especially the cache.                                                                                                                                                    | [3, 10, 13, 14, 21] |
| <ul> <li>Well known, but still ignored.</li> <li>Performance is dominated by the number <i>n</i> of cache lines accessed: cla = 1</li> <li>And even more, if cache lines are shared!</li> </ul> | [8, 12, 22]<br>n    |
| <ul> <li>Avoid true and false sharing. Partition your ressources.</li> <li>Cache trashing is a major bottleneck.</li> <li>Global resource pools require locks.</li> </ul>                       | [4]                 |

### SRA LLFree – A Fast and Optionally Persistent Frame Allocator





06-MemoryChallenges 2023-04-04

- Goal: Efficient management of physical memory.
  - De/allocation of normal (4 KiB) and huge (2 MiB) frames.
- Goal: Optional crash consistency on NVRAM.
  - Allocation state survives sudden power loss.





### **Cache-friendly design:** 512 normal frames are managed within a single cache line. • 4 KiB alloc: find first 0-bit in entry, set it to 1

 $\rightsquigarrow$  very fast, if there is a 0-bit

(WSOS'23) 6 OS Challenges for Modern Memory Systems 6.3 ParPerOS – Contention-Avoiding Design



**Cache-friendly design:** 512 normal frames are managed within a single cache line. • 4 KiB alloc: find entry with  $c_L > 0$ , decrement  $c_L$ , find first 0-bit in entry, set it to 1  $\rightarrow$  there is a 0 bit

10 + 1 bit(2 B aligned) (32 · 2 B = 64 B)

cla = 2





Cache-friendly design: 512 normal frames are managed within a single cache line.

- 4 KiB alloc: find entry with  $c_L > 0$ , decrement  $c_L$ , find first 0-bit in entry, set it to 1  $\rightarrow$  there is a 0 bit
- 2 MiB alloc: find entry with 512 free frames, set  $c_L = 0$  and a = 1 $\rightsquigarrow$  ignore bit field

06-MemoryChallenges 2023-04-04



Cache-friendly design: 512 normal frames are managed within a single cache line.

- 4 KiB alloc: find entry with  $c_L > 0$ , decrement  $c_L$  and  $c_U$ , find first 0-bit in entry, set it to 1  $\rightarrow$  there is a 0 bit
- 2 MiB alloc: find entry with 512 free frames, set  $c_L = 0$  and a = 1, decrement  $c_U$  $\rightarrow$  ignore bit field

*cla* = 3

*cla* = 2



- Avoid false sharing: per-CPU partitioning into 64 MiB chunks (Trees).
- Lower level: No contention on cache managing children array entries (32 fit into one cache line)
- Upper Level: Contention on cache managing trees array entries (32 fit into one cache line)





- Avoid false sharing: per-CPU partitioning into 64 MiB chunks (Trees).
- Lower level: No contention on cache managing children array entries (32 fit into one cache line)
- Upper Level: Contention on cache managing trees array entries (32 fit into one cache line)  $\rightarrow$  Split counter to maintain free-frame count mostly locally: ( $c_P + c_U \le 512 \cdot 32 = 16384$ )



Linux frame allocation

Linux 6.0 on Xeon(R) Gold 5320: 2 imes 26 physical cores @ 2.20 GHz, 256/512 GiB DRAM/NVRAM per node



### $\mathbb{R}_{\mathsf{RA}}$ Visibility $\neq$ Persistency

MOV A change becomes visible to other cores, when it reaches the L1 Cache We can order multiple changes by memory barriers. Core All our multi-core algorithms rely on this! L1 12 L3 WPO NVM ADR eADR

### $\frac{1}{SRA}$ Visibility $\neq$ Persistency

MOV A change becomes visible to other cores, when it reaches the L1 Cache We can order multiple changes by memory barriers. Core All our multi-core algorithms rely on this! L1 A change becomes **persistent** on NVRAM, when ... 12 L3 WPO NVM ADR eADR

### $\frac{1}{SRA}$ Visibility $\neq$ Persistency

A change becomes visible to other cores, when it reaches the L1 Cache
 We can order multiple changes by memory barriers.
 All our multi-core algorithms rely on this!
 A change becomes persistent on NVRAM, when ... it depends



## $\frac{1}{SRA}$ Visibility $\neq$ Persistency

Thread Consistency vs. Crash Consistancy

- A change becomes visible to other cores, when it reaches the L1 Cache
  - We can order multiple changes by memory barriers.
  - All our multi-core algorithms rely on this!
- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency



## SRA Visibility $\neq$ Persistency

Thread Consistency vs. Crash Consistancy

- A change becomes visible to other cores, when it reaches the L1 Cache
  - We can order multiple changes by memory barriers.
  - All our multi-core algorithms rely on this!
- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency
  - Great concept!



...



aes 2023-04-04

06-MemonyChal

- A change becomes visible to other cores, when it reaches the L1 Cache
  - We can order multiple changes by memory barriers.
  - All our multi-core algorithms rely on this!
- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency
  - Great concept! ... that unfortunately did not made it to market





- A change becomes visible to other cores, when it reaches the L1 Cache
  - We can order multiple changes by memory barriers.
  - All our multi-core algorithms rely on this!
- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency
  - Great concept! ... that unfortunately did not made it to market
- ↔ ADR: Changed cache line has *eventually* reached the WPQ
  - Ensuring durability requires expensive explicit flushes
  - Truely awfull programming model, especially on multi-core!





- A change becomes visible to other cores, when it reaches the L1 Cache
  - We can order multiple changes by memory barriers.
  - All our multi-core algorithms rely on this!
- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency
  - Great concept! ... that unfortunately did not made it to market
- ↔ ADR: Changed cache line has *eventually* reached the WPQ
  - Ensuring durability requires expensive explicit flushes
  - Truely awfull programming model, especially on multi-core!
- $\hookrightarrow$  CXL: We don't know yet, but most probably like ADR





- A change becomes visible to other cores, when it reaches the L1 Cache
  - We can order multiple changes by memory barriers.
  - All our multi-core algorithms rely on this!
- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency
  - Great concept! ... that unfortunately did not made it to market
- ↔ ADR: Changed cache line has *eventually* reached the WPQ
  - Ensuring durability requires expensive explicit flushes
  - Truely awfull programming model, especially on multi-core!
- $\hookrightarrow$  CXL: We don't know yet, but most probably like ADR

General: Assume *persist granularity* of a single cache line [6, 19]







- We can order multiple changes by memory barriers.
- All our multi-core algorithms rely on this!

Visibility  $\neq$  Persistency

- A change becomes persistent on NVRAM, when ... it depends
- $\hookrightarrow$  eADR: Change has reached the L1  $\mapsto$  Visibility = Persistency
  - Great concept! ... that unfortunately did not made it to market
- ↔ ADR: Changed cache line has *eventually* reached the WPQ
  - Ensuring durability requires expensive explicit flushes
  - Truely awfull programming model, especially on multi-core!
- $\hookrightarrow$  CXL: We don't know yet, but most probably like ADR
  - General: Assume *persist granularity* of a single cache line [6, 19]

#### Problem: Fixing Hardware in Software









dl (WSOS'23) 6 OS Challenges for Modern Memory Systems | 6.3 ParPerOS – Contention-Avoiding Design





**Single cache-line rule:** Exactly one cache line (selected by the *a*-flag) is the *authoritative truth* **1:** Entry is allocated as huge frame

36-Memory Challenges 2023-04-04



**Single cache-line rule:** Exactly one cache line (selected by the *a*-flag) is the *authoritative truth* **1:** Entry is allocated as huge frame  $\rightarrow$  **child entry** defines the *truth* 

eibniz

Universität

Hannover





**Single cache-line rule:** Exactly one cache line (selected by the a-flag) is the *authoritative truth* 

- 1: Entry is allocated as huge frame ~→ child entry defines the *truth*
- **0:** Entry is free/allocated as normal frames



**Single cache-line rule:** Exactly one cache line (selected by the a-flag) is the *authoritative truth* 

- 1: Entry is allocated as huge frame ~→ child entry defines the *truth*
- **0:** Entry is free/allocated as normal frames  $\rightarrow$  bits in **bit field** define the *truth*

eibniz

Universität

Hannover



Single cache-line rule: Exactly one cache line (selected by the *a*-flag) is the *authoritative truth* 

- 1: Entry is allocated as huge frame → child entry defines the *truth* 
  - **O:** Entry is free/allocated as normal frames  $\rightsquigarrow$  bits in **bit field** define the *truth*

 $\hookrightarrow$  Works with the minimal *persist granularity* offered by any NVM implementation.

eibniz

Universität

Hannover





- **Upper Level** information is simply recreated at boot time.
- ← Crash-consistent page frame allocation and deallocation for normal and huge frames!





# 6.4 Summary and Conclusion

## SRA Summary: OS Challenges for Modern Memory Systems



- In the end, everything has become a memory problem!
  - Thread-level paralellism
  - I/O throughput
  - Contention

- → memory placement.
   → memory allocation.
- $\rightsquigarrow$  memory interaction.

#### Summary: OS Challenges for Modern Memory Systems SRA



- In the end, everything has become a memory problem!
  - Thread-level paralellism  $\rightarrow$  memory placement.
  - I/O throughput
  - $\rightarrow$  memory allocation. Contention  $\rightarrow$  memory interaction.
- Hardware advances (over 30 years) are **uneven** and will continue to be!
  - RAM: 25 000x larger L1: 250x larger TLB: 25x larger
  - RAM: 500–1000x higher througput 2x lower latency
  - I/O: 5000–90000x higher throughput.
  - NVRAM: It's a thing, but SSDs still 5-10x cheaper.
- OS memory management is still dominated by the "Mach view".
  - RAM is scarce. Share it.
  - Memory is an implict resource. Demand paging for everything.
  - I/O is slow. Other overheads neglectible.

# SRA Summary: OS Challenges for Modern Memory Systems



- In the end, everything has become a memory problem!
  - Thread-level paralellism → memory placement.
  - I/O throughput
  - Contention ~→ memory interaction.
- Hardware advances (over 30 years) are **uneven** and will continue to be!

 $\rightarrow$  memory allocation.

- RAM: 25 000x larger L1: 250x larger TLB: 25x larger
- RAM: 500–1 000x higher througput 2x lower latency
- I/O: 5000-90000x higher throughput.
- NVRAM: It's a thing, but SSDs still 5-10x cheaper.
- OS memory managment is still dominated by the "Mach view".
  - RAM is scarce. Share it.
  - Memory is an implict resource. Demand paging for everything.
  - I/O is slow. Other overheads neglectible.

#### $\hookrightarrow$ Lots of things to do!

# Conclusion: Problems and Principles for Memory Management

# Problems

- The Cost of Sharing
- External Fragmentation is back
- The Hierarchy is a Network
- Designed for Scarcity, not Latency
- Software must fix Broken Hardware
- Designed for Slow and Sequential I/O

# **Principles**

- Explicit and Non-Shared Semantics
- Hardware-Specific Granularities
- Constructive Contention Avoidance
- Memory Scarcity is the Exception
- Mitigate Hardware Problems (for Now)
- Parallel and Asynchronous I/O

**SRA** 6 OS Challenges for Modern Memory Systems



# 6.5 Referenzen



- Mike Accetta, Robert Baron, David Golub u. a. "MACH: A New Kernel Foundation for UNIX Development". In: Proceedings of the USENIX Summer Conference. USENIX Association, Juni 1986, S. 93–113.
- [2] Nadav Amit, Amy Tai und Michael Wei. "Don't shoot down TLB shootdowns!" In: Proceedings of the Fifteenth European Conference on Computer Systems. 2020, S. 1–14.
- [3] Andrew Baumann, Simon Peter, Adrian Schüpbach u. a. "Your computer is already a distributed system. Why isn't your OS?" In: HotOS. 2009.
- [4] Silas Boyd-Wickizer, Haibo Chen, Rong Chen u. a. "Corey: An Operating System for Many Cores". In: 8th Symposium on Operating System Design and Implementation (OSDI '08) (San Diego, CA, USA). Berkeley, CA, USA: USENIX Association, 2008, S. 43–57.
- [5] Zhangyu Chen, Yu Hua, Bo Ding u. a. "Lock-free Concurrent Level Hashing for Persistent Memory". In: 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, Juli 2020, S. 799–812. ISBN: 978-1-939133-14-4. URL: https://www.usenix.org/conference/atc20/presentation/chen.
- [6] Jeremy Condit, Edmund B. Nightingale, Christopher Frost u. a. "Better I/O Through Byte-addressable, Persistent Memory". In: Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). Big Sky, Montana, USA: ACM, 2009, S. 133–146. ISBN: 978–1–60558–752–3. DOI: 10.1145/1629575.1629589.
- Dave Dice und Alex Kogan. "Compact NUMA-aware locks". In: Proceedings of the Fourteenth EuroSys Conference 2019. 2019, S. 1–15.





- [8] Dawson R. Engler, M. Frans Kaashoek und James O'Toole. "Exokernel: An Operating System Architecture for Application-Level Resource Management". In: Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95) (Copper Mountain, CO, USA). New York, NY, USA: ACM Press, Dez. 1995, S. 251–266. ISBN: 0-89791-715-4. DOI: 10.1145/224057.224076.
- [9] John Fotheringham. "Dynamic Storage Allocation in the Atlas Computer, Including an Automatic Use of a Backing Store". In: Communications of the ACM 4.10 (Okt. 1961), S. 435–436.
- [10] Michal Friedman, Maurice Herlihy, Virendra Marathe u. a. "A persistent lock-free queue for non-volatile memory". In: ACM SIGPLAN Notices 53.1 (2018), S. 28–40.
- [11] Robert A. Gingell, J. Moran und William Shannon. "Virtual Memory Architecture in SunOS". In: Proceedings of Summer '87 USENIX Conference. Juni 1987.
- [12] Hermann Härtig, Michael Hohmuth, Jochen Liedtke u. a. "The Performance of μ-Kernel-Based Systems". In: Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP '97). New York, NY, USA: ACM Press, Okt. 1997. DOI: 10.1145/269005.266660.
- [13] Joseph Izraelevitz, Hammurabi Mendes und Michael L Scott. "Linearizability of persistent memory objects under a full-system-crash failure model". In: International Symposium on Distributed Computing. Springer. 2016, S. 313–327.
- [14] Kunal Korgaonkar, Joseph Izraelevitz, Jishen Zhao u. a. "Vorpal: Vector clock ordering for large persistent memory systems". In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing. 2019, S. 435–444.



- [15] Youngjin Kwon, Hangchen Yu, Simon Peter u. a. "Coordinated and Efficient Huge Page Management with Ingens". In: 12th Symposium on Operating Systems Design and Implementation (OSDI '16). Savannah, GA, USA: USENIX Association, 2016, S. 705–721. ISBN: 9781931971331.
- [16] Viktor Leis, Adnan Alhomssi, Tobias Ziegler u. a. "Virtual-Memory Assisted Buffer Management". In: Proceedings of the ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD'23). Accepted at SIGMOD'23, to appear. Seattle, WA, USA: ACM, Juni 2023.
- [17] Juan Navarro, Sitaram Iyer und Alan Cox. "Practical, Transparent Operating System Support for Superpages". In: 5th Symposium on Operating Systems Design and Implementation (OSDI '02). Boston, MA: USENIX Association, Dez. 2002.
- [18] Elliot I. Organick. The Multics System: An Examination of its Structure. MIT Press, 1972. ISBN: 0-262-15012-3.
- [19] Steven Pelley, Peter M. Chen und Thomas F. Wenisch. "Memory Persistency". In: Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA '14). Minneapolis, Minnesota, USA: IEEE Press, 2014, S. 265–276. ISBN: 9781479943944.
- [20] Markus Velten, Robert Schöne, Thomas Ilsche u. a. "Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors". In: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering. 2022, S. 165–175.
- [21] Tianzheng Wang, Justin Levandoski und Per-Ake Larson. "Easy Lock-Free Indexing in Non-Volatile Memory". In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). 2018, S. 461–472. DOI: 10.1109/ICDE.2018.00049.

)6-MemoryChallenges 2023-04-04





- [22] David Wentzlaff und Anant Agarwal. "Factored operating systems (fos): the case for a scalable operating system for multicores". In: ACM SIGOPS Operating Systems Review 43 (2 Apr. 2009), S. 76–85. ISSN: 0163-5980. DOI: 10.1145/1531793.1531805.
- [23] Jian Yang, Juno Kim, Morteza Hoseinzadeh u. a. "An Empirical Guide to the Behavior and Use of Scalable Persistent Memory". In: 18th USENIX Conference on File and Storage Technologies (FAST 20). Santa Clara, CA: USENIX Association, Feb. 2020, S. 169–182. ISBN: 978-1-939133-12-0. URL: https://www.usenix.org/conference/fast20/presentation/yang.