



# First Things First: A Discussion of Modelling Approaches for Disruptive Memory Technologies

Herbsttreffen der Fachgruppe Betriebssysteme (FGBS'21), Trondheim (Online), 20.09.21 **Michael Müller**, Daniel Kessener, Olaf Spinczyk





























#### How could any system software efficiently manage these resources?





#### How could any system software efficiently manage these resources?



#### An adequate system-level model



Represents the overall hardware topology including DMTs



Provides information about communication costs



Enables **updates** to the model **at runtime** 



has a **whole system view** including **applications** and system services, i.e., an **application model** 



Provides **performance predictions** and **optimized resource mappings** 



First Things First: Modelling Approaches for DMT / Michael Müller / 4



Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



Management model for scheduling RDMA transfers





Comprehensive work on "best practises" and guidelines for performance optimization



Analyzes on using RDMA with NVM an in NUMA systems



Header-only RECV: CQE contains application data from SEND header: saves a PCIe transaction CPU NIC CPU WQE DM/ WOE DMA WQE 1, 2 Inline RECV: CQE contains payload; saves a CQE PCIe transaction COE oad, CQ

[Kalia et al., 2016]



Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



Management model for scheduling RDMA transfers





Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



Management model for scheduling RDMA transfers

[MacArthur and Russel '12, Nelson and Palmieri '19]





Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



(a) Average one-way time with each

opcode set for small messages using

one buffer.

тороди на страниција и страниц

(b) Average throughput with each

opcode set for small messages using

one buffer.





(c) Average one-way time with RDMA WRITE for large messages using multiple buffers with multiple work requests per posting. (d) Average throughput with RDMA WRITE for large messages using multiple buffers with multiple work requests per posting.

 $\begin{array}{c} \bullet \to \bullet \\ \Box \leftarrow \bullet \end{array} & Management model for \\ scheduling RDMA transfers \end{array}$ 

[MacArthur and Russel '12, Nelson and Palmieri '19]





Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



Management model for scheduling RDMA transfers





Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



Management model for scheduling RDMA transfers





Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems









Comprehensive work on "best practises" and guidelines for performance optimization

Analyzes on using RDMA with NVM an in NUMA systems



Management model for scheduling RDMA transfers





[1]

Hsieh, K., Ebrahim, E., Kim, G., Chatterjee, N., O'Connor, M., Vijaykumar, N., Mutlu, O. and Keckler, S.W. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. *ACM/ IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)* (Jun. 2016), 204–216.

Khan, K., Pasricha, S. and Kim, R.G. 2020. A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures. *Journal of Low Power Electronics and Applications*. 10, 4 (Dec. 2020), 30. DOI:<u>https:// doi.org/10.3390/jlpea10040030</u>.

[3] Mutlu, O., Ghose, S., Gómez-Luna, J. and Ausavarungnirun, R. 2020. A Modern Primer on Processing in Memory. *CoRR*. abs/ 2012.03112, (2020).

[2]



Lots of work on determining offload-ability of instructions

## (7)

Frameworks for estimating performance and energy usage for specific applications

٩Ŋ.





Lots of work on determining offload-ability of instructions

Frameworks for estimating performance and energy usage for specific applications

Perform





Singh, G., Gómez-Luna, J., Mariani, G., Oliveira, G.F., Corda, S., Stuijk, S., Mutlu, O. and Corporaal, H. NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning. *DAC*'2019.



Xiao, Y., Nazarian, S. and Bogdan, P. 2018. Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy. *DATE* 2018.



Lots of work on determining offload-ability of instructions



Frameworks for estimating performance and energy usage for specific applications







EZA

Lots of work on determining offload-ability of instructions

Frameworks for estimating performance and energy usage for specific applications

Perfo





EZA

Lots of work on determining offload-ability of instructions

Frameworks for estimating performance and energy usage for specific applications









EZA

Lots of work on determining offload-ability of instructions



Corda, S., Kumaraswamy, M., Awan, A.J., Jordans, R., Kumar, A. and Corporaal, H. NMPO: Near-Memory Computing Profiling and Offloading. *Euromicro DSD/SEAA 2021*. Frameworks for estimating performance and energy usage for specific applications







 Simulator for NVRAM architectures<sup>1)</sup>



 Simulator for NVRAM architectures<sup>1)</sup>



1) Wang, Z. et al., Characterizing and Modelling Non-Volatile Memory Systems. *MICRO*' 2020.



- Simulator for NVRAM architectures<sup>1)</sup>
- Application-specific cost model for data stream processing<sup>2</sup>)



1) Wang, Z. et al., Characterizing and Modelling Non-Volatile Memory Systems. *MICRO*' 2020.



- Simulator for NVRAM architectures<sup>1)</sup>
- Application-specific cost model for data stream processing<sup>2</sup>)



1) Wang, Z. et al., Characterizing and Modelling Non-Volatile Memory Systems. *MICRO*' 2020.



2) Pohl, C. and Sattler, K.-U. A Cost Model for Data Stream Processing on Modern Hardware. *ADMS' 2017.* 



- Simulator for NVRAM architectures<sup>1)</sup>
- Application-specific cost model for data stream processing<sup>2</sup>)
- Performance studies [Izraelevitz et al. '19]



1) Wang, Z. et al., Characterizing and Modelling Non-Volatile Memory Systems. *MICRO*' 2020.



2) Pohl, C. and Sattler, K.-U. A Cost Model for Data Stream Processing on Modern Hardware. *ADMS' 2017.* 



- Simulator for NVRAM architectures<sup>1)</sup>
- Application-specific cost model for data stream processing<sup>2</sup>)
- Performance studies [Izraelevitz et al. '19]
- Programming models [Scargall '20, George '20, Köppen '19]



1) Wang, Z. et al., Characterizing and Modelling Non-Volatile Memory Systems. *MICRO*' 2020.



2) Pohl, C. and Sattler, K.-U. A Cost Model for Data Stream Processing on Modern Hardware. *ADMS' 2017.* 







#### Managing strategies and measurements for HBM



#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry Sandia National Laboratories jberry@sandia.gov Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu Michael A. Bender Stony Brook University bender@cs.stonybrook.edu

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov



#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry

Sandia National Laboratories

jberry@sandia.gov

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

> Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu

bender@cs.stonybrook.edu Cynthia A. Phillips Sandia National Laboratories

caphill@sandia.gov

Michael A. Bender

Stony Brook University

HBM as additional cache layer



Michael A. Bender

Stony Brook University

#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry

Sandia National Laboratories

jberry@sandia.gov

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

moseleyb@andrew.cmu.edu

bender@cs.stonybrook.edu Benjamin Moseley Carnegie Mellon University

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov

- HBM as additional cache layer
- Replacement strategy for pages in HBM



Michael A. Bender

#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

St. Louis Stony Brook University bender@cs.stonybrook.edu

Jonathan Berry Sandia National Laboratories jberry@sandia.gov Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov

- HBM as additional cache layer
- Replacement strategy for pages in HBM
- Priority-based strategy



#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry

Sandia National Laboratories

jberry@sandia.gov

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

moseleyb@andrew.cmu.edu

kunal@wustl.edu Benjamin Moseley Carnegie Mellon University Stony Brook University bender@cs.stonybrook.edu Cynthia A. Phillips

Sandia National Laboratories

caphill@sandia.gov

Michael A. Bender

HBM as additional cache layer

- Replacement strategy for pages in HBM
- Priority-based strategy
- Performance metrics for HBM used



Michael A. Bender

#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

Stony Brook University bender@cs.stonybrook.edu Benjamin Moseley

Jonathan Berry Sandia National Laboratories Carnegie Mellon University moseleyb@andrew.cmu.edu jberry@sandia.gov

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov

- HBM as additional cache layer
- Replacement strategy for pages in HBM
- Priority-based strategy
- Performance metrics for HBM used
- Model only for HBM





#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

Jonathan Berry Sandia National Laboratories jberry@sandia.gov Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu Stony Brook University bender@cs.stonybrook.edu

Michael A. Bender

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov Object Placement for High Bandwidth Memory Augmented with High Capacity Memory

Mohammad Laghari

Didem Unat

- HBM as additional cache layer
- Replacement strategy for pages in HBM
- Priority-based strategy
- Performance metrics for HBM used
- Model only for HBM





#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry

Sandia National Laboratories

jberry@sandia.gov

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu Michael A. Bender Stony Brook University bender@cs.stonybrook.edu

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov Object Placement for High Bandwidth Memory Augmented with High Capacity Memory

Mohammad Laghari

Didem Unat

 Automatic object placement in hybrid DRAM/HBM systems

- HBM as additional cache layer
- Replacement strategy for pages in HBM
- Priority-based strategy
- Performance metrics for HBM used
- Model only for HBM





#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry

Sandia National Laboratories

jberry@sandia.gov

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu Michael A. Bender Stony Brook University bender@cs.stonybrook.edu

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov

- HBM as additional cache layer
- Replacement strategy for pages in HBM
- Priority-based strategy
- Performance metrics for HBM used
- Model only for HBM



Object Placement for High Bandwidth Memory Augmented with High Capacity Memory

Mohammad Laghari

Didem Unat

- Automatic object placement in hybrid DRAM/HBM systems
- Heuristic based on access frequency for objects



#### How to Manage High-Bandwidth Memory Automatically

Rathish Das Stony Brook University radas@cs.stonybrook.edu

Jonathan Berry

Sandia National Laboratories

jberry@sandia.gov

Kunal Agrawal Washington University in St. Louis kunal@wustl.edu

Benjamin Moseley Carnegie Mellon University moseleyb@andrew.cmu.edu Michael A. Bender Stony Brook University bender@cs.stonybrook.edu

Cynthia A. Phillips Sandia National Laboratories caphill@sandia.gov

- HBM as additional cache layer
- Replacement strategy for pages in HBM
- Priority-based strategy
- Performance metrics for HBM used
- Model only for HBM



Object Placement for High Bandwidth Memory Augmented with High Capacity Memory

Mohammad Laghari

Didem Unat

- Automatic object placement in hybrid DRAM/HBM systems
- Heuristic based on access frequency for objects
- No performance model



Fig. 5: Speedup of our placement configuration achieved over placing all objects in the slow memory





Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.







Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.









Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.









Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.











Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.



Performance predictions









Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.



Performance predictions

• Optimized mapping decisions









Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.























Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.











Chatzopoulos, G., Guerraoui, R., Harris, T. and Trigonakis, V. 2017. Abstracting Multi-Core Topologies with MCTOP. *Proceedings of the Twelfth European Conference on Computer Systems* (Belgrade Serbia, Apr. 2017), 544–559.







Schüpbach, A.L. 2012. Tackling OS Complexity with Declarative Techniques. ETH Zurich.







Schüpbach, A.L. 2012. Tackling OS Complexity with Declarative Techniques. ETH Zurich.





cache DRAM DRAM disks





apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).







apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).







apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).







apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).







apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data). DRAM
DRAM

disks

Includes communication costs

Performance predictions

Optimized mapping decisions

Optimized mapping decisions

Updates at runtime

Whole system view

cache





apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).



cache





apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).







apic(ACPI\_ProcessorID, APICID, Availability). % 1 = Yes, 0 = no bridge(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, secondary(Sec)). device(pcie|pci, addr(Bus, Dev, Fun), VendorID, DeviceID, Class, SubClass, ProgIf, IntPin). interrupt\_override(Bus, SourceIRQ, GlobalIRQ, IntiFlags). rootbridge\_address\_window(addr(Bus, Dev, Fun), mem(Min, Max)). bar(addr(Bus, Dev, Fun), BARNr, Base, Size, mem|io, (non)prefetchable, Bits (64|32)). fixed\_memory(Base, Limit). apic\_nmi(ACPI\_ProcessorID, IntiFlags, Lint). memory\_region(Base, SzBits, SzBytes, RegionType, Data).







Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S. and Namyst, R. 2010. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications. *18th Euromicro Conference on Parallel, Distributed and Network-based Processing* (Feb. 2010), 180–186.





cache HBM HBM DRAM DRAM NVRAM NVRAM disks













































# Conclusion

- Increasing complexity of memory hierarchies calls for sophisticated system models
- Only device-specific models and performance studies for DMTs
- Most existing system models, do not model DMTs; And the ones that do are insufficient
- →Research on holistic models for systems with DMTs needed