Login / Register
The goal of the AMD and Micron collaboration is to deliver best-in-class user experiences across client and data center platforms. To that end, the two companies have a joint server lab in Austin, working to ensure we are reducing time to validate server memory and performing joint workload testing throughout validation and launch. In this blog, we look at some common HPC-workload benchmark results that use Micron DDR5 data center memory with 4th Gen AMD EPYCTM Processors as both these products are shipping now.
High-performance computing (HPC) workloads have historically been the domain of some of the world’s fastest supercomputers. These are often large-scale, data-intensive workloads split into millions of operations that are run in parallel and use terabytes of data. These complex workloads are dedicated to solving some of humankind’s most challenging problems — weather and climate simulations; seismic modeling; chemical, physics and biological analysis; and more.
With advances in computer architectures, these workloads have increasingly been hosted in very large “scale-out” clusters of high-performance servers. These clusters require the latest and greatest compute, fabric, memory and storage infrastructure to address the scalability, low latency and performance needs of such critical workloads. While server CPUs have grown in performance and throughput, the past several years have seen the bandwidth provided by DDR4 memory become a bottleneck. There is just not enough memory bandwidth to supply the growing number of high-performance cores.
Micron DDR5 memory and the new AMD Zen 4 server architectures featuring 4th Gen AMD EPYC Processors change that. Now, server CPUs and memory can be in much better balance to unlock performance and efficiency for the most demanding workloads. DDR5 memory helps organizations reach those insights faster, whether on premises or in the cloud. Consider some of the following proof points generated while testing Micron DDR5 with the latest AMD Zen 4 96-core CPU with an industry-standard HPC workload benchmark. All of our test results have shown two times the performance improvement.
Double the memory bandwidth with Micron DDR5 + 4th Gen AMD EPYC Processors using STREAM
STREAM1 is a simple, well-known benchmark that measures memory bandwidth in HPC computers. It captures peak memory bandwidth for HPC systems
Software stack used for this workload
- Alma 9 Linux kernel 5.14
- STREAM.f 11-29-2021 release
Test setup
- DDR4 system 3rd Gen AMD EPYC Processors with 64 cores and 3.7 GHz; DDR4 3200 MHz system2 is fully populated with 64GB RDIMM
- DDR5 system 4th Gen AMD EPYC Processors with 96 cores and 3.7 GHz; DDR5 4800 MHz system3 is fully populated with 64GB RDIMM
Test results
- Double the memory bandwidth of 378 GB/s for a single-socket DDR5 system
- This result means that customers can run larger artificial intelligence/machine learning (AI/ML) projects or do more HPC computations with increased memory bandwidth from DDR5.
Weather research and forecasting (WRF)4 runs two times faster with Micron DDR5
This HPC workload code is used by the weather and climate community, and the model is widely used for meteorological applications. WRF typically performs well on traditional HPC architectures that support high floating-point processing, high memory bandwidth and a low-latency network. For this effort, the Continental United States (CONUS) at 2.5-km lateral resolution was chosen.
Software stack used for this workload
- Alma 9 Linux kernel 5.14
- WRF 2.3.5 & 4.3.3
- Open MPI v4.1.1
Test setup
- DDR4 system 3rd Gen AMD EPYC Processors with 64 cores and 3.7 GHz; DDR4 3200 MHz system2 is fully populated with 64GB RDIMM
- DDR5 system 4th Gen AMD EPYC Processors with 96 cores and 3.7 GHz; DDR5 4800 MHz system3 is fully populated with 64GB RDIMM
Test results
- We were able to execute 1.3567 time steps per second using Micron DDR5 and 4th Gen AMD EPYC Processors as compared to 2.8533 time steps per second.
- Faster execution time means weather forecasters can either choose bigger datasets or run more models. Both efforts lead to improved forecasts.
OpenFOAM5 with Micron DDR5 runs two times faster
OpenFOAM is an open-source HPC workload for computation fluid dynamics (CFD), used in a wide variety of industries to reduce development time and costs. It simulates physical interactions in applications ranging from consumer product design to aerospace design. One of the simulations included in the data set features a motorbike turbulence simulation. For this model, OpenFOAM calculates steady air flow around a motorcycle and rider. OpenFOAM load balances calculations according to the number of processes specified by the user, and then decomposes the mesh into parts for each process to solve. After the solve is complete, the mesh and solution is recomposed into a single domain.
Software stack used for this workload
- OpenFOAM CFD Software (v8) with motorBike mesh size of 600 x 240 x 240
- Alma 9 Linux kernel 5.14
- Open MPI v4.1.1
Test setup
- DDR4 system 3rd Gen AMD EPYC Processors with 64 cores and 3.7 GHz; DDR4 3200 MHz system2 is fully populated with 64GB RDIMM
- DDR5 system 4th Gen AMD EPYC Processors with 96 cores and 3.7 GHz; DDR5 4800 MHz system3 is fully populated with 64GB RDIMM
Test results
Our tests demonstrated a 2.4 times relative gain for OpenFOAM, which is seen as among the top 5 HPC software platforms with a large open-source community. Used widely in universities and R&D centers, the high parallelization nature of the software takes advantage of both memory (increased bandwidth) and CPU features like denser cores.
Molecular dynamics6 with Micron DDR5 run two times faster
CP2K is an open-source quantum chemistry tool that can be used for a number of applications, including simulations of solid-state biological systems. CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane wave approaches GPW and GAPW. The example that we looked at was linear-scaling density functional theory (DFT) of water (H2O) consisting of 6144 atoms in a 39-cubic-angstrom box (2048 water molecules in total).
Software stack used for this workload
- H2O-DFT-LS.NREP4 & H2O-DFT-LS
- Alma 9 Linux kernel 5.14
Test setup
- DDR4 system 3rd Gen AMD EPYC Processors with 64 cores and 3.7 GHz; DDR4 3200 MHz system2 is fully populated with 64GB RDIMM
- DDR5 system 4th Gen AMD EPYC Processors with 96 cores and 3.7 GHz; DDR5 4800 MHz system3 is fully populated with 64GB RDIMM
Test results
Our tests demonstrated a 2.1 times relative gain for molecular dynamics, and this scales well with more cores and more memory bandwidth.
Summary
The results above are just the start — and just a few examples of HPC workloads. The ability to better match high-performance, high-bandwidth memory with the incredible performance offered by new server processors such as the 4th Gen AMD EPYC Processors stands to be a watershed moment for HPC customers. We can expect to see many more such proof points that demonstrate how enterprise data center and cloud operators can use Micron DDR5 on these new platforms to unlock new levels of performance and efficiency. We look forward to sharing these with you in the coming months. To learn more about Micron DDR5 and data center workload benefits, visit Micron.com/ddr5.
1. Our STREAM benchmark setup with 2.5 billion vector size STREAM Benchmark - AMD run with a 1 CPU system
2. AMD DDR4 system is an AMD EPYC 7763 64 core with DDR4-3200 MHz fully populated with 64GB RDIMMs
3. AMD DDR5 system is an AMD EPYC 9654 96 core with DDR5-4800 MHz fully populated with 64GB RDIMMs
4. WRF with a 12.5-km CONUS ran for 929 seconds on the DDR4 system and 287 seconds on the DDR5 system while counting storage I/O as well. The above example is from a WRF 2.5-km CONUS that ran 2.8533 time steps per second and 1.3567 time steps per second.
5. For OpenFOAM, we ran three variations:
5a. 1004040 runtimes = 1,144 seconds on DDR4 system and 478 seconds DDR5 system
5b. 1084646 runtimes = 1,633 seconds on DDR4 system and 698 seconds on the DDR5 system
5c. 1305252 runtimes = 2,522 seconds on DDR4 system and 1,091 seconds on the DDR5 system
6. Molecular dynamics workload ran for 2,519 seconds on the DDR4 system and for 1,242 seconds on the DDR5 system