DESIGN TOOLS
company

What is a heterogeneous memory storage engine?

Micron Technology | June 2020

In case you haven’t already heard, Micron recently released its heterogeneous-memory storage engine (HSE) to the open source community. Our design focuses on providing a solution that makes storage class memory (SCM) and SSDs more performant, with increased effective SSD lifespan through reduced write amplification, all while being deployed at massive scale. When compared to legacy storage engines, HSE often benefits workloads like Yahoo! Cloud Serving Benchmark (YCSB) many times over.

What is a heterogeneous memory storage engine (HSE)?

Why heterogeneous? Micron has an extensive portfolio of DRAM, SCM and SSDs that gives us the insight and expertise to build a storage engine that intelligently manages data placement across disparate memory and storage media types. Unlike traditional storage engines that were written for hard disk drives, HSE was designed from the ground up to exploit the high throughput and low latency of SCM and SSDs.

Implementation

HSE uses the advantages of discrete media types to support two media classes for data storage: a “staging” media class and a “capacity” media class. A staging media class is typically configured to run on high-performance (IOPS and/or MB/s), low latency and high write endurance media (for example, SCM or data center class SSDs with NVMe™). Data intended for hot, short-term access is allocated to the staging media class while cold, long-lived data is typically configured to run on lower cost, lower write endurance media (like quad-level cell [QLC] SSDs) in the capacity media class tier. This enables HSE to achieve high throughput and low latency while also conserving write cycles on lower endurance media.

Configurable Durability Layer

The HSE durability layer is a user-configurable logical construct that resides on the staging media class. The durability layer provides user-definable data persistence in which the user specifies an upper bound on how many milliseconds of data may be lost in the event of a system failure, like power loss.

Data is initially ingested from DRAM into the durability layer. Storage is allocated from the faster staging media class to meet the low-latency, high-throughput requirements of the durability layer. Unlike a traditional write-ahead log (WAL), this durability layer avoids the “double write problem” common with classic journaling to significantly reduce write amplification.

Data Aging

As stored data ages, the data migrates through multiple layers of the system and is rewritten as part of garbage collection to optimize query performance (completion time). Here’s the high-level process:

When new data needs to be stored, it is first written in the durability layer.

As the data ages, it is rewritten to the capacity media class as a background maintenance operation.

As new data arrives, that new data may render existing data obsolete (by updating or deleting records that were previously written). Maintenance operations periodically scan existing data to enable space reclamation. If a large part of the data is now invalid or obsolete, these operations reclaim space by rewriting just the data that is still valid —freeing up all the space the old data occupied (i.e., garbage collection). To service queries efficiently, valid data is also arranged so that it can be scanned easily.

Valid data is reorganized into tiers for faster query processing. Key and value data are isolated into separate streams throughout this process — keys are written to the staging media class to facilitate faster lookups. Eventually, older data at the bottom tier is written to the designated capacity media class devices.

As queries are serviced and data is read from both media classes, indexes are page-cached into DRAM. An LRU (least recently used) algorithm dynamically ranks indexes to facilitate index tracking, pinning the hottest (i.e., the most frequently accessed indexes) in memory, assuming system DRAM is available.

Media Class Performance

Our test setup used one Micron 9300 SSD with NVMe™ as the staging media class device and four Micron 5210 SATA QLC SSDs as the capacity media class devices. We used the Yahoo!® Cloud Serving Benchmark (YCSB) to compare the operations per second and 99.9% tail latencies:

  • First run: Four Micron 5210 QLC SSDs configured as capacity media class devices
  • Second run: Four 5210 SSDs configured as capacity media class devices and one Micron 9300 SSD with NVMe as a staging media class device

We ran YCSB workloads A, B, C, D and F with the same thread counts for both configurations1. Table 1 summarizes several YCSB workload mixes, with application examples taken from the YCSB documentation. Tables 2 through 4 share other test details regarding hardware, software and benchmark configurations.

Table 1: Workloads
YCSB WorkloadI/O Operations
Application Example 
A50% Read
50% Update
Session store recording user-session activity
B95% Read
5% Update
Photo tagging
C100% Read
User profile cache
D95% Read
5% Insert
User status updates
F50% Read
50% Read-Modify-Write
User database or recording user activity

1. Workload E was not tested because it is not universally supported

Table 2: Hardware Details
Server Platforms
Server PlatformsServer Platform Intel® based (dual-socket) 
Processors
2x Intel E5-2690 v4
Memory
256GB DDR4
SSDs
Staging class media: 1x Micron 9300 SSD with NVMe
Capacity class media: 4x Micron 5210 7.68TB SATA SSDs
Capacity Class Media Configuration
LVM striped logical volume
Table 3: Software Details
Software Details
Operating SystemRed Hat Enterprise Linux 8.1
HSE Version
1.7.0
RocksDB Version
6.6.4
YCSB Version
0.17.0
Table 4: Benchmark2
YCSB Benchmark Configuration
Dataset2TB (2 billion 1,000-byte records) 
Client Threads
96
Operations
2 billion per workload

2. Different configurations may show different results.

Throughput

YCSB starts by loading the database. This is a 100% insert workload. Adding a 9300 to the mix reduces the time taken to load the 2TB database by a factor of four.

Figure 1 shows throughput for the load phase and run phase of the five YCSB workloads. For write-intensive workloads like Workload A (50% update) and Workload F (50% inserts), adding a Micron 9300 as a staging media class increases the overall throughput 2.3 and 2.1 times, respectively. Workloads B and D (5% updates/inserts) show more modest improvements in throughput because 95% of these workloads are reads coming almost entirely from the 5210 SSDs comprising the capacity media class.

YCSB workload chart showing operations per second of Micron 9300 vs 5210 SSD Figure 1: YCSB Workload

Latencies

Figure 2 shows the 99.9% read (tail) latencies. The read tail latencies for all workloads are considerably improved (2 to 3 times) after adding the 9300 (except for Workload C, which is 100% reads). Recall that newly arrived writes are first absorbed by the 9300 and gradually written in the background to the 5210s as the data ages. Key data (indexes) are written to the 9300, making lookups faster in the second configuration. A fraction of the reads are serviced by the 9300 instead of the 5210s (depending on the query distribution and age of the data being read).

Additionally, by reducing the number of writes to the 5210s, even the reads that are serviced by the 5210s suffer less contention from ongoing writes, so tail read latencies are lower. The insert/update latencies are not pictured as they are similar in the two configurations during the run phase.

YCSB workload read latencies of Micron 9300 vs 5210 SSD Figure 2: Latencies by YCSB Workload

Bytes Written

Finally, we measured the amount of data written to the 5210s in the course of executing each workload. Adding a 9300 as a staging media class reduces the number of bytes written to the 5210s, preserving write cycles and extending the 5210’s write lifespan. During the load (insert-only phase), the number of bytes written to the 5210s is reduced by a factor of 2.4 as seen in Table 5.

Table 5: Write Reduction
Bytes Written
Configuration4x 52109300 + 4x 5210
GB written to 5210s (capacity media)72602978
GB written to 9300 (staging media)N/A4158

Figure 3 shows the total number of gigabytes written during the run phase of the YCSB workloads. Note that this includes both user and background writes. With the exception of Workload C (100% read), the other workloads show at least a twofold reduction in the total number of bytes written to the 5210s by adding one 9300 to the configuration.

YCSB total gigabytes written of Micron 9300 vs. 5210 SSD Figure 3: Reductions in Data Written

Future Work

As part of future work, we are looking to broaden specific aspects of the HSE API to enhance its use, like custom media class policies that give the application more control. For example, if the application creates a key-value store (KVS, the equivalent of a table in a relational database) that will be used only for indexing, it can specify that the particular KVS should use a staging media class to speed up lookups. If the size of the indexing KVS grows too large to be accommodated on the staging device, the application can specify a policy that uses staging media but falls back to capacity media. We may also introduce predefined media class policy templates and extend the HSE API to allow an application to use them based on its needs. Be sure to stay in touch for potential developments.