面向RDMA远内存的变粒度硬件DRAM缓存设计

张旭; 卢天越; 陈明宇

文章摘要

张旭,卢天越,陈明宇.面向RDMA远内存的变粒度硬件DRAM缓存设计[J].高技术通讯(中文),2026,36(3):256~267

面向RDMA远内存的变粒度硬件DRAM缓存设计

A dynamic granularity hardware DRAM cache for disaggregated memory

DOI：10. 3772 / j. issn. 1002 - 0470. 2026. 03. 004

中文关键词: 远程内存直接访问远内存系统；内存语义接口；硬件动态随机存储器缓存；预取机制；替换策略

英文关键词: remote direct memory access remote memory, memory semantic interconnect, hardware dynamic random access memory cache, prefetching, eviction

基金项目:

作者	单位
张旭	(处理器芯片全国重点实验室（中国科学院计算技术研究所）北京 100190) (中国科学院大学北京 100049)
卢天越
陈明宇

摘要点击次数: 332

全文下载次数: 267

中文摘要:

远程内存直接访问（remote direct memory access，RDMA）远内存系统显著提高了数据中心的内存利用率。当前RDMA远内存系统的实现可分为2类软件方案和1类硬件方案，它们均利用本地内存缓存热点数据，以隐藏应用访问远内存的长延迟开销。其中，基于内存语义接口的硬件方案具有延迟低、应用透明等诸多优势。然而，受限于硬件逻辑复杂、时序等因素，此方案已有工作中的硬件动态随机存储器（dynamic random access memory，DRAM）缓存结构设计相对简单，且采用固定或双缓存粒度，并未充分利用应用的内存访问特征，无法在各类应用中均取得良好的性能。因此，本文通过分析现有设计的缺陷，并结合发现的2个普遍存在的应用访存特征，提出了一种新型的硬件DRAM缓存设计——实时分析应用的访存特征，并动态选择合适的预取、缓存粒度。本文核心创新点包含以下2点：（1）基于访存相关性的预取机制，通过统计多个细粒度数据块之间的访问相关性，分析应用访问数据对象的粒度，并预取多个具备强访存相关性的数据块；(2）面向2种替换场景的双轨替换策略，分别采用有限最近最少使用(least recently used，LRU)算法和“热度”感知的Clock算法。基于DRAMSim3模拟器的评估显示，与相关工作的硬件DRAM缓存设计相比，本文设计可实现1.12~1.43倍性能加速比。

英文摘要:

Memory disaggregation is a cost-effective approach to improve memory utilization in datacenters. Current implementations of memory disaggregation systems can be categorized into two software-based approaches and one hardware-based approach, all of which leverage local memory to mitigate the long latency overhead of accessing remote direct memory access (RDMA) by caching hot data. Among these, the memory-semantic-interconnect-based hardware approach offer several promising advantages, including low latency and application transparency. However, constrained by hardware logic complexity and timing limitations etc., existing hardware dynamic random access memory (DRAM) cache designs employ simplistic structures with fixed or dual caching granularities, failing to fully exploit the memory access semantics of running applications and consequently delivering optimal performance across diverse workloads. Based on analysis of preliminary design’s shortcomings and two observed application memory access patterns, this paper proposes a novel hardware DRAM cache design that performs real-time analysis of application memory access patterns and dynamically selects optimal prefetching and caching granularities. The core innovations include two aspects. (1) Correlation-based prefetching mechanism: analyzing the access correlations among multiple fine-grained data blocks to determine the granularity of data objects accessed by applications, enabling prefetching of multiple strongly correlated fine-granularity blocks. (2) Dual-track eviction policy: applying least recently used (LRU) algorithm among limited candidates and ‘hotness’-aware Clock algorithm for two eviction scenarios. Based on our evaluations using the DRAMSim3 simulator, it exhibits a performance speedup ranging from 1.12× to 1.43× against the state-of-the-art hardware DRAM cache designs.

查看全文查看/发表评论下载PDF阅读器

关闭