PDM-Shuffle：基于被动分离式内存的数据混洗系统设计

程丽云* **; 吴婧雅*; 卢文岩*; 钟浪辉***; 鄢贵海*

文章摘要

程丽云* **,吴婧雅*,卢文岩*,钟浪辉***,鄢贵海*.PDM-Shuffle：基于被动分离式内存的数据混洗系统设计[J].高技术通讯(中文),2025,35(4):370~384

PDM-Shuffle：基于被动分离式内存的数据混洗系统设计

PDM-Shuffle: passive disaggregated memory-based shuffle system

DOI：

中文关键词: 数据混洗；存算分离；分离式内存系统；计算快速链接；内存一致性；预聚合

英文关键词: shuffle, separation of storage and compute, disaggregated memory system, compute express link (CXL), memory consistency, pre-aggregation

基金项目:

作者	单位
程丽云* **	（处理器芯片全国重点实验室（中国科学院计算技术研究所）北京100190）（中国科学院大学北京100049）（**上交所技术有限责任公司上海 200131）
吴婧雅*
卢文岩*
钟浪辉***
鄢贵海*

摘要点击次数: 514

全文下载次数: 562

中文摘要:

利用存算分离架构，可以将数据混洗的计算和存储过程解耦，从而提高分布式数据处理应用的可扩展性。然而，将混洗数据传输到远端存储节点的过程增加了额外网络开销，存储节点将成为新的通信瓶颈。为应对引入存算分离架构后数据混洗过程面临的新挑战，本文提出一种基于被动分离式内存的数据混洗（passive disaggregated memory-shuffle，PDM-Shuffle）系统，利用新型一致性总线互连协议计算快速链接（compute express link，CXL）直连共享内存设备存储并交换混洗中间数据，避免了数据的硬盘存储及传输控制协议/网际协议（transmission control protocol/Internet protocol，TCP/IP）的传输过程。鉴于内存设备仅支持被动数据写入，本文采用了内存预分区和元数据控制节点分配内存地址等方法来保证同分区数据的预聚合和共享内存的一致性访问管理。实验结果表明，在处理大规模数据集时，与传统的集中式架构相比，PDM Shuffle系统可将排序和图计算的综合类应用程序Terasort和PageRank的单个作业完成时间分别减少49%和65%，相对于存算分离架构下已有的优化方案Zeus，分别提升了36%和18%。

英文摘要:

Using the separation of storage and compute architecture, the computation and storage processes of the shuffle phase can be decoupled, thereby enhancing the scalability of distributed data processing applications. However, transferring shuffle data to remote storage nodes introduces additional network overhead, and storage nodes may become a new communication bottleneck. To address the new challenges faced by shuffle after introducing the separation of storage and compute architecture, this paper proposes a shuffle system called PDM-Shuffle based on passive disaggregated memory, which uses a novel consistency bus interconnect protocol compute express link (CXL) to directly connect shared memory devices for storing and exchanging shuffle data, avoiding the hard disk storage and transmission control protocol/Internet protocol(TCP/IP) transmission. Given that memory devices only support passive data writing, this paper adopts methods such as memory pre-partitioning and metadata control node allocation of memory addresses to ensure the pre-aggregation of data in the same partition and consistent access management of shared memory. Experimental results show that when processing large-scale datasets, compared with traditional centralized architectures, the PDM-Shuffle system can optimize the single job completion time of comprehensive applications such as TeraSort and PageRank by 49% and 65%, respectively. Relative to the existing optimization solutions in the separation of storage and compute architecture like Zeus, it has been improved by 36% and 18%, respectively.

查看全文查看/发表评论下载PDF阅读器

关闭