文章摘要
刘晨骁* **,王晨曦* **,杜子东* ** ***.一种支持在线检测静默数据错误的事务性框架[J].高技术通讯(中文),2026,36(4):340~353
一种支持在线检测静默数据错误的事务性框架
An online silent data corruption detection transactional framework
  
DOI:10. 3772 / j. issn. 1002 - 0470. 2026. 04. 002
中文关键词: 静默数据错误; 容错系统; 云计算; 数据中心; 编程框架
英文关键词: silent data corruption, fault-tolerance, cloud computing, data center, programming framework
基金项目:
作者单位
刘晨骁* ** (*中国科学院大学北京 101408) (**处理器芯片全国重点实验室(中国科学院计算技术研究所)北京 100190) (***上海处理器技术创新中心上海 201210) 
王晨曦* **  
杜子东* ** ***  
摘要点击次数: 38
全文下载次数: 32
中文摘要:
      随着处理器设计日趋复杂化以及工艺制程的精细化,处理器的可靠性和稳定性正在面临越来越大的挑战。即使经过严格测试,处理器仍可能会在部署后暴露出硬件问题,并导致一系列的应用级错误。在众多错误中,一类特殊的错误会在不导致应用程序崩溃、不触发任何警报的情况下造成应用数据错误,该类现象被称为静默数据错误(silent data corruption,SDC)。尽管会引起SDC错误的处理器的比例仅在0.03%,但是对于具有数十万处理器的数据中心而言,发生SDC的概率不能忽视。同时,由于现有容错系统基于错误崩溃(Fail-Stop)假设开发,缺少针对SDC的动态校验机制,难以检测到SDC错误,为数据中心的数据安全带来了重大的威胁。因此,本文提出了一种名为“双子星(简称Gemini)”的SDC在线校验机制——名称来自于校验模块如双子星一般伴随应用运行,该机制可以对应用进行在线的正确性校验。Gemini提供了一套基于事务的框架,包含一系列数据结构、配套事务和一个运行时系统。基于Gemini开发的应用在其执行阶段,会有一个在线的SDC校验模块在后台运行。该校验模块会实时地在事务粒度校验应用计算的正确性。为了避免校验程序和应用频繁同步而带来的性能干扰,Gemini采用了一种基于日志的校验方式。应用在每个事务的结尾生成一个SDC日志来记录对用户数据的修改、执行的函数等必要信息,校验程序会自动读取生成的日志,并在不同的处理器核心上依据SDC日志进行重算,并与日志记录的数据修改进行对比。Gemini的贡献在于其能够在毫秒级别上检测出处理器静默错误造成的用户数据的损坏,同时仅带来最低3.20%、平均15.20%的性能损耗,使之成为可以部署在数据中心中的SDC在线校验机制。
英文摘要:
      As processor design becomes increasingly complex and the manufacturing processes become more refined, the reliability and stability of processors are facing more significant challenges. Despite rigorous testing, processors may still encounter hardware issues after production deployment, leading to various application-level errors. Among these errors, one particular type could cause application data to corrupt without crashing the application or triggering any warnings. This error is known as silent data corruption (SDC). Although the proportion of the processors that may cause SDC errors is only 0.03%, the probability of SDC occurrences is still non-negligible for the data centers that run hundreds of thousands of cores. At the same time, the existing fault-tolerant systems that depend on the Fail-Stop error consumption lack the ability to detect SDC at runtime. These pose a significant threat to the data security of data centers.This paper introduces an online SDC detection mechanism named Gemini. Gemini is inspired by its operational design, where the validation module operates concurrently with the application, closely monitoring and verifying the application’s outcomes. Gemini provides a transaction based framework, including a set of data structures, accompanying transactions, and a runtime system. For applications developed based on Gemini, there is an online SDC verification module running in the background during its executing phase. The module validates the correctness of the application’s computations in real-time at the transaction granularity. To avoid performance impact by the frequent synchronization between the verification module and the application, Gemini adopts a log-based verification method at the end of each transaction, the application generates an SDC log with information such as modifications to the user data and executed functions. The verification module automatically checks the log queue and performs recalculation work from a core different from the transaction’s execution core based on the SDC log. Gemini’s contribution lies in its ability to detect the silent user data corruption from the processor’s silent error at the millisecond level while only introducing a minimal 3.20% and average 15.20% performance overhead, which makes it a feasible online SDC detection mechanism for continuous deployment in data centers.
查看全文   查看/发表评论  下载PDF阅读器
关闭