文章摘要
杨明烜* **,洪学海*,唐宏伟* ***.基于任务资源需求预测的人工智能算力调度[J].高技术通讯(中文),2024,34(5):475~485
基于任务资源需求预测的人工智能算力调度
Artificial intelligence computing power cluster scheduling based on task resource demand prediction
  
DOI:10. 3772 / j. issn. 1002-0470. 2024. 05. 004
中文关键词: 资源调度; 弹性资源分配; 人工智能(AI); 算力
英文关键词: resource scheduling, elastic resource allocation, artificial intelligence (AI), computing power
基金项目:
作者单位
杨明烜* ** (*中国科学院计算技术研究所北京 100190) (**中国科学院大学北京 100049) (***中国科学院大学南京学院南京 211135) 
洪学海*  
唐宏伟* ***  
摘要点击次数: 1457
全文下载次数: 913
中文摘要:
      为提升人工智能(AI)算力的任务执行效率和资源利用率,本文提出一种基于任务资源需求预测的AI算力调度方法,指导资源调度过程。相比于以往大多数研究工作仅围绕着图形处理器(GPU)资源设计的AI算力调度方法,本文充分考虑了多个维度资源对用户任务运行效率和计算集群资源利用的影响。本文基于机器学习方法构建任务资源需求预测模型,分析多维度资源对任务性能的影响,进而完成自适应资源伸缩调度,解决用户超额申请问题。实验结果表明,在相同时间内,该方法实现了更多任务的部署和执行。任务部署量提升25.3%,部署任务的完成率提升15.2%, GPU和内存利用率分别提升7.2%和8.0%,提升了算力资源的总体利用率。
英文摘要:
      A scheduling method based on task resource demand prediction is proposed to improve the job execution and resource utilization of artificial intelligence (AI) computing power cluster. Existing schedulers are designed by optimizing the graphics processing unit (GPU) resources allocation, which ignore the effect of multidimensional resources on AI task executing. In this work, the impact of multi-dimension resources on job execution and cluster resource utilization is considered. First, the multi-dimensional resource requirements of jobs are modeled through machine learning methods. Then, an adaptive resource scaling scheduling method is proposed, which reduce the over claim resource waste. It is found that compared with the basic strategy, this method makes more tasks allocated and executed in the same period. Evaluation results shows that the job deployment increases by 25.3%, the completion rate of deployed tasks increases by 15.2%. The GPU and memory utilization rates have been increased by 7.2% and 8.0% respectively, leading to an improvement in the overall utillization of computing resources.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮