大模型训练数据版权侵权风险规制

代江龙; 何若楠

文章摘要

代江龙,何若楠.大模型训练数据版权侵权风险规制[J].数字图书馆论坛,2025,21(9):74~81

大模型训练数据版权侵权风险规制

Regulation of Copyright Infringement Risks in Large Model Training Data

投稿时间：2025-08-18

DOI：10.3772/j.issn.1673-2286.2025.09.008

中文关键词: 大模型；生成式人工智能；训练数据；版权合规；侵权风险；合理使用

英文关键词: Large Model; Generative Artificial Intelligence; Training Data; Copyright Compliance; Infringement Risk; Fair Use

基金项目:本研究得到2024年度教育部人文社会科学青年项目“大模型训练数据的版权侵权风险应对研究”（编号：24YJC820011）、2025年度湖北省社会科学基金法治湖北专项“湖北科技产业知识产权法治保障研究”（编号：200505）资助。

作者	单位
代江龙	武汉工程大学法商学院（知识产权学院）
何若楠	英国杜伦大学法学院

摘要点击次数: 736

全文下载次数: 643

中文摘要:

人工智能产业的快速发展让全球大模型竞争更趋白热化，成为产业发展的前沿阵地。大模型训练对大规模高质量数据的利用带来一系列版权侵权挑战。立足大模型训练数据获取与利用的不同阶段，结合比较法视野下美德中等国在法律制度与司法实践中的差异化立场，提出训练数据版权侵权风险的分层规制体系。研究认为，在训练数据集获取阶段，需要深入审视对机器学习需要的复制行为在著作权法上的评价，从数据资源合法性来源、数据副本单一性要求、数据训练唯一性目的等方面，规避可能存在的版权风险。在训练数据集利用阶段，对于生成式人工智能大模型，要符合非表达性使用要求，必须将数据输入端与输出端相结合，严控输出端表达的实质性相似可能，以排除版权侵权风险；对于非生成式人工智能大模型，则可考虑直接纳入合理使用范畴，排除侵权风险。著作权法上的合理使用制度当前仍然是训练数据获取与利用合法性的重要出口，需要进行准确的体系化定位，在个案中逐步形成适用边界。

英文摘要:

The development of the artificial intelligence industry has intensified global competition in large models, making it a frontline arena for industrial advancement. The utilization of massive, high-quality data in training large models poses a series of copyright infringement challenges. Based on different stages of obtaining and utilizing training data for large models, this study proposes a tiered regulatory system for copyright infringement risks in training data by comparing the differentiated stances of the United States, Germany, and China in legal systems and judicial practices from the perspective of comparative law. During the stage of obtaining training datasets, it is essential to conduct thorough reviews of the legal assessment of replication required for machine learning under copyright law. This includes addressing the legitimacy of data sources, the requirement for unique data copies, and the exclusive purpose of data training to mitigate potential copyright risks. In the utilization stage of taining datasets, for generative artificial intelligence large models, compliance with the non-expressive use requirement is mandatory. This necessitates integrating input and output processes while strictly controlling the likelihood of substantial similarity in outputs to exclude copyright infringement risks. For non-generative artificial intelligence large models, direct inclusion under fair use may be considered to exclude infringement. The fair use doctrine under copyright law remains a critical legal pathway for the legitimacy of training data acquisition and utilization. Accurate and systematic positioning is required, with applicable boundaries gradually established on a case-by-case basis.

查看全文查看/发表评论下载PDF阅读器

关闭