基于生成式预训练语言模型的学者画像构建研究

柳涛; 丁陈君; 姜恩波; 许睿; 陈方

文章摘要

柳涛,丁陈君,姜恩波,许睿,陈方.基于生成式预训练语言模型的学者画像构建研究[J].数字图书馆论坛,2024,20(3):1~11

基于生成式预训练语言模型的学者画像构建研究

Construction of Scholar Profile Based on Generative Pre-Trained Language Model

投稿时间：2023-12-12

DOI：10.3772/j.issn.1673-2286.2024.03.001

中文关键词: 生成式预训练语言模型；样例微调；学者画像；GPT-3

英文关键词: Generative Pre-Trained Language Model; Sample Fine-Tuning; Scholar Profile; GPT-3

基金项目:本研究得到“西部之光”人才培养计划“基于模式创新的医药生物产业科技服务体系研发及应用示范”（编号：E1C0000401）、中国科学院成都文献情报中心创新基金项目“生物-信息科技情报领域智慧数据体系建设”（编号：E1Z0000101）资助。

作者	单位
柳涛	中国科学院成都文献情报中心；中国科学院大学信息资源管理系
丁陈君	中国科学院成都文献情报中心
姜恩波	中国科学院成都文献情报中心；中国科学院大学信息资源管理系
许睿	中国科学院成都文献情报中心
陈方	中国科学院成都文献情报中心；中国科学院大学信息资源管理系

摘要点击次数: 1692

全文下载次数: 1601

中文摘要:

大数据时代，互联网中以多源异构、非结构化形式存在的学者信息在实体抽取时伴有属性混淆、长实体等问题，严重影响学者画像构建的精准度。与此同时，学者属性实体抽取模型作为学者画像构建过程中的关键模型，在实际应用方面还存在较高的技术门槛，这对学者画像的应用推广造成一定阻碍。为此，在开放资源的基础上，通过引导句建模、自回归生成方式、训练语料微调等构建一种基于生成式预训练语言模型的属性实体抽取框架，并从模型整体效果、实体类别抽取效果、主要影响因素实例分析、样例微调影响分析4个方面对该方法进行验证分析。与对比模型相比，所提出的方法在12类学者属性实体上均达到最优效果，其综合F1值为99.34%，不仅能够较好地识别区分相互混淆的属性实体，对“研究方向”这一典型长属性实体的抽取准确率还提升了6.11%，为学者画像的工程化应用提供了更快捷、有效的方法支撑。

英文摘要:

In the era of big data, the information of scholars in the Internet that exists in a multi-source heterogeneous and unstructured form is accompanied by problems such as attribute confusion and long entities during entity extraction, which seriously affect the accuracy of the construction of scholar profiles. Meanwhile, the scholar attribute entity extraction model, as a key model in the construction of scholar profiles, still presents significant technical barriers in practical applications, which pose certain obstacles to the widespread application of scholar profiles. Therefore, based on open resources, we construct an attribute entity extraction method based on generative pre-trained language models through guided sentence modelling, autoregressive generation approach, and training corpus fine-tuning, and validate the method from four aspects: overall model effect, entity category extraction effect, instance analysis of the main influencing factors, and analysis of sample fine-tuning impact. Compared with the contrastive models, the method proposed in this paper achieves optimal performance across 12 categories of scholar attribute entities, with a comprehensive F1 score of 99.34%. It not only effectively identifies and differentiates mutually confusing attribute entities, but also enhances the extraction precision of typical long attribute entities such as “research interests” by 6.11%. This method provides more expedient and effective methodological support for the engineering application of scholar profiles.

查看全文查看/发表评论下载PDF阅读器

关闭