刘桂锋,陈亦侯,包翔,韩牧哲.基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究[J].情报工程,2024,10(5):085-098 |
基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究 |
Research on Short Text Classification Method Based on BERTopic Topic Modeling and RoBERTa Algorithm |
|
DOI:10.3772/j.issn.2095-915X.2024.05.008 |
中文关键词: 短文本分类;词向量;BERTopic 模型;RoBERTa模型 |
英文关键词: Short Textbook Classification; Word Vector; BERTopic Model; RoBERTa Model |
基金项目:2024年江苏省研究生科研与实践创新计划项目“基于 BERTopic主题概率特征扩展的新闻短文本分类方法研究”(2385);国家社会科学基金一般项目“科学数据融合模式设计与体系建构研究”(21BTQ080)。 |
作者 | 单位 | 刘桂锋 | 江苏大学科技信息研究所 镇江 212013 | 陈亦侯 | 江苏大学科技信息研究所 镇江 212013 | 包翔 | 江苏大学科技信息研究所 镇江 212013 | 韩牧哲 | 江苏大学科技信息研究所 镇江 212013 |
|
摘要点击次数: 35 |
全文下载次数: 16 |
中文摘要: |
[目的/意义]针对短文本分类中的稀疏问题,提出一种基于BERTopic-RoBERTa-PCA-CatBoost模型进行主题概率特征扩展的短文本分类方法。[方法/过程]使用RoBERTa模型获取短文本的词向量表示,使用BERTopic主题模型提取主题概率特征向量,二者融合进行特征扩展,最后通过CatBoost算法分类。[局限]在分类层面,未使用深度学习算法进行验证;在特征融合层面,未来可以考虑其他的特征融合方法。[结果/结论]提出的BERTopic-RoBERTa-PCACatBoost模型与LDA-CatBoost模型相比在准确率上提升10.90%,精确率上提升10.91%,召回率上提升10.68%。基于主题概率特征扩展的短文本分类方法能够克服单一模型的不足,提高短文本分类的效果。 |
英文摘要: |
[Purpose/Significance] To address the sparsity issue in short text classification, this paper proposes a short text classification method based on topic probabilistic feature expansion with BERTopic-RoBERTa-PCA-CatBoost model. [Methods/Processes] The RoBERTa model is employed to obtain word vector representations of short texts. Topic probabilistic feature vectors are extracted using BERTopic topic model, which is then fused with word vectors for feature expansion. Finally, the CatBoost algorithm is utilized for classification. [Limitations] In terms of classification, deep learning algorithms have not been utilized for verification. Regarding feature fusion, future work may consider alternative feature fusion methods. [Results/Conclusions] The proposed BERTopic-RoBERTa-PCA-CatBoost model demonstrates improvements of 10.90% in accuracy, 10.91% in precision, and 10.68% in recall compared to LDA-CatBoost model. The short text classification method based on topic probabilistic feature expansion can overcome the limitations of individual models and enhance the effectiveness of short text classification. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |