面向出版社富媒体知识的文本分类研究

刘琼昕; 宋祥; 王鹏

文章摘要

刘琼昕,宋祥,王鹏.面向出版社富媒体知识的文本分类研究[J].情报工程,2019,5(2):040-048

面向出版社富媒体知识的文本分类研究

Research on the Processing of Rich Media Knowledge for Publishers

DOI：10.3772/j.issn.2095-915X.2019.02.004

中文关键词: 富媒体；文本分类；支持向量机；降准

英文关键词: Rich media; text classification; SVM; reduce dimension

基金项目:富媒体数字出版内容组织与知识服务重点实验室开放基金项目（ZD2018-07/02)：“ 富媒体数字出版内容的知识挖掘及发现技术研究 ”。

作者	单位
刘琼昕	1.北京市海量语言信息处理与云计算应用工程技术研究中心 2.北京理工大学计算机学院
宋祥	2.北京理工大学计算机学院 3.中国科学技术信息研究所富媒体数字出版内容组织与知识服务重点实验室
王鹏	2.北京理工大学计算机学院 3.中国科学技术信息研究所富媒体数字出版内容组织与知识服务重点实验室

摘要点击次数: 3809

全文下载次数: 3345

中文摘要:

大数据环境下，出版行业面临着富媒体数据带来的跨媒体数据组织和海量历史数据的挑战。为了形成有效的知识组织，针对富媒体出版社的文本数据具有数据量巨大、标签分层级的特点，本论文使用截断奇异值分解进行降维，应用线性分类核支持向量机模型，并且设计了多层级分类方法，对富媒体文本进行文本分类。实验表明，在富媒体出版社的文本数据下，本文方法取得了较好的文本分类结果。在 150 维的文本特征下，区域分类的第二级分类效果最好，其中准确率达到 0.98，召回率达到0.76，F1 指标达到 0.87。

英文摘要:

The publishing industry faces the challenge of cross-media data organization and massive historical data brought by rich media data in big data area. The text data for rich media publishing houses has the characteristics of huge data and hierar-chical labels. In order to form an effective knowledge organization, this paper uses TSVD to reduce dimensionality, applies Lin-earSVM model, and designs Multi-level classification method for text classification of rich media texts. Experiments show that under the texts of rich media, our method has achieved good results. Under the 150-dimensional text feature, the second-level effect of regional classification is the best, with the accuracy rate reaching 0.98, the recall rate reaching 0.76, and the F1 index reaching 0.87.

查看全文查看/发表评论下载PDF阅读器

关闭