李湘东,阮涛.互信息特征选择法在《中图法》内容相似类目中的运用及改进——以E271和E712.51为例[J].数字图书馆论坛,2018,(1):46~52 |
互信息特征选择法在《中图法》内容相似类目中的运用及改进——以E271和E712.51为例 |
The Application and Improvement of Mutual Information Feature Selection Method in the Similar Categories of Classification in CLC: Take E271 and E712.51 as an Example |
|
DOI: |
中文关键词: 内容相似类目;中国图书馆分类法;两类分类;互信息;特征选择 |
英文关键词: Similar Content Category;Chinese Library Classification;Two Categories of Classification;Mutual Information;Feature Selection |
基金项目: |
|
摘要点击次数: 2225 |
全文下载次数: 1623 |
中文摘要: |
针对内容相似的两个类目间存在大量共同特征而难以自动区分的特点,提出一种改进的互信息特征选择法,以提高两类文本自动分类的效果.以《中国图书馆分类法》中E271(中国陆军)和E712.51(美国陆军)两个类别的书目信息作为文本分类的对象,首先针对传统互信息特征选择法未考虑负相关特征、类间集中度和类内分散度等问题,引入改进的互信息特征选择法DNCF_MI;其次,针对DNCF_MI未区分不同特征对类别的贡献程度等不足,引入领域无关特征和领域相关特征,提出一种改进的互信息特征选择法DNCF_DI_MI;最后,使用knn分类器进行分类,并采用宏平均F1值和微平均F1值对分类结果进行评价.实验结果表明,本文提出方法的宏平均F1值和微平均F1值比传统互信息特征选择法分别提升24.1%和28.5%,比DNCF_MI均提升4.5%,证明本文方法对内容相似类目的分类更有效. |
英文摘要: |
An improved mutual information feature selection method is proposed to improve the effect of automatic classification of two kinds of text, which is characterized by the existence of a large number of common features in text, which is difficult to distinguish automatical y. The E271 (Chinese army) and E712.51 (American army) bibliographic information in CLC are used as the object of two types of text classification. Firstly, the traditional mutual information feature selection method, which does not consider the negative correlation feature, however the DNCF_MI feature selection method has overcome the weakness. Secondly, the DNCF_MI does not consider the difference between the two types of features in two categories, because the features that wil appear simultaneously in two categories, have different degrees of contribution to characteristics that appear only in one of the classes. So, this paper introduces the field-independent features, domain-related features and proposes an improved DNCF_DI_MI feature selection method. Finally, the knn classifier is used for classification, and the Marco-F1 value and the Mirco-F1 value are used to evaluate the classification results. The experimental results show that the Marco-F1 and Mirco-F1 values of the proposed method are 24.1%and 28.5%higher than that of the traditional mutual information respectively, and 4.5%higher than that of DNCF_MI, which proves that the method is valid. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |
|
|
|