黄佳妮,于丰畅.基于表格检索和机器学习二阶段的文献表格相关文本自动识别[J].数字图书馆论坛,2022,(11):34~42 |
基于表格检索和机器学习二阶段的文献表格相关文本自动识别 |
Automatic Recognition of Table-related Text in Literature Based on Table Retrieval and Machine Learning Two-stage Method |
投稿时间:2022-11-06 |
DOI:10.3772/j.issn.1673-2286.2022.11.009 |
中文关键词: 文献表格;表格理解;机器学习 |
英文关键词: Scientific Table; Table Understanding; Machine Learning |
基金项目:本研究得到2021年度湖北省博士后创新研究岗位项目“基于迁移学习的开放领域非格式化文档理解”(编号:211000090)资助。 |
作者 | 单位 | 黄佳妮 | 武汉大学信息管理学院 | 于丰畅 | 武汉大学信息管理学院 |
|
摘要点击次数: 1031 |
全文下载次数: 711 |
中文摘要: |
学术文献中的表格以结构化的形式高度凝练地展示了文献中的核心知识。主流文献检索引擎中已逐步开始使用表格内容作为文字摘要的补充,以帮助科研人员快速掌握研究工作核心,提升科研工作效率。但是在仅展示表格而不提供表格的相关信息(对表格进行描述或解释的文本)的情况下,读者往往难以充分理解表格内容,阻碍文献阅读效率的进一步提升。本文提出一种基于表格检索和机器学习二阶段的表格相关文本识别方法,阶段一运用表格内容进行全文检索,获取潜在相关文本;阶段二构建机器学习模型,判断表格与潜在相关文本间的相关性,从而实现文献中表格相关文本的自动识别。以Text Retrieval Conference会议论文数据集为例,验证本文所提出的方法的有效性,证明该方法能够快速抽取文献中与图表相关的文本,为现有的论文图表抽取式摘要相关研究提供借鉴,对提高科研人员文献调研效率具有重要的现实意义。 |
英文摘要: |
The tables in academic literature concisely represent the core knowledge in the literature in a structured form. Numerous academic search engines have integrated tables into retrieval results, which may help researchers quickly grasp the core knowledge and improve the research efficiency. However, while solely displaying the table without offering related information about it, readers frequently fail to fully understand the table’s content, hindering further improvement of literature reading efficiency. We propose a two-stage table-related text recognition method based on machine learning and table retrieval. Stage 1 uses the table content to perform a full-text retrieval, and the retrieval results are regarded as the text potentially related to the table. Stage 2 builds a machine learning model to determine the correlation between the table and potentially relevant text, thereby realizing the automatic recognition of relevant text in the literature. This study utilizes the dataset from the Text Retrieval Conference as an example to verify the effectiveness of the method proposed in this paper. This method can easily extract text related to tables in the literature, which can provide a reference for the existing research on extractive summary of scientific tables and it is of great practical significance for improving the efficiency of literature research. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |