Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (5): 51-56     https://doi.org/10.11925/infotech.2096-3467.2018.1380
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
篇章级并列关系文本块识别方法研究
裴晶晶,乐小虬()
中国科学院文献情报中心 北京 100190
中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Identifying Coordinate Text Blocks in Discourses
Jingjing Pei,Xiaoqiu Le
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (662 KB)   HTML ( 8
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】识别出科技论文中分布在不同段落、在语义及版面视觉上具有并列关系的文本块, 捕捉并列关系文本特征, 为并列关系知识对象识别提供预训练模型。【方法】以段落为处理单元, 在字符向量和词向量的基础上附加版面视觉特征, 对不同层级具有并列关系的文本进行多维特征表征, 利用卷积神经网络(Convolutional Neural Networks, CNN)模型对标注数据进行文本分类训练, 得到并列关系文本块识别模型。【结果】在人工标注的科技论文数据集上展开实验, 对并列关系文本块分类准确率达96%, 比基准模型高出约3%, 召回率高出约2%。【局限】仅适用于HTML网页文本数据, 对于其他格式的文本数据还有待进一步研究和实验。【结论】以段落为处理单元, 综合多种特征后利用卷积神经网络模型能够高效识别篇章级并列关系文本块, 可以作为并列关系知识对象识别预训练模型。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
裴晶晶
乐小虬
关键词 并列关系文本表示文本块深度学习    
Abstract

[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects. [Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network (CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%. [Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.

Key wordsCoordinate Relationship    Text Representation    Text Block    Deep Learning
收稿日期: 2018-12-06      出版日期: 2019-07-03
引用本文:   
裴晶晶,乐小虬. 篇章级并列关系文本块识别方法研究[J]. 数据分析与知识发现, 2019, 3(5): 51-56.
Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses. Data Analysis and Knowledge Discovery, 2019, 3(5): 51-56.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1380      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I5/51
[1] Nivre J.Dependency Parsing[J]. Language & Linguistics Compass, 2010, 4(3): 138-152.
[2] 昝红英, 张静杰, 娄鑫坡. 汉语虚词用法在依存句法分析中的应用研究[J]. 中文信息学报, 2013, 27(5): 35-42.
[2] (Zan Hongying, Zhang Jingjie, Lou Xinpo.Studies on the Application of Chinese Functional Words’ Usages in Dependency Parsing[J]. Journal of Chinese Information Processing, 2013, 27(5): 35-42.)
[3] 王东波. 基于规则的单层单标记联合结构自动识别[J].文教资料, 2008(9): 29-31.
[3] (Wang Dongbo.Automatic Identification of Non-nest Coordinate Structure Based on Rules[J]. Data of Culture and Education, 2008(9): 29-31.)
[4] Magerman D M.Natural Language Parsing as Statistical Pattern Recognition[D]. California: Doctoral Dissertation Stanford University, 1994.
[5] 郑略省, 吕学强, 刘坤, 等. 汉语并列关系的识别研究[J].北京大学学报: 自然科学版, 2013, 49(1): 20-24.
[5] (Zheng Luesheng, Lv Xueqiang, Liu Kun, et al.Automatic Identification of Chinese Coordination Relations[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 20-24.)
[6] 石翠, 王杨, 杨彬, 等. 面向中文专利文献的单层并列结构识别[J]. 现代图书情报技术, 2014(10): 76-83.
[6] (Shi Cui, Wang Yang, Yang Bin, et al.Identification of Non-nest Coordination for Chinese Patent Literature[J]. New Technology of Library and Information Service, 2014(10): 76-83.)
[7] 苗艳军, 李军辉, 周国栋. 统计和规则相结合的并列结构自动识别[J]. 计算机应用研究, 2009, 26(9): 3403-3406.
[7] (Miao Yanjun, Li Junhui, Zhou Guodong.Automatic Identification of Coordinate Structure Based on Statistics and Rules[J]. Application Research of Computers, 2009, 26(9): 3403-3406.)
[8] Socher R, Lin C C, Manning C, et al.Parsing Natural Scenes and Natural Language with Recursive Neural Networks[C]// Proceedings of the 28th International Conference on Machine Learning. 2011: 129-136.
[9] Zhao M, Ohshima H, Tanaka K.Finding “Similar But Different” Documents Based on Coordinate Relationship[C]// Proceedings of the 2016 International Conference on Asian Digital Libraries. 2016: 110-123.
[10] Wang S, Huang M, Deng Z.Densely Connected CNN with Multi-scale Feature Attention for Text Classification[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4468-4474.
[11] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 2013 Conference on Neural Information Processing Systems, 2013: 3111-3119.
[12] Pennington J, Socher R, Manning C.Glove: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[13] 张庆辉, 万晨霞. 卷积神经网络综述[J]. 中原工学院学报, 2017, 28(3): 82-86.
[13] (Zhang Qinghui, Wan Chenxia.Review of Convolutional Neural Networks[J]. Journal of Zhongyuan University of Technology, 2017, 28(3): 82-86.)
[14] LeCun Y, Bottou L, Bengio Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[15] Krizhevsky A, Sutskever I, Hinton G.ImageNet Classification with Deep Convolutional Neural Networks[C]//Proceedings of the 2012 Conference on Neural Information Processing Systems. 2012: 1097-1105.
[16] Kim Y.Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint. arXiv: 1408.5882.
[17] Zhang X, Zhao J, LeCun Y. Character-level Convolutional Networks for Text Classification[C]// Proceedings of the 2015 Conference on Neural Information Processing Systems. 2015: 649-657.
[1] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[3] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[4] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[5] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[6] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[7] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[8] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[9] 冯勇,刘洋,徐红艳,王嵘冰,张永刚. 融合近邻评论的GRU商品推荐模型*[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[10] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[11] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[12] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[13] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[14] 李丹阳, 甘明鑫. 基于多源信息融合的音乐推荐方法 *[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[15] 余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn