Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (5): 51-56    DOI: 10.11925/infotech.2096-3467.2018.1380
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
篇章级并列关系文本块识别方法研究
裴晶晶,乐小虬()
中国科学院文献情报中心 北京 100190
中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Identifying Coordinate Text Blocks in Discourses
Jingjing Pei,Xiaoqiu Le
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF(662 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】识别出科技论文中分布在不同段落、在语义及版面视觉上具有并列关系的文本块, 捕捉并列关系文本特征, 为并列关系知识对象识别提供预训练模型。【方法】以段落为处理单元, 在字符向量和词向量的基础上附加版面视觉特征, 对不同层级具有并列关系的文本进行多维特征表征, 利用卷积神经网络(Convolutional Neural Networks, CNN)模型对标注数据进行文本分类训练, 得到并列关系文本块识别模型。【结果】在人工标注的科技论文数据集上展开实验, 对并列关系文本块分类准确率达96%, 比基准模型高出约3%, 召回率高出约2%。【局限】仅适用于HTML网页文本数据, 对于其他格式的文本数据还有待进一步研究和实验。【结论】以段落为处理单元, 综合多种特征后利用卷积神经网络模型能够高效识别篇章级并列关系文本块, 可以作为并列关系知识对象识别预训练模型。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
裴晶晶
乐小虬
关键词 并列关系文本表示文本块深度学习    
Abstract

[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects. [Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network (CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%. [Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.

Key wordsCoordinate Relationship    Text Representation    Text Block    Deep Learning
收稿日期: 2018-12-06     
引用本文:   
裴晶晶,乐小虬. 篇章级并列关系文本块识别方法研究[J]. 数据分析与知识发现, 2019, 3(5): 51-56.
Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1380.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1380
[1] Nivre J.Dependency Parsing[J]. Language & Linguistics Compass, 2010, 4(3): 138-152.
[2] 昝红英, 张静杰, 娄鑫坡. 汉语虚词用法在依存句法分析中的应用研究[J]. 中文信息学报, 2013, 27(5): 35-42.
[2] (Zan Hongying, Zhang Jingjie, Lou Xinpo.Studies on the Application of Chinese Functional Words’ Usages in Dependency Parsing[J]. Journal of Chinese Information Processing, 2013, 27(5): 35-42.)
[3] 王东波. 基于规则的单层单标记联合结构自动识别[J].文教资料, 2008(9): 29-31.
[3] (Wang Dongbo.Automatic Identification of Non-nest Coordinate Structure Based on Rules[J]. Data of Culture and Education, 2008(9): 29-31.)
[4] Magerman D M.Natural Language Parsing as Statistical Pattern Recognition[D]. California: Doctoral Dissertation Stanford University, 1994.
[5] 郑略省, 吕学强, 刘坤, 等. 汉语并列关系的识别研究[J].北京大学学报: 自然科学版, 2013, 49(1): 20-24.
[5] (Zheng Luesheng, Lv Xueqiang, Liu Kun, et al.Automatic Identification of Chinese Coordination Relations[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 20-24.)
[6] 石翠, 王杨, 杨彬, 等. 面向中文专利文献的单层并列结构识别[J]. 现代图书情报技术, 2014(10): 76-83.
[6] (Shi Cui, Wang Yang, Yang Bin, et al.Identification of Non-nest Coordination for Chinese Patent Literature[J]. New Technology of Library and Information Service, 2014(10): 76-83.)
[7] 苗艳军, 李军辉, 周国栋. 统计和规则相结合的并列结构自动识别[J]. 计算机应用研究, 2009, 26(9): 3403-3406.
[7] (Miao Yanjun, Li Junhui, Zhou Guodong.Automatic Identification of Coordinate Structure Based on Statistics and Rules[J]. Application Research of Computers, 2009, 26(9): 3403-3406.)
[8] Socher R, Lin C C, Manning C, et al.Parsing Natural Scenes and Natural Language with Recursive Neural Networks[C]// Proceedings of the 28th International Conference on Machine Learning. 2011: 129-136.
[9] Zhao M, Ohshima H, Tanaka K.Finding “Similar But Different” Documents Based on Coordinate Relationship[C]// Proceedings of the 2016 International Conference on Asian Digital Libraries. 2016: 110-123.
[10] Wang S, Huang M, Deng Z.Densely Connected CNN with Multi-scale Feature Attention for Text Classification[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4468-4474.
[11] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 2013 Conference on Neural Information Processing Systems, 2013: 3111-3119.
[12] Pennington J, Socher R, Manning C.Glove: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[13] 张庆辉, 万晨霞. 卷积神经网络综述[J]. 中原工学院学报, 2017, 28(3): 82-86.
[13] (Zhang Qinghui, Wan Chenxia.Review of Convolutional Neural Networks[J]. Journal of Zhongyuan University of Technology, 2017, 28(3): 82-86.)
[14] LeCun Y, Bottou L, Bengio Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[15] Krizhevsky A, Sutskever I, Hinton G.ImageNet Classification with Deep Convolutional Neural Networks[C]//Proceedings of the 2012 Conference on Neural Information Processing Systems. 2012: 1097-1105.
[16] Kim Y.Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint. arXiv: 1408.5882.
[17] Zhang X, Zhao J, LeCun Y. Character-level Convolutional Networks for Text Classification[C]// Proceedings of the 2015 Conference on Neural Information Processing Systems. 2015: 649-657.
[1] 张梦吉,杜婉钰,郑楠. 引入新闻短文本的个股走势预测模型[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[2] 余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[3] 付常雷,钱力,张华平,赵华茗,谢靖. 基于深度学习的创新主题智能挖掘算法研究*[J]. 数据分析与知识发现, 2019, 3(1): 46-54.
[4] 余本功,张培行,许庆堂. 基于F-BiGRU情感分析的产品选择方法*[J]. 数据分析与知识发现, 2018, 2(9): 22-30.
[5] 陆伟,罗梦奇,丁恒,李信. 深度学习图像标注与用户标注比较研究*[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[6] 冯国明,张晓冬,刘素辉. 基于CapsNet的中文文本分类研究*[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[7] 肖延辉,王欣,冯文刚,田华伟,吴绍忠,李丽华. 基于长短记忆型卷积神经网络的犯罪地理位置预测方法*[J]. 数据分析与知识发现, 2018, 2(10): 15-20.
[8] 冯文刚,黄静. 基于深度学习的民航安检和航班预警研究*[J]. 数据分析与知识发现, 2018, 2(10): 46-53.
[9] 胡家珩,岑咏华,吴承尧. 基于深度学习的领域情感词典自动构建*——以金融领域为例[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[10] 邓三鸿,傅余洋子,王昊. 基于LSTM模型的中文图书多标签分类研究*[J]. 数据分析与知识发现, 2017, 1(7): 52-60.
[11] 朱丹浩, 杨蕾, 王东波. 基于深度学习的中文机构名识别研究*——一种汉字级别的循环神经网络方法[J]. 数据分析与知识发现, 2016, 32(12): 36-43.
[12] 张李义,刘畅. 结合深度置信网络和模糊集的虚假交易识别研究[J]. 现代图书情报技术, 2016, 32(1): 32-39.
[13] 杨志墨, 刘怀亮, 赵辉. 一种基于复杂网络的中文文本表示算法[J]. 现代图书情报技术, 2014, 30(11): 38-44.
[14] 刘飒 章成志. 多语言文本表示研究综述*[J]. 现代图书情报技术, 2010, 26(6): 33-41.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn