Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (5): 51-56    DOI: 10.11925/infotech.2096-3467.2018.1380
Current Issue | Archive | Adv Search |
Identifying Coordinate Text Blocks in Discourses
Jingjing Pei,Xiaoqiu Le
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (662 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects. [Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network (CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%. [Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.

Key wordsCoordinate Relationship      Text Representation      Text Block      Deep Learning     
Received: 06 December 2018      Published: 03 July 2019

Cite this article:

Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses. Data Analysis and Knowledge Discovery, 2019, 3(5): 51-56.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1380     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I5/51

[1] Nivre J.Dependency Parsing[J]. Language & Linguistics Compass, 2010, 4(3): 138-152.
[2] 昝红英, 张静杰, 娄鑫坡. 汉语虚词用法在依存句法分析中的应用研究[J]. 中文信息学报, 2013, 27(5): 35-42.
[2] (Zan Hongying, Zhang Jingjie, Lou Xinpo.Studies on the Application of Chinese Functional Words’ Usages in Dependency Parsing[J]. Journal of Chinese Information Processing, 2013, 27(5): 35-42.)
[3] 王东波. 基于规则的单层单标记联合结构自动识别[J].文教资料, 2008(9): 29-31.
[3] (Wang Dongbo.Automatic Identification of Non-nest Coordinate Structure Based on Rules[J]. Data of Culture and Education, 2008(9): 29-31.)
[4] Magerman D M.Natural Language Parsing as Statistical Pattern Recognition[D]. California: Doctoral Dissertation Stanford University, 1994.
[5] 郑略省, 吕学强, 刘坤, 等. 汉语并列关系的识别研究[J].北京大学学报: 自然科学版, 2013, 49(1): 20-24.
[5] (Zheng Luesheng, Lv Xueqiang, Liu Kun, et al.Automatic Identification of Chinese Coordination Relations[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 20-24.)
[6] 石翠, 王杨, 杨彬, 等. 面向中文专利文献的单层并列结构识别[J]. 现代图书情报技术, 2014(10): 76-83.
[6] (Shi Cui, Wang Yang, Yang Bin, et al.Identification of Non-nest Coordination for Chinese Patent Literature[J]. New Technology of Library and Information Service, 2014(10): 76-83.)
[7] 苗艳军, 李军辉, 周国栋. 统计和规则相结合的并列结构自动识别[J]. 计算机应用研究, 2009, 26(9): 3403-3406.
[7] (Miao Yanjun, Li Junhui, Zhou Guodong.Automatic Identification of Coordinate Structure Based on Statistics and Rules[J]. Application Research of Computers, 2009, 26(9): 3403-3406.)
[8] Socher R, Lin C C, Manning C, et al.Parsing Natural Scenes and Natural Language with Recursive Neural Networks[C]// Proceedings of the 28th International Conference on Machine Learning. 2011: 129-136.
[9] Zhao M, Ohshima H, Tanaka K.Finding “Similar But Different” Documents Based on Coordinate Relationship[C]// Proceedings of the 2016 International Conference on Asian Digital Libraries. 2016: 110-123.
[10] Wang S, Huang M, Deng Z.Densely Connected CNN with Multi-scale Feature Attention for Text Classification[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4468-4474.
[11] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 2013 Conference on Neural Information Processing Systems, 2013: 3111-3119.
[12] Pennington J, Socher R, Manning C.Glove: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[13] 张庆辉, 万晨霞. 卷积神经网络综述[J]. 中原工学院学报, 2017, 28(3): 82-86.
[13] (Zhang Qinghui, Wan Chenxia.Review of Convolutional Neural Networks[J]. Journal of Zhongyuan University of Technology, 2017, 28(3): 82-86.)
[14] LeCun Y, Bottou L, Bengio Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[15] Krizhevsky A, Sutskever I, Hinton G.ImageNet Classification with Deep Convolutional Neural Networks[C]//Proceedings of the 2012 Conference on Neural Information Processing Systems. 2012: 1097-1105.
[16] Kim Y.Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint. arXiv: 1408.5882.
[17] Zhang X, Zhao J, LeCun Y. Character-level Convolutional Networks for Text Classification[C]// Proceedings of the 2015 Conference on Neural Information Processing Systems. 2015: 649-657.
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[3] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[4] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[7] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[8] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[9] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[10] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[11] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[12] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[13] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[14] Yu Chuanming, Zhang Zhengang, Kong Lingge. Comparing Knowledge Graph Representation Models for Link Prediction[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[15] Han Pu, Zhang Wei, Zhang Zhanpeng, Wang Yuxin, Fang Haoyu. Sentiment Analysis of Weibo Posts on Public Health Emergency with Feature Fusion and Multi-Channel[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn