Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (5): 51-56    DOI: 10.11925/infotech.2096-3467.2018.1380
Identifying Coordinate Text Blocks in Discourses
Jingjing Pei,Xiaoqiu Le
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects. [Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network (CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%. [Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.

Key wordsCoordinate Relationship      Text Representation      Text Block      Deep Learning     
Received: 06 December 2018      Published: 03 July 2019

Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses. Data Analysis and Knowledge Discovery, 2019, 3(5): 51-56.

