New Technology of Library and Information Service  2014, Vol. 30 Issue (4): 71-77    DOI: 10.11925/infotech.1003-3513.2014.04.11
Study of Book Pages Automatic Identification and Bibliographic Information Extraction
Li Xiangdong1,2, Huo Yayong1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
[Objective] The article studies the book pages automatic identification and the thematic information extraction method, which sets relevant book pages as the objects. [Methods] Based on the analysis of the features usage of different book pages labels, layout structure and theme information representation, the article establishes a book pages automatic identification and thematic information extraction model through defining general rules, using co-occurrence words and pages analysis, etc. [Results] The result shows that the book pages identification rates from the general Web sites of the model can reach nearly 80%, and the average abstraction rates of the thematic information about kinds of book pages can reach nearly 79%. [Limitations] The method of threshold setting comprehensively considerates various types of books characteristics of Web information, but for some features extremely special webpages exists misjudgment phenomenon, if the algorithm is further improved, it may be better. [Conclusions] The method for automatic identification of all kinds of book pages and thematic information extraction can obtain ideal result, it has a strong universality, at the same time, it also has laid the foundation for the book Web page information organization management and automatic classification research.

Key wordsBook pages      Bibliographic information      Automatic identification      Information extraction     
Received: 18 December 2013      Published: 19 May 2014
Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction. New Technology of Library and Information Service, 2014, 30(4): 71-77.

