Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (4): 71-77    DOI: 10.11925/infotech.1003-3513.2014.04.11
Current Issue | Archive | Adv Search |
Study of Book Pages Automatic Identification and Bibliographic Information Extraction
Li Xiangdong1,2, Huo Yayong1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
Download: PDF(725 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The article studies the book pages automatic identification and the thematic information extraction method, which sets relevant book pages as the objects. [Methods] Based on the analysis of the features usage of different book pages labels, layout structure and theme information representation, the article establishes a book pages automatic identification and thematic information extraction model through defining general rules, using co-occurrence words and pages analysis, etc. [Results] The result shows that the book pages identification rates from the general Web sites of the model can reach nearly 80%, and the average abstraction rates of the thematic information about kinds of book pages can reach nearly 79%. [Limitations] The method of threshold setting comprehensively considerates various types of books characteristics of Web information, but for some features extremely special webpages exists misjudgment phenomenon, if the algorithm is further improved, it may be better. [Conclusions] The method for automatic identification of all kinds of book pages and thematic information extraction can obtain ideal result, it has a strong universality, at the same time, it also has laid the foundation for the book Web page information organization management and automatic classification research.

Key wordsBook pages      Bibliographic information      Automatic identification      Information extraction     
Received: 18 December 2013      Published: 19 May 2014
:  TP391  

Cite this article:

Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction. New Technology of Library and Information Service, 2014, 30(4): 71-77.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.04.11     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I4/71

[1] 罗永莲,秦振吉.新闻网页主题内容提取方法研究[J].微计算机应用,2007,28(5):556-560.(Luo Yonglian,Qin Zhenji.Research on Extracting Topic Content from News Web Pages[J].Microcomputer Applications,2007,28(5):556-560.)
[2] 施洋,张奇,黄萱菁.含有语义特征的网页新闻自动抽取[J].计算机工程,2010,36(7):173-178.(Shi Yang,Zhang Qi,Huang Xuanjing.Automatic Web News Extraction with Semantic Features[J].Computer Engineering,2010,36(7):173-178.)
[3] 孔胜,王宇.一种基于正文特征的新闻网页抽取方法[J].情报杂志,2010,29(8):122-125.(Kong Sheng,Wang Yu.A News Page Information Extraction Based on Web Feature[J].Journal of Intelligence,2010,29(8):122-125.)
[4] 刘伟,严华梁.一种统一的Web新闻对象自动抽取方法[J].计算机工程,2012,38(11):167-169.(Liu Wei,Yan Hualiang.A Unified and Automatic Web News Object Extraction Approach[J].Computer Engineering,2012,38(11):167-169.)
[5] 朱红灿,龙朝阳.基于熵的新闻网页抽取方法的研究[J].现代图书情报技术,2007(4):48-51.(Zhu Hongcan,Long Chaoyang.An Entropy-Based Approach for News Article Extraction from Web Page[J].New Technology of Library and Information Service,2007(4):48-51.)
[6] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.(Sun Chengjie,Guan Yi.A Statistical Approach for Content Extraction from Web Page[J].Journal of Chinese Information Processing,2004,18(5):17-22.)
[7] 赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-148.(Zhao Xinxin,Suo Hongguang,Liu Yushu.Web Content Information Extraction Method Based on Tag Window[J].Application Research of Computer,2007,24(3):144-148.)
[8] Zheng S Y,Song R H,Wen J R.Template-independent News Extraction Based on Visual Consistency[C].In:Proceedings of the AAAI'07,Vancouver,Canada.2007.
[9] 郑德权,张迪,赵铁军,等.Blog网页分类与识别技术研究[J].通信学报,2007,28(12):156-160.(Zheng Dequan,Zhang Di,Zhao Tiejun.Study on the Classification and Identification of Blog Pages[J].Journal of Communication,2007,28(12):156-160.)
[10] 范纯龙,夏佳,肖昕,等.基于功能语义单元的博客评论抽取技术[J].计算机应用,2011,31(9):17-23.(Fan Chunlong,Xia Jia,Xiao Xin,et al.Extraction Technology of Blog Comments Based on Functional Semantic Units[J].Journal of Computer Application,2011,31(9):17-23.)
[11] 曹冬林,廖祥文,许洪波,等.基于网页格式信息量的博客文章和评论抽取模型[J].软件学报,2009,20(5):1282-1291.(Cao Donglin,Liao Xiangwen,Xu Hongbo,et al.Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction[J].Journal of Software,2009,20(5):1282-1291.)
[12] 唐伟,洪宇,冯艳卉,等.网页中商品"属性-值"关系的自动抽取方法研究[J].中文信息学报,2012,27(1):21-29.(Tang Wei,Hong Yu,Feng Yanhui,et al.Automatic Extraction of the Product "Attribute-Value" Pair from the Web Pages[J].Journal of Chinese Information Processing,2012,27(1):21-29.)
[13] 杨舟,卓林,赵朋朋,等.一种针对商品数据记录的自动抽取方法[J].计算机工程,2010,36(23):262-265.(Yang Zhou,Zhuo Lin,Zhao Pengpeng,et al.Automatic Extraction Method for Product Data Records[J].Computer Engineering,2010,36(23):262-265.)
[14] 吴晓彦,郑骁庆,顾轶灵,等.基于结构语义熵的网上商品信息提取系统[J].计算机应用与软件,2010,27(9):49-53.(Wu Xiaoyan,Zheng Xiaoqing,Gu Yiling,et al.Extraction Algorithm of Merchandise Information on Networks Based on Structured-Semantic Entropy[J].Computer Application and Software,2010,27(9):49-53.)
[15] 李文博.基于XML的藏文网页的信息抽取与转存技术研究[D].兰州:西北民族大学,2006.(Li Wenbo.The Research of XML-Based Tibet Web Page Information Extraction and Conversion Storage[D].Lanzhou:Northwest University for Nationalities,2006.)
[16] 蔡李,单艳,薛化建.维吾尔文网页正文抽取系统的研究与实现[J].计算机工程与设计,2012,33(2):551-555.(Cai Li,Shan Yan,Xue Huajian.Research and Implementation of Uyghur Web Content Extraction System[J].Computer Engineering and Design,2012,33(2):551-555.)
[17] 王瑞,周喜,李晓.基于正文相关度的维吾尔网页正文提取[J].计算机工程,2012,38(21):153-160.(Wang Rui,Zhou Xi,Li Xiao.Content Extraction of Uighur Web Based on Content Correlativity[J].Computer Engineering,2012,38(21):153-160.)
[18] 王爽.面向数字旅游网页的Web信息抽取技术研究[D].西安:西安电子科技大学,2012.(Wang Shuang.Research of Web Information Extraction Technology Oriented to Digital Tourism Website[D].Xi'an:Xidian University,2012.)
[19] 顾轶灵.基于多维语义的互联网药品信息提取方法[J].计算机系统应用,2011,20(11):50-54.(Gu Yiling.Multidim­ensional-Semantics-Based Web Medicine Information Extr­action[J].Computer Systems and Applications,2011,20(11):50-54.)
[20] 王文生,谢能付.基于Web的农业信息自动抽取方法研究[C].见:全国农业信息分析理论与方法学术研讨会.2007:77-83.(Wang Wensheng,Xie Nengfu.Research on Web-based Agriculture Information Extraction[C].In:National Seminar on Agricultural Information Analysis Theory and Method.2007:77-83.)

[1] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[2] Dongmei Mu,Shan Jin,Yuanhong Ju. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[3] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[4] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[5] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[6] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[7] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[8] Zhang Han, Liu Shuangmei. Comparative Analysis of Centrality Indices in Extracting Concepts from Semantic Predication Network——Based on Disease Treatment Research[J]. 现代图书情报技术, 2013, (6): 30-35.
[9] Huang Xun, You Hongliang, Yu Yang. A Review of Relation Extraction[J]. 现代图书情报技术, 2013, 29(11): 30-39.
[10] He Lin, He Juan, Shen Gengyu, Yang Bo, Huang Shuiqing. An Approach to Discovery of Reference Control Gene for qRT-PCR Experiment Based on Texting Mining[J]. 现代图书情报技术, 2012, 28(7): 109-114.
[11] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
[12] Wang Xiuyan, Cui Lei. Overview of Semantic Relations Extraction Between Biomedical Entities by Key Verbs[J]. 现代图书情报技术, 2011, 27(9): 21-27.
[13] Zhou Hong, Zhang Bei, Jiang Airong, Zhang Chengyu. Design and Implementation of Library Bibliography Information Self SMS Push Service[J]. 现代图书情报技术, 2011, 27(7/8): 127-131.
[14] Wang Zhichao, Weng Nan, Wang Yu. Research of Title Party News Identification Technology Based on Topic Sentence Similarity[J]. 现代图书情报技术, 2011, (11): 48-53.
[15] Lu Wanhui, Ma Jianxia. Research on Complex Time Information Extraction Based on CRF Model[J]. 现代图书情报技术, 2011, 27(10): 29-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn