Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (6): 27-31    DOI: 10.11925/infotech.1003-3513.2011.06.05
Current Issue | Archive | Adv Search |
Automatic Identify Title of Web Text Resource Based on Rules
Liu Jianhua1, Zhang Zhixiong1, Xie Jing1, Zou Yimin1,2
1. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2. Craduate University of Chinese Acadeny of Sciences, Beijing 100049, China
Download: PDF(567 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  As the important role of titles of Web resource for information retrieval,text cluster and so on,this paper proposes a method to identify the titles automatically and quickly based on the style information(such as font) and location information of text which are used by many other researchers. Besides, it considers the relevance between the title candidates and text content. Lastly, this paper implements the title identification component and does some experiments to show the effectiveness of this method.
Key wordsWeb text resources      Title identification      Title source      Title feature     
Received: 05 May 2011      Published: 15 August 2011
: 

G203

 

Cite this article:

Liu Jianhua, Zhang Zhixiong, Xie Jing, Zou Yimin. Automatic Identify Title of Web Text Resource Based on Rules. New Technology of Library and Information Service, 2011, 27(6): 27-31.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.06.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I6/27

[1] Changuel S, Labroche N, Bouchon-Meunier B.A General Learning Method for Automatic Title Extraction from HTML Pages[C].In:Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition.2009:704-718.

[2] Giuffrida G, Shek E C, Yang J. Knowledge-based Metadata Extraction from PostScript Files[C]. In: Proceedings of the 5th ACM Conference on Digital Libraries.2000:77-84.

[3] Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C]. In: Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting.2004:329-336.

[4] 朱海军,张桂平,蔡东风,等.科技论文的标题识别[C].见:第九届全国计算语言学学术会议论文集,2007.

[5] Hu Y, Li H,Cao Y, et al. Automatic Extraction of Titles from General Documents Using Machine Learning[J].Information Processing and Management, 2006, 42(5):1276-1293.

[6] Xue Y, Hu Y,Xin G, et al. Web Page Title Extraction and Its Application[J]. Information Processing and Management, 2007, 43(5):1332-1347.

[7] 朱青,吕晓旭. 基于机器学习的HTML标题抽取[J].微计算机信息,2010,26(3):15-16,11.

[8] 李国华,昝红英. 基于语句相似度的网页标题抽取方法[J].中文信息学报, 2011,25 (2): 32-37.

[9] Open Document Format for Office Applications (OpenDoeument) v1.0 [EB/OL]. [2011-02-10]. http://docs.oasis-open.or~oficdv1.0.

[10] HTML Parser[EB/OL].[2011-03-10].http://htmlparser.sourceforge.net/.

[11] Jericho HTML Parser[EB/OL].[2011-03-10].http://jericho.htmlparser.net/docs/index.html.

[12] Apache PDFBox-Java PDF Library[EB/OL]. [2011-03-20]. http://pdfbox.apache.org/.

[13] Apache POI-Text Extraction[EB/OL].[2011-03-20]. http://poi.apache.org/.
[1] Yuan Yuan, Sun Xiaoling, Zhu Qinghua. Research on Attention Behavior of Microblog Users Based on Social Network Analysis[J]. 现代图书情报技术, 2012, 28(2): 68-75.
[2] Zhang Yunzhong. A New Ontology Construction Method Based on FCA and Folksonomy[J]. 现代图书情报技术, 2011, 27(12): 15-23.
[3] Zhao Yang, Zhang Liyi. System Dynamics Modeling and Simulation for Information Resources Allocation in R&D Cooperation[J]. 现代图书情报技术, 2011, 27(2): 54-61.
[4] Zhao Wenbing, Zhu Qinghua, Wu Kewen, Huang Qi. Analysis of Micro-blogging User Character and Motivation ——Take Micro-blogging of Hexun.com as an Example[J]. 现代图书情报技术, 2011, 27(2): 69-75.
[5] Liu Honghong, An Haizhong, Gao Xiangyun. Research on Content Characteristics About Complex Network of Text[J]. 现代图书情报技术, 2011, 27(1): 69-73.
[6] Wu Dan, Liu Yuan, Wang Shaocheng. A Comparison and Evaluation Experiment on Chinese and English Online Question Answering Communities[J]. 现代图书情报技术, 2011, 27(1): 74-82.
[7] Yuan Hong. Web Usability Evaluation of the University Portal Based on Web Content Analysis ——Case Study of Jiangsu Province[J]. 现代图书情报技术, 2010, 26(10): 70-75.
[8] Qiao Jianzhong. Research and Implementation of Classification System of E-government Information Resources Based on Business Association[J]. 现代图书情报技术, 2010, 26(9): 28-36.
[9] Ma Chao Ye Qi Wu Bin Shi Chuan She Ying. Design and Implementation of a Visual Analytical Platform for Dynamic Link Analysis[J]. 现代图书情报技术, 2010, 26(6): 60-65.
[10] Wu Kewen,Zhao Yuxiang,Zhu Qinghua. The Usage Pattern of Social Q&A Site——Take Chinese Yahoo Answers as an Example[J]. 现代图书情报技术, 2009, 25(12): 57-63.
[11] Zou Rong,Zeng Ting,Jiang Airong,Guo Jing. Study and Implementation of DSpace-based Union Website[J]. 现代图书情报技术, 2009, 25(5): 67-71.
[12] Li Qingmao. Research on Topic Maps-based Tourism Document Organization Method[J]. 现代图书情报技术, 2009, 25(4): 82-87.
[13] Tang Yi,Yang Yan. OAI-PMH with Application to Information Integration in Subject Information Portal Base on Mediawiki System[J]. 现代图书情报技术, 2009, 3(3): 80-84.
[14] Zhu Yaling,Jia Xiaofeng. A Double Auction-based Scheduling Model and Bidding Strategy to Grid Resource[J]. 现代图书情报技术, 2008, 24(12): 32-36.
[15] Lei Xue,Jiao Yuying,Lu Quan,Cheng Quan. Investigation on Knowledge Sharing Behavior of Wiki Community Based on the Social Cognitive Theory[J]. 现代图书情报技术, 2008, 24(2): 30-34.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn