|
|
Automatic Identify Title of Web Text Resource Based on Rules |
Liu Jianhua1, Zhang Zhixiong1, Xie Jing1, Zou Yimin1,2 |
1. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2. Craduate University of Chinese Acadeny of Sciences, Beijing 100049, China |
|
|
Abstract As the important role of titles of Web resource for information retrieval,text cluster and so on,this paper proposes a method to identify the titles automatically and quickly based on the style information(such as font) and location information of text which are used by many other researchers. Besides, it considers the relevance between the title candidates and text content. Lastly, this paper implements the title identification component and does some experiments to show the effectiveness of this method.
|
Received: 05 May 2011
Published: 15 August 2011
|
|
[1] Changuel S, Labroche N, Bouchon-Meunier B.A General Learning Method for Automatic Title Extraction from HTML Pages[C].In:Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition.2009:704-718.[2] Giuffrida G, Shek E C, Yang J. Knowledge-based Metadata Extraction from PostScript Files[C]. In: Proceedings of the 5th ACM Conference on Digital Libraries.2000:77-84.[3] Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C]. In: Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting.2004:329-336.[4] 朱海军,张桂平,蔡东风,等.科技论文的标题识别[C].见:第九届全国计算语言学学术会议论文集,2007.[5] Hu Y, Li H,Cao Y, et al. Automatic Extraction of Titles from General Documents Using Machine Learning[J].Information Processing and Management, 2006, 42(5):1276-1293.[6] Xue Y, Hu Y,Xin G, et al. Web Page Title Extraction and Its Application[J]. Information Processing and Management, 2007, 43(5):1332-1347.[7] 朱青,吕晓旭. 基于机器学习的HTML标题抽取[J].微计算机信息,2010,26(3):15-16,11.[8] 李国华,昝红英. 基于语句相似度的网页标题抽取方法[J].中文信息学报, 2011,25 (2): 32-37.[9] Open Document Format for Office Applications (OpenDoeument) v1.0 [EB/OL]. [2011-02-10]. http://docs.oasis-open.or~oficdv1.0.[10] HTML Parser[EB/OL].[2011-03-10].http://htmlparser.sourceforge.net/.[11] Jericho HTML Parser[EB/OL].[2011-03-10].http://jericho.htmlparser.net/docs/index.html.[12] Apache PDFBox-Java PDF Library[EB/OL]. [2011-03-20]. http://pdfbox.apache.org/.[13] Apache POI-Text Extraction[EB/OL].[2011-03-20]. http://poi.apache.org/. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|