Abstract:The processing of recognizing and extracting the reviews from webpages is transformed into recognizing the DOM-SubTrees which is cyclical in the DOM-Tree. Each node is iterated in the DOM, and the similarity between DOM-SubTrees is calculated, then those nodes meeting the requirements are found out.The proposed method can calculate the similarity between DOM-SubTrees in the end. To make it suitable in recognizing the reviews in webpages, the paper transforms the DOM-SubTrees into the paths of leave-nodes which consider the name and the position of tag.The authors compare 4 methods which are used in calculating the similarity between DOM-SubTrees, and also compare the algorithm with other algorithms which recognizes the reviews in webpages by using the weight of tags in the DOM-Tree. The experiments show that the algorithm has higher precision and recall rates, and more effective than other algorithms.
朱毅华, 张超群, 曾通, 吴龙凤, 徐玛丽, 王东波, 李晓晖. 基于子树相似度计算的网页评论提取算法研究[J]. 现代图书情报技术, 2013, 29(11): 52-59.
Zhu Yihua, Zhang Chaoqun, Zeng Tong, Wu Longfeng, Xu Mali, Wang Dongbo, Li Xiaohui. The Research of Recognizing the Reviews in Webpages Based on Calculating the Similarity of DOM-SubTrees. New Technology of Library and Information Service, 2013, 29(11): 52-59.
[1] 安增文,徐杰锋.基于视觉特征的网页正文提取方法研究[J]. 微型机与应用,2010(3):38-41.(An Zengwen,Xu Jiefeng. The Research on Vision-based Web Page Information Extraction Algorithm [J]. Microcomputer & Its Applications,2010(3): 38-41.) [2] 杜鹏.基于视觉特征的Web页面信息抽取技术的研究[D].兰州:西北师范大学,2009.( Du Peng. Research on Vision-based Web Page Information Extraction Technology[D].Lanzhou:Northwest Normal University,2009.) [3] Cai D, Yu S P, Wen J R, et al. VIPS: A Vision -based Page Segmentation Algorithm[R]. Microsoft Technical Report,MSR-TR-2003-79. 2003. [4] Liao X, Cao D, Tan S,et al.Combining Language Model with Sentiment Analysis for Opinion Retrieval of Blog-Post[C].In: Proceedings of Text Retrieval Conference 2006 (TREC'06), Maryland,USA. 2006:211-213. [5] Hu M, Sun A, Lim E.Comments-oriented Blog Summarization by Sentence Extraction[C]. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM'07).New York: ACM, 2007:901-904. [6] 连小刚.基于DOM的Web信息抽取系统设计与实现[D].武汉:华中科技大学,2009.(Lian Xiaogang. Design and Implementation of Web Information Extraction Based on DOM [D].Wuhan: Huazhong University of Science and Technology,2009.) [7] Banko M, Cafarella M J, Soderland S,et al. Open Information Extraction from the Web[C].In: Proceedings of the 20th International Joint Conference on Artificial Intelligence(IJCAI-07).Hyderabad:AAAI Press,2007:2670-2676. [8] 李效东,顾毓清.基于DOM的Web信息提取[J]. 计算机学报,2002,25(5):526-533.(Li Xiaodong,Gu Yuqing.DOM-based Information Extraction for the Web Sources[J].Chinese Journal of Computers,2002,25(5):526-533.) [9] 李姜.基于DOM的评论发现及抽取模型研究[J]. 计算机工程与设计,2007,28(9):2150-2153.(Li Jiang. Reviews Discovery and Opinions Extraction Model Based on DOM [J].Computer Engineering and Design,2007,28(9):2150-2153.) [10] 杨奕锦.Web页面用户评论信息抽取技术研究[D].杭州:浙江大学,2011.(Yang Yijin. Study on Information Extraction Technology in Web Pages of Review [D].Hangzhou: Zhejiang University,2011.) [11] 刘伟,严华梁,肖建国,等.一种Web评论自动抽取方法[J]. 软件学报,2010,21(12):3220-3236. (Liu Wei, Yan Hualiang, Xiao Jianguo, et al. Solution for Automatic Web Review Extraction[J]. Journal of Software,2010,21(12):3220-3236. ) [12] Parapar J,Lopez-Castro J, Barreiro Á. Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness[C]. In: Proceedings of the 1st Spanish Conference on Information Retrieval (CERI2010), Madrid, Spain.2010:5-16. [13] 高虹安.部落格贴文评论擷取及其在意见探勘上的应用[D].台北:台湾大学,2008.(Kao H.Comment Extraction from Blog Posts and Its Applications to Opinion Mining[D].Taipei: National Taiwan University,2008.) [14] 张瑞雪.基于DOM树的网页相似度研究与应用[D].大连:大连理工大学,2011.(Zhang Ruixue. Research & Application of Web Similarity Based on DOM Tree [D].Dalian: Dalian University of Technology,2011.) [15] 聂卉,黄贵鹏.树编辑距离在Web信息抽取中的应用与实现[J]. 现代图书情报技术,2010(5):29-34.(Nie Hui, Huang Guipeng. The Application and Implementation of Tree Edit Distance in Web Information Extraction [J]. New Technology of Library and Information Service,2010(5):29-34.) [16] 罗刚.解密搜索引擎技术实战(Lucene & Java精华版)[M].北京:电子工业出版社,2011.(Luo Gang. Actual Battles of Decoding Searching Engine Technology(Lucene & Java Essentials) [M]. Beijing: Publishing House of Electronics Industry,2011.) [17] 何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J]. 计算机研究与发展,2007,44(S3):1-6.(He Xin,Xie Zhipeng.Structural Similarity Measurement of Web Pages Based on Simple Tree Matching Algorithm [J].Journal of Computer Research and Development,2007,44(S3):1-6.) [18] Manning C D,Schütze H,Raghavan P.信息检索导论[M].北京:人民邮电出版社,2010.(Manning C D,Schütze H,Raghavan P.Introduction to Information Retrieval[M]. Beijing:Posts & Telecom Press,2010.) [19] 刘兵.Web数据挖掘[M].北京:清华大学出版社,2013.(Liu Bing. Web Data Mining [M].Beijing: Tsinghua University Press,2013.)