Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (11): 52-59    DOI: 10.11925/infotech.1003-3513.2013.11.08
  情报分析与研究 本期目录 | 过刊浏览 | 高级检索 |
基于子树相似度计算的网页评论提取算法研究
朱毅华, 张超群, 曾通, 吴龙凤, 徐玛丽, 王东波, 李晓晖
南京农业大学信息科学技术学院 南京 210095
The Research of Recognizing the Reviews in Webpages Based on Calculating the Similarity of DOM-SubTrees
Zhu Yihua, Zhang Chaoqun, Zeng Tong, Wu Longfeng, Xu Mali, Wang Dongbo, Li Xiaohui
College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
全文: PDF(1248 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 将网页评论的识别与自动提取转化为DOM树结构中的子树循环体识别问题,提出一种基于网页DOM子树相似度计算的方法,从网页中节点向下逐层遍历识别出满足约定条件的评论块节点树。针对目前DOM树相似度计算算法在评论提取方面的性能不足,本算法同时考虑树节点的标签与位置信息构建叶节点路径,通过求解两个DOM子树的叶节点路径相似度矩阵得到两个子树的相似度。比较其他几种基于DOM相似度计算方法和一种基于标签权重的网页评论提取方法在性能和效率上的差异。实验表明,基于本算法的网页评论提取方法具有较高的查准率和查全率,总体优于现有网页评论提取方法。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
朱毅华
张超群
曾通
吴龙凤
徐玛丽
王东波
李晓晖
关键词 DOM树子树相似度评论提取    
Abstract:The processing of recognizing and extracting the reviews from webpages is transformed into recognizing the DOM-SubTrees which is cyclical in the DOM-Tree. Each node is iterated in the DOM, and the similarity between DOM-SubTrees is calculated, then those nodes meeting the requirements are found out.The proposed method can calculate the similarity between DOM-SubTrees in the end. To make it suitable in recognizing the reviews in webpages, the paper transforms the DOM-SubTrees into the paths of leave-nodes which consider the name and the position of tag.The authors compare 4 methods which are used in calculating the similarity between DOM-SubTrees, and also compare the algorithm with other algorithms which recognizes the reviews in webpages by using the weight of tags in the DOM-Tree. The experiments show that the algorithm has higher precision and recall rates, and more effective than other algorithms.
Key wordsDOM-Tree    Sub-tree similarity    Review extraction
收稿日期: 2013-07-22     
:  TP393  
基金资助:本文系教育部人文社会科学研究青年基金项目“基于信息生态学的网络舆情管理机制与平台研究”(项目编号:10YJC870053)和江苏高校哲学社会科学研究重点项目“涉农网络舆情的政府监管研究”(项目编号:2011ZDIXM027)的研究成果之一。
通讯作者: 王东波     E-mail: wangdongbo0102@gmail.com
引用本文:   
朱毅华, 张超群, 曾通, 吴龙凤, 徐玛丽, 王东波, 李晓晖. 基于子树相似度计算的网页评论提取算法研究[J]. 现代图书情报技术, 2013, 29(11): 52-59.
Zhu Yihua, Zhang Chaoqun, Zeng Tong, Wu Longfeng, Xu Mali, Wang Dongbo, Li Xiaohui. The Research of Recognizing the Reviews in Webpages Based on Calculating the Similarity of DOM-SubTrees. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2013.11.08.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.11.08
[1] 安增文,徐杰锋.基于视觉特征的网页正文提取方法研究[J]. 微型机与应用,2010(3):38-41.(An Zengwen,Xu Jiefeng. The Research on Vision-based Web Page Information Extraction Algorithm [J]. Microcomputer & Its Applications,2010(3): 38-41.)
[2] 杜鹏.基于视觉特征的Web页面信息抽取技术的研究[D].兰州:西北师范大学,2009.( Du Peng. Research on Vision-based Web Page Information Extraction Technology[D].Lanzhou:Northwest Normal University,2009.)
[3] Cai D, Yu S P, Wen J R, et al. VIPS: A Vision -based Page Segmentation Algorithm[R]. Microsoft Technical Report,MSR-TR-2003-79. 2003.
[4] Liao X, Cao D, Tan S,et al.Combining Language Model with Sentiment Analysis for Opinion Retrieval of Blog-Post[C].In: Proceedings of Text Retrieval Conference 2006 (TREC'06), Maryland,USA. 2006:211-213.
[5] Hu M, Sun A, Lim E.Comments-oriented Blog Summarization by Sentence Extraction[C]. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM'07).New York: ACM, 2007:901-904.
[6] 连小刚.基于DOM的Web信息抽取系统设计与实现[D].武汉:华中科技大学,2009.(Lian Xiaogang. Design and Implementation of Web Information Extraction Based on DOM [D].Wuhan: Huazhong University of Science and Technology,2009.)
[7] Banko M, Cafarella M J, Soderland S,et al. Open Information Extraction from the Web[C].In: Proceedings of the 20th International Joint Conference on Artificial Intelligence(IJCAI-07).Hyderabad:AAAI Press,2007:2670-2676.
[8] 李效东,顾毓清.基于DOM的Web信息提取[J]. 计算机学报,2002,25(5):526-533.(Li Xiaodong,Gu Yuqing.DOM-based Information Extraction for the Web Sources[J].Chinese Journal of Computers,2002,25(5):526-533.)
[9] 李姜.基于DOM的评论发现及抽取模型研究[J]. 计算机工程与设计,2007,28(9):2150-2153.(Li Jiang. Reviews Discovery and Opinions Extraction Model Based on DOM [J].Computer Engineering and Design,2007,28(9):2150-2153.)
[10] 杨奕锦.Web页面用户评论信息抽取技术研究[D].杭州:浙江大学,2011.(Yang Yijin. Study on Information Extraction Technology in Web Pages of Review [D].Hangzhou: Zhejiang University,2011.)
[11] 刘伟,严华梁,肖建国,等.一种Web评论自动抽取方法[J]. 软件学报,2010,21(12):3220-3236. (Liu Wei, Yan Hualiang, Xiao Jianguo, et al. Solution for Automatic Web Review Extraction[J]. Journal of Software,2010,21(12):3220-3236. )
[12] Parapar J,Lopez-Castro J, Barreiro Á. Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness[C]. In: Proceedings of the 1st Spanish Conference on Information Retrieval (CERI2010), Madrid, Spain.2010:5-16.
[13] 高虹安.部落格贴文评论擷取及其在意见探勘上的应用[D].台北:台湾大学,2008.(Kao H.Comment Extraction from Blog Posts and Its Applications to Opinion Mining[D].Taipei: National Taiwan University,2008.)
[14] 张瑞雪.基于DOM树的网页相似度研究与应用[D].大连:大连理工大学,2011.(Zhang Ruixue. Research & Application of Web Similarity Based on DOM Tree [D].Dalian: Dalian University of Technology,2011.)
[15] 聂卉,黄贵鹏.树编辑距离在Web信息抽取中的应用与实现[J]. 现代图书情报技术,2010(5):29-34.(Nie Hui, Huang Guipeng. The Application and Implementation of Tree Edit Distance in Web Information Extraction [J]. New Technology of Library and Information Service,2010(5):29-34.)
[16] 罗刚.解密搜索引擎技术实战(Lucene & Java精华版)[M].北京:电子工业出版社,2011.(Luo Gang. Actual Battles of Decoding Searching Engine Technology(Lucene & Java Essentials) [M]. Beijing: Publishing House of Electronics Industry,2011.)
[17] 何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J]. 计算机研究与发展,2007,44(S3):1-6.(He Xin,Xie Zhipeng.Structural Similarity Measurement of Web Pages Based on Simple Tree Matching Algorithm [J].Journal of Computer Research and Development,2007,44(S3):1-6.)
[18] Manning C D,Schütze H,Raghavan P.信息检索导论[M].北京:人民邮电出版社,2010.(Manning C D,Schütze H,Raghavan P.Introduction to Information Retrieval[M]. Beijing:Posts & Telecom Press,2010.)
[19] 刘兵.Web数据挖掘[M].北京:清华大学出版社,2013.(Liu Bing. Web Data Mining [M].Beijing: Tsinghua University Press,2013.)
[1] 刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[2] 李亚子,方安,陈薇,朱峰. Web页面最大有意义节点发现算法研究[J]. 现代图书情报技术, 2009, (10): 22-27.
[3] 吕聚旺,都云程,王弘蔚,施水才. 基于新型主题信息量化方法的Web主题信息提取研究*[J]. 现代图书情报技术, 2008, 24(12): 48-53.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn