Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (11): 52-59    DOI: 10.11925/infotech.1003-3513.2013.11.08
Current Issue | Archive | Adv Search |
The Research of Recognizing the Reviews in Webpages Based on Calculating the Similarity of DOM-SubTrees
Zhu Yihua, Zhang Chaoqun, Zeng Tong, Wu Longfeng, Xu Mali, Wang Dongbo, Li Xiaohui
College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  The processing of recognizing and extracting the reviews from webpages is transformed into recognizing the DOM-SubTrees which is cyclical in the DOM-Tree. Each node is iterated in the DOM, and the similarity between DOM-SubTrees is calculated, then those nodes meeting the requirements are found out.The proposed method can calculate the similarity between DOM-SubTrees in the end. To make it suitable in recognizing the reviews in webpages, the paper transforms the DOM-SubTrees into the paths of leave-nodes which consider the name and the position of tag.The authors compare 4 methods which are used in calculating the similarity between DOM-SubTrees, and also compare the algorithm with other algorithms which recognizes the reviews in webpages by using the weight of tags in the DOM-Tree. The experiments show that the algorithm has higher precision and recall rates, and more effective than other algorithms.
Key wordsDOM-Tree      Sub-tree similarity      Review extraction     
Received: 22 July 2013      Published: 29 November 2013
:  TP393  

Cite this article:

Zhu Yihua, Zhang Chaoqun, Zeng Tong, Wu Longfeng, Xu Mali, Wang Dongbo, Li Xiaohui. The Research of Recognizing the Reviews in Webpages Based on Calculating the Similarity of DOM-SubTrees. New Technology of Library and Information Service, 2013, 29(11): 52-59.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.11.08     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I11/52

[1] 安增文,徐杰锋.基于视觉特征的网页正文提取方法研究[J]. 微型机与应用,2010(3):38-41.(An Zengwen,Xu Jiefeng. The Research on Vision-based Web Page Information Extraction Algorithm [J]. Microcomputer & Its Applications,2010(3): 38-41.)
[2] 杜鹏.基于视觉特征的Web页面信息抽取技术的研究[D].兰州:西北师范大学,2009.( Du Peng. Research on Vision-based Web Page Information Extraction Technology[D].Lanzhou:Northwest Normal University,2009.)
[3] Cai D, Yu S P, Wen J R, et al. VIPS: A Vision -based Page Segmentation Algorithm[R]. Microsoft Technical Report,MSR-TR-2003-79. 2003.
[4] Liao X, Cao D, Tan S,et al.Combining Language Model with Sentiment Analysis for Opinion Retrieval of Blog-Post[C].In: Proceedings of Text Retrieval Conference 2006 (TREC'06), Maryland,USA. 2006:211-213.
[5] Hu M, Sun A, Lim E.Comments-oriented Blog Summarization by Sentence Extraction[C]. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM'07).New York: ACM, 2007:901-904.
[6] 连小刚.基于DOM的Web信息抽取系统设计与实现[D].武汉:华中科技大学,2009.(Lian Xiaogang. Design and Implementation of Web Information Extraction Based on DOM [D].Wuhan: Huazhong University of Science and Technology,2009.)
[7] Banko M, Cafarella M J, Soderland S,et al. Open Information Extraction from the Web[C].In: Proceedings of the 20th International Joint Conference on Artificial Intelligence(IJCAI-07).Hyderabad:AAAI Press,2007:2670-2676.
[8] 李效东,顾毓清.基于DOM的Web信息提取[J]. 计算机学报,2002,25(5):526-533.(Li Xiaodong,Gu Yuqing.DOM-based Information Extraction for the Web Sources[J].Chinese Journal of Computers,2002,25(5):526-533.)
[9] 李姜.基于DOM的评论发现及抽取模型研究[J]. 计算机工程与设计,2007,28(9):2150-2153.(Li Jiang. Reviews Discovery and Opinions Extraction Model Based on DOM [J].Computer Engineering and Design,2007,28(9):2150-2153.)
[10] 杨奕锦.Web页面用户评论信息抽取技术研究[D].杭州:浙江大学,2011.(Yang Yijin. Study on Information Extraction Technology in Web Pages of Review [D].Hangzhou: Zhejiang University,2011.)
[11] 刘伟,严华梁,肖建国,等.一种Web评论自动抽取方法[J]. 软件学报,2010,21(12):3220-3236. (Liu Wei, Yan Hualiang, Xiao Jianguo, et al. Solution for Automatic Web Review Extraction[J]. Journal of Software,2010,21(12):3220-3236. )
[12] Parapar J,Lopez-Castro J, Barreiro Á. Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness[C]. In: Proceedings of the 1st Spanish Conference on Information Retrieval (CERI2010), Madrid, Spain.2010:5-16.
[13] 高虹安.部落格贴文评论擷取及其在意见探勘上的应用[D].台北:台湾大学,2008.(Kao H.Comment Extraction from Blog Posts and Its Applications to Opinion Mining[D].Taipei: National Taiwan University,2008.)
[14] 张瑞雪.基于DOM树的网页相似度研究与应用[D].大连:大连理工大学,2011.(Zhang Ruixue. Research & Application of Web Similarity Based on DOM Tree [D].Dalian: Dalian University of Technology,2011.)
[15] 聂卉,黄贵鹏.树编辑距离在Web信息抽取中的应用与实现[J]. 现代图书情报技术,2010(5):29-34.(Nie Hui, Huang Guipeng. The Application and Implementation of Tree Edit Distance in Web Information Extraction [J]. New Technology of Library and Information Service,2010(5):29-34.)
[16] 罗刚.解密搜索引擎技术实战(Lucene & Java精华版)[M].北京:电子工业出版社,2011.(Luo Gang. Actual Battles of Decoding Searching Engine Technology(Lucene & Java Essentials) [M]. Beijing: Publishing House of Electronics Industry,2011.)
[17] 何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J]. 计算机研究与发展,2007,44(S3):1-6.(He Xin,Xie Zhipeng.Structural Similarity Measurement of Web Pages Based on Simple Tree Matching Algorithm [J].Journal of Computer Research and Development,2007,44(S3):1-6.)
[18] Manning C D,Schütze H,Raghavan P.信息检索导论[M].北京:人民邮电出版社,2010.(Manning C D,Schütze H,Raghavan P.Introduction to Information Retrieval[M]. Beijing:Posts & Telecom Press,2010.)
[19] 刘兵.Web数据挖掘[M].北京:清华大学出版社,2013.(Liu Bing. Web Data Mining [M].Beijing: Tsinghua University Press,2013.)
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Li Wenna,Zhang Zhixiong. Research on Knowledge Base Error Detection Method Based on Confidence Learning[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[3] Sun Yu, Qiu Jiangnan. Research on Influence of Opinion Leaders Based on Network Analysis and Text Mining [J]. 数据分析与知识发现, 0, (): 1-.
[4] Wang Qinjie, Qin Chunxiu, Ma Xubu, Liu Huailiang, Xu Cunzhen. Recommending Scientific Literature Based on Author Preference and Heterogeneous Information Network[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[5] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[6] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[7] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[8] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[10] Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu. Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[11] Ruan Xiaoyun,Liao Jianbin,Li Xiang,Yang Yang,Li Daifeng. Interpretable Recommendation of Reinforcement Learning Based on Talent Knowledge Graph Reasoning[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[12] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[14] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn