Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (3): 120-128    DOI: 10.11925/infotech.2096-3467.2018.0655
Current Issue | Archive | Adv Search |
Extraction of Key Information in Web News Based on Improved Hidden Markov Model
Zhiqiang Liu1(),Yuncheng Du2,Shuicai Shi2
1School of Computer, Beijing Information Science and Technology University, Beijing 100101, China
2TRS Information Technology Co., Ltd., Beijing 100101, China
Download: PDF(990 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      

[Objective] This paper aims to solve key information extraction problems in news web pages, such as title, date, source, and text, by Hidden Markov Model (HMM). [Methods] The web document was transformed into a DOM tree and preprocessed. The information items to be extracted were mapped to state, and the observation value of the extracted items was mapped to vocabulary. The application of HMM in key information extraction of web news was studied, and the algorithm was improved. [Results] Using the improved HMM algorithm, the accuracy rate can reach 97% on average in the websites. [Limitations] The extraction model is slightly insufficient in classification ability, and it is impossible to accurately extract the slightly differences. [Conclusions] The experiment proves that this method has the advantages of high recognition accuracy, strong modeling ability, and fast training speed with small set of tracing data.

Key wordsInformation Extraction      Hidden Markov Model      Machine Learning      DOM Tree     
Received: 20 June 2018      Published: 17 April 2019

Cite this article:

Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model. Data Analysis and Knowledge Discovery, 2019, 3(3): 120-128.

URL:     OR

[1] 万国, 张桂平, 白宇, 等. 基于特征加权的新闻主题句抽取[J]. 中文信息学报, 2017, 31(5): 120-126.
[1] (Wan Guo, Zhang Guiping, Bai Yu, et al.News Topic Sentence Extraction via Weighted Features[J]. Journal of Chinese Information Processing, 2017, 31(5): 120-126.)
[2] 姬鑫, 钟诚. 基于分块的新闻网页信息抽取算法[J]. 计算机应用与软件, 2015, 32(4): 317-322.
[2] (Ji Xin, Zhong Cheng.Blocking-Based Information Extraction Algorithm for Webpage of News[J]. Computer Applications and Software, 2015, 32(4): 317-322.)
[3] 孟川, 武小年. 基于文本特征值的正文抽取方法[J]. 桂林电子科技大学学报, 2017, 37(2): 106-110.
[3] (Meng Chuan, Wu Xiaonian.Web Content Extraction Method Based on Text Feature Value[J]. Journal of Guilin University of Electronic Technology, 2017, 37(2): 106-110.)
[4] Rabiner L, Juang B.An Introduction to Hidden Markov Models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16.
[5] Jundt O, Keulen M V.Sample-based XPath Ranking for Web Information Extraction[J]. Advances in Intelligent Systems Research, 2013, 32: 187-194.
[6] Gogar T, Hubacek O, Sedivy J.Deep Neural Networks for Web Page Information Extraction[C]// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations. 2016: 154-163.
[7] 王海艳, 曹攀. 基于节点属性与正文内容的海量Web信息抽取方法[J]. 通信学报, 2016,37(10): 9-17.
[7] (Wang Haiyan, Cao Pan.Information Extraction from Massive Web Pages Based on Node Property and Text Content[J]. Journal on Communications, 2016,37(10): 9-17.
[8] 马晓慧, 李泓莹. 一种DOM 树标签路径和行块密度结合的 Web 信息抽取方法[J]. 智能计算机与应用, 2017, 7(4): 13-16, 20.
[8] (Ma Xiaohui, Li Hongying.Web Information Extraction Based on Label Path of DOM Tree and Block Density[J]. Intelligent Computer & Applications, 2017, 7(4): 13-16, 20.)
[9] 向菁菁, 耿光刚, 李晓东. 一种新闻网页关键信息的提取算法[J]. 计算机应用, 2016, 36(8): 2082-2086, 2120.
[9] (Xiang Jingjing, Geng Guanggang, Li Xiaodong.Key Information Extraction Algorithm of News Web Pages[J]. Journal of Computer Applications, 2016, 36(8): 2082-2086, 2120.)
[10] 孙璐, 陈军华, 廉德胜. 一种基于视觉特征的Deep Web信息抽取方法[J]. 计算机与数字工程, 2016, 44(6): 1107-1111.
[10] (Sun Lu, Chen Junhua, Lian Desheng.Deep Web Information Extraction Method Based on Visual Features[J]. Computer & Digital Engineering, 2016, 44(6): 1107-1111.)
[11] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 170-189.
[11] (Li Hang.Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012: 170-189.)
[12] 杜秋霞, 王洪国, 邵增珍, 等. 基于混合HMM的文献元数据地名抽取方法研究[J]. 计算机与数字工程, 2017, 45(1): 101-106.
[12] (Du Qiuxia, Wang Hongguo, Shao Zengzhen, et al.Place Names Extraction Method of Literature Metadata Based on Hybrid HMM[J]. Computer and Digital Engineering, 2017, 45(1): 101-106.)
[13] 祝伟华, 卢熠, 刘斌斌. 基于HMM的Web信息抽取算法的研究与应用[J]. 计算机科学, 2010, 37(2): 203-206.
[13] (Zhu Weihua, Lu Yi, Liu Binbin.Improvement of Web Information Extraction Algorithm Based on HMM[J]. Computer Science, 2010, 37(2): 203-206.)
[14] 潘心宇, 陈长福, 刘蓉, 等. 基于网页DOM树节点路径相似度的正文抽取[J]. 微型机与应用, 2016, 35(19): 74-77.
[14] (Pan Xinyu, Chen Changfu, Liu Rong, et al.Content Extraction Based on the Similarity of the Web Pages’ DOM Tree Nodes Path[J]. Microcomputer and Its Applications, 2016, 35(19): 74-77.)
[15] Field D A.Laplacian Smoothing and Delaunay Triangulations[J]. Communications in Applied Numerical Methods, 1988, 4: 709-712.
[16] 任丽芳. 教育新闻网页信息抽取系统的设计与实现[D]. 广州: 华南理工大学, 2012.
[16] (Ren Lifang.Design and Implementation of Educational News Web Page Information Extraction System[D]. Guangzhou: South China University of Technology, 2012.)
[17] 刘浩. 基于主题和类别的网络新闻采集系统设计与实现[D]. 济南: 山东师范大学, 2017.
[17] (Liu Hao.The Design and Implementation of NetWork News Gathering System Based on Topics and Categories[D]. Jinan: Shandong Normal University, 2017.)
[18] 吴共庆, 胡骏, 李莉, 等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27(3): 714-735.
[18] (Wu Gongqing, Hu Jun, Li Li, et al.Online Web News Extraction via Tag Path Feature Fusion[J]. Journal of Software, 2016, 27(3): 714-735.)
[19] 双哲, 孙蕾. 基于改进的隐马尔可夫模型在网页信息抽取中的研究与应用[J]. 计算机应用与软件, 2017, 34(2): 42-47.
[19] (Shuang Zhe, Sun Lei.Research and Application for Web Information Extraction Based on Improved Hidden Markov Model[J]. Computer Applications and Software, 2017, 34(2): 42-47.)
[1] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[2] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[3] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[4] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[5] Lina Liu,Jiayin Qi,Zhenping Zhang,Dan Zeng. Analyzing Impacts of Brand Reputation on Online Sales Based on Massive Commodity Reviews and Brand[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[6] Dongmei Mu,Shan Jin,Yuanhong Ju. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[7] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[8] Wei Lu,Mengqi Luo,Heng Ding,Xin Li. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[9] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[10] Xinyue Fan,Lei Cui. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[11] Yang Zhao,Xini Yuan,Yawen Chen,Liqiang Wu. Predicting Conversion Rate of APP Advertising with Machine Learning[J]. 数据分析与知识发现, 2018, 2(11): 2-9.
[12] Xin Wang,Wen’gang Feng. Review of Techniques Detecting Online Extremism and Radicalization[J]. 数据分析与知识发现, 2018, 2(10): 2-8.
[13] Zhongyi Hu,Chaoqun Wang,Jiang Wu. Identifying Phishing Websites with Multiple Online Data Sources[J]. 数据分析与知识发现, 2017, 1(6): 47-55.
[14] Weimin Lv,Xiaomei Wang,Tao Han. Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 38-45.
[15] Yue He,Min Xiao,Yue Zhang. Sentiment Analysis of Trending Topics Based on Relevance[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938