Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (3): 120-128    DOI: 10.11925/infotech.2096-3467.2018.0655
Current Issue | Archive | Adv Search |
Extraction of Key Information in Web News Based on Improved Hidden Markov Model
Zhiqiang Liu1(),Yuncheng Du2,Shuicai Shi2
1School of Computer, Beijing Information Science and Technology University, Beijing 100101, China
2TRS Information Technology Co., Ltd., Beijing 100101, China
Download: PDF (990 KB)   HTML ( 8
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to solve key information extraction problems in news web pages, such as title, date, source, and text, by Hidden Markov Model (HMM). [Methods] The web document was transformed into a DOM tree and preprocessed. The information items to be extracted were mapped to state, and the observation value of the extracted items was mapped to vocabulary. The application of HMM in key information extraction of web news was studied, and the algorithm was improved. [Results] Using the improved HMM algorithm, the accuracy rate can reach 97% on average in the websites. [Limitations] The extraction model is slightly insufficient in classification ability, and it is impossible to accurately extract the slightly differences. [Conclusions] The experiment proves that this method has the advantages of high recognition accuracy, strong modeling ability, and fast training speed with small set of tracing data.

Key wordsInformation Extraction      Hidden Markov Model      Machine Learning      DOM Tree     
Received: 20 June 2018      Published: 17 April 2019

Cite this article:

Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model. Data Analysis and Knowledge Discovery, 2019, 3(3): 120-128.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0655     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I3/120

[1] 万国, 张桂平, 白宇, 等. 基于特征加权的新闻主题句抽取[J]. 中文信息学报, 2017, 31(5): 120-126.
[1] (Wan Guo, Zhang Guiping, Bai Yu, et al.News Topic Sentence Extraction via Weighted Features[J]. Journal of Chinese Information Processing, 2017, 31(5): 120-126.)
[2] 姬鑫, 钟诚. 基于分块的新闻网页信息抽取算法[J]. 计算机应用与软件, 2015, 32(4): 317-322.
[2] (Ji Xin, Zhong Cheng.Blocking-Based Information Extraction Algorithm for Webpage of News[J]. Computer Applications and Software, 2015, 32(4): 317-322.)
[3] 孟川, 武小年. 基于文本特征值的正文抽取方法[J]. 桂林电子科技大学学报, 2017, 37(2): 106-110.
[3] (Meng Chuan, Wu Xiaonian.Web Content Extraction Method Based on Text Feature Value[J]. Journal of Guilin University of Electronic Technology, 2017, 37(2): 106-110.)
[4] Rabiner L, Juang B.An Introduction to Hidden Markov Models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16.
[5] Jundt O, Keulen M V.Sample-based XPath Ranking for Web Information Extraction[J]. Advances in Intelligent Systems Research, 2013, 32: 187-194.
[6] Gogar T, Hubacek O, Sedivy J.Deep Neural Networks for Web Page Information Extraction[C]// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations. 2016: 154-163.
[7] 王海艳, 曹攀. 基于节点属性与正文内容的海量Web信息抽取方法[J]. 通信学报, 2016,37(10): 9-17.
[7] (Wang Haiyan, Cao Pan.Information Extraction from Massive Web Pages Based on Node Property and Text Content[J]. Journal on Communications, 2016,37(10): 9-17.
[8] 马晓慧, 李泓莹. 一种DOM 树标签路径和行块密度结合的 Web 信息抽取方法[J]. 智能计算机与应用, 2017, 7(4): 13-16, 20.
[8] (Ma Xiaohui, Li Hongying.Web Information Extraction Based on Label Path of DOM Tree and Block Density[J]. Intelligent Computer & Applications, 2017, 7(4): 13-16, 20.)
[9] 向菁菁, 耿光刚, 李晓东. 一种新闻网页关键信息的提取算法[J]. 计算机应用, 2016, 36(8): 2082-2086, 2120.
[9] (Xiang Jingjing, Geng Guanggang, Li Xiaodong.Key Information Extraction Algorithm of News Web Pages[J]. Journal of Computer Applications, 2016, 36(8): 2082-2086, 2120.)
[10] 孙璐, 陈军华, 廉德胜. 一种基于视觉特征的Deep Web信息抽取方法[J]. 计算机与数字工程, 2016, 44(6): 1107-1111.
[10] (Sun Lu, Chen Junhua, Lian Desheng.Deep Web Information Extraction Method Based on Visual Features[J]. Computer & Digital Engineering, 2016, 44(6): 1107-1111.)
[11] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 170-189.
[11] (Li Hang.Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012: 170-189.)
[12] 杜秋霞, 王洪国, 邵增珍, 等. 基于混合HMM的文献元数据地名抽取方法研究[J]. 计算机与数字工程, 2017, 45(1): 101-106.
[12] (Du Qiuxia, Wang Hongguo, Shao Zengzhen, et al.Place Names Extraction Method of Literature Metadata Based on Hybrid HMM[J]. Computer and Digital Engineering, 2017, 45(1): 101-106.)
[13] 祝伟华, 卢熠, 刘斌斌. 基于HMM的Web信息抽取算法的研究与应用[J]. 计算机科学, 2010, 37(2): 203-206.
[13] (Zhu Weihua, Lu Yi, Liu Binbin.Improvement of Web Information Extraction Algorithm Based on HMM[J]. Computer Science, 2010, 37(2): 203-206.)
[14] 潘心宇, 陈长福, 刘蓉, 等. 基于网页DOM树节点路径相似度的正文抽取[J]. 微型机与应用, 2016, 35(19): 74-77.
[14] (Pan Xinyu, Chen Changfu, Liu Rong, et al.Content Extraction Based on the Similarity of the Web Pages’ DOM Tree Nodes Path[J]. Microcomputer and Its Applications, 2016, 35(19): 74-77.)
[15] Field D A.Laplacian Smoothing and Delaunay Triangulations[J]. Communications in Applied Numerical Methods, 1988, 4: 709-712.
[16] 任丽芳. 教育新闻网页信息抽取系统的设计与实现[D]. 广州: 华南理工大学, 2012.
[16] (Ren Lifang.Design and Implementation of Educational News Web Page Information Extraction System[D]. Guangzhou: South China University of Technology, 2012.)
[17] 刘浩. 基于主题和类别的网络新闻采集系统设计与实现[D]. 济南: 山东师范大学, 2017.
[17] (Liu Hao.The Design and Implementation of NetWork News Gathering System Based on Topics and Categories[D]. Jinan: Shandong Normal University, 2017.)
[18] 吴共庆, 胡骏, 李莉, 等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27(3): 714-735.
[18] (Wu Gongqing, Hu Jun, Li Li, et al.Online Web News Extraction via Tag Path Feature Fusion[J]. Journal of Software, 2016, 27(3): 714-735.)
[19] 双哲, 孙蕾. 基于改进的隐马尔可夫模型在网页信息抽取中的研究与应用[J]. 计算机应用与软件, 2017, 34(2): 42-47.
[19] (Shuang Zhe, Sun Lei.Research and Application for Web Information Extraction Based on Improved Hidden Markov Model[J]. Computer Applications and Software, 2017, 34(2): 42-47.)
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[5] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[6] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[7] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[8] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[9] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[10] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[11] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[12] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[13] Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying. Domain-Specific Event Graph Construction Methods:A Review[J]. 数据分析与知识发现, 2020, 4(10): 1-13.
[14] Wang Shuyi,Liu Sai,Ma Zheng. Microblog Image Privacy Classification with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[15] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn