Extraction of Key Information in Web News Based on Improved Hidden Markov Model
Zhiqiang Liu1(),Yuncheng Du2,Shuicai Shi2
1School of Computer, Beijing Information Science and Technology University, Beijing 100101, China 2TRS Information Technology Co., Ltd., Beijing 100101, China
[Objective] This paper aims to solve key information extraction problems in news web pages, such as title, date, source, and text, by Hidden Markov Model (HMM). [Methods] The web document was transformed into a DOM tree and preprocessed. The information items to be extracted were mapped to state, and the observation value of the extracted items was mapped to vocabulary. The application of HMM in key information extraction of web news was studied, and the algorithm was improved. [Results] Using the improved HMM algorithm, the accuracy rate can reach 97% on average in the websites. [Limitations] The extraction model is slightly insufficient in classification ability, and it is impossible to accurately extract the slightly differences. [Conclusions] The experiment proves that this method has the advantages of high recognition accuracy, strong modeling ability, and fast training speed with small set of tracing data.
刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model. Data Analysis and Knowledge Discovery, 2019, 3(3): 120-128.
(Wan Guo, Zhang Guiping, Bai Yu, et al.News Topic Sentence Extraction via Weighted Features[J]. Journal of Chinese Information Processing, 2017, 31(5): 120-126.)
(Ji Xin, Zhong Cheng.Blocking-Based Information Extraction Algorithm for Webpage of News[J]. Computer Applications and Software, 2015, 32(4): 317-322.)
(Meng Chuan, Wu Xiaonian.Web Content Extraction Method Based on Text Feature Value[J]. Journal of Guilin University of Electronic Technology, 2017, 37(2): 106-110.)
Jundt O, Keulen M V.Sample-based XPath Ranking for Web Information Extraction[J]. Advances in Intelligent Systems Research, 2013, 32: 187-194.
[6]
Gogar T, Hubacek O, Sedivy J.Deep Neural Networks for Web Page Information Extraction[C]// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations. 2016: 154-163.
(Wang Haiyan, Cao Pan.Information Extraction from Massive Web Pages Based on Node Property and Text Content[J]. Journal on Communications, 2016,37(10): 9-17.
(Ma Xiaohui, Li Hongying.Web Information Extraction Based on Label Path of DOM Tree and Block Density[J]. Intelligent Computer & Applications, 2017, 7(4): 13-16, 20.)
(Xiang Jingjing, Geng Guanggang, Li Xiaodong.Key Information Extraction Algorithm of News Web Pages[J]. Journal of Computer Applications, 2016, 36(8): 2082-2086, 2120.)
(Sun Lu, Chen Junhua, Lian Desheng.Deep Web Information Extraction Method Based on Visual Features[J]. Computer & Digital Engineering, 2016, 44(6): 1107-1111.)
[11]
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 170-189.
[11]
(Li Hang.Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012: 170-189.)
(Du Qiuxia, Wang Hongguo, Shao Zengzhen, et al.Place Names Extraction Method of Literature Metadata Based on Hybrid HMM[J]. Computer and Digital Engineering, 2017, 45(1): 101-106.)
(Pan Xinyu, Chen Changfu, Liu Rong, et al.Content Extraction Based on the Similarity of the Web Pages’ DOM Tree Nodes Path[J]. Microcomputer and Its Applications, 2016, 35(19): 74-77.)
[15]
Field D A.Laplacian Smoothing and Delaunay Triangulations[J]. Communications in Applied Numerical Methods, 1988, 4: 709-712.
[16]
任丽芳. 教育新闻网页信息抽取系统的设计与实现[D]. 广州: 华南理工大学, 2012.
[16]
(Ren Lifang.Design and Implementation of Educational News Web Page Information Extraction System[D]. Guangzhou: South China University of Technology, 2012.)
[17]
刘浩. 基于主题和类别的网络新闻采集系统设计与实现[D]. 济南: 山东师范大学, 2017.
[17]
(Liu Hao.The Design and Implementation of NetWork News Gathering System Based on Topics and Categories[D]. Jinan: Shandong Normal University, 2017.)
(Shuang Zhe, Sun Lei.Research and Application for Web Information Extraction Based on Improved Hidden Markov Model[J]. Computer Applications and Software, 2017, 34(2): 42-47.)