Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (3): 120-128    DOI: 10.11925/infotech.2096-3467.2018.0655
Extraction of Key Information in Web News Based on Improved Hidden Markov Model
Zhiqiang Liu1(),Yuncheng Du2,Shuicai Shi2
1School of Computer, Beijing Information Science and Technology University, Beijing 100101, China
2TRS Information Technology Co., Ltd., Beijing 100101, China
[Objective] This paper aims to solve key information extraction problems in news web pages, such as title, date, source, and text, by Hidden Markov Model (HMM). [Methods] The web document was transformed into a DOM tree and preprocessed. The information items to be extracted were mapped to state, and the observation value of the extracted items was mapped to vocabulary. The application of HMM in key information extraction of web news was studied, and the algorithm was improved. [Results] Using the improved HMM algorithm, the accuracy rate can reach 97% on average in the websites. [Limitations] The extraction model is slightly insufficient in classification ability, and it is impossible to accurately extract the slightly differences. [Conclusions] The experiment proves that this method has the advantages of high recognition accuracy, strong modeling ability, and fast training speed with small set of tracing data.

Key wordsInformation Extraction      Hidden Markov Model      Machine Learning      DOM Tree     
Received: 20 June 2018      Published: 17 April 2019

Cite this article:

Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model. Data Analysis and Knowledge Discovery, 2019, 3(3): 120-128.

