|
|
Extraction of Key Information in Web News Based on Improved Hidden Markov Model |
Zhiqiang Liu1(),Yuncheng Du2,Shuicai Shi2 |
1School of Computer, Beijing Information Science and Technology University, Beijing 100101, China 2TRS Information Technology Co., Ltd., Beijing 100101, China |
|
|
Abstract [Objective] This paper aims to solve key information extraction problems in news web pages, such as title, date, source, and text, by Hidden Markov Model (HMM). [Methods] The web document was transformed into a DOM tree and preprocessed. The information items to be extracted were mapped to state, and the observation value of the extracted items was mapped to vocabulary. The application of HMM in key information extraction of web news was studied, and the algorithm was improved. [Results] Using the improved HMM algorithm, the accuracy rate can reach 97% on average in the websites. [Limitations] The extraction model is slightly insufficient in classification ability, and it is impossible to accurately extract the slightly differences. [Conclusions] The experiment proves that this method has the advantages of high recognition accuracy, strong modeling ability, and fast training speed with small set of tracing data.
|
Received: 20 June 2018
Published: 17 April 2019
|
[1] | 万国, 张桂平, 白宇, 等. 基于特征加权的新闻主题句抽取[J]. 中文信息学报, 2017, 31(5): 120-126. | [1] | (Wan Guo, Zhang Guiping, Bai Yu, et al.News Topic Sentence Extraction via Weighted Features[J]. Journal of Chinese Information Processing, 2017, 31(5): 120-126.) | [2] | 姬鑫, 钟诚. 基于分块的新闻网页信息抽取算法[J]. 计算机应用与软件, 2015, 32(4): 317-322. | [2] | (Ji Xin, Zhong Cheng.Blocking-Based Information Extraction Algorithm for Webpage of News[J]. Computer Applications and Software, 2015, 32(4): 317-322.) | [3] | 孟川, 武小年. 基于文本特征值的正文抽取方法[J]. 桂林电子科技大学学报, 2017, 37(2): 106-110. | [3] | (Meng Chuan, Wu Xiaonian.Web Content Extraction Method Based on Text Feature Value[J]. Journal of Guilin University of Electronic Technology, 2017, 37(2): 106-110.) | [4] | Rabiner L, Juang B.An Introduction to Hidden Markov Models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16. | [5] | Jundt O, Keulen M V.Sample-based XPath Ranking for Web Information Extraction[J]. Advances in Intelligent Systems Research, 2013, 32: 187-194. | [6] | Gogar T, Hubacek O, Sedivy J.Deep Neural Networks for Web Page Information Extraction[C]// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations. 2016: 154-163. | [7] | 王海艳, 曹攀. 基于节点属性与正文内容的海量Web信息抽取方法[J]. 通信学报, 2016,37(10): 9-17. | [7] | (Wang Haiyan, Cao Pan.Information Extraction from Massive Web Pages Based on Node Property and Text Content[J]. Journal on Communications, 2016,37(10): 9-17. | [8] | 马晓慧, 李泓莹. 一种DOM 树标签路径和行块密度结合的 Web 信息抽取方法[J]. 智能计算机与应用, 2017, 7(4): 13-16, 20. | [8] | (Ma Xiaohui, Li Hongying.Web Information Extraction Based on Label Path of DOM Tree and Block Density[J]. Intelligent Computer & Applications, 2017, 7(4): 13-16, 20.) | [9] | 向菁菁, 耿光刚, 李晓东. 一种新闻网页关键信息的提取算法[J]. 计算机应用, 2016, 36(8): 2082-2086, 2120. | [9] | (Xiang Jingjing, Geng Guanggang, Li Xiaodong.Key Information Extraction Algorithm of News Web Pages[J]. Journal of Computer Applications, 2016, 36(8): 2082-2086, 2120.) | [10] | 孙璐, 陈军华, 廉德胜. 一种基于视觉特征的Deep Web信息抽取方法[J]. 计算机与数字工程, 2016, 44(6): 1107-1111. | [10] | (Sun Lu, Chen Junhua, Lian Desheng.Deep Web Information Extraction Method Based on Visual Features[J]. Computer & Digital Engineering, 2016, 44(6): 1107-1111.) | [11] | 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 170-189. | [11] | (Li Hang.Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012: 170-189.) | [12] | 杜秋霞, 王洪国, 邵增珍, 等. 基于混合HMM的文献元数据地名抽取方法研究[J]. 计算机与数字工程, 2017, 45(1): 101-106. | [12] | (Du Qiuxia, Wang Hongguo, Shao Zengzhen, et al.Place Names Extraction Method of Literature Metadata Based on Hybrid HMM[J]. Computer and Digital Engineering, 2017, 45(1): 101-106.) | [13] | 祝伟华, 卢熠, 刘斌斌. 基于HMM的Web信息抽取算法的研究与应用[J]. 计算机科学, 2010, 37(2): 203-206. | [13] | (Zhu Weihua, Lu Yi, Liu Binbin.Improvement of Web Information Extraction Algorithm Based on HMM[J]. Computer Science, 2010, 37(2): 203-206.) | [14] | 潘心宇, 陈长福, 刘蓉, 等. 基于网页DOM树节点路径相似度的正文抽取[J]. 微型机与应用, 2016, 35(19): 74-77. | [14] | (Pan Xinyu, Chen Changfu, Liu Rong, et al.Content Extraction Based on the Similarity of the Web Pages’ DOM Tree Nodes Path[J]. Microcomputer and Its Applications, 2016, 35(19): 74-77.) | [15] | Field D A.Laplacian Smoothing and Delaunay Triangulations[J]. Communications in Applied Numerical Methods, 1988, 4: 709-712. | [16] | 任丽芳. 教育新闻网页信息抽取系统的设计与实现[D]. 广州: 华南理工大学, 2012. | [16] | (Ren Lifang.Design and Implementation of Educational News Web Page Information Extraction System[D]. Guangzhou: South China University of Technology, 2012.) | [17] | 刘浩. 基于主题和类别的网络新闻采集系统设计与实现[D]. 济南: 山东师范大学, 2017. | [17] | (Liu Hao.The Design and Implementation of NetWork News Gathering System Based on Topics and Categories[D]. Jinan: Shandong Normal University, 2017.) | [18] | 吴共庆, 胡骏, 李莉, 等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27(3): 714-735. | [18] | (Wu Gongqing, Hu Jun, Li Li, et al.Online Web News Extraction via Tag Path Feature Fusion[J]. Journal of Software, 2016, 27(3): 714-735.) | [19] | 双哲, 孙蕾. 基于改进的隐马尔可夫模型在网页信息抽取中的研究与应用[J]. 计算机应用与软件, 2017, 34(2): 42-47. | [19] | (Shuang Zhe, Sun Lei.Research and Application for Web Information Extraction Based on Improved Hidden Markov Model[J]. Computer Applications and Software, 2017, 34(2): 42-47.) |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|