基于改进的隐马尔科夫模型的网页新闻关键信息抽取<sup>*</sup>

doi:10.11925/infotech.2096-3467.2018.0655

数据分析与知识发现

2019, Vol. 3

Issue (3): 120-128 https://doi.org/10.11925/infotech.2096-3467.2018.0655

应用论文

本期目录 | 过刊浏览 | 高级检索

基于改进的隐马尔科夫模型的网页新闻关键信息抽取^*

刘志强¹(

),都云程²,施水才²

¹北京信息科技大学计算机学院北京 100101
²拓尔思信息技术股份有限公司北京 100101

Extraction of Key Information in Web News Based on Improved Hidden Markov Model

Zhiqiang Liu¹(

),Yuncheng Du²,Shuicai Shi²

¹School of Computer, Beijing Information Science and Technology University, Beijing 100101, China
²TRS Information Technology Co., Ltd., Beijing 100101, China

摘要
参考文献
相关文章
Metrics

全文: PDF (990 KB) HTML ( 8 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】通过隐马尔科夫模型解决新闻网页中标题、日期、来源、正文等关键信息抽取问题, 并根据应用场景对算法做出改进以提高抽取效果。【方法】将网页文档转为DOM树并进行预处理, 映射待抽取信息项为状态, 映射待抽取观测项为词汇, 研究隐马尔科夫模型在网页新闻关键信息抽取中的应用并对算法提出改进。【结果】使用隐马尔科夫模型的改进算法, 在已构建抽取模型的网站中, 平均准确率可达97%。【局限】抽取模型在分类能力上稍有不足, 无法对细微差别信息进行准确抽取。【结论】该方法具有识别准确率高、建模能力强、训练数据小、训练速度快的优点。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	刘志强
	都云程
	施水才

关键词 ：信息抽取, 隐马尔科夫模型, 机器学习, DOM树

Abstract：

[Objective] This paper aims to solve key information extraction problems in news web pages, such as title, date, source, and text, by Hidden Markov Model (HMM). [Methods] The web document was transformed into a DOM tree and preprocessed. The information items to be extracted were mapped to state, and the observation value of the extracted items was mapped to vocabulary. The application of HMM in key information extraction of web news was studied, and the algorithm was improved. [Results] Using the improved HMM algorithm, the accuracy rate can reach 97% on average in the websites. [Limitations] The extraction model is slightly insufficient in classification ability, and it is impossible to accurately extract the slightly differences. [Conclusions] The experiment proves that this method has the advantages of high recognition accuracy, strong modeling ability, and fast training speed with small set of tracing data.

Key words： Information Extraction Hidden Markov Model Machine Learning DOM Tree

收稿日期: 2018-06-20 出版日期: 2019-04-17

基金资助:*本文系教育部社会科学重大攻关项目基金项目“大数据驱动的城市公共安全风险研究”(项目编号: 16JZD023)的研究成果之一

引用本文:

刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取^*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model. Data Analysis and Knowledge Discovery, 2019, 3(3): 120-128.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0655 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I3/120

[1]	万国, 张桂平, 白宇, 等. 基于特征加权的新闻主题句抽取[J]. 中文信息学报, 2017, 31(5): 120-126.
[1]	(Wan Guo, Zhang Guiping, Bai Yu, et al.News Topic Sentence Extraction via Weighted Features[J]. Journal of Chinese Information Processing, 2017, 31(5): 120-126.)
[2]	姬鑫, 钟诚. 基于分块的新闻网页信息抽取算法[J]. 计算机应用与软件, 2015, 32(4): 317-322.
[2]	(Ji Xin, Zhong Cheng.Blocking-Based Information Extraction Algorithm for Webpage of News[J]. Computer Applications and Software, 2015, 32(4): 317-322.)
[3]	孟川, 武小年. 基于文本特征值的正文抽取方法[J]. 桂林电子科技大学学报, 2017, 37(2): 106-110.
[3]	(Meng Chuan, Wu Xiaonian.Web Content Extraction Method Based on Text Feature Value[J]. Journal of Guilin University of Electronic Technology, 2017, 37(2): 106-110.)
[4]	Rabiner L, Juang B.An Introduction to Hidden Markov Models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16.
[5]	Jundt O, Keulen M V.Sample-based XPath Ranking for Web Information Extraction[J]. Advances in Intelligent Systems Research, 2013, 32: 187-194.
[6]	Gogar T, Hubacek O, Sedivy J.Deep Neural Networks for Web Page Information Extraction[C]// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations. 2016: 154-163.
[7]	王海艳, 曹攀. 基于节点属性与正文内容的海量Web信息抽取方法[J]. 通信学报, 2016,37(10): 9-17.
[7]	(Wang Haiyan, Cao Pan.Information Extraction from Massive Web Pages Based on Node Property and Text Content[J]. Journal on Communications, 2016,37(10): 9-17.
[8]	马晓慧, 李泓莹. 一种DOM 树标签路径和行块密度结合的 Web 信息抽取方法[J]. 智能计算机与应用, 2017, 7(4): 13-16, 20.
[8]	(Ma Xiaohui, Li Hongying.Web Information Extraction Based on Label Path of DOM Tree and Block Density[J]. Intelligent Computer & Applications, 2017, 7(4): 13-16, 20.)
[9]	向菁菁, 耿光刚, 李晓东. 一种新闻网页关键信息的提取算法[J]. 计算机应用, 2016, 36(8): 2082-2086, 2120.
[9]	(Xiang Jingjing, Geng Guanggang, Li Xiaodong.Key Information Extraction Algorithm of News Web Pages[J]. Journal of Computer Applications, 2016, 36(8): 2082-2086, 2120.)
[10]	孙璐, 陈军华, 廉德胜. 一种基于视觉特征的Deep Web信息抽取方法[J]. 计算机与数字工程, 2016, 44(6): 1107-1111.
[10]	(Sun Lu, Chen Junhua, Lian Desheng.Deep Web Information Extraction Method Based on Visual Features[J]. Computer & Digital Engineering, 2016, 44(6): 1107-1111.)
[11]	李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 170-189.
[11]	(Li Hang.Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012: 170-189.)
[12]	杜秋霞, 王洪国, 邵增珍, 等. 基于混合HMM的文献元数据地名抽取方法研究[J]. 计算机与数字工程, 2017, 45(1): 101-106.
[12]	(Du Qiuxia, Wang Hongguo, Shao Zengzhen, et al.Place Names Extraction Method of Literature Metadata Based on Hybrid HMM[J]. Computer and Digital Engineering, 2017, 45(1): 101-106.)
[13]	祝伟华, 卢熠, 刘斌斌. 基于HMM的Web信息抽取算法的研究与应用[J]. 计算机科学, 2010, 37(2): 203-206.
[13]	(Zhu Weihua, Lu Yi, Liu Binbin.Improvement of Web Information Extraction Algorithm Based on HMM[J]. Computer Science, 2010, 37(2): 203-206.)
[14]	潘心宇, 陈长福, 刘蓉, 等. 基于网页DOM树节点路径相似度的正文抽取[J]. 微型机与应用, 2016, 35(19): 74-77.
[14]	(Pan Xinyu, Chen Changfu, Liu Rong, et al.Content Extraction Based on the Similarity of the Web Pages’ DOM Tree Nodes Path[J]. Microcomputer and Its Applications, 2016, 35(19): 74-77.)
[15]	Field D A.Laplacian Smoothing and Delaunay Triangulations[J]. Communications in Applied Numerical Methods, 1988, 4: 709-712.
[16]	任丽芳. 教育新闻网页信息抽取系统的设计与实现[D]. 广州: 华南理工大学, 2012.
[16]	(Ren Lifang.Design and Implementation of Educational News Web Page Information Extraction System[D]. Guangzhou: South China University of Technology, 2012.)
[17]	刘浩. 基于主题和类别的网络新闻采集系统设计与实现[D]. 济南: 山东师范大学, 2017.
[17]	(Liu Hao.The Design and Implementation of NetWork News Gathering System Based on Topics and Categories[D]. Jinan: Shandong Normal University, 2017.)
[18]	吴共庆, 胡骏, 李莉, 等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27(3): 714-735.
[18]	(Wu Gongqing, Hu Jun, Li Li, et al.Online Web News Extraction via Tag Path Feature Fusion[J]. Journal of Software, 2016, 27(3): 714-735.)
[19]	双哲, 孙蕾. 基于改进的隐马尔可夫模型在网页信息抽取中的研究与应用[J]. 计算机应用与软件, 2017, 34(2): 42-47.
[19]	(Shuang Zhe, Sun Lei.Research and Application for Web Information Extraction Based on Improved Hidden Markov Model[J]. Computer Applications and Software, 2017, 34(2): 42-47.)

[1]	王寒雪,崔文娟,周园春,杜一. 基于机器学习的食源性疾病致病菌识别方法*[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2]	陈东华,赵红梅,尚小溥,张润彤. 数据驱动的大型医院手术室运营预测与优化方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3]	车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4]	谭荧, 唐亦非. 基于指代消解的引文内容抽取研究^*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[5]	苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究^*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[6]	曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型^*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[7]	钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述^*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[8]	向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 ^*[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[9]	柴国荣,王斌,沙勇忠. 基于多机器学习方法联合的公共卫生风险预测研究——以兰州市流感预测为例*[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[10]	陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[11]	梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[12]	杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[13]	王树义,刘赛,马峥. 基于深度迁移学习的微博图像隐私分类研究^*[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[14]	陶玥,余丽,张润杰. 科技文献中短语级主题抽取的主动学习方法研究^*[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[15]	王若佳,张璐,王继民. 基于机器学习的在线问诊平台智能分诊研究[J]. 数据分析与知识发现, 2019, 3(9): 88-97.

Viewed

Full text

Abstract

Cited

Shared

Discussed