Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (5): 118-126     https://doi.org/10.11925/infotech.2096-3467.2019.0907
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
改进的知识迁移景点实体识别算法研究及应用*
赵平1,孙连英2(),涂帅1,卞建玲3,万莹1
1北京联合大学智慧城市学院 北京 100101
2北京联合大学城市轨道交通与物流学院 北京 100101
3北京中电普华信息技术有限公司 北京100192
Identifying Scenic Spot Entities Based on Improved Knowledge Transfer
Zhao Ping1,Sun Lianying2(),Tu Shuai1,Bian Jianling3,Wan Ying1
1Smart City College, Beijing UnionUniversity, Beijing 100101, China
2College of Urban Rail Transit and Logistics, Beijing Union University, Beijing 100101, China
3Beijing China-Power Information Technology Co., LTD, Beijing 100192, China
全文: PDF (849 KB)   HTML ( 17
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决景点实体识别中标注数据难以获取的问题。【方法】 提出一种改进的知识迁移景点实体识别算法,通过对人民日报的数据集进行关键词、句子以及可扩展能力三种级别的实验评估扩展数据集。【结果】 实验结果表明,本文方法在仅使用少量标注数据时,其准确率相比使用全部标注数据的模型提高1.62%。【局限】 对样本扩展能力考虑的特征较少,可能影响模型效果。【结论】 解决了景点实体识别中严重依赖标注数据质量的问题,为旅游自动化推荐提供技术支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵平
孙连英
涂帅
卞建玲
万莹
关键词 迁移学习BERT条件随机场景点实体识别    
Abstract

[Objective] This paper addresses the issues facing labeled data in the recognition of scenic spots.[Methods] We proposed an improved knowledge transfer algorithm for entity recognition and used datasets from the People’s Daily to evaluate our new model.[Results] Our method’s accuracy was 1.62% higher than the model using all labeled data.[Limitations] More research is needed to examine the expansion of samples.[Conclusions] The proposed method uses less labeled data in entity recognition and provides better technical support for tourism recommendation.

Key wordsTransfer Learning    BERT    Conditional Random Fields    Scenery Spot Recognition
收稿日期: 2019-08-05      出版日期: 2020-06-15
ZTFLH:  TP393  
基金资助:*本文系国家重点研发计划项目“多方法综合探测数据融合与智能识别技术研究”(2018YFC0807806);教育部科研创新基金项目“大数据驱动下的都市轨道交通安全应急决策模式研究”的研究成果之一(2018A01003)
通讯作者: 孙连英     E-mail: sunlychina@163.com
引用本文:   
赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer. Data Analysis and Knowledge Discovery, 2020, 4(5): 118-126.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0907      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I5/118
Fig.1  BBC实体识别模型
Fig.2  算法结构
数量(个)
实体数 74 430
非实体数 457 040
总量 531 470
Table 1  数据分布情况
词性 标注
r O
t O
t O
v O
ul O
n B-SE
n I-SE
Table 2  标注实例
Fig.3  数据特点
方法 P R F1
CRF 86.67% 87.84% 87.25%
BiLSTM 93.25% 87.98% 90.53%
BiLSTM+CRF 94.97% 92.10% 93.52%
BBC 96.79% 96.85% 96.74%
Table 3  模型分层验证
i P R F1
0.40 84.64% 64.01% 72.89%
0.45 87.26% 69.28% 77.24%
0.50 90.93% 53.85% 67.64%
0.55 93.14% 55.42% 69.49%
0.60 91.41% 55.74% 69.25%
Table 4  不同i值的实验结果
simsen P R F1
0.40 89.01% 56.07% 68.80%
0.45 91.30% 57.79% 70.78%
0.50 92.05% 58.16% 71.28%
0.55 91.03% 55.99% 69.33%
0.60 90.81% 56.50% 69.66%
Table 5  不同simsen值的实验结果
SEA P R F1
0.40 87.26% 79.28% 83.07%
0.45 90.93% 83.85% 87.24%
0.50 93.14% 85.42% 89.11%
0.55 91.41% 85.74% 88.48%
0.60 90.81% 83.50% 87.00%
Table 6  不同SEA的实验结果
μ P R F1
1/5 93.14% 85.42% 89.11%
1/4 95.06% 82.12% 88.12%
1/3 97.91% 89.15% 93.30%
1/2 98.41% 88.09% 92.97%
Table 7  不同μ的实验结果
模型 μ P R F1
BBC 1 96.79% 96.85% 96.74%
1/5 93.14% 85.42% 89.11%
AttTrBBC 1/4 95.06% 82.12% 88.12%
1/3 97.91% 89.15% 93.30%
1/2 98.41% 88.09% 92.97%
Table 8  全部标注与少量标注对比实验
方法 P R F1
HMM[11] 85.49% 90.14% 87.75%
CRF[10] 83.40% 95.70% 89.10%
CNN[12] 95.03% 92.80% 93.90%
AttTrBBC 98.41% 88.09% 92.97%
Table 9  工作对比分析
[1] Grishman R, Sundheim B . Message Understanding Conference-6:A Brief History [C]// Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark. Stroudsburg, PA: ACL, 1996: 466-471.
[2] Hanisch D, Fundel K, Mevissen H T, et al. ProMiner: Rule-based Protein and Gene Entity Recognition[J]. BMC Bioinformatics, 2005,6(1):S14.
[3] Lample G, Ballesteros M, Subramanian S , et al. Neural Architectures for Named Entity Recognition [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA. Stroudsburg, PA: ACL, 2016: 260-270.
[4] Dong C, Zhang J, Zong C , et al. Character-based LSTM-CRF with Radical-level Features for Chinese Named Entity Recognition [C]// Proceedings of the Natural Language Understanding and Intelligent Applications,Kunming, China. Berlin, German:Springer, 2016: 239-250.
[5] Patil N V, Patil A S, Pawar B V . HMM Based Named Entity Recognition for Inflectional Language [C]// Proceedings of the 2017 International Conference on Computer, Communications and Electronics,Jaipur, India. Piscataway, NJ: IEEE, 2017: 565-572.
[6] 薛征山, 郭剑毅, 余正涛, 等. 基于HMM的中文旅游景点的识别[J]. 昆明理工大学学报:理工版, 2009,34(6):44-48.
[6] ( Xue Zhengshan, Guo Jianyi, Yu Zhengtao, et al. Recognition of HMM-Based Chinese Tourist Attractions[J]. Journal of Kunming University of Science and Technology:Science and Technology, 2009,34(6):44-48.)
[7] 郭剑毅, 薛征山, 余正涛, 等. 基于层叠条件随机场的旅游领域命名实体识别[J]. 中文信息学报, 2009,23(5):47-52.
[7] ( Guo Jianyi, Xue Zhengshan, Yu Zhengtao, et al. Named Entity Recognition for the Tourism Domain Based on Cascaded Conditional Random Fields[J]. Journal of Chinese Information Processing, 2009,23(5):47-52.)
[8] Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016,4:357-370.
[9] 黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019,3(6):66-74.
[9] ( Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019,3(6):66-74.)
[10] Greenberg N, Bansal T, Verga P , et al. Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Stroudsburg, PA: ACL, 2018: 2824-2829.
[11] 刘小安, 彭涛.基于卷积神经网络的中文景点识别研究[J/OL].计算机工程与应用.[ 2019- 08- 01]. http://kns.cnki.net/kcms/detail/11.2127.TP.20190307.1807.007.html.
[11] ( Liu Xiaoan, Peng Tao. Research on Chinese Scenic Spot Named Entity Recognition Based on Convolutional Neural Network[J/OL]. Computer Engineering and Applications.[ 2019- 08- 01]. http://kns.cnki.net/kcms/detail/11.2127.TP.20190307.1807.007.html.)
[12] Devlin J, Chang M W, Lee K , et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA. Stroudsburg, PA: ACL, 2019: 4171-4186.
[13] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
[14] Sutton C, McCallum A. An Introduction to Conditional Random Fields[J]. Foundations and Trends® in Machine Learning, 2012,4(4):267-373.
[15] Peng D L, Wang Y R, Liu C, et al. TL-NER: A Transfer Learning Model for Chinese Named Entity Recognition[J]. Information Systems Frontiers, 2019. https://doi.org/10.1007/s10796-019-09932-y.
[16] Gomaa W H, Fahmy A A. A Survey of Text Similarity Approaches[J]. International Journal of Computer Applications, 2013,68(13):13-18.
[17] Zhang W, Yoshida T, Tang X. A Comparative Study of TF*IDF, LSI and Multi-Words for Text Classification[J]. Expert Systems with Applications, 2011,38(3):2758-2765.
[18] 俞士汶, 段慧明, 吴云芳.现代汉语多级加工语料库[DS/OL].[ 2019- 01- 03]. http://dx.doi.org/10.18170/DVN/SEYRX5.
[18] ( Yu Shiwen, Duan Huiming, Wu Yunfang. Corpus of Multi-Level Processing for Modern Chinese[DS/OL]. [ 2019- 01- 03]. http://dx.doi.org/10.18170/DVN/SEYRX5.)
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] 喻雪寒, 何琳, 徐健. 基于RoBERTa-CRF的古文历史事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] 陆泉, 何超, 陈静, 田敏, 刘婷. 基于两阶段迁移学习的多标签分类模型研究*[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[8] 刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[9] 尹鹏博,潘伟民,张海军,陈德刚. 基于BERT-BiGA模型的标题党新闻识别研究*[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[10] 宋若璇,钱力,杜宇. 基于科技论文中未来工作句集的学术创新构想话题自动生成方法研究*[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[11] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[12] 王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[13] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[14] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[15] 董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn