Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (5): 118-126     https://doi.org/10.11925/infotech.2096-3467.2019.0907
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
改进的知识迁移景点实体识别算法研究及应用*
赵平1,孙连英2(),涂帅1,卞建玲3,万莹1
1北京联合大学智慧城市学院 北京 100101
2北京联合大学城市轨道交通与物流学院 北京 100101
3北京中电普华信息技术有限公司 北京100192
Identifying Scenic Spot Entities Based on Improved Knowledge Transfer
Zhao Ping1,Sun Lianying2(),Tu Shuai1,Bian Jianling3,Wan Ying1
1Smart City College, Beijing UnionUniversity, Beijing 100101, China
2College of Urban Rail Transit and Logistics, Beijing Union University, Beijing 100101, China
3Beijing China-Power Information Technology Co., LTD, Beijing 100192, China
全文: PDF (849 KB)   HTML ( 14
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决景点实体识别中标注数据难以获取的问题。【方法】 提出一种改进的知识迁移景点实体识别算法,通过对人民日报的数据集进行关键词、句子以及可扩展能力三种级别的实验评估扩展数据集。【结果】 实验结果表明,本文方法在仅使用少量标注数据时,其准确率相比使用全部标注数据的模型提高1.62%。【局限】 对样本扩展能力考虑的特征较少,可能影响模型效果。【结论】 解决了景点实体识别中严重依赖标注数据质量的问题,为旅游自动化推荐提供技术支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵平
孙连英
涂帅
卞建玲
万莹
关键词 迁移学习BERT条件随机场景点实体识别    
Abstract

[Objective] This paper addresses the issues facing labeled data in the recognition of scenic spots.[Methods] We proposed an improved knowledge transfer algorithm for entity recognition and used datasets from the People’s Daily to evaluate our new model.[Results] Our method’s accuracy was 1.62% higher than the model using all labeled data.[Limitations] More research is needed to examine the expansion of samples.[Conclusions] The proposed method uses less labeled data in entity recognition and provides better technical support for tourism recommendation.

Key wordsTransfer Learning    BERT    Conditional Random Fields    Scenery Spot Recognition
收稿日期: 2019-08-05      出版日期: 2020-06-15
ZTFLH:  TP393  
基金资助:*本文系国家重点研发计划项目“多方法综合探测数据融合与智能识别技术研究”(2018YFC0807806);教育部科研创新基金项目“大数据驱动下的都市轨道交通安全应急决策模式研究”的研究成果之一(2018A01003)
通讯作者: 孙连英     E-mail: sunlychina@163.com
引用本文:   
赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer. Data Analysis and Knowledge Discovery, 2020, 4(5): 118-126.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0907      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I5/118
Fig.1  BBC实体识别模型
Fig.2  算法结构
数量(个)
实体数 74 430
非实体数 457 040
总量 531 470
Table 1  数据分布情况
词性 标注
r O
t O
t O
v O
ul O
n B-SE
n I-SE
Table 2  标注实例
Fig.3  数据特点
方法 P R F1
CRF 86.67% 87.84% 87.25%
BiLSTM 93.25% 87.98% 90.53%
BiLSTM+CRF 94.97% 92.10% 93.52%
BBC 96.79% 96.85% 96.74%
Table 3  模型分层验证
i P R F1
0.40 84.64% 64.01% 72.89%
0.45 87.26% 69.28% 77.24%
0.50 90.93% 53.85% 67.64%
0.55 93.14% 55.42% 69.49%
0.60 91.41% 55.74% 69.25%
Table 4  不同i值的实验结果
simsen P R F1
0.40 89.01% 56.07% 68.80%
0.45 91.30% 57.79% 70.78%
0.50 92.05% 58.16% 71.28%
0.55 91.03% 55.99% 69.33%
0.60 90.81% 56.50% 69.66%
Table 5  不同simsen值的实验结果
SEA P R F1
0.40 87.26% 79.28% 83.07%
0.45 90.93% 83.85% 87.24%
0.50 93.14% 85.42% 89.11%
0.55 91.41% 85.74% 88.48%
0.60 90.81% 83.50% 87.00%
Table 6  不同SEA的实验结果
μ P R F1
1/5 93.14% 85.42% 89.11%
1/4 95.06% 82.12% 88.12%
1/3 97.91% 89.15% 93.30%
1/2 98.41% 88.09% 92.97%
Table 7  不同μ的实验结果
模型 μ P R F1
BBC 1 96.79% 96.85% 96.74%
1/5 93.14% 85.42% 89.11%
AttTrBBC 1/4 95.06% 82.12% 88.12%
1/3 97.91% 89.15% 93.30%
1/2 98.41% 88.09% 92.97%
Table 8  全部标注与少量标注对比实验
方法 P R F1
HMM[11] 85.49% 90.14% 87.75%
CRF[10] 83.40% 95.70% 89.10%
CNN[12] 95.03% 92.80% 93.90%
AttTrBBC 98.41% 88.09% 92.97%
Table 9  工作对比分析
[1] Grishman R, Sundheim B . Message Understanding Conference-6:A Brief History [C]// Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark. Stroudsburg, PA: ACL, 1996: 466-471.
[2] Hanisch D, Fundel K, Mevissen H T, et al. ProMiner: Rule-based Protein and Gene Entity Recognition[J]. BMC Bioinformatics, 2005,6(1):S14.
[3] Lample G, Ballesteros M, Subramanian S , et al. Neural Architectures for Named Entity Recognition [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA. Stroudsburg, PA: ACL, 2016: 260-270.
[4] Dong C, Zhang J, Zong C , et al. Character-based LSTM-CRF with Radical-level Features for Chinese Named Entity Recognition [C]// Proceedings of the Natural Language Understanding and Intelligent Applications,Kunming, China. Berlin, German:Springer, 2016: 239-250.
[5] Patil N V, Patil A S, Pawar B V . HMM Based Named Entity Recognition for Inflectional Language [C]// Proceedings of the 2017 International Conference on Computer, Communications and Electronics,Jaipur, India. Piscataway, NJ: IEEE, 2017: 565-572.
[6] 薛征山, 郭剑毅, 余正涛, 等. 基于HMM的中文旅游景点的识别[J]. 昆明理工大学学报:理工版, 2009,34(6):44-48.
[6] ( Xue Zhengshan, Guo Jianyi, Yu Zhengtao, et al. Recognition of HMM-Based Chinese Tourist Attractions[J]. Journal of Kunming University of Science and Technology:Science and Technology, 2009,34(6):44-48.)
[7] 郭剑毅, 薛征山, 余正涛, 等. 基于层叠条件随机场的旅游领域命名实体识别[J]. 中文信息学报, 2009,23(5):47-52.
[7] ( Guo Jianyi, Xue Zhengshan, Yu Zhengtao, et al. Named Entity Recognition for the Tourism Domain Based on Cascaded Conditional Random Fields[J]. Journal of Chinese Information Processing, 2009,23(5):47-52.)
[8] Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016,4:357-370.
[9] 黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019,3(6):66-74.
[9] ( Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019,3(6):66-74.)
[10] Greenberg N, Bansal T, Verga P , et al. Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Stroudsburg, PA: ACL, 2018: 2824-2829.
[11] 刘小安, 彭涛.基于卷积神经网络的中文景点识别研究[J/OL].计算机工程与应用.[ 2019- 08- 01]. http://kns.cnki.net/kcms/detail/11.2127.TP.20190307.1807.007.html.
[11] ( Liu Xiaoan, Peng Tao. Research on Chinese Scenic Spot Named Entity Recognition Based on Convolutional Neural Network[J/OL]. Computer Engineering and Applications.[ 2019- 08- 01]. http://kns.cnki.net/kcms/detail/11.2127.TP.20190307.1807.007.html.)
[12] Devlin J, Chang M W, Lee K , et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA. Stroudsburg, PA: ACL, 2019: 4171-4186.
[13] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
[14] Sutton C, McCallum A. An Introduction to Conditional Random Fields[J]. Foundations and Trends® in Machine Learning, 2012,4(4):267-373.
[15] Peng D L, Wang Y R, Liu C, et al. TL-NER: A Transfer Learning Model for Chinese Named Entity Recognition[J]. Information Systems Frontiers, 2019. https://doi.org/10.1007/s10796-019-09932-y.
[16] Gomaa W H, Fahmy A A. A Survey of Text Similarity Approaches[J]. International Journal of Computer Applications, 2013,68(13):13-18.
[17] Zhang W, Yoshida T, Tang X. A Comparative Study of TF*IDF, LSI and Multi-Words for Text Classification[J]. Expert Systems with Applications, 2011,38(3):2758-2765.
[18] 俞士汶, 段慧明, 吴云芳.现代汉语多级加工语料库[DS/OL].[ 2019- 01- 03]. http://dx.doi.org/10.18170/DVN/SEYRX5.
[18] ( Yu Shiwen, Duan Huiming, Wu Yunfang. Corpus of Multi-Level Processing for Modern Chinese[DS/OL]. [ 2019- 01- 03]. http://dx.doi.org/10.18170/DVN/SEYRX5.)
[1] 赵旸, 张智雄, 刘欢, 丁良萍. 基于BERT模型的中文医学文献分类研究*[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[2] 李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[3] 张冬瑜,崔紫娟,李映夏,张伟,林鸿飞. 基于Transformer和BERT的名词隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
[4] 刘彤,倪维健,孙宇健,曾庆田. 基于深度迁移学习的业务流程实例剩余执行时间预测方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[5] 向菲,谢耀谈. 基于混合采样与迁移学习的患者评论识别模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[6] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[7] 陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[8] 肖连杰,孟涛,王伟,吴志祥. 基于深度学习的情报分析方法识别研究 * ——以安全情报领域为例[J]. 数据分析与知识发现, 2019, 3(10): 20-28.
[9] 伍杰华, 沈静, 周蓓. 基于迁移成分分析的多层社交网络链接分类*[J]. 数据分析与知识发现, 2018, 2(9): 88-99.
[10] 唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[11] 王东波, 吴毅, 叶文豪, 刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[12] 张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[13] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[14] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[15] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn