Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (2/3): 165-172    DOI: 10.11925/infotech.2096-3467.2019.0640
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于游记文本的游客游览行程重构*
高原1,施元磊2,张蕾2,曹天奕2,冯筠2()
1西北大学经济管理学院 西安 710127
2西北大学信息科学与技术学院 西安 710127
Reconstructing Tour Routes Based on Travel Notes
Gao Yuan1,Shi Yuanlei2,Zhang Lei2,Cao Tianyi2,Feng Jun2()
1School of Economics and Management, Northwest University, Xi’an 710127, China
2School of Information Science and Technology, Northwest University, Xi’an 710127, China
全文: PDF(902 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 基于大量的游记文本和景点信息,实现游客游览行程的重构。【方法】 结合TF-IDF和Word2Vec,提出一种基于文本相似度的命名实体识别方法识别景点;提出一种基于马尔可夫性、先验知识和空间特征的模型重构游客的游览行程。【结果】 本文所提景点识别方法的查全率达90.72%,查准率达89.65%,F值为0.9018,明显优于条件随机场方法,重构的游客游览行程与真实行程相似度达83.27%。【局限】 景点识别方法一定程度上依赖于景点信息库的完整性。【结论】 本文所提景点识别方法可自动化识别景点,且游览行程重构达到了较佳的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
高原
施元磊
张蕾
曹天奕
冯筠
关键词 命名实体识别文本相似度马尔可夫性行程重构    
Abstract

[Objective] This study tries to reconstruct tourists’ itineraries based on their travel notes and scenic information.[Methods] Firstly, we combined the TF-IDF and Word2Vec models. Then, we built a recognition method for named entities based on text similarity, which helped us identify scenic spots from travel notes. Finally, we proposed a model based on Markov property, prior knowledge and spatial characteristics to reconstruct tour itineraries.[Results] The recall, precision and F1 index values of the proposed method were 90.72%, 89.65%, and 0.9018, which were all better than those of the methods based on Conditional Random Field. The degree of similarity between the reconstructed routes and the actual ones was 83.27%.[Limitations] The completeness of scenic information might impact the performance of our model.[Conclusions] The proposed method can automatically identify scenic spots, and reconstruct travel itinerary effectively.

Key wordsNamed Entity Recognition    Text Similarity    Markov Property    Travel Reconfiguration
收稿日期: 2019-06-10     
中图分类号:  TP393  
基金资助:*本文系教育部社会科学规划基金项目“基于大数据挖掘的文化旅游时空认知分析及演变模式研究”的研究成果之一(18YJA630025)
通讯作者: 冯筠     E-mail: fengjun@nwu.edu.cn
引用本文:   
高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0640.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0640
图1  游客行程自动化重构方法整体架构
图2  景点识别技术路线
图3  Word2Vec转化词向量模型
景点名称 tf-idf 景点名称 tf-idf
定西玉湖公园 1.54 拉卜楞寺 3.13
西岩寺 1.32 嘉峪关关城 0.58
米拉日巴佛阁 2.96 悬壁长城 0.67
郎木寺 2.35 博罗转井 2.82
尕海湖 0.98 雅丹国家地质公园 0.39
表1  游记中部分景点tf-idf值示例
图4  查准率与相似度的关系
图5  识别错误数目与相似度的关系
方法 平均查全率 平均查准率 F值
条件随机场 81.38% 75.33% 0.782 4
本文方法 90.72% 89.65% 0.901 8
表2  景点识别结果指标
[1] 张晓艳, 王挺, 陈火旺 . 命名实体识别研究[J]. 计算机科学, 2005,32(4):44-48.
( Zhang Xiaoyan, Wang Ting, Chen Huowang . Research on Named Entity Recognition[J]. Computer Science, 2005,32(4):44-48.)
[2] Phithakkitnukoon S, Horanont T, Witayangkurn A , et al. Understanding Tourist Behavior Using Large-Scale Mobile Sensing Approach: A Case Study of Mobile Phone Users in Japan[J]. Pervasive and Mobile Computing, 2015,18:18-39.
[3] Budig B, Van Dijk T C . Journeys of the Past: A Hidden Markov Approach to Georeferencing Historical Itineraries[C]// Proceedings of the 11th Workshop on Geographic Information Retrieval. ACM, 2017: Article No. 7.
[4] Blank D, Henrich A . Geocoding Place Names from Historic Route Descriptions[C]// Proceedings of the 9th Workshop on Geographic Information Retrieval. ACM, 2015: Article No. 9.
[5] Blank D, Henrich A . A Depth-First Branch-and-Bound Algorithm for Geocoding Historic Itinerary Tables[C]// Proceedings of the 10th Workshop on Geographic Information Retrieval. ACM, 2016: Article No. 3.
[6] Adelfio M D, Samet H . Itinerary Retrieval: Travelers, Like Traveling Salesmen, Prefer Efficient Routes[C]// Proceedings of the 8th Workshop on Geographic Information Retrieval. ACM, 2014: Article No. 1.
[7] Zhou J, Li B, Chen G . Automatically Building Large-Scale Named Entity Recognition Corpora from Chinese Wikipedia[J]. Frontiers of Information Technology & Electronic Engineering, 2015,16(11):940-956.
[8] 张玥杰, 徐智婷, 薛向阳 . 融合多特征的最大熵汉语命名实体识别模型[J]. 计算机研究与发展, 2008,45(6):1004-1010.
( Zhang Yuejie, Xu Zhiting, Xue Xiangyang . Fusion of Multiple Features for Chinese Named Entity Recognition Based on Maximum Entropy Model[J]. Journal of Computer Research and Development, 2008,45(6):1004-1010.)
[9] 康才畯, 龙从军, 江荻 . 基于条件随机场的藏文人名识别研究[J]. 计算机工程与应用, 2015,51(3):109-111, 185.
( Kang Caijun, Long Congjun, Jiang Di . Tibetan Names Recognition Research Based on CRF[J]. Computer Engineering & Applications, 2015,51(3):109-111, 185.)
[10] 何炎祥, 罗楚威, 胡彬尧 . 基于CRF和规则相结合的地理命名实体识别方法[J]. 计算机应用与软件, 2015,32(1):179-185, 202.
( He Yanxiang, Luo Chuwei, Hu Binyao . Geographic Entity Recognition Method Based on CRF Model and Rules Combination[J]. Computer Applications and Software, 2015,32(1):179-185,202.)
[11] 张永富, 李志宏, 李军军 , 等. 一种基于自然语言处理的环境科学命名实体识别方法[J]. 科技创新导报, 2017,14(21):120-121.
( Zhang Yongfu, Li Zhihong, Li Junjun , et al. A Named Entity Recognition Method for Environmental Science Based on Natural Language Processing[J]. Science and Technology Innovation Herald, 2017,14(21):120-121.)
[12] Southall H, Mostern R, Berman M L . On Historical Gazetteers[J]. International Journal of Humanities and Arts Computing, 2011,5(2):127-145.
[13] Jordan P . Placing Names: Enriching and Integrating Gazetteers[J]. The Cartographic Journal, 2017,54(4):377-379.
[14] Melo F, Martins B . Automated Geocoding of Textual Documents: A Survey of Current Approaches[J]. Transactions in GIS, 2017,21(1):3-38.
[15] Khan A, Vasardani M, Winter S . Extracting Spatial Information from Place Descriptions [C]// Proceedings of the 1st ACM SIGSPATIAL International Workshop on Computational Models of Place. 2013: 62-69.
[16] Newson P, Krumm J . Hidden Markov Map Matching Through Noise and Sparseness [C]// Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 2009: 336-343.
[17] Moncla L, Gaio M, Noguerasiso J , et al. Reconstruction of Itineraries from Annotated Text with an Informed Spanning Tree Algorithm[J]. International Journal of Geographical Information Science, 2016,30(6):1137-1160.
[18] Moncla L, Renteria-Agualimpia W, Noguerasiso J , et al. Geocoding for Texts with Fine-Grain Toponyms: An Experiment on a Geoparsed Hiking Descriptions Corpus [C]// Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 2014: 183-192.
[19] Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[20] 武永亮, 赵书良, 李长镜 , 等. 基于TF-IDF和余弦相似度的文本分类方法[J]. 中文信息学报, 2017,31(5):138-145.
( Wu Yongliang, Zhao Shuliang, Li Changjing , et al. Text Classification Method Based on TF-IDF and Cosine Similarity[J]. Journal of Chinese Information Processing, 2017,31(5):138-145.)
[21] Niu K, Zhang H, Zhou T , et al. A Novel Spatio-Temporal Model for City-Scale Traffic Speed Prediction[J]. IEEE Access, 2019,7:30050-30057.
[1] 马建霞,袁慧,蒋翔. 基于Bi-LSTM+CRF的科学文献中生态治理技术相关命名实体抽取研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[2] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[3] 陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[4] 余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[5] 唐慧慧,王昊,张紫玄,王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[6] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[7] 范馨月,崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[8] 陈二静,姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[9] 白如江,冷伏海,廖君华. 一种基于语义组块特征的改进Cosine文本相似度计算方法*[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
[10] 郭旭,祁瑞华. 作者身份识别中不规范文本特征选择方法的研究*[J]. 现代图书情报技术, 2016, 32(11): 27-33.
[11] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[12] 杨志墨, 刘怀亮, 赵辉. 一种基于复杂网络的中文文本表示算法[J]. 现代图书情报技术, 2014, 30(11): 38-44.
[13] 汪润,何琳,王东波,黄水清,范远标. 面向文本挖掘的植物生长发育实体识别研究*[J]. 现代图书情报技术, 2014, 30(1): 24-27.
[14] 马军红. 分阶段融合的文本语义相似度计算方法[J]. 现代图书情报技术, 2013, 29(10): 20-26.
[15] 高强, 游宏梁. 基于层叠模型的国防领域命名实体识别研究[J]. 现代图书情报技术, 2012, (11): 47-52.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn