Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (2): 34-42     https://doi.org/10.11925/infotech.1003-3513.2016.02.05
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于统计和特征相结合的查询纠错方法研究*
段建勇(),关晓龙
北方工业大学计算机学院 北京 100144
Auto-Correction Search Model Based on Statistics and Characteristics
Duan Jianyong(),
College of Computer Science, North China University of Technology, Beijing 100144, China
全文: PDF (652 KB)   HTML ( 67
输出: BibTeX | EndNote (RIS)      
摘要 

目的提高搜索引擎查询纠错过程中的准确率和召回率, 改善用户的检索体验。方法提出一种基于统计和特征相结合的查询纠错模型, 建立混淆集生成模型, 将用户输入的查询关键字生成其对应的混淆集; 建立混淆集排序模型, 对混淆集中的词条进行排序, 选出混淆集中最佳的词条与用户输入的查询关键字对照, 以此达到查错纠错的目的。结果实验结果证明该模型在搜索引擎查询时具有较好的效果, 测试集在110k时的准确率和召回率分别达到92.2%和95%, 相对于N-gram纠错模型准确率和召回率分别提高13.6%和8.3%。【局限】该模型中混淆集的生成规则有限、模型的训练需要大量的计算。结论本模型能够提高搜索引擎查询的准确率及效率, 改善用户的检索体验。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
段建勇
关晓龙
关键词 查询纠错混淆集N-gram模型N-gram相似度编辑距离点击词频    
Abstract

[Objective] This study aims to improve the precision, recall and user experience of the search engine. [Methods] We proposed an automatic query correction model based on the statistics and characteristics. First, established a model to generate the confusion query set for the users’ search terms, Then, created a ranking algorithm for the confusion set and chose the best match for the original queries. [Results] Our new model improved the search engine’s performance. The precision and recall rates were 92.2% and 95% on a testing set of 110k, which were 13.6% and 8.3% higher than those of the N-gram model. [Limitations] Our model only generated four types of words for the confusion set, and the training process required a lot of computation. [Conclusions] The new model can improve the precision, recall and user experience of the search engine.

Key wordsQuery correction    Confusion sets    N-gram model    N-gram similarity    Levenshtein Distance(LD)    Frequent click rate
收稿日期: 2015-08-03      出版日期: 2016-03-08
基金资助:*本文系北京市社会科学基金项目“北京市公共危机事件在网络传播中的演化机制与模型研究”(项目编号:13SHC031)和国家自然科学基金项目“面向维基百科的多粒度一体化信息抽取方法研究”(项目编号:61103112)的研究成果之一
引用本文:   
段建勇,关晓龙. 基于统计和特征相结合的查询纠错方法研究*[J]. 现代图书情报技术, 2016, 32(2): 34-42.
Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics. New Technology of Library and Information Service, 2016, 32(2): 34-42.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.02.05      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I2/34
[1] 罗成, 刘奕群, 张敏, 等. 基于用户意图识别的查询推荐研究[J]. 中文信息学报, 2014, 28(1): 64-72.
[1] (Luo Cheng, Liu Yiqun, Zhang Min, et al.Query Recommendation Based on User Intent Recognition[J]. Journal of Chinese Information Processing, 2014, 28(1): 64-72.)
[2] 姜华, 韩安琪, 王美佳, 等. 基于改进编辑距离的字符串相似度求解算法[J]. 计算机工程, 2014, 40(1): 222-227.
[2] (Jiang Hua, Han Anqi, Wang Meijia, et al.Solution Algorithm of String Similarity Based on Improved Levenshtein Distance[J]. Computer Engineering, 2014, 40(1): 222-227.)
[3] Senger C, Kaltschmidt J, Schmitt S P W, et al. Misspellings in Drug Information System Queries : Characteristics of Drug Name Spelling Errors and Strategies for Their Prevention[J]. International Journal of Medical Informatics, 2010, 79(12): 832-839.
[4] 胡晓青. 网络搜索引擎中文纠错功能实例剖析[J]. 图书情报工作网刊, 2008(1): 1-6.
[4] (Hu Xiaoqing.The Examples Analysis of Chinese-Error Correction Function in Search Engines[J]. Library and Information Service Online, 2008(1): 1-6.)
[5] 张仰森, 曹元大, 俞士汶. 基于规则与统计相结合的中文文本自动查错模型与算法[J]. 中文信息学报, 2006, 20(4): 1-7, 55.
[5] (Zhang Yangsen, Cao Yuanda, Yu Shiwen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J]. Journal of Chinese Information Processing, 2006, 20(1): 1-7, 55.)
[6] Strohmaier M, Kroll M.Acquiring Knowledge About Human Goals from Search Query Logs[J]. Information Processing and Management, 2012, 48(1): 63-82.
[7] Roy R S, Katare R, Ganguly N, et al.Discovering and Understanding Word Level User Intent in Web Search Queries[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, 30:22-38.
[8] Subramaniam L V, Roy S, Faruquie T A, et al.A Survey of Types of Text Noise and Techniques to Handle Noisy Text [C]. In: Proceedings of the 3rd Workshop on Analytics for Noisy Unstructured Text Data, Barcelona, Spain. New York, NY, USA: ACM, 2009: 115-122.
[9] 王永景. 面向文本识别流的自动校对算法研究[D]. 上海: 上海交通大学, 2008.
[9] (Wang Yongjing.The Research on the Automatic Proofreading Algorithm of Recognition Flow [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[10] 余慧佳, 刘奕群, 张敏, 等.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报, 2007, 21(1): 109-114.
[10] (Yu Huijia, Liu Yiqun, Zhang Min, et al.Research in Search Engine User Behavior Based on Log Analysis[J]. Journal of Chinese Information Processing, 2007, 21(1): 109-114.)
[11] 王斯宇, 邵波. 基于CSSCI的文本自动校对系统的构建与实现[J]. 高校图书馆工作, 2014, 34(6): 50-54.
[11] (Wang Siyu, Shao Bo.The Construction and Implementation of Text Automatic Proofreading System[J]. Library Work in Colleges and Universities, 2014, 34(6): 50-54.)
[12] 张仰森, 丁冰青. 基于二元接续关系检查的字词级自动查错方法[J]. 中文信息学报, 2001, 15(3): 36-43.
[12] (Zhang Yangsen, Ding Bingqing.Automatic Errors Detecting of Chinese Texts Based on the Bi-neighborship[J]. Journal of Chinese Information Processing, 2001, 15(3): 36-43.)
[13] Chen Q, Li M, Zhou M.Improving Query Spelling Correction Using Web Search Results [C].In:Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic. 2007: 181-189.
[14] 万飞, 赵溪, 梁循, 等. 基于移动互联网日志的搜索引擎用户行为研究[J]. 中文信息学报, 2014, 28(2): 144-150.
[14] (Wan Fei, Zhao Xi, Liang Xun, et al.Search Behavior Study Based on the Mobile Search Log[J]. Journal of Chinese Information Processing, 2014, 28(2): 144-150.)
[15] 王金铨, 梁茂成, 俞洪亮. 基于N-gram 和向量空间模型的语句相似度研究[J]. 现代外语, 2007, 30(4): 405-413.
[15] (Wang Jinquan, Liang Maocheng, Yu Hongliang.A Measure of Sentence Similarity Based on N-grams and Vector Space Model[J]. Modern Foreign Languages, 2007, 30(4): 405-413.)
[16] Liang J, Chen L, Mehrotra S.Efficient Record Linkage in Large Data Sets [C]. In: Proceedings of the 8th International Conference on Database System for Advanced Application. IEEE Computer Society, 2003: 137-146.
[17] 赵作鹏, 尹志民, 王潜平, 等. 一种改进的编辑距离算法及其在数据处理中的应用[J]. 计算机应用, 2009, 29(2): 424-426.
[17] (Zhao Zuopeng, Yin Zhimin, Wang Qianping.An Improved Algorithm of Levenshtein Distance and Its Application in Data Processing[J]. Journal of Computer Applications, 2009, 29(2): 424-426.)
[18] 邵艳秋. 信息检索相关术语[J]. 术语标准化与信息技术, 2009(4): 9-43.
[18] (Shao Yanqiu. Some Information Retrieval Terms [J]. Terminology Standardization and Information Technology, 2009(4): 9-43.).
[19] 陈智鹏, 吕玉琴, 刘华生, 等. 基于N-gram统计模型的搜索引擎中文纠错[J]. 中国电子科学研究院学报, 2009, 14(3): 323-326.
[19] (Chen Zhipeng, Lv Yuqin, Liu Huasheng, et al.Chinese Spelling Correction in Search Engines Based on N-gram Model[J]. Journal of China Academy of Electronics and Information Technology, 2009, 14(3): 323-326.)
[1] 叶焕倬, 吴迪. 基于改进编辑距离的相似重复记录清理算法[J]. 现代图书情报技术, 2011, 27(7/8): 82-90.
[2] 聂卉 黄贵鹏. 树编辑距离在Web信息抽取中的应用与实现*[J]. 现代图书情报技术, 2010, 26(5): 29-34.
[3] 王建冬,王继民,田飞佳. 一种基于内容规则的网页去噪算法*[J]. 现代图书情报技术, 2008, 24(3): 51-54.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn