Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (2): 34-42    DOI: 10.11925/infotech.1003-3513.2016.02.05
Orginal Article Current Issue | Archive | Adv Search |
Auto-Correction Search Model Based on Statistics and Characteristics
Duan Jianyong(),
College of Computer Science, North China University of Technology, Beijing 100144, China
Download: PDF(652 KB)   HTML ( 67
Export: BibTeX | EndNote (RIS)      

[Objective] This study aims to improve the precision, recall and user experience of the search engine. [Methods] We proposed an automatic query correction model based on the statistics and characteristics. First, established a model to generate the confusion query set for the users’ search terms, Then, created a ranking algorithm for the confusion set and chose the best match for the original queries. [Results] Our new model improved the search engine’s performance. The precision and recall rates were 92.2% and 95% on a testing set of 110k, which were 13.6% and 8.3% higher than those of the N-gram model. [Limitations] Our model only generated four types of words for the confusion set, and the training process required a lot of computation. [Conclusions] The new model can improve the precision, recall and user experience of the search engine.

Key wordsQuery correction      Confusion sets      N-gram model      N-gram similarity      Levenshtein Distance(LD)      Frequent click rate     
Received: 03 August 2015      Published: 08 March 2016

Cite this article:

Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics. New Technology of Library and Information Service, 2016, 32(2): 34-42.

URL:     OR

[1] 罗成, 刘奕群, 张敏, 等. 基于用户意图识别的查询推荐研究[J]. 中文信息学报, 2014, 28(1): 64-72.
[1] (Luo Cheng, Liu Yiqun, Zhang Min, et al.Query Recommendation Based on User Intent Recognition[J]. Journal of Chinese Information Processing, 2014, 28(1): 64-72.)
[2] 姜华, 韩安琪, 王美佳, 等. 基于改进编辑距离的字符串相似度求解算法[J]. 计算机工程, 2014, 40(1): 222-227.
[2] (Jiang Hua, Han Anqi, Wang Meijia, et al.Solution Algorithm of String Similarity Based on Improved Levenshtein Distance[J]. Computer Engineering, 2014, 40(1): 222-227.)
[3] Senger C, Kaltschmidt J, Schmitt S P W, et al. Misspellings in Drug Information System Queries : Characteristics of Drug Name Spelling Errors and Strategies for Their Prevention[J]. International Journal of Medical Informatics, 2010, 79(12): 832-839.
[4] 胡晓青. 网络搜索引擎中文纠错功能实例剖析[J]. 图书情报工作网刊, 2008(1): 1-6.
[4] (Hu Xiaoqing.The Examples Analysis of Chinese-Error Correction Function in Search Engines[J]. Library and Information Service Online, 2008(1): 1-6.)
[5] 张仰森, 曹元大, 俞士汶. 基于规则与统计相结合的中文文本自动查错模型与算法[J]. 中文信息学报, 2006, 20(4): 1-7, 55.
[5] (Zhang Yangsen, Cao Yuanda, Yu Shiwen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J]. Journal of Chinese Information Processing, 2006, 20(1): 1-7, 55.)
[6] Strohmaier M, Kroll M.Acquiring Knowledge About Human Goals from Search Query Logs[J]. Information Processing and Management, 2012, 48(1): 63-82.
[7] Roy R S, Katare R, Ganguly N, et al.Discovering and Understanding Word Level User Intent in Web Search Queries[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, 30:22-38.
[8] Subramaniam L V, Roy S, Faruquie T A, et al.A Survey of Types of Text Noise and Techniques to Handle Noisy Text [C]. In: Proceedings of the 3rd Workshop on Analytics for Noisy Unstructured Text Data, Barcelona, Spain. New York, NY, USA: ACM, 2009: 115-122.
[9] 王永景. 面向文本识别流的自动校对算法研究[D]. 上海: 上海交通大学, 2008.
[9] (Wang Yongjing.The Research on the Automatic Proofreading Algorithm of Recognition Flow [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[10] 余慧佳, 刘奕群, 张敏, 等.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报, 2007, 21(1): 109-114.
[10] (Yu Huijia, Liu Yiqun, Zhang Min, et al.Research in Search Engine User Behavior Based on Log Analysis[J]. Journal of Chinese Information Processing, 2007, 21(1): 109-114.)
[11] 王斯宇, 邵波. 基于CSSCI的文本自动校对系统的构建与实现[J]. 高校图书馆工作, 2014, 34(6): 50-54.
[11] (Wang Siyu, Shao Bo.The Construction and Implementation of Text Automatic Proofreading System[J]. Library Work in Colleges and Universities, 2014, 34(6): 50-54.)
[12] 张仰森, 丁冰青. 基于二元接续关系检查的字词级自动查错方法[J]. 中文信息学报, 2001, 15(3): 36-43.
[12] (Zhang Yangsen, Ding Bingqing.Automatic Errors Detecting of Chinese Texts Based on the Bi-neighborship[J]. Journal of Chinese Information Processing, 2001, 15(3): 36-43.)
[13] Chen Q, Li M, Zhou M.Improving Query Spelling Correction Using Web Search Results [C].In:Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic. 2007: 181-189.
[14] 万飞, 赵溪, 梁循, 等. 基于移动互联网日志的搜索引擎用户行为研究[J]. 中文信息学报, 2014, 28(2): 144-150.
[14] (Wan Fei, Zhao Xi, Liang Xun, et al.Search Behavior Study Based on the Mobile Search Log[J]. Journal of Chinese Information Processing, 2014, 28(2): 144-150.)
[15] 王金铨, 梁茂成, 俞洪亮. 基于N-gram 和向量空间模型的语句相似度研究[J]. 现代外语, 2007, 30(4): 405-413.
[15] (Wang Jinquan, Liang Maocheng, Yu Hongliang.A Measure of Sentence Similarity Based on N-grams and Vector Space Model[J]. Modern Foreign Languages, 2007, 30(4): 405-413.)
[16] Liang J, Chen L, Mehrotra S.Efficient Record Linkage in Large Data Sets [C]. In: Proceedings of the 8th International Conference on Database System for Advanced Application. IEEE Computer Society, 2003: 137-146.
[17] 赵作鹏, 尹志民, 王潜平, 等. 一种改进的编辑距离算法及其在数据处理中的应用[J]. 计算机应用, 2009, 29(2): 424-426.
[17] (Zhao Zuopeng, Yin Zhimin, Wang Qianping.An Improved Algorithm of Levenshtein Distance and Its Application in Data Processing[J]. Journal of Computer Applications, 2009, 29(2): 424-426.)
[18] 邵艳秋. 信息检索相关术语[J]. 术语标准化与信息技术, 2009(4): 9-43.
[18] (Shao Yanqiu. Some Information Retrieval Terms [J]. Terminology Standardization and Information Technology, 2009(4): 9-43.).
[19] 陈智鹏, 吕玉琴, 刘华生, 等. 基于N-gram统计模型的搜索引擎中文纠错[J]. 中国电子科学研究院学报, 2009, 14(3): 323-326.
[19] (Chen Zhipeng, Lv Yuqin, Liu Huasheng, et al.Chinese Spelling Correction in Search Engines Based on N-gram Model[J]. Journal of China Academy of Electronics and Information Technology, 2009, 14(3): 323-326.)
No related articles found!
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938