[Objective] This study aims to improve the precision, recall and user experience of the search engine. [Methods] We proposed an automatic query correction model based on the statistics and characteristics. First, established a model to generate the confusion query set for the users’ search terms, Then, created a ranking algorithm for the confusion set and chose the best match for the original queries. [Results] Our new model improved the search engine’s performance. The precision and recall rates were 92.2% and 95% on a testing set of 110k, which were 13.6% and 8.3% higher than those of the N-gram model. [Limitations] Our model only generated four types of words for the confusion set, and the training process required a lot of computation. [Conclusions] The new model can improve the precision, recall and user experience of the search engine.
段建勇,关晓龙. 基于统计和特征相结合的查询纠错方法研究*[J]. 现代图书情报技术, 2016, 32(2): 34-42.
Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics. New Technology of Library and Information Service, 2016, 32(2): 34-42.
(Luo Cheng, Liu Yiqun, Zhang Min, et al.Query Recommendation Based on User Intent Recognition[J]. Journal of Chinese Information Processing, 2014, 28(1): 64-72.)
(Jiang Hua, Han Anqi, Wang Meijia, et al.Solution Algorithm of String Similarity Based on Improved Levenshtein Distance[J]. Computer Engineering, 2014, 40(1): 222-227.)
[3]
Senger C, Kaltschmidt J, Schmitt S P W, et al. Misspellings in Drug Information System Queries : Characteristics of Drug Name Spelling Errors and Strategies for Their Prevention[J]. International Journal of Medical Informatics, 2010, 79(12): 832-839.
[4]
胡晓青. 网络搜索引擎中文纠错功能实例剖析[J]. 图书情报工作网刊, 2008(1): 1-6.
[4]
(Hu Xiaoqing.The Examples Analysis of Chinese-Error Correction Function in Search Engines[J]. Library and Information Service Online, 2008(1): 1-6.)
(Zhang Yangsen, Cao Yuanda, Yu Shiwen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J]. Journal of Chinese Information Processing, 2006, 20(1): 1-7, 55.)
[6]
Strohmaier M, Kroll M.Acquiring Knowledge About Human Goals from Search Query Logs[J]. Information Processing and Management, 2012, 48(1): 63-82.
[7]
Roy R S, Katare R, Ganguly N, et al.Discovering and Understanding Word Level User Intent in Web Search Queries[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, 30:22-38.
[8]
Subramaniam L V, Roy S, Faruquie T A, et al.A Survey of Types of Text Noise and Techniques to Handle Noisy Text [C]. In: Proceedings of the 3rd Workshop on Analytics for Noisy Unstructured Text Data, Barcelona, Spain. New York, NY, USA: ACM, 2009: 115-122.
[9]
王永景. 面向文本识别流的自动校对算法研究[D]. 上海: 上海交通大学, 2008.
[9]
(Wang Yongjing.The Research on the Automatic Proofreading Algorithm of Recognition Flow [D]. Shanghai: Shanghai Jiaotong University, 2008.)
(Yu Huijia, Liu Yiqun, Zhang Min, et al.Research in Search Engine User Behavior Based on Log Analysis[J]. Journal of Chinese Information Processing, 2007, 21(1): 109-114.)
(Wang Siyu, Shao Bo.The Construction and Implementation of Text Automatic Proofreading System[J]. Library Work in Colleges and Universities, 2014, 34(6): 50-54.)
(Zhang Yangsen, Ding Bingqing.Automatic Errors Detecting of Chinese Texts Based on the Bi-neighborship[J]. Journal of Chinese Information Processing, 2001, 15(3): 36-43.)
[13]
Chen Q, Li M, Zhou M.Improving Query Spelling Correction Using Web Search Results [C].In:Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic. 2007: 181-189.
(Wan Fei, Zhao Xi, Liang Xun, et al.Search Behavior Study Based on the Mobile Search Log[J]. Journal of Chinese Information Processing, 2014, 28(2): 144-150.)
(Wang Jinquan, Liang Maocheng, Yu Hongliang.A Measure of Sentence Similarity Based on N-grams and Vector Space Model[J]. Modern Foreign Languages, 2007, 30(4): 405-413.)
[16]
Liang J, Chen L, Mehrotra S.Efficient Record Linkage in Large Data Sets [C]. In: Proceedings of the 8th International Conference on Database System for Advanced Application. IEEE Computer Society, 2003: 137-146.
(Zhao Zuopeng, Yin Zhimin, Wang Qianping.An Improved Algorithm of Levenshtein Distance and Its Application in Data Processing[J]. Journal of Computer Applications, 2009, 29(2): 424-426.)
[18]
邵艳秋. 信息检索相关术语[J]. 术语标准化与信息技术, 2009(4): 9-43.
[18]
(Shao Yanqiu. Some Information Retrieval Terms [J]. Terminology Standardization and Information Technology, 2009(4): 9-43.).
(Chen Zhipeng, Lv Yuqin, Liu Huasheng, et al.Chinese Spelling Correction in Search Engines Based on N-gram Model[J]. Journal of China Academy of Electronics and Information Technology, 2009, 14(3): 323-326.)