|
|
Auto-Correction Search Model Based on Statistics and Characteristics |
Duan Jianyong(), |
College of Computer Science, North China University of Technology, Beijing 100144, China |
|
|
Abstract [Objective] This study aims to improve the precision, recall and user experience of the search engine. [Methods] We proposed an automatic query correction model based on the statistics and characteristics. First, established a model to generate the confusion query set for the users’ search terms, Then, created a ranking algorithm for the confusion set and chose the best match for the original queries. [Results] Our new model improved the search engine’s performance. The precision and recall rates were 92.2% and 95% on a testing set of 110k, which were 13.6% and 8.3% higher than those of the N-gram model. [Limitations] Our model only generated four types of words for the confusion set, and the training process required a lot of computation. [Conclusions] The new model can improve the precision, recall and user experience of the search engine.
|
Received: 03 August 2015
Published: 08 March 2016
|
[1] | 罗成, 刘奕群, 张敏, 等. 基于用户意图识别的查询推荐研究[J]. 中文信息学报, 2014, 28(1): 64-72. | [1] | (Luo Cheng, Liu Yiqun, Zhang Min, et al.Query Recommendation Based on User Intent Recognition[J]. Journal of Chinese Information Processing, 2014, 28(1): 64-72.) | [2] | 姜华, 韩安琪, 王美佳, 等. 基于改进编辑距离的字符串相似度求解算法[J]. 计算机工程, 2014, 40(1): 222-227. | [2] | (Jiang Hua, Han Anqi, Wang Meijia, et al.Solution Algorithm of String Similarity Based on Improved Levenshtein Distance[J]. Computer Engineering, 2014, 40(1): 222-227.) | [3] | Senger C, Kaltschmidt J, Schmitt S P W, et al. Misspellings in Drug Information System Queries : Characteristics of Drug Name Spelling Errors and Strategies for Their Prevention[J]. International Journal of Medical Informatics, 2010, 79(12): 832-839. | [4] | 胡晓青. 网络搜索引擎中文纠错功能实例剖析[J]. 图书情报工作网刊, 2008(1): 1-6. | [4] | (Hu Xiaoqing.The Examples Analysis of Chinese-Error Correction Function in Search Engines[J]. Library and Information Service Online, 2008(1): 1-6.) | [5] | 张仰森, 曹元大, 俞士汶. 基于规则与统计相结合的中文文本自动查错模型与算法[J]. 中文信息学报, 2006, 20(4): 1-7, 55. | [5] | (Zhang Yangsen, Cao Yuanda, Yu Shiwen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J]. Journal of Chinese Information Processing, 2006, 20(1): 1-7, 55.) | [6] | Strohmaier M, Kroll M.Acquiring Knowledge About Human Goals from Search Query Logs[J]. Information Processing and Management, 2012, 48(1): 63-82. | [7] | Roy R S, Katare R, Ganguly N, et al.Discovering and Understanding Word Level User Intent in Web Search Queries[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, 30:22-38. | [8] | Subramaniam L V, Roy S, Faruquie T A, et al.A Survey of Types of Text Noise and Techniques to Handle Noisy Text [C]. In: Proceedings of the 3rd Workshop on Analytics for Noisy Unstructured Text Data, Barcelona, Spain. New York, NY, USA: ACM, 2009: 115-122. | [9] | 王永景. 面向文本识别流的自动校对算法研究[D]. 上海: 上海交通大学, 2008. | [9] | (Wang Yongjing.The Research on the Automatic Proofreading Algorithm of Recognition Flow [D]. Shanghai: Shanghai Jiaotong University, 2008.) | [10] | 余慧佳, 刘奕群, 张敏, 等.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报, 2007, 21(1): 109-114. | [10] | (Yu Huijia, Liu Yiqun, Zhang Min, et al.Research in Search Engine User Behavior Based on Log Analysis[J]. Journal of Chinese Information Processing, 2007, 21(1): 109-114.) | [11] | 王斯宇, 邵波. 基于CSSCI的文本自动校对系统的构建与实现[J]. 高校图书馆工作, 2014, 34(6): 50-54. | [11] | (Wang Siyu, Shao Bo.The Construction and Implementation of Text Automatic Proofreading System[J]. Library Work in Colleges and Universities, 2014, 34(6): 50-54.) | [12] | 张仰森, 丁冰青. 基于二元接续关系检查的字词级自动查错方法[J]. 中文信息学报, 2001, 15(3): 36-43. | [12] | (Zhang Yangsen, Ding Bingqing.Automatic Errors Detecting of Chinese Texts Based on the Bi-neighborship[J]. Journal of Chinese Information Processing, 2001, 15(3): 36-43.) | [13] | Chen Q, Li M, Zhou M.Improving Query Spelling Correction Using Web Search Results [C].In:Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic. 2007: 181-189. | [14] | 万飞, 赵溪, 梁循, 等. 基于移动互联网日志的搜索引擎用户行为研究[J]. 中文信息学报, 2014, 28(2): 144-150. | [14] | (Wan Fei, Zhao Xi, Liang Xun, et al.Search Behavior Study Based on the Mobile Search Log[J]. Journal of Chinese Information Processing, 2014, 28(2): 144-150.) | [15] | 王金铨, 梁茂成, 俞洪亮. 基于N-gram 和向量空间模型的语句相似度研究[J]. 现代外语, 2007, 30(4): 405-413. | [15] | (Wang Jinquan, Liang Maocheng, Yu Hongliang.A Measure of Sentence Similarity Based on N-grams and Vector Space Model[J]. Modern Foreign Languages, 2007, 30(4): 405-413.) | [16] | Liang J, Chen L, Mehrotra S.Efficient Record Linkage in Large Data Sets [C]. In: Proceedings of the 8th International Conference on Database System for Advanced Application. IEEE Computer Society, 2003: 137-146. | [17] | 赵作鹏, 尹志民, 王潜平, 等. 一种改进的编辑距离算法及其在数据处理中的应用[J]. 计算机应用, 2009, 29(2): 424-426. | [17] | (Zhao Zuopeng, Yin Zhimin, Wang Qianping.An Improved Algorithm of Levenshtein Distance and Its Application in Data Processing[J]. Journal of Computer Applications, 2009, 29(2): 424-426.) | [18] | 邵艳秋. 信息检索相关术语[J]. 术语标准化与信息技术, 2009(4): 9-43. | [18] | (Shao Yanqiu. Some Information Retrieval Terms [J]. Terminology Standardization and Information Technology, 2009(4): 9-43.). | [19] | 陈智鹏, 吕玉琴, 刘华生, 等. 基于N-gram统计模型的搜索引擎中文纠错[J]. 中国电子科学研究院学报, 2009, 14(3): 323-326. | [19] | (Chen Zhipeng, Lv Yuqin, Liu Huasheng, et al.Chinese Spelling Correction in Search Engines Based on N-gram Model[J]. Journal of China Academy of Electronics and Information Technology, 2009, 14(3): 323-326.) |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|