Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (2): 15-23    DOI: 10.11925/infotech.1003-3513.2015.02.03
Current Issue | Archive | Adv Search |
Feature Analysis and Automatic Identification of Query Specificity
Tang Xiangbin1, Lu Wei2, Zhang Xiaojuan1, Huang Shihao1
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
Export: BibTeX | EndNote (RIS)      

[Objective] This paper constructs a human-annotated collection on the basis of Sogou query logs, aims at feature analysis and automatic identification of query specificity, as well as evaluates and compares the identifing results. [Methods] The queries' basic features and content features are selected and analyzed. And then the decision tree, SVM and Naive Bayes classifiers are built and trained to achieve the automatic query specificity classification. [Results] Using the features mentioned above, an effective query specificty identification is obtained. Finally, the macro average F-measures of the identification effects are all above 0.8. [Limitations] Users' clickthrough information is not selected during the feature selection, and the ignorance of the conditional independence assumption of the Naive Bayes classifier in this particular experiment should be further verified. [Conclusions] The queries' basic features and content features, by themselves, can well distinguish broad, medium, and specific queries.

Key wordsQuery specificity      Decision tree      SVM      Naive Bayes     
Received: 23 April 2014      Published: 17 March 2015
:  G353.1  

Cite this article:

Tang Xiangbin, Lu Wei, Zhang Xiaojuan, Huang Shihao. Feature Analysis and Automatic Identification of Query Specificity. New Technology of Library and Information Service, 2015, 31(2): 15-23.

URL:     OR

[1] comScore, Inc. Global Search Market Draws More than 100 Billion Searches per Month [R/OL]. (2009-08-31). [2014-01-11]. arches_per_Month.
[2] González-Caro C, Calderón-Benavides L, Baeza-Yates R, et al. Web Queries: The Tip of the Iceberg of the User's Intent [C]. In: Proceedings of the 4th ACM WSDM Conference, Hong Kong, China. 2011.
[3] Nguyen B V, Kan M. Functional Faceted Web Query Analysis [C]. In: Proceedings of the 16th International Conference on World Wide Web. ACM, 2007.
[4] Song R, Luo Z, Wen J, et al. Identifying Ambiguous Queries in Web Search [C]. In: Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 1169-1170.
[5] Broder A. A Taxonomy of Web Search [J]. ACM SIGIR Forum, 2002, 36(2): 3-10.
[6] Rose D E, Levinson D. Understanding User Goals in Web Search [C]. In: Proceedings of the 13th International Conference on World Wide Web. New York: ACM, 2004: 13-19.
[7] Donato D, Donmez P, Noronha S. Toward a Deeper Understanding of User Intent and Query Expressiveness[C]. In: Proceedings of ACM SIGIR for Query Representation and Understanding Workshop. ACM, 2011.
[8] Chang Y, He K, Yu S, et al. Identifying User Goals from Web Search Results [C]. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI'06). IEEE, 2006: 1038-1041.
[9] Calderón-Benavides L, González-Caro C, Baeza-Yates R. Towards a Deeper Understanding of the User's Query Intent[C]. In: Proceedings of the SIGIR 2010 Workshop on Query Representation and Understanding. 2010:21-24.
[10] Song R, Luo Z, Nie J, et al. Identification of Ambiguous Queries in Web Search [J]. Information Processing & Management, 2009, 45(2): 216-229.
[11] White M D, Iivonen M. Questions as a Factor in Web Search Strategy [J]. Information Processing & Management, 2001, 37(5): 721-740.
[12] Phan N, Bailey P, Wilkinson R. Understanding the Relationship of Information Need Specificity to Search Query Length [C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). New York: ACM, 2007: 709-710.
[13] Hafernik C T, Jansen B J. Understanding the Specificity of Web Search Queries [C]. In: Proceedings of the CHI'13 Extended Abstracts on Human Factors in Computing Systems (CHI EA'13). New York:ACM, 2013:1827-1832.
[14] Ingwersen P, Jarvelin K. The Turn [M]. Springer, 2005.
[15] Ramírez G, de Vries A P. Relevant Contextual Features in XML Retrieval [C]. In: Proceedings of the 1st International Conference on Information Interaction in Context. New York: ACM, 2006: 56-65.
[16] 用户查询日志(SogouQ) [EB/OL]. [2013-12-27]. (User Query Logs (SogouQ) [EB/OL].[2013-12-27].
[17] KNIME [EB/OL]. [2012-09-24].
[18] Metzler D, Jones R, Peng F, et al. Improving Search Relevance for Implicitly Temporal Queries [C]. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). New York: ACM, 2009: 700-701.
[19] 张晓娟, 陆伟, 周红霞. 用户查询中潜在时间意图分析及 其检索建模[J]. 现代图书情报技术, 2011(11): 38-43. (Zhang Xiaojuan, Lu Wei, Zhou Hongxia. Analyzing and Retrieval Modeling on Implicit Temporal Intents in User's Queries [J]. New Technology of Library and Information Service, 2011(11): 38-43.)
[20] Ding J, Gravano L, Shivakumar N. Computing Geographical Scopes of Web Resources [C]. In: Proceedings of the 26th International Conference on Very Large Databases (VLDB'00). San Francisco: Morgan Kaufmann Publishers Inc., 2000: 545-556.
[21] Jones C B, Abdelmoty A I, Fu G. Maintaining Ontologies for Geographical Information Retrieval on the Web [M]. Springer Berlin Heidelberg, 2003: 934-951.
[22] McCreadie R M C, Macdonald C, Ounis I. Crowdsourcing a News Query Classification Dataset [C]. In: Proceedings of the 3rd Computer Science and Engineering. 2010.
[23] Cohen J. A Coefficient of Agreement for Nominal Scales [J]. Educational and Psychological Measurement, 1960, 20: 37-46.
[24] 周钦强, 孙炳达, 王义. 文本自动分类系统文本预处理方 法的研究[J]. 计算机应用研究, 2005, 22(2): 85-86. (Zhou Qinqiang, Sun Bingda, Wang Yi. Study on New Pretreatment Method for Chinese Text Classification System [J]. Application Research of Computers, 2005, 22(2): 85-86.)
[25] Baeza-Yates R, Calderón-Benavides L, González-Caro C. The Intention Behind Web Queries [C]. In: Proceedings of the 13th International Conference on String Processing and Information Retrieval (SPIRE'06). Berlin, Heidelberg: Springer-Verlag, 2006: 98-109.
[26] Mitchell T M. 机器学习[M]. 曾华军, 张银奎等译. 北京: 机械工业出版社, 2008: 62-70. (Mitchell T M. Machine Learning [M]. Translated by Zeng Huajun, Zhang Yinkui, et al. Beijing: China Machine Press, 2008: 62-70.)
[27] Vapnik V N. The Nature of Statistical Learning Theory [M]. New York: Springer-Verlag, 1995.
[28] Domingos P, Pazzani M. On the Optimality of the Simple Bayesian Classifier under Zero-one Loss [J]. Machine Learning, 1997, 29(2-3): 103-130.
[29] Quinlan J R. C4.5: Programs for Machine Learning [M]. San Francisco: Morgan Kaufmann Publishers Inc., 1993.
[30] 邓乃扬, 田英杰. 支持向量机: 理论、算法与拓展[M]. 北 京: 科学出版社, 2009: 77-85. (Deng Naiyang, Tian Yingjie. Support Vector Machine: Theory, Algorithms and Extensions[M]. Beijing: Science Press, 2009: 77-85.)
[31] Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. Morgan Kaufmann Publishers, 2006.
[32] 范金金, 刘鹏. 朴素贝叶斯分类器的独立性假设研究[J].计 算机工程与应用, 2008, 44(34): 139-141. (Fan Jinjin, Liu Peng. Research on Naive Bayesian Classifier's Independence Assumption [J]. Computer Engineering and Applications, 2008, 44(34): 139-141.)
[33] Manning C D, Schutze H, Raghavan P. 信息检索导论 [M]. 王 斌译. 北京: 人民邮电出版社, 2010: 105-107, 196-200. (Manning C D, Schutze H, Raghavan P. Introduction to Information Retrieval [M]. Translated by Wang Bin. Beijing: Posts & Telecom Press, 2010: 105-107, 196-200.)

[1] Shen Wang, Li Shiyu, Liu Jiayu, Li He. Optimizing Quality Evaluation for Answers of Q&A Community[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[2] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[3] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[4] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[5] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[6] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[7] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[8] Hou Jun,Liu Kui,Li Qianmu. Classification Recommendation Based on ESSVM[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[9] Cheng Xiufeng,Zhang Xinyi,Wang Ning. Identifying Trending Topics in Q&A Community with CART Decision Tree[J]. 数据分析与知识发现, 2018, 2(12): 52-59.
[10] Fan Xinyue,Cui Lei. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[11] Zhao Yang,Li Qiqi,Chen Yuhan,Cao Wenhang. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[12] Li Yongnan. Using Bayes Theory to Classify Counter Terrorism Intelligence[J]. 数据分析与知识发现, 2018, 2(10): 9-14.
[13] Tian Shihai,Lyu Deli. An Early Warning Algorithm for Public Opinion of Safety Emergency[J]. 数据分析与知识发现, 2017, 1(2): 11-18.
[14] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[15] Yang Yang,Lin Hui,Hu Guangwei. Detecting Investment Risks of Photovoltaic Projects with Big Data: Case Study of[J]. 现代图书情报技术, 2016, 32(11): 11-19.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938