Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (2): 15-23     https://doi.org/10.11925/infotech.1003-3513.2015.02.03
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
查询专指度特征分析与自动识别
唐祥彬1, 陆伟2, 张晓娟1, 黄诗豪1
1. 武汉大学信息管理学院 武汉 430072;
2. 武汉大学信息资源研究中心 武汉 430072
Feature Analysis and Automatic Identification of Query Specificity
Tang Xiangbin1, Lu Wei2, Zhang Xiaojuan1, Huang Shihao1
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF (637 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 基于Sogou 查询日志构建人工标注集, 实现查询专指度的特征分析与自动识别, 并对识别效果进行分析与评测。[方法] 选取用户查询串基本特征与内容特征进行统计分析, 并分别训练决策树、SVM 和朴素贝叶斯分类器对专指度进行自动识别。[结果] 使用以上特征的识别效果良好, 十折交叉检验的宏平均F-measure均高于0.8。[局限] 分类特征的选择未考虑用户点击信息; 朴素贝叶斯的独立性假设在本实验中是否可以忽略仍需进一步验证。[结论] 利用查询串基本特征和内容特征, 可以有效识别弱、略和强专指度查询。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张晓娟
黄诗豪
陆伟
唐祥彬
关键词 查询专指度决策树SVM朴素贝叶斯    
Abstract

[Objective] This paper constructs a human-annotated collection on the basis of Sogou query logs, aims at feature analysis and automatic identification of query specificity, as well as evaluates and compares the identifing results. [Methods] The queries' basic features and content features are selected and analyzed. And then the decision tree, SVM and Naive Bayes classifiers are built and trained to achieve the automatic query specificity classification. [Results] Using the features mentioned above, an effective query specificty identification is obtained. Finally, the macro average F-measures of the identification effects are all above 0.8. [Limitations] Users' clickthrough information is not selected during the feature selection, and the ignorance of the conditional independence assumption of the Naive Bayes classifier in this particular experiment should be further verified. [Conclusions] The queries' basic features and content features, by themselves, can well distinguish broad, medium, and specific queries.

Key wordsQuery specificity    Decision tree    SVM    Naive Bayes
收稿日期: 2014-04-23      出版日期: 2015-03-17
:  G353.1  
基金资助:

本文系国家科技支撑计划课题“文化遗产知识本体构建存储可视化技术研究”(项目编号:2012BAH33F03)和国家自然科学基金面上项目“基于语言模型的通用实体检索建模及框架实现研究”(项目编号: 71173164)的研究成果之一。

通讯作者: 陆伟, ORCID: 0000-0002-0929-7416, E-mail: reedwhu@gmail.com。     E-mail: reedwhu@gmail.com
作者简介: 作者贡献声明: 唐祥彬: 文献调研, 分析数据, 起草论文, 论文多次版本以及最终版本修订;陆伟: 提出研究思路, 论文多次版本以及最终版本修订;张晓娟: 文献调研, 分析数据, 论文初稿修订;黄诗豪: 标注系统构建, 实验数据处理。
引用本文:   
唐祥彬, 陆伟, 张晓娟, 黄诗豪. 查询专指度特征分析与自动识别[J]. 现代图书情报技术, 2015, 31(2): 15-23.
Tang Xiangbin, Lu Wei, Zhang Xiaojuan, Huang Shihao. Feature Analysis and Automatic Identification of Query Specificity. New Technology of Library and Information Service, 2015, 31(2): 15-23.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.02.03      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I2/15

[1] comScore, Inc. Global Search Market Draws More than 100 Billion Searches per Month [R/OL]. (2009-08-31). [2014-01-11]. http://www.comscore.com/Insights/Press_Releases/2009/8/Global_Search_Market_Draws_More_than_100_Billion_Se arches_per_Month.
[2] González-Caro C, Calderón-Benavides L, Baeza-Yates R, et al. Web Queries: The Tip of the Iceberg of the User's Intent [C]. In: Proceedings of the 4th ACM WSDM Conference, Hong Kong, China. 2011.
[3] Nguyen B V, Kan M. Functional Faceted Web Query Analysis [C]. In: Proceedings of the 16th International Conference on World Wide Web. ACM, 2007.
[4] Song R, Luo Z, Wen J, et al. Identifying Ambiguous Queries in Web Search [C]. In: Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 1169-1170.
[5] Broder A. A Taxonomy of Web Search [J]. ACM SIGIR Forum, 2002, 36(2): 3-10.
[6] Rose D E, Levinson D. Understanding User Goals in Web Search [C]. In: Proceedings of the 13th International Conference on World Wide Web. New York: ACM, 2004: 13-19.
[7] Donato D, Donmez P, Noronha S. Toward a Deeper Understanding of User Intent and Query Expressiveness[C]. In: Proceedings of ACM SIGIR for Query Representation and Understanding Workshop. ACM, 2011.
[8] Chang Y, He K, Yu S, et al. Identifying User Goals from Web Search Results [C]. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI'06). IEEE, 2006: 1038-1041.
[9] Calderón-Benavides L, González-Caro C, Baeza-Yates R. Towards a Deeper Understanding of the User's Query Intent[C]. In: Proceedings of the SIGIR 2010 Workshop on Query Representation and Understanding. 2010:21-24.
[10] Song R, Luo Z, Nie J, et al. Identification of Ambiguous Queries in Web Search [J]. Information Processing & Management, 2009, 45(2): 216-229.
[11] White M D, Iivonen M. Questions as a Factor in Web Search Strategy [J]. Information Processing & Management, 2001, 37(5): 721-740.
[12] Phan N, Bailey P, Wilkinson R. Understanding the Relationship of Information Need Specificity to Search Query Length [C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). New York: ACM, 2007: 709-710.
[13] Hafernik C T, Jansen B J. Understanding the Specificity of Web Search Queries [C]. In: Proceedings of the CHI'13 Extended Abstracts on Human Factors in Computing Systems (CHI EA'13). New York:ACM, 2013:1827-1832.
[14] Ingwersen P, Jarvelin K. The Turn [M]. Springer, 2005.
[15] Ramírez G, de Vries A P. Relevant Contextual Features in XML Retrieval [C]. In: Proceedings of the 1st International Conference on Information Interaction in Context. New York: ACM, 2006: 56-65.
[16] 用户查询日志(SogouQ) [EB/OL]. [2013-12-27]. http://www.sogou.com/labs/dl/q.html. (User Query Logs (SogouQ) [EB/OL].[2013-12-27]. http://www.sogou.com/labs/dl/q.html.)
[17] KNIME [EB/OL]. [2012-09-24]. http://www.knime.org/.
[18] Metzler D, Jones R, Peng F, et al. Improving Search Relevance for Implicitly Temporal Queries [C]. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). New York: ACM, 2009: 700-701.
[19] 张晓娟, 陆伟, 周红霞. 用户查询中潜在时间意图分析及 其检索建模[J]. 现代图书情报技术, 2011(11): 38-43. (Zhang Xiaojuan, Lu Wei, Zhou Hongxia. Analyzing and Retrieval Modeling on Implicit Temporal Intents in User's Queries [J]. New Technology of Library and Information Service, 2011(11): 38-43.)
[20] Ding J, Gravano L, Shivakumar N. Computing Geographical Scopes of Web Resources [C]. In: Proceedings of the 26th International Conference on Very Large Databases (VLDB'00). San Francisco: Morgan Kaufmann Publishers Inc., 2000: 545-556.
[21] Jones C B, Abdelmoty A I, Fu G. Maintaining Ontologies for Geographical Information Retrieval on the Web [M]. Springer Berlin Heidelberg, 2003: 934-951.
[22] McCreadie R M C, Macdonald C, Ounis I. Crowdsourcing a News Query Classification Dataset [C]. In: Proceedings of the 3rd Computer Science and Engineering. 2010.
[23] Cohen J. A Coefficient of Agreement for Nominal Scales [J]. Educational and Psychological Measurement, 1960, 20: 37-46.
[24] 周钦强, 孙炳达, 王义. 文本自动分类系统文本预处理方 法的研究[J]. 计算机应用研究, 2005, 22(2): 85-86. (Zhou Qinqiang, Sun Bingda, Wang Yi. Study on New Pretreatment Method for Chinese Text Classification System [J]. Application Research of Computers, 2005, 22(2): 85-86.)
[25] Baeza-Yates R, Calderón-Benavides L, González-Caro C. The Intention Behind Web Queries [C]. In: Proceedings of the 13th International Conference on String Processing and Information Retrieval (SPIRE'06). Berlin, Heidelberg: Springer-Verlag, 2006: 98-109.
[26] Mitchell T M. 机器学习[M]. 曾华军, 张银奎等译. 北京: 机械工业出版社, 2008: 62-70. (Mitchell T M. Machine Learning [M]. Translated by Zeng Huajun, Zhang Yinkui, et al. Beijing: China Machine Press, 2008: 62-70.)
[27] Vapnik V N. The Nature of Statistical Learning Theory [M]. New York: Springer-Verlag, 1995.
[28] Domingos P, Pazzani M. On the Optimality of the Simple Bayesian Classifier under Zero-one Loss [J]. Machine Learning, 1997, 29(2-3): 103-130.
[29] Quinlan J R. C4.5: Programs for Machine Learning [M]. San Francisco: Morgan Kaufmann Publishers Inc., 1993.
[30] 邓乃扬, 田英杰. 支持向量机: 理论、算法与拓展[M]. 北 京: 科学出版社, 2009: 77-85. (Deng Naiyang, Tian Yingjie. Support Vector Machine: Theory, Algorithms and Extensions[M]. Beijing: Science Press, 2009: 77-85.)
[31] Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. Morgan Kaufmann Publishers, 2006.
[32] 范金金, 刘鹏. 朴素贝叶斯分类器的独立性假设研究[J].计 算机工程与应用, 2008, 44(34): 139-141. (Fan Jinjin, Liu Peng. Research on Naive Bayesian Classifier's Independence Assumption [J]. Computer Engineering and Applications, 2008, 44(34): 139-141.)
[33] Manning C D, Schutze H, Raghavan P. 信息检索导论 [M]. 王 斌译. 北京: 人民邮电出版社, 2010: 105-107, 196-200. (Manning C D, Schutze H, Raghavan P. Introduction to Information Retrieval [M]. Translated by Wang Bin. Beijing: Posts & Telecom Press, 2010: 105-107, 196-200.)

[1] 陈浩, 张梦毅, 程秀峰. 融合主题模型与决策树的跨地区专利合作关系发现与推荐*——以广东省和武汉市高校专利库为例[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[2] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[3] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[5] 程秀峰, 张心怡, 王宁. 基于CART决策树的网络问答社区新兴话题识别研究*[J]. 数据分析与知识发现, 2018, 2(12): 52-59.
[6] 范馨月, 崔雷. 基于网络属性的抗肿瘤药物靶点预测方法及其应用*[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[7] 赵杨, 李齐齐, 陈雨涵, 曹文航. 基于在线评论情感分析的海淘APP用户满意度研究*[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[8] 李勇男. 贝叶斯理论在反恐情报分类分析中的应用研究*[J]. 数据分析与知识发现, 2018, 2(10): 9-14.
[9] 杨旸,林辉,胡广伟. 面向光伏项目投资风险的大数据监测指标甄选研究*——以Solarbao平台为例[J]. 现代图书情报技术, 2016, 32(11): 11-19.
[10] 任珂,陆伟,丁恒. 查询专指度对检索效果的影响研究[J]. 现代图书情报技术, 2016, 32(11): 34-43.
[11] 赵静娴. 基于决策树的网络伪舆情识别研究[J]. 现代图书情报技术, 2015, 31(6): 78-84.
[12] 马宾, 殷立峰. 一种基于Hadoop平台的并行朴素贝叶斯网络舆情快速分类算法[J]. 现代图书情报技术, 2015, 31(2): 78-84.
[13] 李湘东, 廖香鹏, 黄莉. LDA模型下书目信息分类系统的研究与实现[J]. 现代图书情报技术, 2014, 30(5): 18-25.
[14] 段宇锋, 朱雯晶, 陈巧, 崔红. 朴素贝叶斯算法与Bootstrapping方法相结合的中文物种描述文本语义标注研究*[J]. 现代图书情报技术, 2014, 30(5): 83-89.
[15] 徐孝娟,赵宇翔,朱庆华. 民族志决策树方法在学术博客用户行为中的研究*——以科学网博客为例[J]. 现代图书情报技术, 2014, 30(1): 79-86.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn