Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (2): 25-33    DOI: 10.11925/infotech.1003-3513.2016.02.04
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向搜索引擎查询日志的领域术语自动识别方法*
刘彤,倪维健(),柳梅
山东科技大学信息科学与工程学院青岛266590
Identifying Terminology from Search Engine Query Logs
Liu Tong,Ni Weijian(),Liu Mei
College of Information Science and Engineering, Shandong University of Science and Technolgoy, Qingdao 266590, China
全文: PDF(1915 KB)   HTML ( 62
输出: BibTeX | EndNote (RIS)      
摘要 

目的】为弥补传统基于静态领域语料的领域术语识别方法的不足, 提出一种从搜索引擎查询日志中自动识别领域术语的新方法。【方法】使用四部图对查询日志进行抽象描述, 并在其上应用流形排序算法得到所有候选术语关于领域度的排序, 取排在前列的术语作为领域术语。【结果】在真实搜索引擎的查询日志上实验证实本文方法具有更好的领域术语识别效果, 在Precision@n指标上比基准方法提升约20%。【局限】识别到的领域术语的覆盖面部分依赖于领域专家选取的初始查询词, 这对领域专家的经验提出一定要求。【结论】该方法无需事先准备大规模领域语料以及大量的人工标注, 即可构建高质量的领域术语集合, 具有较高的实用价值。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘彤
倪维健
柳梅
关键词 领域术语搜索引擎查询日志流形排序    
Abstract

[Objective] This study proposes a new approach to identify terminologies from search engine query logs for the purpose of improving traditional technology.[Methods]First, used the four-partite graph to re-present those query logs.Then,ranked the candidate terminologies with the help of manifold ranking algorithm. Those top ranked ones were domain-specified. [Results]We tested the proposed method with real search engine query logs and found the precision rates were about 20% higher than the standard approach. [Limitations] The coverage of those identified terminologies relies on the initial domain-specified queries manually chosen by the experts. [Conclusions]The proposed approach could build high quality domain thesaurus without pre-defined large domain corpus and annotations. Thus, the new method was more practical for real world issues.

Key wordsDomain terminology    Search engine    Query logs    Manifold ranking
收稿日期: 2015-08-13     
基金资助:*本文系山东省自然科学基金“动态环境下结构支持向量机学习算法及其应用研究”(项目编号:ZR2014FP011)、山东省高等学校科技计划项目“面向信息检索的非平衡数据排序学习问题研究”(项目编号:J12LN45)和山东省高等学校科技计划项目“面向非规范分布形态下不平衡文本数据的监督学习关键技术研究”(项目编号:J14LN33)的研究成果之一
引用本文:   
刘彤,倪维健,柳梅. 面向搜索引擎查询日志的领域术语自动识别方法*[J]. 现代图书情报技术, 2016, 32(2): 25-33.
Liu Tong,Ni Weijian,Liu Mei. Identifying Terminology from Search Engine Query Logs. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.02.04.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.02.04
[1] 刘春燕, 安小米, 侯人华. 术语标准研制方法及在信息与文献领域中的应用[J]. 图书情报工作,2014,58(9):91-95.
[1] (Liu Chunyan, An Xiaomi, Hou Renhua.Vocabulary Standard Development Methodology and Its Application in the Information and Documentation Fields[J]. Library and Information Service,2014,58(9):91-95.)
[2] Caracciolo C, Stellato A,Morshed A, et al.TheAGROVOCLinked Dataset[J].Semantic Web, 2013, 4(3): 341-348.
[3] Bodenreider O.The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004, 32(S1): D267-D270.
[4] Bonin F, Dell’Orletta F, Venturi G, et al. A Contrastive Approach to Multi-word Term Extraction from Domain Corpora[C]. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. 2010: 3222-3229.
[5] 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6):68-75.
[5] (Hua Bolin.Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6):68-75.)
[6] 何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014(3):73-79.
[6] (He Yuanbiao, Le Xiaoqiu,Zhang Fan.Research on Keyphrase Extraction from Scholarly Article Outline[J]. New Technology of Library and Information Service, 2014(3):73-79.)
[7] 曾文, 徐硕, 张运良,等. 科技文献术语的自动抽取技术研究与分析[J]. 现代图书情报技术, 2014(1):51-55.
[7] (Zeng Wen,Xu Shuo,Zhang Yunliang,etal. The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J].New Technology of Library and Information Service, 2014(1):51-55.)
[8] Dorji T C, Atlam E-S, Tata S, et al.Extraction, Selection and Ranking of Field Association (FA) Terms from Domain-specific Corpora for Building a Comprehensive FA Terms Dictionary[J]. Knowledge and Information Systems, 2011, 27(1):141-161.
[9] 屈鹏, 王惠临. 面向信息分析的专利术语抽取研究[J]. 图书情报工作, 2013, 57(1):130-135.
[9] (Qu Peng, Wang Huilin.Patent Term Extraction for Information Analysis[J].Library and Information Service, 2013, 57(1):130-135.)
[10] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4):29-34.
[10] (Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4):29-34.)
[11] 闫兴龙, 刘奕群, 方奇,等. 基于网络资源与用户行为信息的领域术语提取[J]. 软件学报, 2013, 24(9):2089-2100.
[11] (Yan Xinglong, Liu Yiqun, Fang Qi, et al.Domain-Specific Terms Extraction Based on Web Resource and User Behavior[J]. Journal of Software, 2013, 24(9):2089-2100.)
[12] Jiang D, Pei J, Li H. Mining Search and Browse Logs for Web Search: ASurvey[J]. ACM Transactions on Intelligent Systems and Technology, 2013,4(4): Article No. 57.
[13] 季培培, 鄢小燕, 岑咏华. 面向领域中文文本信息处理的术语识别与抽取研究综述[J].图书情报工作,2010, 54(16):124-129.
[13] (Ji Peipei, Yan Xiaoyan, Cen Yonghua.A Survey of Term Recognition and Extraction for Domain-specific Chinese Text Information Processing[J]. Library and Information Service, 2010,54(16): 124-129.)
[14] 宋培彦, 路青, 刘宁静. 一种从术语定义句中自动抽取知识单元的方法[J]. 情报杂志, 2014, 33(4):139-143.
[14] (Song Peiyan, Lu Qing, Liu Ningjing.A New Method for Knowledge Unit Automatic Extraction Using Definitions of Terms[J]. Journal of Intelligence, 2014, 33(4):139-143.)
[15] 熊李艳, 谭龙, 钟茂生. 基于有效词频的改进C-value 自动术语抽取方法[J]. 现代图书情报技术, 2013(9):54-59.
[15] (Xiong Liyan, Tan Long, Zhong Maosheng.An Automatic Term Extraction System of Improved C-value Based on Effective Word Frequency[J]. New Technology of Library and Information Service, 2013(9): 54-59.)
[16] Foo J, Merkel M.Using Machine Learning to Perform Automatic Term Recognition[C]. In:Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, Malta.2010: 49-54.
[17] Da Silva Conrado M, Pardo T, Rezende S O. A Machine Learning Approach to Automatic Term Extraction Using a Rich Feature Set[C]. In: Proceedings of NAACL HLT 2013 Student Research Workshop. 2013: 16-23.
[18] Loukachevitch N V.Automatic Term Recognition Needs Multiple Evidence[C]. In: Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 2401-2407.
[19] Jiang D, Leung K W T, Yang L, et al. Query Suggestion with Diversification and Personalization[J]. Knowledge-Based Systems, 2015, 89: 553-568.
[20] Rose D E, Levinson D.Understanding User Goals in Web Search[C]. In: Proceedings of the 13th International Conference on World Wide Web. ACM, 2004:13-19.
[21] 翟海军, 郭嘉丰, 王小磊,等. 基于用户查询日志的命名实体挖掘[J]. 中文信息学报, 2010, 24(1): 71-76,116.
[21] (Zhai Haijun, Guo Jiafeng, Wang Xiaolei, et al.Mining Named Entities from Query Logs[J]. Journal of Chinese Information Processing, 2010, 24(1): 71-76,116.)
[22] Xu G, Yang S H, Li H.Named Entity Mining from Click-through Data Using Weakly Supervised LatentDirichletAllocation[C]. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009:1365-1374.
[23] Jain A, Pennacchiotti M.Domain-independent Entity Extraction from Web Search Query Logs[C]. In: Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 2011:63-64.
[24] Dalvi B, Xiong C, Callan J.A Language Modeling Approach to Entity Recognition and Disambiguation for Search Queries[C]. In: Proceedings of the 1st International Workshop on Entity Recognition & Disambiguation. ACM, 2014: 45-54.
[25] Zhou D, Weston J, Gretton A, et al.Ranking on Data Manifolds[J]. Advances in Neural Information Processing Systems, 2004, 16: 169-176.
[26] Singhal A.Modern Information Retrieval: A Brief Overview[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2001,24(4):35-43.
[27] Van de Cruys T. Two Multivariate Generalizations ofPointwiseMutual Information[C]. In: Proceedings of the Workshop on Distributional Semantics and Compositionality. Association for Computational Linguistics, 2011: 16-20.
[1] 童国平, 孙建军. 基于搜索日志的用户行为分析[J]. 现代图书情报技术, 2015, 31(7-8): 80-88.
[2] 王晰巍, 赵丹, 杨梦晴, 魏俊巍. 行业网站搜索引擎优化指标及实证研究——基于信息生态视角的分析[J]. 现代图书情报技术, 2015, 31(3): 75-83.
[3] 张晓娟, 唐祥彬. 面向用户任务的查询推荐研究[J]. 现代图书情报技术, 2014, 30(4): 34-40.
[4] 陈勇, 李红莲, 吕学强. 网络用户搜索行为特征分析[J]. 现代图书情报技术, 2014, 30(12): 10-17.
[5] 关晓炟,吕学强,李卓,郑略省,. 用户查询日志中的中文机构名识别*[J]. 现代图书情报技术, 2014, 30(1): 72-78.
[6] 张李义, 陈明英. 搜索引擎的灵敏度和特异度研究[J]. 现代图书情报技术, 2011, 27(7/8): 41-46.
[7] 王继民, 李雷明子, 张鹏. 搜索引擎日志挖掘领域的论文合著网络分析[J]. 现代图书情报技术, 2011, 27(4): 58-63.
[8] 张红斌, 曹义亲. 混合多层分类和朴素贝叶斯模型的垂直搜索引擎分类器设计[J]. 现代图书情报技术, 2011, 27(3): 73-79.
[9] 周之诚. 基于查询意图聚类的实时搜索建议[J]. 现代图书情报技术, 2011, 27(2): 87-93.
[10] 柯青, 成颖, 郑彦宁, 潘云涛. 搜索引擎可用性评价指标体系构建[J]. 现代图书情报技术, 2011, (11): 24-30.
[11] 景璟, 洪颖, 蒋媛媛, 杲晓锋. 基于相关反馈的Web检索提问融合研究[J]. 现代图书情报技术, 2011, 27(1): 57-62.
[12] 郭少友. 基于通用搜索引擎的深层网络表面化方法研究[J]. 现代图书情报技术, 2010, 26(2): 24-30.
[13] 崔宇红, 张奎. 基于Nutch的开放存取搜索引擎构建研究[J]. 现代图书情报技术, 2010, 26(10): 82-86.
[14] 聂靖, 李强, 庞力, 应慧杰. 移动元搜索引擎中网页内容提取算法研究[J]. 现代图书情报技术, 2010, 26(10): 54-58.
[15] 付真真,陆伟. 基于关键词的搜索引擎优化策略及效果分析*[J]. 现代图书情报技术, 2009, 25(6): 61-65.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn