利用主题标引进行查询重排序<sup>*</sup>

引用本文

毛进, 李纲, 操玉杰. 利用主题标引进行查询重排序^* . 现代图书情报技术, 2014, 30(7,8): 48-55
Mao Jin, Li Gang, Cao Yujie. Re-rank Retrieval Results Through Subject Indexing. NEW TECHNOLOGY OF LIBRARY AND INFORMATION SERVICE, 2014, 30(7,8): 48-55 复制到剪切板

Permissions

《现代图书情报技术》编辑部

利用主题标引进行查询重排序^*

毛进¹, 李纲¹, 操玉杰²

¹武汉大学信息资源研究中心武汉 430072

²网易(杭州)网络有限公司杭州 310052

通讯作者: 毛进 E-mail:danveno@163.com

作者贡献声明：

毛进:设计研究方案,进行实验;

李纲:提出研究思路,起草论文;

操玉杰:参与研究方案设计,论文最终版修订。

基金:本文系国家社会科学基金重大项目“智慧城市应急决策情报体系建设研究”(项目编号:13&ZD173)的研究成果之一。

摘要

【目的】在伪相关反馈过程中,利用主题标引对查询结果进行重排序。【方法】借助语言模型方法,挖掘主题词与用户查询关系,将用户查询表达为主题词的概率分布,并建立主题词语言模型,进而判断主题词在文档中的权重。在此基础上,重新计算初次查询结果文档分值,进行查询重排序。【结果】本文方法能够较好地为主题词建立语言模型表示,挖掘得到主题词在文档中的权重,重排序结果相较于初次检索具有普遍性能提升。【局限】未比较挖掘主题词与文档关系的不同方法;未在不同规模、不同语言数据集中实验。【结论】挖掘主题词与用户查询关系、主题词与文档关系,进行查询重排序,能够提升查询精确度。

关键词: 语言模型; 信息检索主题标引; 查询重排序

中图分类号:TP391.3

Re-rank Retrieval Results Through Subject Indexing

Mao Jin¹, Li Gang¹, Cao Yujie²

¹Center for the Studies of Information Resources, Wuhan University,Wuhan 430072,China

²NetEase(Hangzhou) Inc.,Hangzhou 310052,China

Abstract

[Objective] This paper tries to re-rank search results with the help of subject indexing in the process of pseudo feedback. [Methods] User queries are represented with probability distributions over subject terms by mining the user query and subject term association in the manner of language modeling. The weights of subject terms in documents are calculated by incorporating the generative language models for subject terms. Then re-calculate the score of search documents in the first retrieval and re-rank the documents according to their scores. [Results] The proposed method constructs the generative langauge models for subject terms and mines weights of subject terms in documents appropriately. The re-rank results are pervasively improved over the initial retieval. [Limitations] Different methods of mining the associations between subject terms and documents are not compared. This approach doesn’t test the datasets with different scales or in different languages. [Conclusions] The re-rank approach can improve the retrieval precision, which exploits the associations between user queries, documents and subject terms.

Keyword: Language model; Information retrieval; Subject heading; Subject indexing; Re-rank results

Show Figures

1 引言

图书情报机构中,大量信息资源以自然语言文本形式存在。自然语言文本中的词汇具有自由性、随意性等特征,同一个概念往往存在多种词汇表达,这给信息检索带来不便。为避免自然语言的这种不确定性,主题标引借助受控词表中的规范主题词在概念层面对信息资源进行标注,用户亦可通过主题词检索信息。通过概念层面匹配,主题标引期望解决用户查询用词与信息资源中词汇的不匹配问题^{[ 1]}。作为一种重要的信息组织方式,主题标引不仅在传统图书馆领域起到重要作用,随着计算机的应用,主题标引也在机读记录和文献数据库中得到更为深入的发展。主题标引已成为众多数据库管理中一种必不可少的工作,其中最具代表、最为成功的应用是PubMed数据库^{[ 2]}。在该数据库中,专业标引人员付出大量努力利用医学主题词(MeSH)对文献进行主题标引,并整合到数据库检索系统之中。在人工主题标引之外,自动主题标引的应用与发展使得部分主题标引工作可交由计算机自动完成,使得主题标引数据更易获取。在主题标引数据越来越丰富的背景下,尝试利用主题标引来优化信息检索过程,提升用户检索的满意度,将有助于用户更好地获取信息资源。目前利用主题标引提升信息检索效果的方式主要有:借助主题词进行查询扩展^{[ 3]},增加主题词在检索模型中的权重^{[ 4]},将主题词视为概念并融入到文档或用户查询的表示模型中^{[ 5, 6, 7]}等。与前述研究不同,本文通过挖掘主题词与普通词项、主题词与文档之间的关系,借助语言模型方法,提出一种利用主题词对检索结果进行重排序的方法,提升检索系统精确度。

2 相关研究现状

观察用户使用检索系统,发现用户一般不会浏览所有查询结果,而更多地关注查询结果的靠前部分^{[ 8]}。改进检索系统查询结果排序,使其更满足用户信息需求,有助于提升用户满意度。查询重排序的一般过程为:在检索系统初次返回的查询结果基础上,对查询结果进行重新排序,使得与用户查询相关的文档排序更靠前,并且该过程无须执行二次检索。

影响查询重排序的原因或重排序的准则来自于多个方面,其中最为重要的是通过查询重排序,提升查询结果的精确度。由于用户对检索主题的熟悉程度不同,用户查询词或多或少存在着一定的模糊性,对检索结果进行重排序,使检索结果的靠前部分包含多个主题,以使用户理解和选择主题,帮助用户找到所需信息^{[ 9]}。根据检索结果的多样性(Diversity)^{[ 9]}或新颖性(Novelty)^{[ 10]}进行查询重排,即是按照结果文档所从属的主题进行查询重排序。另一方面,在部分检索场景中,用户希望先掌握检索目标的概貌,然后再利用更加专指的查询词进行检索,此时可根据文档的一般性(Generality)^{[ 11]}进行查询重排序,将讲述主题概貌的文档置前。

建立新的结果文档排序规则,是实施查询重排序的重要环节。现有研究主要利用相关反馈信息、文档上下文等信息来构建新规则。其中,相关反馈信息包括:检索结果的点击信息、文档发布时间、浏览历史、查询日志等隐式反馈信息^{[ 12]},以及相关反馈或伪相关反馈信息。伪相关反馈应用较多,部分研究在伪相关反馈基础上,进一步从伪相关文档集合中挖掘出相关文档。Sakai等利用选择性抽样策略进一步筛选出相关文档进行查询重排序^{[ 13]}。周博等通过计算文档与伪相关反馈信息中的相关文档与不相关文档的相似度,组合得到文档的查询分值,从而对查询文档重新排序^{[ 14]}。原福永等则通过计算文档与伪相关反馈文档集合中其他文档的相似度,并整合文档与查询词的相似度,以最终相似度大小对文档排序,呈现给用户^{[ 15]}。Diaz认为相似的文档应当拥有相似的分值,根据这种思想,运用KNN聚类算法在整个语料库中寻找文档的邻近文档,构建语料库的图结构表示,进而调节查询结果文档中的相似文档分值,重新计算文档分值,以此排序检索结果^{[ 16]}。Kurland进一步通过查询结果文档进行聚类,将相关文档和不相关文档、不同主题的文档进行聚类,在伪相关反馈文档中聚集在一起的文档可能是相关文档,而孤立的文档可能为非相关文档,进而通过相关文档所属聚类信息来构建查询语言模型,利用语言模型方法对查询结果文档进行重排序^{[ 17]}。

另一方面,文档上下文信息也可用于衡量文档本身的重要性,Google等搜索引擎在初次查询结果的基础上,利用网页链接关系、锚点文本等信息对查询结果进行重排序,提升检索效果^{[ 18]}。Krestel等通过将文档标题等信息链接到维基百科上,利用外部知识资源识别文档主题,作为文档多样性评分,将其与文档原始检索评分进行整合,对检索结果重排序^{[ 9]}。文档中所标引的主题词也可以理解为上下文信息,在某种程度上揭示文档主题,可作为查询重排序的依据。Kamps将主题词作为文档特征,建立文档的向量空间模型,挖掘伪相关反馈文档中的主题词,利用降维技术提取出核心主题词,建立用户查询的主题词向量模型,进而计算用户查询的主题词向量与每个查询结果文档的主题词向量的距离,以此重排序^{[ 19]}。Yin等构建两个维度的上下文信息:医学主题词和文档关键词,组合两个维度的上下文信息后,在向量空间模型基础上计算文档与查询的相关度概率值,以此排序^{[ 12]}。与这些研究类似,本文在伪相关反馈过程中,通过文档中标注的主题词来进行查询重排,不同的是本文在语言模型基础上利用主题词概率分布来表示用户查询,结合主题词与文档的相关性进行查询重排。

3 利用主题标引进行查询重排序

首先从伪相关文档中发现主题词,计算主题词与用户查询的相关性大小,并利用主题词概率分布来表示用户查询,即用户查询的主题词语言模型。在语料库中,通过挖掘主题词与普通词项在文档层次的共现关系,利用普通词项为主题词建立语言模型,进而计算文档中所标注的主题词与文档的相关度,得到文档中主题词的权重。最后,利用查询结果文档中主题词权重,将用户查询的主题词语言模型,转化为查询文档最终分值,并以此将初次检索的结果文档集进行重排序。

3.1 用户查询的主题词语言模型表示

在基于语言模型的信息检索中,查询相关性模型通过伪相关文档挖掘用户需求,以语言模型方式对用户查询建立概率分布模型。与查询相关性模型类似,本文利用伪相关文档挖掘用户信息需求。不同于以往研究中以词项来表征用户查询需求,笔者认为用户查询可以用主题标引的概念来表示,即利用文档中赋予的主题词来表示用户需求,即以主题词作为语言模型的基本单元,来表示用户查询。图1展示了从伪相关文档中得到主题词集合的过程,其中Q为用户查询, 表示伪相关反馈文档, 表示文档中第k个主题词,它来自于主题词表C,则。

	Figure Option View Download New Window
	图1 用户查询的主题词表示

类似于一元语言模型以词的概率分布来表示文档,也可利用主题词的概率分布来表示用户查询,即采用概率分布来表示用户查询。整个计算过程为:

(1) 统计伪相关文档集合中主题词的频次,得到主题词及其频次其中,代表主题词出现的频次;

(2) 利用最大似然估计法计算主题词概率:

其中, 指伪相关文档中出现的所有主题词。

3.2 主题词的语言模型

在图书情报领域,主题词常常被理解为概念,自然语言文本中的普通词项,根据其语义可以映射为主题词。从语义关联和语义层次关系出发,一些人工知识资源,如中文语言的HowNet、英文语言的WordNet、医学主题词表(MeSH)等,通过同义词、近义词、上位词、下位词、概念词项等形式,一方面体现主题词之间的概念关系,另一方面也能实现普通词项与主题词之间的映射。根据这种显式的语义映射,可以寻找到主题词所关联的普通词项集合,用普通词项集合来表示主题词。进一步扩展,可以借助文本挖掘方法,从主题词与普通词项在文档层次的共现关系,来识别出与主题词存在联系的普通词集合,并量化表达。此种方法所发现的普通词项与主题词之间的隐式联系,不同于人工知识资源中显式的语义关联。

根据概率论原理,可以采用语言模型来表示主题词,即将主题词表示为普通词项的条件概论分布: ,其中表示主题词的词项分布,t表示普通词项。本文采用启发式方法计算概率 :如果词项t在主题词c所标注的文档集合中出现次数越高,在整个语料库中出现次数越低,那么词项t与主题词c的关联强度越大。这种启发式方法来源于TFIDF权重方法,得到词项t与主题词c之间的关联强度值为^{[ 20]}:

其中, 是词项t在主题词c标注的文档集合中出现的次数,N是语料库中总文档数, 为词项t的逆文档频率,即在整个语料库中出现的文档数,0.5为平滑参数。

经过标准化处理后,得到:

其中, 是主题词c标注的文档集合中出现的所有普通词项。

3.3 文档主题词权重计算

在现有的主题标引系统中,标引人员对文档赋以主题词进行主题标引时,缺乏体现不同主题词重要性的机制,以反映主题词与文档的相关度。美国国家医学图书馆NLM(National Library of Medicine)所采取的方式是通过添加“*”符号来区分文档的不同级别主题词^{[ 21]},标有“*”符号的主题词相较于无该符号标识的主题词而言对该文档更为重要,是主要的标引词,其不足在于它只有级别之分,而没有从定量的角度反映主题词与文档的相关性大小。概率模型在自动标引和文本检索中的成功应用,给主题标引带来新的启示:在人工主题标引中引入概率,通过概率大小来表示文档中主题词的权重^{[ 22]}。相较于以往的标引机制而言,给标引词赋以权重的加权标引机制,不仅应该受到标引人员的重视,同时还应该整合到系统功能之中^{[ 23]}。在信息检索领域,通过引入加权标引机制,主题标引的穷尽性、专指性等特征将对检索的有效性造成影响^{[ 24]}。

在人工主题标引过程中,标引人员一般根据文档的某些特征或者模式来选择主题词^{[ 25]},其中关键词往往是标引人员参考的重要线索^{[ 26]}。简单来讲,标引人员通过识别文档中的主题性词项作为依据选择用于标引的主题词。因此,带权重的普通词项可作为文档与主题词之间的纽带,挖掘文档与普通词项、主题词与普通词项之间的关系,则可以推测出主题词与文档的相关性权重。

根据语言模型方法,将文档视为由普通词项构建的语言模型 ,利用加权互信息来计算文档与主题词c之间的相关性权重^{[ 27]},公式如下:

其中,t是文档中词项,c是文档主题词, 是配对的权重,用TFIDF权重计算方式得到权重:

其中,tf是文档中词项t的频次,N是语料库中文档总量, 是词项t的文档频次,即出现该词的文档数量, 是主题词c的文档频次,即包含该主题词的文档数量。上式中去掉了主题词的TFIDF计算项中的对数函数,从而增强逆文档频次的作用,突出文档频次的作用。

而针对 , 和采用最大似然估计计算。如果对象在语料库中出现的文档频次为 ,则其概率即为:

通过以上计算步骤,即得到主题词c与文档d的加权互信息权重,进而对文档中所有主题词进行标准化处理,计算得到最终的主题词c在文档d中的权重值 :

该权重值代表主题词在多大程度上与文档主题相关。

3.4 检索结果文档分值的重计算

本文所解决的核心问题是在检索系统初次检索结果基础之上,利用少数伪相关反馈文档构建用户查询的主题词表示,进而利用主题词在文档中的权重重新计算检索结果文档的分值。若某主题词在用户查询的主题词语言模型中的概率越大,而且该主题词在文档中的权重越高,那么该文档与用户查询的相关度则越大。检索结果文档的最终分值汇集该文档所有主题词的信息。其计算过程如下:

利用贝叶斯理论,得到:

其中, 为文档概率,假设所有文档出现的概率保持一致,不影响最终的文档排序,故省去, 即主题词出现的概率,采用最大似然估计方法计算。

通过以上公式计算的结果,即得到检索结果文档新的分值,根据该分值从大到小将检索结果文档排序,实现检索结果的重排序。

4 实验分析

4.1 数据集及预处理

实验中选择常用于医学信息检索研究的Ohsumed数据集^{[ 28]},共收集1987-1991年的348 566篇文献的元数据信息,分别是标题、标引词、作者、出版类型、摘要、期刊来源及文档标识符等,其中主题标引所使用的受控词表为医学主题词表(MeSH)。从数据集特征来看,该数据集适用于本文研究实验。Ohsumed数据集中共包括106个查询,并含有确定相关文档和可能相关文档数据,实验中将两者合并视为相关文档,用以进行精确度计算。设置实验参数时,用于表示主题词的语言模型中的词项数量上限设置为1 000,伪相关反馈文档数量N取值为1-10,用于表示用户查询的主题词数量取值为1-50,计算不同参数下实验结果。

实验中精确度指标采用P@N^{[ 20]}:

其中, 为前N个检索结果中相关文档的个数。实验中比较P@5、P@10、P@15、P@20指标。

采用语言模型检索工具Lemur^{[ 29]}对数据集建立索引,索引过程中使用其自带的Krovetz词干提取方法,并根据标准信用词表,去除418个停用词。运用Java语言编程实现算法,进行分析。实验初次检索采用查询似然模型,将查询重排序结果与其结果进行对比。

4.2 主题词的语言模型结果

主题词的语言模型反映主题词与普通词项之间的语义关联程度,并以概率方式进行量化表达。主题词的语言模型,通过将主题词所标注的文档子集当作主题词的语言文本表示,从而进行计算。表1呈现了启发式方法与最大似然估计方法下“Liver Circulation”(肝循环)和“Ascitic Fluid”(腹水)两个主题词的语言模型权重排在前10位的词项。

表1 主题词的语言模型示例

观察发现,两种方法都能将主题词自身所包含的词项赋予较高概率,但是启发式方法能够降低高频词的概率,提升与主题词存在较强语义关联的词项概率。启发式方法为“Liver Circulation”寻找到“vein”(静脉)、“cirrhosis”(硬化)、“pressure”(血压)等与肝循环存在较强关联的词项,为“Ascitic Fluid”找到“peritoneal” (腹膜的)等与腹水相关的词项;而“patient”、“article”和“journal”等在该数据集中出现频次较高的词项的权重则较低。

通过以上分析发现,启发式方法能够较好地挖掘普通词项与主题词之间的语义关联,并以概率体现关联强度。正确计算普通词与主题词之间的语义关联强度,为文档中主题词权重计算,打下较好的基础,直接影响最终结果。

4.3 主题词权重计算结果

文档主题词权重计算尝试区分不同主题词与文档主题的相关性。图2列出PubMedID为90149307的文档的所有主题词权重。从该例中可以发现,带有“*”标号的主要主题词项权重较高,而“Human”和“Child”等在语料库中广泛存在的主题词则被赋予较小的权重。“Great Britain”权重也较高,但由于该主题词是地理意义的词,未被识别为文档的主要主题词。由此发现,本文主题词权重计算方法与人工标引过程对主题词地位的区分存在一定的契合度,能够较好地区分不同主题词的权重大小,从而保证后续文档重排序计算。

	Figure Option View Download New Window
	图2 文档主题词权重示例

4.4 查询重排序性能

通过实验结果分析发现,当N=5、=44时和N=6、=44时取得较优结果。表2列出了两组实验与查询似然模型(QLH)的对比结果。观察发现,在P@5、P@10、P@15、P@20四个指标上,两组重排序结果均普遍优于查询似然模型的排序结果,其中Re_5_44的P@10指标和Re_6_44的P@15指标提升8%以上。进一步利用Wilcoxon检验是否有显著性提升,结果表明,Re_6_44的P@15值相较于初次检索具有显著性提升。通过以上分析说明,本文查询重排序方法能够较好地提升查询精确度。

表2 重排序结果与查询似然模型结果的对比

为了更为细致地观察不同查询的重排序情况,图3列出了所有106个查询的重排序结果相对于初始检索的P@15值变化比例,其参数设置为N=6、 =44。P@15值提升的查询数(51)多于P@15值降低的查询数(30),而且提升的比例大都多于50%,另外部分查询性能无变化。由此观之,查询重排序提升大多数查询的精确度,因此本文查询重排序方法能够普遍提升查询精确度。

	Figure Option View Download New Window
	图3 106个查询的P@15值变化比例

5 结语与讨论

本文利用文档中标注的主题词,在语言模型基础上,通过挖掘伪相关反馈文档,将用户查询表达为主题词语言模型。通过主题词与普通词项在文档层次的共现关系,发现主题词与普通词项之间的联系,以此判断主题词与文档内容的主题相关性权重,区分不同主题词对文档的主题贡献度。进而结合以上两方面,对初次查询结果进行重排序,提升检索性能。结合本文方法和实验,有如下讨论:

(1) 本文方法不仅可用于主题标引环境,在社会化标注环境中,用户对信息资源赋予的社会化标签,也能从一定程度上揭示主题概念。结合社会化标注环境特征,修改和优化本方法,可将本方法用于社会化标注环境中的信息检索;

(2) 本文方法是一种形式化的语言模型方法,与具体语言环境无关,适用于包括英文、中文在内的各种语言环境,研究在多语言和跨语言环境中概念与词项、概念与文本间的关系,可进一步扩展本文方法的应用场景;

(3) 实验过程中,伪相关反馈文档数量N,表征用户信息需求的主题词数量是本文方法的重要参数。参数值的选择可根据数据集中相关文档数量进行估测。伪相关反馈文档数量N值过大会引入较多的不相关文档和主题词,产生大量噪音;该值过小,则模型估算也不够精确。在实际应用中,可从整体文本中抽样得到训练集,对这些参数进行训练;

(4) 本文采用加权互信息来计算主题词与文档之间的关系,是否存在其他更为合适的方法将会在未来进一步探索;

(5) 本文方法是一种利用主题词提升信息检索效果的方法,其核心在于量化文档中主题词的权重值。除查询重排序之外,是否能将主题标引应用于信息检索的其他过程,特别是量化文档中主题词的权重是否能用来改进信息检索,将是未来的研究方向之一。

参考文献

View Option

[1]	Furnas G W, Land auer T K, Gomez L M, et al. The Vocabulary Problem in Human-system Communication[J]. Communications of the ACM, 1987, 30(11): 964-971. [本文引用:1] [JCR: 2.511]
[2]	PubMed[EB/OL]. [2013-12-09]. http://www.ncbi.nlm.nih.gov/pubmed/. [本文引用:1]
[3]	Lu Z Y, Kim W, Wilbur W J. Evaluation of Query Expansion Using MeSH in PubMed[J]. Information Retrieval, 2009, 12(1): 69-80. [本文引用:1] [JCR: 0.63]
[4]	Shin K, Han S Y. Improving Information Retrieval in MEDLINE by Modulating MeSH Term Weights [C]. In: Proceedings of the 9th International Conference on Applications of Natural Languages to Information Systems, NLDB 2004, Salford, UK. Berlin: Springer, 2004: 388-394. [本文引用:1]
[5]	Jalali V, Borujerdi M R M. Information Retrieval with Concept-based Pseudo-relevance Feedback in MEDLINE[J]. Knowledge and Information Systems, 2011, 29(1): 237-248. [本文引用:1] [JCR: 2.225]
[6]	Meij E, De Rijke M. Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method [C]. In: Proceedings of International Conference on the Theory of Information Retrieval(ICTIR 2007). 2007. [本文引用:1]
[7]	Meij E, Trieschnigg D, De Rijke M, et al. Conceptual Language Models for Domain-specific Retrieval[J]. Infor-mation Processing and Management, 2010, 46(4): 448-469. [本文引用:1]
[8]	Croft W B. What do People Want from Information Retrieval [J]. D-Lib Magazine, 1995, 1(5). http://www.dlib.org/dlib/november95/11croft.html. [本文引用:1]
[9]	Krestel R, Fankhauser P. Reranking Web Search Results for Diversity[J]. Information Retrieval, 2012, 15(5): 458-477. [本文引用:3] [JCR: 0.63]
[10]	Santos R L, Macdonald C, Ounis I. On the Role of Novelty for Search Result Diversification[J]. Information Retrieval, 2012, 15(5): 478-502. [本文引用:1] [JCR: 0.63]
[11]	Yan X, Li X, Song D. Document Re-ranking by Generality in Bio-medical Information Retrieval [A]. // Web Information Systems Engineering-WISE 2005 [M]. New York: Springer, 2005: 376-389. [本文引用:1]
[12]	Yin X, Huang X, Li Z. Towards a Better Ranking for Biomedical Information Retrieval Using Context[C]. In: Proceedings of 2009 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2009), Washington, DC, USA, Washington D. C. : IEEE, 2009: 344-349. [本文引用:2]
[13]	Sakai T, Manabe T, Koyama M. Flexible Pseudo-relevance Feedback via Selective Sampling[J]. ACM Transactions on Asian Language Information Processing, 2005, 4(2): 111-135. [本文引用:1]
[14]	周博, 岑荣伟, 刘奕群, 等. 一种基于文档相似度的检索结果重排序方法[J]. 中文信息学报, 2010, 24(3): 19-23, 36. (Zhou Bo , Cen Rongwei , Liu Yiqun , et al. A Document Relevance Based Search Result Re-Ranking [J]. Journal of Chinese Information Processing, 2010, 24(3): 19-23, 36. ) [本文引用:1] [CJCR: 1.13]
[15]	原福永, 郭丽娜, 毛伟伟. 基于内部文档比较的重排序算法[J]. 现代图书情报技术, 2009(11): 49-52. (Yuan Fuyong, Guo Lina, Mao Weiwei. Re-ranking Algorithm Based on the Inter-Documents Comparison[J]. New Technology of Library and Information Service, 2009(11): 49-52. ) [本文引用:1] [CJCR: 1.073]
[16]	Diaz F. Regularizing Query-based Retrieval Scores[J]. Information Retrieval, 2007, 10(6): 531-562. [本文引用:1] [JCR: 0.63]
[17]	Kurland O. Re-ranking Search Results Using Language Models of Query-specific Clusters[J]. Information Retrieval, 2009, 12(4): 437-460. [本文引用:1] [JCR: 0.63]
[18]	Croft W B, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice [M]. Reading, MA: Addison-Wesley, 2010. [本文引用:1]
[19]	Kamps J. Improving Retrieval Effectiveness by Reranking Documents Based on Controlled Vocabulary [A]. // Advances in Information Retrieval [M]. Berlin: Springer, 2004: 283-295. [本文引用:1]
[20]	Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008. [本文引用:2]
[21]	PubMed Tutorial [EB/OL]. [2013-07-28]. http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/015_030.html. [本文引用:1]
[22]	Kent A, Lancour H, Daily J E. Encyclopedia of Library and Information Science[M]. Boca Raton: CRC Press, 1978. [本文引用:1]
[23]	Zhang H, Smith L C, Twidale M, et al. Seeing the Wood for the Trees: Enhancing Metadata Subject Elements with Weights[J]. Information Technology and Libraries, 2011, 30(2): 75-80. [本文引用:1] [JCR: 0.595]
[24]	Wolfram D, Zhang J. The Influence of Indexing Practices and Weighting Algorithms on Document Spaces[J]. Journal of the American Society for Information Science and Technology, 2008, 59(1): 3-11. [本文引用:1] [JCR: 2.005]
[25]	Moens M F. Automatic Indexing and Abstracting of Document Texts [M]. Berlin: Springer, 2000. [本文引用:1]
[26]	Chung E, Miksa S, Hastings S K. A Framework of Automatic Subject Term Assignment for Text Categorization: An Indexing Conception-based Approach[J]. Journal of the American Society for Information Science and Technology, 2010, 61(4): 688-699. [本文引用:1] [JCR: 2.005]
[27]	Lu K, Mao J. Automatically Infer Subject Terms and Documents Associations Through Text Mining [C]. In: Proceedings of the 76th Annual Conference of Association for Information Science and Technology (ASIST 2013). Montreal: ASIS&T, 2013. [本文引用:1]
[28]	OHSUMED Test Collection[EB/OL]. [2012-12-01]. http://ir.ohsu.edu/ohsumed/ohsumed.html. [本文引用:1]
[29]	The Lemur Project [EB/OL]. [2012-10-13]. http://www.lemur-project.org/. [本文引用:1]

1987

2.511

0.0

... 通过概念层面匹配,主题标引期望解决用户查询用词与信息资源中词汇的不匹配问题^[1] ...

2013

0.0

... 主题标引已成为众多数据库管理中一种必不可少的工作,其中最具代表、最为成功的应用是PubMed数据库^[2] ...

2009

0.63

0.0

. 2009, 12(1):69-80 DOI:10.1007/s10791-008-9074-8

Evaluation of Query Expansion Using MeSH in PubMed

1.National Library of Medicine National Center for Biotechnology Information (NCBI) Bethesda MD 20894 USA

This paper investigates the effectiveness of using MeSH ? in PubMed through its automatic query expansion process: Automatic Term Mapping (ATM). We run Boolean searches based on a collection of 55 topics and about 160,000 MEDLINE ? citations used in the 2006 and 2007 TREC Genomics Tracks. For each topic, we first automatically construct a query by selecting keywords from the question. Next, each query is expanded by ATM, which assigns different search tags to terms in the query. Three search tags: [MeSH Terms], [Text Words], and [All Fields] are chosen to be studied after expansion because they all make use of the MeSH field of indexed MEDLINE citations. Furthermore, we characterize the two different mechanisms by which the MeSH field is used. Retrieval results using MeSH after expansion are compared to those solely based on the words in MEDLINE title and abstracts. The aggregate retrieval performance is assessed using both F-measure and mean rank precision. Experimental results suggest that query expansion using MeSH in PubMed can generally improve retrieval performance, but the improvement may not affect end PubMed users in realistic situations.

... 目前利用主题标引提升信息检索效果的方式主要有:借助主题词进行查询扩展^[3],增加主题词在检索模型中的权重^[4],将主题词视为概念并融入到文档或用户查询的表示模型中^[5,6,7]等 ...

2004

0.0

2011

2.225

0.0

. 2011, 29(1):237-248 DOI:10.1007/s10115-010-0327-7

Information Retrieval with Concept-based Pseudo-relevance Feedback in MEDLINE

1.Advanced Artificial Intelligence Lab, Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran

Although using domain specific knowledge sources for information retrieval yields more accurate results compared to pure keyword-based methods, more improvements can be achieved by considering both relations between concepts in an ontology and also their statistical dependencies over the corpus. In this paper, an innovative approach named concept-based pseudo-relevance feedback is introduced for improving accuracy of biomedical retrieval systems. Proposed method uses a hybrid retrieval algorithm for discovering relevancy between queries and documents which is based on a combination of keyword- and concept-based approaches. It also uses a pseudo-relevance feedback mechanism for expanding initial queries with auxiliary biomedical concepts extracted from top-ranked results of hybrid information retrieval. Using concept-based similarities makes it possible for the system to detect related documents to users’ queries, which are semantically close to each other while not necessarily sharing common keywords. In addition, expanding initial queries with concepts introduced by pseudo-relevance feedback captures those relations between queries and documents, which rely on statistical dependencies between concepts they contain. As a matter of fact, these relations may remain undetected, examining merely existing links between concepts in an external knowledge source. Proposed approach is evaluated using OHSUMED test collection and standard evaluation methods from text retrieval conference (TREC). Experimental results on MEDLINE documents (in OHSUMED collection) show 21% improvement over keyword-based approach in terms of mean average precision, which is a noticeable gain.

0.0

2010

0.0

. 2010, 46(4):448-469 DOI:10.1016/j.ipm.2009.09.005

Conceptual Language Models for Domain-specific Retrieval

Abstract Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model. Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms. Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.

1995

0.0

... 2 相关研究现状观察用户使用检索系统,发现用户一般不会浏览所有查询结果,而更多地关注查询结果的靠前部分^[8] ...

2012

0.63

0.0

. 2012, 15(5):458-477 DOI:10.1007/s10791-011-9179-3

Reranking Web Search Results for Diversity

1. L3S Research Center - Leibniz Universität Hannover, Appelstrasse 9a, 30167, Hannover, Germany 2. DFKI - German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, 66123, Saarbrcken, Germany

Abstract Search engine results are often biased towards a certain aspect of a query or towards a certain meaning for ambiguous query terms. Diversification of search results offers a way to supply the user with a better balanced result set increasing the probability that a user finds at least one document suiting her information need. In this paper, we present a reranking approach based on minimizing variance of Web search results to improve topic coverage in the top-k results. We investigate two different document representations as the basis for reranking. Smoothed language models and topic models derived by Latent Dirichlet allocation. To evaluate our approach we selected 240 queries from Wikipedia disambiguation pages. This provides us with ambiguous queries together with a community generated balanced representation of their (sub)topics. For these queries we crawled two major commercial search engines. In addition, we present a new evaluation strategy based on Kullback-Leibler divergence and Wikipedia. We evaluate this method using the TREC sub-topic evaluation on the one hand, and manually annotated query results on the other hand. Our results show that minimizing variance in search results by reranking relevant pages significantly improves topic coverage in the top- k results with respect to Wikipedia, and gives a good overview of the overall search result. Moreover, latent topic models achieve competitive diversification with significantly less reranking. Finally, our evaluation reveals that our automatic evaluation strategy using Kullback-Leibler divergence correlates well with α-nDCG scores used in manual evaluation efforts.

... 由于用户对检索主题的熟悉程度不同,用户查询词或多或少存在着一定的模糊性,对检索结果进行重排序,使检索结果的靠前部分包含多个主题,以使用户理解和选择主题,帮助用户找到所需信息^[9] ...

... 根据检索结果的多样性(Diversity)^[9]或新颖性(Novelty)^[10]进行查询重排,即是按照结果文档所从属的主题进行查询重排序 ...

... Krestel等通过将文档标题等信息链接到维基百科上,利用外部知识资源识别文档主题,作为文档多样性评分,将其与文档原始检索评分进行整合,对检索结果重排序^[9] ...

2012

0.63

0.0

. 2012, 15(5):478-502 DOI:10.1007/s10791-011-9180-x

On the Role of Novelty for Search Result Diversification

1. School of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK

Abstract Re-ranking the search results in order to promote novel ones has traditionally been regarded as an intuitive diversification strategy. In this paper, we challenge this common intuition and thoroughly investigate the actual role of novelty for search result diversification, based upon the framework provided by the diversity task of the TREC 2009 and 2010 Web tracks. Our results show that existing diversification approaches based solely on novelty cannot consistently improve over a standard, non-diversified baseline ranking. Moreover, when deployed as an additional component by the current state-of-the-art diversification approaches, our results show that novelty does not bring significant improvements, while adding considerable efficiency overheads. Finally, through a comprehensive analysis with simulated rankings of various quality, we demonstrate that, although inherently limited by the performance of the initial ranking, novelty plays a role at breaking the tie between similarly diverse results.

... 根据检索结果的多样性(Diversity)^[9]或新颖性(Novelty)^[10]进行查询重排,即是按照结果文档所从属的主题进行查询重排序 ...

2005

0.0

... 另一方面,在部分检索场景中,用户希望先掌握检索目标的概貌,然后再利用更加专指的查询词进行检索,此时可根据文档的一般性(Generality)^[11]进行查询重排序,将讲述主题概貌的文档置前 ...

2009

0.0

... 其中,相关反馈信息包括:检索结果的点击信息、文档发布时间、浏览历史、查询日志等隐式反馈信息^[12],以及相关反馈或伪相关反馈信息 ...

... Yin等构建两个维度的上下文信息:医学主题词和文档关键词,组合两个维度的上下文信息后,在向量空间模型基础上计算文档与查询的相关度概率值,以此排序^[12] ...

2005

0.0

... Sakai等利用选择性抽样策略进一步筛选出相关文档进行查询重排序^[13] ...

2010

0.0

1.13

. 2010, 24(3):19-23, 36

A Document Relevance Based Search Result Re-Ranking

对相关反馈问题的研究已有近30年的历史,相关反馈也被证明可以大程度稳定地提升检索系统的性能.当前网络环境下相关反馈的应用以及用户提供反馈信息的方式已经发生了明显的变化,因此相关反馈研究又一次引起了研究界的注意.该文提出了一种基于文档相似度的搜索结果重排序方法,该方法同时利用了反馈信息中的相关文档与不相关文档.在大规模网络信息检索标准实验数据上的实验结果表明:该方法不仅可以稳定地提高系统的检索性能,并且相较于经典的查询扩展方法有着明显的优势. Abstract： Relevance Feedback has been studied in information retrieval research for the past 30 years. It has been shown to be worthwhile in a wide variety of settings, either the actual user feedback is availableor it is implicit.Since the applications of relevance feedback and the type of user input to relevance feedback have changed in the Web environment, the relevance feedback is again emphasized by researchers. A document relevance based search result re-ranking approach is proposed in this paper, which makes use of both the relevant documents and irrelevant documents in feedback information. The approach is shown to be consistently valid for performance improvement on the standard large scale test dataset of TREC 2008 Relevance Feedback Track.

... 周博等通过计算文档与伪相关反馈信息中的相关文档与不相关文档的相似度,组合得到文档的查询分值,从而对查询文档重新排序^[14] ...

0.0

1.073

. , 2009(11):49-52

Re-ranking Algorithm Based on the Inter-Documents Comparison

(College of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China)

This paper analyzes the shortages of the existing re-ranking methods of the search engine and researches on the similarity between each document and the query, which refers to the other documents of the result set. It presents a re-ranking algorithm based on the inter-documents comparison and shows the documents to the users, according to the descending order of the similarity. The results of the experiments demonstrate that the algorithm has a much better precision than the current re-ranking algorithms.

分析现有搜索引擎重排序方法的不足，并对初始检索结果集中的每个文档相对于其他文档与查询词之间的相似度进行研究。提出一个基于内部文档比较的重排序算法，将检索结果中的文档按照相似度以从大到小的顺序呈现给用户。实验结果表明，该算法比当前的重排序算法具有更高的查准率。

... 原福永等则通过计算文档与伪相关反馈文档集合中其他文档的相似度,并整合文档与查询词的相似度,以最终相似度大小对文档排序,呈现给用户^[15] ...

2007

0.63

0.0

. 2007, 10(6):531-562 DOI:10.1007/s10791-007-9034-8

Regularizing Query-based Retrieval Scores

1.University of Massachusetts-Amherst Department of Computer Science 140 Governor’s Drive Amherst MA 01003-4610 USA

We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.

... Diaz认为相似的文档应当拥有相似的分值,根据这种思想,运用KNN聚类算法在整个语料库中寻找文档的邻近文档,构建语料库的图结构表示,进而调节查询结果文档中的相似文档分值,重新计算文档分值,以此排序检索结果^[16] ...

2009

0.63

0.0

. 2009, 12(4):437-460 DOI:10.1007/s10791-008-9065-9

Re-ranking Search Results Using Language Models of Query-specific Clusters

1.Technion—Israel Institute of Technology Faculty of Industrial Engineering and Management Technion City Haifa 32000 Israel

To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models . We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.

... Kurland进一步通过查询结果文档进行聚类,将相关文档和不相关文档、不同主题的文档进行聚类,在伪相关反馈文档中聚集在一起的文档可能是相关文档,而孤立的文档可能为非相关文档,进而通过相关文档所属聚类信息来构建查询语言模型,利用语言模型方法对查询结果文档进行重排序^[17] ...

2010

0.0

... 另一方面,文档上下文信息也可用于衡量文档本身的重要性,Google等搜索引擎在初次查询结果的基础上,利用网页链接关系、锚点文本等信息对查询结果进行重排序,提升检索效果^[18] ...

2004

0.0

... Kamps将主题词作为文档特征,建立文档的向量空间模型,挖掘伪相关反馈文档中的主题词,利用降维技术提取出核心主题词,建立用户查询的主题词向量模型,进而计算用户查询的主题词向量与每个查询结果文档的主题词向量的距离,以此重排序^[19] ...

2008

0.0

... 这种启发式方法来源于TFIDF权重方法,得到词项t与主题词c之间的关联强度值为^[20]: ...

... 实验中精确度指标采用P@N^[20]: ...

2013

0.0

... 美国国家医学图书馆NLM(National Library of Medicine)所采取的方式是通过添加“*”符号来区分文档的不同级别主题词^[21],标有“*”符号的主题词相较于无该符号标识的主题词而言对该文档更为重要,是主要的标引词,其不足在于它只有级别之分,而没有从定量的角度反映主题词与文档的相关性大小 ...

1978

0.0

... 概率模型在自动标引和文本检索中的成功应用,给主题标引带来新的启示:在人工主题标引中引入概率,通过概率大小来表示文档中主题词的权重^[22] ...

2011

0.595

0.0

... 相较于以往的标引机制而言,给标引词赋以权重的加权标引机制,不仅应该受到标引人员的重视,同时还应该整合到系统功能之中^[23] ...

2008

2.005

0.0

. 2008, 59(1):3-11

The influence of indexing practices and weighting algorithms on document spaces

Dietmar Wolfram andJin Zhang

School of Information Studies, University of Wisconsin-Milwaukee, P.O. Box 413, Milwaukee, WI 53201

> Index modeling and computer simulation techniques are used to examine the influence of indexing frequency distributions, indexing exhaustivity distributions, and three weighting methods on hypothetical document spaces in a vector-based information retrieval (IR) system. The way documents are indexed plays an important role in retrieval. The authors demonstrate the influence of different indexing characteristics on document space density (DSD) changes and document space discriminative capacity for IR. Document environments that contain a relatively higher percentage of infrequently occurring terms provide lower density outcomes than do environments where a higher percentage of frequently occurring terms exists. Different indexing exhaustivity levels, however, have little influence on the document space densities. A weighting algorithm that favors higher weights for infrequently occurring terms results in the lowest overall document space densities, which allows documents to be more readily differentiated from one another. This in turn can positively influence IR. The authors also discuss the influence on outcomes using two methods of normalization of term weights (i.e., means and ranges) for the different weighting methods.

... 在信息检索领域,通过引入加权标引机制,主题标引的穷尽性、专指性等特征将对检索的有效性造成影响^[24] ...

2000

0.0

Artificial Intelligence and Law. 2000, 8(4):343-347 DOI:10.1023/A:1011271122687

Marie-Francine Moens, Automatic Indexing and Abstracting of Document Texts, The Kluwer International Series on Information Retrieval Vol. 6

Luuk Matthijssen (1)

1. Center for Law, Public administration, Netherlands

1.Center for Law, Public administration Netherlands

... 在人工主题标引过程中,标引人员一般根据文档的某些特征或者模式来选择主题词^[25],其中关键词往往是标引人员参考的重要线索^[26] ...

2010

2.005

0.0

... 在人工主题标引过程中,标引人员一般根据文档的某些特征或者模式来选择主题词^[25],其中关键词往往是标引人员参考的重要线索^[26] ...

2013

0.0

... 根据语言模型方法,将文档视为由普通词项构建的语言模型 ,利用加权互信息来计算文档与主题词c之间的相关性权重^[27],公式如下: ...

2012

0.0

... 1 数据集及预处理实验中选择常用于医学信息检索研究的Ohsumed数据集^[28],共收集1987-1991年的348 566篇文献的元数据信息,分别是标题、标引词、作者、出版类型、摘要、期刊来源及文档标识符等,其中主题标引所使用的受控词表为医学主题词表(MeSH) ...

2012

0.0

... 采用语言模型检索工具Lemur^[29]对数据集建立索引,索引过程中使用其自带的Krovetz词干提取方法,并根据标准信用词表,去除418个停用词 ...