Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (9): 77-87    DOI: 10.11925/infotech.2096-3467.2019.0301
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 *
黄名选1,2,3(),卢守东3,徐辉3
1 广西财经学院广西(东盟)财经研究中心 南宁 530003
2 广西跨境电商智能信息处理重点实验室(广西财经学院) 南宁 530003
3 广西财经学院信息与统计学院 南宁 530003
Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion
Mingxuan Huang1,2,3(),Shoudong Lu3,Hui Xu3
1 Guangxi (ASEAN) Financial Research Center, Guangxi University of Finance and Economics, Nanning 530003, China
2 Guangxi Key Laboratory of Cross-border E-commerce Intelligent Information Processing, Guangxi University of Finance and Economics, Nanning 530003, China
3 School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China
全文: PDF(739 KB)   HTML ( 8
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】针对自然语言处理中查询主题漂移和词不匹配问题, 提出一种基于加权关联模式挖掘和规则后件扩展的跨语言信息检索模型及其算法。【方法】该模型采用新的加权关联模式支持度和基于最大项目权值的项集剪枝策略挖掘频繁项集, 利用置信度和相关度评价加权关联规则, 根据扩展模型从规则中提取优质扩展词实现规则后件扩展, 扩展词与原查询词项组合为新查询再次检索文档得到最终检索结果。【结果】实验结果表明, 与单语言检索基准比较, 本文检索模型的R-prec和P@10平均增幅分别为42.49%和25.53%; 与跨语言检索基准比较, 其平均增幅分别为91.87%和64.61%; 与现有基于加权关联规则挖掘的跨语言检索方法比较, R-prec和P@10最高平均增幅分别可达93.20%和34.60%。【局限】只进行实验性研究, 需要探讨在实际跨语言搜索引擎中的具体应用。【结论】本文检索模型能有效地减少查询主题漂移和词不匹配问题, 改善和提高检索性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
黄名选
卢守东
徐辉
关键词 信息检索跨语言检索文本挖掘关联规则自然语言处理    
Abstract

[Objective] This paper proposes a new Cross-Language Information Retrieval (CLIR) model, aiming to address the issues facing natural language processing, such as query topic drift and word mismatch. [Methods] First, we explored the frequent item-sets with the weighted association patterns and the pruning strategies based on maximum item weight. Then, we used the confidence and relevance degrees to evaluate the weighted association rules, which helped us extract the high quality expansion terms. Finally, we combined the new terms with the original ones to create new queries for the final lists. [Results] Compared with the monolingual retrieval benchmark, the average increases (AIs) of R-prec and P@10 of the proposed model were 42.49% and 25.53%. Our results were 91.87% and 64.61% higher than the cross language retrieval benchmark. Compared to the existing CLIR methods, the maximum AIs of R-prec and P@10 were 93.20% and 34.60%. [Limitations] The proposed model needs to be examined with more cross language search engines. [Conclusions] Our model improves the performance of CLIR.

Key wordsInformation Retrieval    Cross Language Retrieval    Text Mining    Association Rule    Natural Language Processing
收稿日期: 2019-03-22     
中图分类号:  TP393 G35  
基金资助:*本文系国家自然科学基金项目“基于深度学习和迁移学习的东盟跨语言查询扩展研究”(项目编号: 61762006);广西应用经济学一流学科(培育)开放性课题“中国-东盟贸易商务数据挖掘及应用研究”(项目编号: 2018MA07);广西(东盟)财经研究中心开放性课题“东盟财经文本大数据关联模式挖掘及其跨语言检索研究”(项目编号: 2018DMCJYB08)
引用本文:   
黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 *[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0301.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0301
图1  基于加权关联模式挖掘与规则后件扩展的跨语言信息检索模型
检索算法 Relax Rigid
mdn0 mdn1 kt1 mdn0 mdn1 kt1
MR 0.3649 0.4470 0.2426 0.3331 0.3682 0.2399
CLR 0.2566 0.3884 0.2042 0.1943 0.3159 0.1754
CLIR_AWAR 0.6103 0.5674 0.4083 0.4982 0.3985 0.3437
CLIR_WAR 0.5894 0.5423 0.3702 0.4787 0.3674 0.3104
CLIR_AWPNAR 0.3960 0.3518 0.1385 0.3216 0.2544 0.1327
CLIR_WAPMRCE 0.6155 0.5934 0.3801 0.5023 0.4121 0.3218
表1  各检索算法的R-prec值
检索算法 Relax Rigid
mdn0 mdn1 kt1 mdn0 mdn1 kt1
MR 0.1800 0.2524 0.2655 0.1480 0.1667 0.2069
CLR 0.1640 0.2190 0.1552 0.1400 0.1429 0.1138
CLIR_AWAR 0.2216 0.2352 0.2524 0.1832 0.1638 0.2021
CLIR_WAR 0.2168 0.2400 0.2462 0.1808 0.1752 0.1966
CLIR_AWPNAR 0.2144 0.2076 0.2048 0.1720 0.1514 0.1710
CLIR_WAPMRCE 0.2544 0.3067 0.2952 0.2024 0.2124 0.2379
表2  各检索算法的P@10值
图2  不同支持度阈值的检索结果
图3  不同置信度阈值的检索结果
图4  本文检索算法在不同数据集上的检索结果
(注: m0、m1和k1分别表示mdn0、mdn1和kt1数据集, 其后缀“e”代表Relax, 后缀“i”代表Rigid。)
查询 No.9查询主题(Title) No.22查询主题(Title)
印尼语版 Gempa bumi, pertolongan Internasional Penyakit sapi gila
英文版(机器翻译结果) The earthquake, International aid Mad cow disease
英文版(NTCIR-5语料) Earthquakes, International rescue mad cow disease
扩展词词干
(CLIR_WAPMRCE算法)
relief(1.000), ken(0.868), hit(0.740), gujarat(0.732),
quak(0.697), hardest(0.618), india(0.607), govern(0.581),
arriv(0.477), damage(0.463), bhuj(0.451), central(0.436),
reach(0.422), American(0.418), bhachua(0.394),
lentil(0.386), office(0.351), state(0.344), set(0.316),
unit(0.175), intern(0.171), rescu(0.162), million(0.156),
team(0.129), Washington(0.127), Ahmedabad(0.032),
maclean(0.029), epicent(0.028), berger(0.027),
carton(0.027), clog(0.025), baltimor(0.021), feb(0.021),
boucher(0.016), rubbl(0.015), purify(0.014),
wrench(0.013), estimate(0.012), cremat(0.011),
flatten(0.010), Turkish(0.008), chunk(0.008), toll(0.006),
bundl(0.005), blanket(0.004), homeless(0.003),
heap(0.003), freight(0.003)
bse(1.000), encephalopathy(0.937), spongiform(0.915),
bovin(0.904), human(0.848), food(0.818), beef(0.800),
part(0.669), agricultur(0.511), feed(0.510), meat(0.505),
anim(0.475), effort(0.363), measure(0.323),
ministry(0.321), cattl(0.309), confirm(0.253), infect(0.248),
European(0.240), contamin(0.233), spread(0.229),
brain(0.211), scare(0.211), ban(0.200)
表3  查询实例原文及及其扩展词
查询
编号
检索算法 Relax Rigid
p@10 R_prec p@10 R_prec
No.9 MR 0.6000 0.1525 0.6000 0.1525
CLR 0.2000 0.0865 0.2000 0.0865
CLIR_WAPMRCE 0.7000 0.2432 0.7000 0.2432
No.22 MR 0.5000 0.3182 0.3000 0.2273
CLR 0.3000 0.2333 0.3000 0.1667
CLIR_WAPMRCE 0.5000 0.4375 0.3000 0.3125
表4  查询(Title)实例的检索性能比较
[1] 吴丹, 何大庆, 王惠临 . 一种基于相关反馈的跨语言信息检索查询翻译优化技术研究[J]. 情报学报, 2012,31(4):398-406.
( Wu Dan, He Daqing, Wang Huilin . A Relevance Feedback Based Query Translation Enhancement Technique in Cross Language Information Retrieval[J]. Journal of the China Society for Scientific and Technical Information, 2012,31(4):398-406.)
[2] Zhang L, Rettinger A, Zhang J. A Knowledge Base Approach to Cross-Lingual Keyword Query Interpretation[C]// Proceedings of the 15th International Semantic Web Conference, Kobe, Japan. Springer International Publishing, 2016: 615-631.
[3] Saleh S, Pecina P. Re-ranking Hypotheses of Machine-Translated Queries for Cross-Lingual Information Retrieval[C]// Proceedings of the 7th International Conference of the Cross-Language Evaluation Forum for European Languages, Évora, Portugal. Springer International Publishing, 2016: 54-66.
[4] Elayeb B, Romdhane W B, Saoud N B B . Towards a New Possibilistic Query Translation Tool for Cross-Language Information Retrieval[J]. Multimedia Tools and Applications, 2018, 77(2):2423-2465.
[5] Ture F, Lin J . Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval[J]. ACM Transactions on Information Systems, 2014, 32(4): Article No. 19.
[6] Rahimi R, Shakery A, King I . Extracting Translations from Comparable Corpora for Cross-Language Information Retrieval Using the Language Modeling Framework[J]. Information Processing & Management, 2016,52(2):299-318.
[7] Vulić I, Moens M F. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings [C]// Proceedings of the 38th International ACM SIGIR Conference on Research & Development in Information Retrieval, Santiago, Chile. ACM, 2015: 363-372.
[8] Niyogi M, Ghosh K, Bhattacharya A . Learning Multilingual Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically Aligned Corpora[OL]. arXiv Preprint, arXiv: 1804.04475.
[9] Adriani M, Wahyu I. The Performance of a Machine Translation-Based English-Indonesian CLIR System [C]// Proceedings of the 6th Workshop of the Cross-Language Evaluation Forum Conference on Accessing Multilingual Information Repositories, Vienna, Austria. Springer-Verlag, 2005: 151-154.
[10] Hayurani H, Sari S, Adriani M. Query and Document Translation for English-Indonesian Cross Language IR [C]// Proceedings of the 7th Workshop of the Cross-Language Evaluation Forum Conference on Evaluation of Multilingual and Multi-modal Information Retrieval, Alicante, Spain. Springer-Verlag, 2006: 57-61.
[11] Adriani M, Hayurani H, Sari S. Indonesian-English Transitive Translation for Cross-Language Information Retrieval [C]// Proceedings of the 8th Workshop of the Cross-Language Evaluation Forum Conference on Advances in Multilingual and Multimodal Information Retrieval, Budapest, Hungary. Springer-Verlag, 2007: 127-133.
[12] 吴丹, 何大庆, 王惠临 . 基于伪相关反馈的跨语言查询扩展[J]. 情报学报, 2010,29(2):232-239.
( Wu Dan, He Daqing, Wang Huilin . Cross Language Query Expansion Using Pseudo Relevance Feedback[J]. Journal of the China Society for Scientific and Technical Information, 2010,29(2):232-239.)
[13] Agrawal A J . Improving Performance of Hindi-English Based Cross Language Information Retrieval Using Selective Documents Technique and Query Expansion[J]. International Journal of Science and Research, 2016,5(5):1964-1967.
[14] Chandra G, Dwivedi S K . Query Expansion Based on Term Selection for Hindi-English Cross Lingual IR[J/OL]. Journal of King Saud University-Computer and Information Sciences, 2017. https://doi.org/10.1016/j.jksuci. 2017. 09. 002.
[15] 郝嘉树, 王惠临, 刘耀 . 基于本体的跨语言信息检索模型和关键技术研究[J]. 情报科学, 2009,27(2):271-275.
( Hao Jiashu, Wang Huilin, Liu Yao . Research on Ontology-based CLIR Model and Key Technologies[J]. Information Science, 2009,27(2):271-275.)
[16] 司莉, 史雅莉 . 以多语本体库为核心的跨语言信息检索映射技术研究进展——EuroWordNet案例分析[J]. 图书情报工作, 2016,60(2):106-111.
( Si Li, Shi Yali . Research Review on Cross-language Information Retrieval Mapping Technology with the Multilingual Ontology Database as Core Factor: A Case Study on EuroWordNet[J]. Library and Information Service, 2016,60(2):106-111.)
[17] 司莉, 陈雨雪, 曾粤亮 . 基于多语言本体的中英跨语言信息检索模型及实现[J]. 图书情报工作, 2017,61(1):100-108.
( Si Li, Chen Yuxue, Zeng Yueliang . A Study on Cross-Language Information Retrieval Model Based on Multilingual Ontology[J]. Library and Information Service, 2017,61(1):100-108.)
[18] 闭剑婷, 苏一丹 . 基于潜在语义分析的跨语言查询扩展方法[J]. 计算机工程, 2009,35(10):49-53.
( Bi Jianting, Su Yidan . Expansion Method for Language-Crossed Query Based on Latent Semantic Analysis[J]. Computer Engineering, 2009,35(10):49-53.)
[19] 宁健, 林鸿飞 . 基于改进潜在语义分析的跨语言检索[J]. 中文信息学报, 2010,24(3):105-111.
( Ning Jian, Lin Hongfei . Cross-Language Information Retrieval Based on Improved Latent Semantic Indexing[J]. Journal of Chinese Information Processing, 2010,24(3):105-111.)
[20] Wang X, Zhang Q, Wang X, et al. Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment [C]//Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 2015: 529-534.
[21] Dai B . Research on Chinese and English Language Information Retrieval Algorithm Based on Bilingual Theme Model[J/OL]. Cluster Computing. https://doi.org/10.1007/s10586-018-2218-8.
[22] 黄名选, 蒋曹清, 何冬蕾 . 基于矩阵加权关联规则的跨语言查询译后扩展[J]. 模式识别与人工智能, 2018,31(10):887-898.
( Huang Mingxuan, Jiang Caoqing, He Donglei . Cross Language Query Post-Translation Expansion Based on Matrix-Weighted Association Rules[J]. Pattern Recognition and Artificial Intelligence, 2018,31(10):887-898.)
[23] Latiri C, Haddad H, Hamrouni T . Towards an Effective Automatic Query Expansion Process Using an Association Rule Mining Approach[J]. Journal of Intelligent Information Systems, 2012,39(1):209-247.
doi: 10.1007/s10844-011-0189-9
[24] Liu C, Qi R, Liu Q. Query Expansion Terms Based on Positive and Negative Association Rules[C]// Proceedings of the 3rd International Conference on Information Science and Technology, Yangzhou, China. IEEE Press, 2013: 802-808.
[25] Geraldo A P, Moreira V P. UFRGS@CLEF2008: Using Association Rules for Cross-Language Information Retrieval [C]// Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, Aarhus, Denmark. Springer-Verlag, 2009: 66-74.
[26] Song M, Song I Y, Hu X , et al. Integration of Association Rules and Ontology for Semantic-Based Query Expansion[J]. Data & Knowledge Engineering, 2007,63(1):63-75.
[27] 黄名选 . 完全加权模式挖掘与相关反馈融合的印尼汉跨语言查询扩展[J]. 小型微型计算机系统, 2017,38(8):1783-1791.
( Huang Mingxuan . Indonesian-Chinese Cross Language Query Expansion Based on All-Weighted Patterns Mining and Relevance Feedback[J]. Journal of Chinese Computer Systems, 2017,38(8):1783-1791.)
[28] Agrawal R, Imielinski T, Swami A. Mining Association Rules Between Sets of Items in Large Database[C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA. ACM Press, 1993: 207-216.
[29] Bouziri A, Latiri C, Gaussier E. Efficient Association Rules Selecting for Automatic Query Expansion [C]// Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary. Springer-Verlag, 2017: 563-574.
[30] Belalem G, Abbache A, Belkredim F Z , et al. Arabic Query Expansion Using WordNet and Association Rules[J]. International Journal of Intelligent Information Technologies, 2016,12(3):51-64.
[31] Cai C H, Fu A W C, Cheng C H , et al. Mining Association Rules with Weighted Items[C]// Proceedings of the IDEAS’98 IEEE International Database Engineering and Application Symposiums, Cardiff, UK. IEEE Computer Society Press, 1998: 68-77.
[32] 周秀梅, 黄名选 . 基于项权值变化的矩阵加权关联规则挖掘[J]. 计算机应用研究, 2015,32(10):2918-2923.
( Zhou Xiumei, Huang Mingxuan . Matrix-Weighted Association Rules Mining Based on Dynamic Weight of Item[J]. Application Research of Computers, 2015,32(10):2918-2923.)
[33] 周秀梅, 黄名选 . 基于项权值变化的完全加权正负关联规则挖掘[J]. 电子学报, 2015,43(8):1545-1554.
( Zhou Xiumei, Huang Mingxuan . All-Weighted Positive and Negative Association Rules Mining Based on Dynamic Item Weight[J]. Acta Electronica Sinica, 2015,43(8):1545-1554.)
[34] 周秀梅, 黄名选 . 有效的矩阵加权正负关联规则挖掘算法——MWARM-SRCCCI[J]. 计算机应用, 2015,34(10):2820-2826.
( Zhou Xiumei, Huang Mingxuan . MWARM- SRCCCI: Efficient Algorithm for Mining Matrix-Weighted Positive and Negative Association Rules[J]. Journal of Computer Applications, 2015,34(10):2820-2826.)
[35] 黄名选, 严小卫, 张师超 . 基于矩阵加权关联规则挖掘的伪相关反馈查询扩展[J]. 软件学报, 2009,20(7):1854-1865.
( Huang Mingxuan, Yan Xiaowei, Zhang Shichao . Query Expansion of Pseudo Relevance Feedback Based on Matrix-Weighted Association Rules Mining[J]. Journal of Software, 2009,20(7):1854-1865.)
[36] 黄名选, 严小卫, 张师超 . 基于完全加权关联规则挖掘和查询扩展的信息检索[J]. 计算机应用与软件, 2009,26(8):26-28.
( Huang Mingxuan, Yan Xiaowei, Zhang Shichao . Information Retrieval Based on All-Weighted Association Rules Mining and Query Expansion[J]. Computer Applications and Software, 2009,26(8):26-28.)
[37] 黄名选 . 基于矩阵加权关联模式的印尼中跨语言信息检索模型[J]. 数据分析与知识发现, 2017,1(1):26-35.
( Huang Mingxuan . Indonesian-Chinese Cross Language Information Retrieval Model Based on Matrix-Weighted Association Patterns Mining[J]. Data Analysis and Knowledge Discovery, 2017,1(1):26-35.)
[38] 黄名选 . 基于加权关联模式挖掘的越英跨语言查询扩展[J]. 情报学报, 2017,36(3):307-318.
( Huang Mingxuan . Vietnamese-English Cross Language Query Expansion Based on Weighted Association Patterns Mining[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(3):307-318.)
[39] 黄名选, 蒋曹清 . 基于完全加权正负关联模式挖掘的越-英跨语言查询译后扩展[J]. 电子学报, 2018,46(12):3029-3036.
( Huang Mingxuan, Jiang Caoqing . Vietnamese-English Cross Language Query Post-Translation Expansion Based on All-Weighted Positive and Negative Association Patterns Mining[J]. Acta Electronica Sinica, 2018,46(12):3029-3036.)
[40] Salton G, Buckley C . Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[1] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[2] 张勇,李树青,程永上. 基于频次有效长度的加权关联规则挖掘算法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 85-93.
[3] 杨亚楠,赵文辉,张健,谭珅,张贝贝. 基于多视图协同的政策文本可视化研究*[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[4] 张梦吉,杜婉钰,郑楠. 引入新闻短文本的个股走势预测模型[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[5] 何跃,丰月,赵书朋,马玉凤. 基于知乎问答社区的内容推荐研究——以物流话题为例[J]. 数据分析与知识发现, 2018, 2(9): 42-49.
[6] 孙海霞,王蕾,吴英杰,华薇娜,李军莲. 科技文献数据库中机构名称匹配策略研究*[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[7] 张宁,尹乐民,何立峰. 网络股评“发布者-关注者”BSI与股票市场关联性研究*[J]. 数据分析与知识发现, 2018, 2(6): 1-12.
[8] 范馨月,崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[9] 何跃,王爱欣,丰月,王莉. 基于关联规则的门诊药房布局优化[J]. 数据分析与知识发现, 2018, 2(1): 99-108.
[10] 杨超凡,邓仲华,彭鑫,刘斌. 近5年信息检索的研究热点与发展趋势综述*——基于相关会议论文的分析[J]. 数据分析与知识发现, 2017, 1(7): 35-43.
[11] 汪强兵,章成志. 融合内容与用户手势行为的用户画像构建系统设计与实现*[J]. 数据分析与知识发现, 2017, 1(2): 80-86.
[12] 杨春雷. 面向语用消歧的量化约束条件系统: 从语言学设计到计算实现*[J]. 数据分析与知识发现, 2017, 1(11): 1-11.
[13] 魏星,胡德华,易敏寒,朱启贞,朱文婕. 基于数据立方体挖掘疾病-基因-药物新关联*[J]. 数据分析与知识发现, 2017, 1(10): 94-104.
[14] 谢秀芳,张晓林. 针对科技路线图的文本挖掘研究: 集成分析及可视化*[J]. 数据分析与知识发现, 2017, 1(1): 16-25.
[15] 张晓娟, 韩毅. 时态信息检索研究综述*[J]. 数据分析与知识发现, 2017, 1(1): 3-15.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn