Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 77-87    DOI: 10.11925/infotech.2096-3467.2019.0301
Current Issue | Archive | Adv Search |
Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion
Mingxuan Huang1,2,3(),Shoudong Lu3,Hui Xu3
1 Guangxi (ASEAN) Financial Research Center, Guangxi University of Finance and Economics, Nanning 530003, China
2 Guangxi Key Laboratory of Cross-border E-commerce Intelligent Information Processing, Guangxi University of Finance and Economics, Nanning 530003, China
3 School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China
Download: PDF (739 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new Cross-Language Information Retrieval (CLIR) model, aiming to address the issues facing natural language processing, such as query topic drift and word mismatch. [Methods] First, we explored the frequent item-sets with the weighted association patterns and the pruning strategies based on maximum item weight. Then, we used the confidence and relevance degrees to evaluate the weighted association rules, which helped us extract the high quality expansion terms. Finally, we combined the new terms with the original ones to create new queries for the final lists. [Results] Compared with the monolingual retrieval benchmark, the average increases (AIs) of R-prec and P@10 of the proposed model were 42.49% and 25.53%. Our results were 91.87% and 64.61% higher than the cross language retrieval benchmark. Compared to the existing CLIR methods, the maximum AIs of R-prec and P@10 were 93.20% and 34.60%. [Limitations] The proposed model needs to be examined with more cross language search engines. [Conclusions] Our model improves the performance of CLIR.

Key wordsInformation Retrieval      Cross Language Retrieval      Text Mining      Association Rule      Natural Language Processing     
Received: 22 March 2019      Published: 23 October 2019
ZTFLH:  TP393 G35  

Cite this article:

Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion. Data Analysis and Knowledge Discovery, 2019, 3(9): 77-87.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0301     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I9/77

检索算法 Relax Rigid
mdn0 mdn1 kt1 mdn0 mdn1 kt1
MR 0.3649 0.4470 0.2426 0.3331 0.3682 0.2399
CLR 0.2566 0.3884 0.2042 0.1943 0.3159 0.1754
CLIR_AWAR 0.6103 0.5674 0.4083 0.4982 0.3985 0.3437
CLIR_WAR 0.5894 0.5423 0.3702 0.4787 0.3674 0.3104
CLIR_AWPNAR 0.3960 0.3518 0.1385 0.3216 0.2544 0.1327
CLIR_WAPMRCE 0.6155 0.5934 0.3801 0.5023 0.4121 0.3218
检索算法 Relax Rigid
mdn0 mdn1 kt1 mdn0 mdn1 kt1
MR 0.1800 0.2524 0.2655 0.1480 0.1667 0.2069
CLR 0.1640 0.2190 0.1552 0.1400 0.1429 0.1138
CLIR_AWAR 0.2216 0.2352 0.2524 0.1832 0.1638 0.2021
CLIR_WAR 0.2168 0.2400 0.2462 0.1808 0.1752 0.1966
CLIR_AWPNAR 0.2144 0.2076 0.2048 0.1720 0.1514 0.1710
CLIR_WAPMRCE 0.2544 0.3067 0.2952 0.2024 0.2124 0.2379
查询 No.9查询主题(Title) No.22查询主题(Title)
印尼语版 Gempa bumi, pertolongan Internasional Penyakit sapi gila
英文版(机器翻译结果) The earthquake, International aid Mad cow disease
英文版(NTCIR-5语料) Earthquakes, International rescue mad cow disease
扩展词词干
(CLIR_WAPMRCE算法)
relief(1.000), ken(0.868), hit(0.740), gujarat(0.732),
quak(0.697), hardest(0.618), india(0.607), govern(0.581),
arriv(0.477), damage(0.463), bhuj(0.451), central(0.436),
reach(0.422), American(0.418), bhachua(0.394),
lentil(0.386), office(0.351), state(0.344), set(0.316),
unit(0.175), intern(0.171), rescu(0.162), million(0.156),
team(0.129), Washington(0.127), Ahmedabad(0.032),
maclean(0.029), epicent(0.028), berger(0.027),
carton(0.027), clog(0.025), baltimor(0.021), feb(0.021),
boucher(0.016), rubbl(0.015), purify(0.014),
wrench(0.013), estimate(0.012), cremat(0.011),
flatten(0.010), Turkish(0.008), chunk(0.008), toll(0.006),
bundl(0.005), blanket(0.004), homeless(0.003),
heap(0.003), freight(0.003)
bse(1.000), encephalopathy(0.937), spongiform(0.915),
bovin(0.904), human(0.848), food(0.818), beef(0.800),
part(0.669), agricultur(0.511), feed(0.510), meat(0.505),
anim(0.475), effort(0.363), measure(0.323),
ministry(0.321), cattl(0.309), confirm(0.253), infect(0.248),
European(0.240), contamin(0.233), spread(0.229),
brain(0.211), scare(0.211), ban(0.200)
查询
编号
检索算法 Relax Rigid
p@10 R_prec p@10 R_prec
No.9 MR 0.6000 0.1525 0.6000 0.1525
CLR 0.2000 0.0865 0.2000 0.0865
CLIR_WAPMRCE 0.7000 0.2432 0.7000 0.2432
No.22 MR 0.5000 0.3182 0.3000 0.2273
CLR 0.3000 0.2333 0.3000 0.1667
CLIR_WAPMRCE 0.5000 0.4375 0.3000 0.3125
[1] 吴丹, 何大庆, 王惠临 . 一种基于相关反馈的跨语言信息检索查询翻译优化技术研究[J]. 情报学报, 2012,31(4):398-406.
[1] ( Wu Dan, He Daqing, Wang Huilin . A Relevance Feedback Based Query Translation Enhancement Technique in Cross Language Information Retrieval[J]. Journal of the China Society for Scientific and Technical Information, 2012,31(4):398-406.)
[2] Zhang L, Rettinger A, Zhang J. A Knowledge Base Approach to Cross-Lingual Keyword Query Interpretation[C]// Proceedings of the 15th International Semantic Web Conference, Kobe, Japan. Springer International Publishing, 2016: 615-631.
[3] Saleh S, Pecina P. Re-ranking Hypotheses of Machine-Translated Queries for Cross-Lingual Information Retrieval[C]// Proceedings of the 7th International Conference of the Cross-Language Evaluation Forum for European Languages, Évora, Portugal. Springer International Publishing, 2016: 54-66.
[4] Elayeb B, Romdhane W B, Saoud N B B . Towards a New Possibilistic Query Translation Tool for Cross-Language Information Retrieval[J]. Multimedia Tools and Applications, 2018, 77(2):2423-2465.
[5] Ture F, Lin J . Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval[J]. ACM Transactions on Information Systems, 2014, 32(4): Article No. 19.
[6] Rahimi R, Shakery A, King I . Extracting Translations from Comparable Corpora for Cross-Language Information Retrieval Using the Language Modeling Framework[J]. Information Processing & Management, 2016,52(2):299-318.
[7] Vulić I, Moens M F. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings [C]// Proceedings of the 38th International ACM SIGIR Conference on Research & Development in Information Retrieval, Santiago, Chile. ACM, 2015: 363-372.
[8] Niyogi M, Ghosh K, Bhattacharya A . Learning Multilingual Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically Aligned Corpora[OL]. arXiv Preprint, arXiv: 1804.04475.
[9] Adriani M, Wahyu I. The Performance of a Machine Translation-Based English-Indonesian CLIR System [C]// Proceedings of the 6th Workshop of the Cross-Language Evaluation Forum Conference on Accessing Multilingual Information Repositories, Vienna, Austria. Springer-Verlag, 2005: 151-154.
[10] Hayurani H, Sari S, Adriani M. Query and Document Translation for English-Indonesian Cross Language IR [C]// Proceedings of the 7th Workshop of the Cross-Language Evaluation Forum Conference on Evaluation of Multilingual and Multi-modal Information Retrieval, Alicante, Spain. Springer-Verlag, 2006: 57-61.
[11] Adriani M, Hayurani H, Sari S. Indonesian-English Transitive Translation for Cross-Language Information Retrieval [C]// Proceedings of the 8th Workshop of the Cross-Language Evaluation Forum Conference on Advances in Multilingual and Multimodal Information Retrieval, Budapest, Hungary. Springer-Verlag, 2007: 127-133.
[12] 吴丹, 何大庆, 王惠临 . 基于伪相关反馈的跨语言查询扩展[J]. 情报学报, 2010,29(2):232-239.
[12] ( Wu Dan, He Daqing, Wang Huilin . Cross Language Query Expansion Using Pseudo Relevance Feedback[J]. Journal of the China Society for Scientific and Technical Information, 2010,29(2):232-239.)
[13] Agrawal A J . Improving Performance of Hindi-English Based Cross Language Information Retrieval Using Selective Documents Technique and Query Expansion[J]. International Journal of Science and Research, 2016,5(5):1964-1967.
[14] Chandra G, Dwivedi S K . Query Expansion Based on Term Selection for Hindi-English Cross Lingual IR[J/OL]. Journal of King Saud University-Computer and Information Sciences, 2017. https://doi.org/10.1016/j.jksuci. 2017. 09. 002.
[15] 郝嘉树, 王惠临, 刘耀 . 基于本体的跨语言信息检索模型和关键技术研究[J]. 情报科学, 2009,27(2):271-275.
[15] ( Hao Jiashu, Wang Huilin, Liu Yao . Research on Ontology-based CLIR Model and Key Technologies[J]. Information Science, 2009,27(2):271-275.)
[16] 司莉, 史雅莉 . 以多语本体库为核心的跨语言信息检索映射技术研究进展——EuroWordNet案例分析[J]. 图书情报工作, 2016,60(2):106-111.
[16] ( Si Li, Shi Yali . Research Review on Cross-language Information Retrieval Mapping Technology with the Multilingual Ontology Database as Core Factor: A Case Study on EuroWordNet[J]. Library and Information Service, 2016,60(2):106-111.)
[17] 司莉, 陈雨雪, 曾粤亮 . 基于多语言本体的中英跨语言信息检索模型及实现[J]. 图书情报工作, 2017,61(1):100-108.
[17] ( Si Li, Chen Yuxue, Zeng Yueliang . A Study on Cross-Language Information Retrieval Model Based on Multilingual Ontology[J]. Library and Information Service, 2017,61(1):100-108.)
[18] 闭剑婷, 苏一丹 . 基于潜在语义分析的跨语言查询扩展方法[J]. 计算机工程, 2009,35(10):49-53.
[18] ( Bi Jianting, Su Yidan . Expansion Method for Language-Crossed Query Based on Latent Semantic Analysis[J]. Computer Engineering, 2009,35(10):49-53.)
[19] 宁健, 林鸿飞 . 基于改进潜在语义分析的跨语言检索[J]. 中文信息学报, 2010,24(3):105-111.
[19] ( Ning Jian, Lin Hongfei . Cross-Language Information Retrieval Based on Improved Latent Semantic Indexing[J]. Journal of Chinese Information Processing, 2010,24(3):105-111.)
[20] Wang X, Zhang Q, Wang X, et al. Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment [C]//Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 2015: 529-534.
[21] Dai B . Research on Chinese and English Language Information Retrieval Algorithm Based on Bilingual Theme Model[J/OL]. Cluster Computing. https://doi.org/10.1007/s10586-018-2218-8.
[22] 黄名选, 蒋曹清, 何冬蕾 . 基于矩阵加权关联规则的跨语言查询译后扩展[J]. 模式识别与人工智能, 2018,31(10):887-898.
[22] ( Huang Mingxuan, Jiang Caoqing, He Donglei . Cross Language Query Post-Translation Expansion Based on Matrix-Weighted Association Rules[J]. Pattern Recognition and Artificial Intelligence, 2018,31(10):887-898.)
[23] Latiri C, Haddad H, Hamrouni T . Towards an Effective Automatic Query Expansion Process Using an Association Rule Mining Approach[J]. Journal of Intelligent Information Systems, 2012,39(1):209-247.
doi: 10.1007/s10844-011-0189-9
[24] Liu C, Qi R, Liu Q. Query Expansion Terms Based on Positive and Negative Association Rules[C]// Proceedings of the 3rd International Conference on Information Science and Technology, Yangzhou, China. IEEE Press, 2013: 802-808.
[25] Geraldo A P, Moreira V P. UFRGS@CLEF2008: Using Association Rules for Cross-Language Information Retrieval [C]// Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, Aarhus, Denmark. Springer-Verlag, 2009: 66-74.
[26] Song M, Song I Y, Hu X , et al. Integration of Association Rules and Ontology for Semantic-Based Query Expansion[J]. Data & Knowledge Engineering, 2007,63(1):63-75.
[27] 黄名选 . 完全加权模式挖掘与相关反馈融合的印尼汉跨语言查询扩展[J]. 小型微型计算机系统, 2017,38(8):1783-1791.
[27] ( Huang Mingxuan . Indonesian-Chinese Cross Language Query Expansion Based on All-Weighted Patterns Mining and Relevance Feedback[J]. Journal of Chinese Computer Systems, 2017,38(8):1783-1791.)
[28] Agrawal R, Imielinski T, Swami A. Mining Association Rules Between Sets of Items in Large Database[C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA. ACM Press, 1993: 207-216.
[29] Bouziri A, Latiri C, Gaussier E. Efficient Association Rules Selecting for Automatic Query Expansion [C]// Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary. Springer-Verlag, 2017: 563-574.
[30] Belalem G, Abbache A, Belkredim F Z , et al. Arabic Query Expansion Using WordNet and Association Rules[J]. International Journal of Intelligent Information Technologies, 2016,12(3):51-64.
[31] Cai C H, Fu A W C, Cheng C H , et al. Mining Association Rules with Weighted Items[C]// Proceedings of the IDEAS’98 IEEE International Database Engineering and Application Symposiums, Cardiff, UK. IEEE Computer Society Press, 1998: 68-77.
[32] 周秀梅, 黄名选 . 基于项权值变化的矩阵加权关联规则挖掘[J]. 计算机应用研究, 2015,32(10):2918-2923.
[32] ( Zhou Xiumei, Huang Mingxuan . Matrix-Weighted Association Rules Mining Based on Dynamic Weight of Item[J]. Application Research of Computers, 2015,32(10):2918-2923.)
[33] 周秀梅, 黄名选 . 基于项权值变化的完全加权正负关联规则挖掘[J]. 电子学报, 2015,43(8):1545-1554.
[33] ( Zhou Xiumei, Huang Mingxuan . All-Weighted Positive and Negative Association Rules Mining Based on Dynamic Item Weight[J]. Acta Electronica Sinica, 2015,43(8):1545-1554.)
[34] 周秀梅, 黄名选 . 有效的矩阵加权正负关联规则挖掘算法——MWARM-SRCCCI[J]. 计算机应用, 2015,34(10):2820-2826.
[34] ( Zhou Xiumei, Huang Mingxuan . MWARM- SRCCCI: Efficient Algorithm for Mining Matrix-Weighted Positive and Negative Association Rules[J]. Journal of Computer Applications, 2015,34(10):2820-2826.)
[35] 黄名选, 严小卫, 张师超 . 基于矩阵加权关联规则挖掘的伪相关反馈查询扩展[J]. 软件学报, 2009,20(7):1854-1865.
[35] ( Huang Mingxuan, Yan Xiaowei, Zhang Shichao . Query Expansion of Pseudo Relevance Feedback Based on Matrix-Weighted Association Rules Mining[J]. Journal of Software, 2009,20(7):1854-1865.)
[36] 黄名选, 严小卫, 张师超 . 基于完全加权关联规则挖掘和查询扩展的信息检索[J]. 计算机应用与软件, 2009,26(8):26-28.
[36] ( Huang Mingxuan, Yan Xiaowei, Zhang Shichao . Information Retrieval Based on All-Weighted Association Rules Mining and Query Expansion[J]. Computer Applications and Software, 2009,26(8):26-28.)
[37] 黄名选 . 基于矩阵加权关联模式的印尼中跨语言信息检索模型[J]. 数据分析与知识发现, 2017,1(1):26-35.
[37] ( Huang Mingxuan . Indonesian-Chinese Cross Language Information Retrieval Model Based on Matrix-Weighted Association Patterns Mining[J]. Data Analysis and Knowledge Discovery, 2017,1(1):26-35.)
[38] 黄名选 . 基于加权关联模式挖掘的越英跨语言查询扩展[J]. 情报学报, 2017,36(3):307-318.
[38] ( Huang Mingxuan . Vietnamese-English Cross Language Query Expansion Based on Weighted Association Patterns Mining[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(3):307-318.)
[39] 黄名选, 蒋曹清 . 基于完全加权正负关联模式挖掘的越-英跨语言查询译后扩展[J]. 电子学报, 2018,46(12):3029-3036.
[39] ( Huang Mingxuan, Jiang Caoqing . Vietnamese-English Cross Language Query Post-Translation Expansion Based on All-Weighted Positive and Negative Association Patterns Mining[J]. Acta Electronica Sinica, 2018,46(12):3029-3036.)
[40] Salton G, Buckley C . Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[1] Wang Yifan,Li Bo,Shi Hua,Miao Wei,Jiang Bin. Annotation Method for Extracting Entity Relationship from Ancient Chinese Works[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[2] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[4] Xu Guang,Ren Ming,Song Chengyu. Extracting China’s Economic Image from Western News[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[5] Li Yueyan,Wang Hao,Deng Sanhong,Wang Wei. Research Trends of Information Retrieval——Case Study of SIGIR Conference Papers[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[6] Dai Bing,Hu Zhengyin. Review of Studies on Literature-Based Discovery[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[7] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[8] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[9] Li Tiejun,Yan Duanwu,Yang Xiongfei. Recommending Microblogs Based on Emotion-Weighted Association Rules[J]. 数据分析与知识发现, 2020, 4(4): 27-33.
[10] Du Jian. Measuring Uncertainty of Medical Knowledge: A Literature Review[J]. 数据分析与知识发现, 2020, 4(10): 14-27.
[11] Peng Guan,Yuefen Wang. Advances in Patent Network[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[12] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[13] Yong Zhang,Shuqing Li,Yongshang Cheng. Mining Algorithm for Weighted Association Rules Based on Frequency Effective Length[J]. 数据分析与知识发现, 2019, 3(7): 85-93.
[14] Yanan Yang,Wenhui Zhao,Jian Zhang,Shen Tan,Beibei Zhang. Visualizing Policy Texts Based on Multi-View Collaboration[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[15] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn