Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 77-87    DOI: 10.11925/infotech.2096-3467.2019.0301
Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion
Mingxuan Huang1,2,3(),Shoudong Lu3,Hui Xu3
1 Guangxi (ASEAN) Financial Research Center, Guangxi University of Finance and Economics, Nanning 530003, China
2 Guangxi Key Laboratory of Cross-border E-commerce Intelligent Information Processing, Guangxi University of Finance and Economics, Nanning 530003, China
3 School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China
[Objective] This paper proposes a new Cross-Language Information Retrieval (CLIR) model, aiming to address the issues facing natural language processing, such as query topic drift and word mismatch. [Methods] First, we explored the frequent item-sets with the weighted association patterns and the pruning strategies based on maximum item weight. Then, we used the confidence and relevance degrees to evaluate the weighted association rules, which helped us extract the high quality expansion terms. Finally, we combined the new terms with the original ones to create new queries for the final lists. [Results] Compared with the monolingual retrieval benchmark, the average increases (AIs) of R-prec and P@10 of the proposed model were 42.49% and 25.53%. Our results were 91.87% and 64.61% higher than the cross language retrieval benchmark. Compared to the existing CLIR methods, the maximum AIs of R-prec and P@10 were 93.20% and 34.60%. [Limitations] The proposed model needs to be examined with more cross language search engines. [Conclusions] Our model improves the performance of CLIR.

Key wordsInformation Retrieval      Cross Language Retrieval      Text Mining      Association Rule      Natural Language Processing     
Received: 22 March 2019      Published: 23 October 2019
ZTFLH:  TP393 G35  

Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion. Data Analysis and Knowledge Discovery, 2019, 3(9): 77-87.

检索算法 Relax Rigid
mdn0 mdn1 kt1 mdn0 mdn1 kt1
MR 0.3649 0.4470 0.2426 0.3331 0.3682 0.2399
CLR 0.2566 0.3884 0.2042 0.1943 0.3159 0.1754
CLIR_AWAR 0.6103 0.5674 0.4083 0.4982 0.3985 0.3437
CLIR_WAR 0.5894 0.5423 0.3702 0.4787 0.3674 0.3104
CLIR_AWPNAR 0.3960 0.3518 0.1385 0.3216 0.2544 0.1327
CLIR_WAPMRCE 0.6155 0.5934 0.3801 0.5023 0.4121 0.3218
检索算法 Relax Rigid
mdn0 mdn1 kt1 mdn0 mdn1 kt1
MR 0.1800 0.2524 0.2655 0.1480 0.1667 0.2069
CLR 0.1640 0.2190 0.1552 0.1400 0.1429 0.1138
CLIR_AWAR 0.2216 0.2352 0.2524 0.1832 0.1638 0.2021
CLIR_WAR 0.2168 0.2400 0.2462 0.1808 0.1752 0.1966
CLIR_AWPNAR 0.2144 0.2076 0.2048 0.1720 0.1514 0.1710
CLIR_WAPMRCE 0.2544 0.3067 0.2952 0.2024 0.2124 0.2379
查询 No.9查询主题(Title) No.22查询主题(Title)
印尼语版 Gempa bumi, pertolongan Internasional Penyakit sapi gila
英文版(机器翻译结果) The earthquake, International aid Mad cow disease
英文版(NTCIR-5语料) Earthquakes, International rescue mad cow disease
relief(1.000), ken(0.868), hit(0.740), gujarat(0.732),
quak(0.697), hardest(0.618), india(0.607), govern(0.581),
arriv(0.477), damage(0.463), bhuj(0.451), central(0.436),
reach(0.422), American(0.418), bhachua(0.394),
lentil(0.386), office(0.351), state(0.344), set(0.316),
unit(0.175), intern(0.171), rescu(0.162), million(0.156),
team(0.129), Washington(0.127), Ahmedabad(0.032),
maclean(0.029), epicent(0.028), berger(0.027),
carton(0.027), clog(0.025), baltimor(0.021), feb(0.021),
boucher(0.016), rubbl(0.015), purify(0.014),
wrench(0.013), estimate(0.012), cremat(0.011),
flatten(0.010), Turkish(0.008), chunk(0.008), toll(0.006),
bundl(0.005), blanket(0.004), homeless(0.003),
heap(0.003), freight(0.003)
bse(1.000), encephalopathy(0.937), spongiform(0.915),
bovin(0.904), human(0.848), food(0.818), beef(0.800),
part(0.669), agricultur(0.511), feed(0.510), meat(0.505),
anim(0.475), effort(0.363), measure(0.323),
ministry(0.321), cattl(0.309), confirm(0.253), infect(0.248),
European(0.240), contamin(0.233), spread(0.229),
brain(0.211), scare(0.211), ban(0.200)
检索算法 Relax Rigid
p@10 R_prec p@10 R_prec
No.9 MR 0.6000 0.1525 0.6000 0.1525
CLR 0.2000 0.0865 0.2000 0.0865
CLIR_WAPMRCE 0.7000 0.2432 0.7000 0.2432
No.22 MR 0.5000 0.3182 0.3000 0.2273
CLR 0.3000 0.2333 0.3000 0.1667
CLIR_WAPMRCE 0.5000 0.4375 0.3000 0.3125
