Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (6): 115-125    DOI: 10.11925/infotech.2096-3467.2020.1312
Current Issue | Archive | Adv Search |
Expanding Queries Based on Word Embedding and Expansion Terms
Huang Mingxuan1,2(),Jiang Caoqing1,2,Lu Shoudong2
1Guangxi Key Laboratory of Cross-border E-commerce Intelligent Information Processing, Guangxi University of Finance and Economics, Nanning 530003, China
2School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China
Download: PDF (889 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a query expansion model based on the intersection of word embedding and expansion terms, aiming to reduce the mismatched words in information retrieval. [Methods] First, we trained the word embedding learning with the retrieved documents to obtain the Word Embedding Candidate Expansion Term set. Then, we examined the association rules and generated the Mining Candidate Expansion Term set. Finally, we created the final expansion term set by merging the previous two sets and expanded the queries. [Results] The MAP and P@5 of the proposed model were higher than those of the benchmark ones. Compared with the similar query expansion methods developed in recent years, the average increase of the MAP and P@5 were 0.96%-31.24% and 1.07%-13.55%, respectively. [Limitations] The proposed model needs to be examined with real world information retrieval systems. [Conclusions] The proposed model can improve the quality of expansion terms and the performance of information retrieval systems, which also reduces query topic drifting and word mismatch issues.

Key wordsInformation Retrieval      Query Expansion      Text Mining      Deep Learning      Word Embedding     
Received: 30 December 2020      Published: 06 July 2021
ZTFLH:  TP393  
Fund:National Natural Science Foundation of China(61762006)
Corresponding Authors: Huang Mingxuan     E-mail: mingxh05@163.com

Cite this article:

Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms. Data Analysis and Knowledge Discovery, 2021, 5(6): 115-125.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1312     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I6/115

Query Expansion Model in This Paper
语料集 简称 文档数 语料集 简称 文档数
edn2000 EN00 79 380 mhn2000 MN00 84 437
end2001 EN01 93 467 mhn2001 MN01 85 302
ude2000 UE00 40 445 udn2000 UN00 244 038
ude2001 UE01 51 851 udn2001 UN01 222 526
Original Corpus and Number of Documents
对比查询扩展方法 描述
ARG_QE (Query Expansion Based on Association Rules Graph) 文献[9]基于规则图的查询扩展方法。实验参数:ms∈(0.1,0.12,0.13,0.14,0.15),mc=0.3, minqEn=10,minEEn=2, Lift=0.1, Jaccard=0.45, IG=0.2
WAP_QE (Query Expansion Based on Weighted Association Patterns) 基于文献[10]加权关联模式挖掘技术的查询扩展方法。实验参数:ms∈(0.004,0.005,0.006,0.007) , mc=0.1,mi=0.0001
WPNP_QE (Query Expansion Based on Weighted Positive and Negative Patterns) 基于文献[13]完全加权正负关联模式挖掘技术的查询扩展方法。实验参数:ms∈(0.10,0.11,0.12,0.13), mc=0.1, α=0.3, minPR=0.1,minNR=0.01
WMS_QE (Query Expansion Based on Weighted Multiple Supports) 基于文献[15]多支持度阈值的加权频繁模式挖掘技术的查询扩展方法。实验参数:ms∈(0.1,0.15,0.2,0.25), mc=0.1, LMS=0.2, HMS=0.25, WT=0.1
WE_QE (Query Expansion Based on Word Embedding) 采用文献[21]基于词向量的查询扩展方法(详见文献中方法1),按文献[21]公式(9)计算扩展词权值
Comparison of Methods
Retrieval Results of the Proposed Expansion Model for Different ms
Retrieval Results of the Proposed Expansion Model for Different mc
评价 扩展方法 EN01 EN00 UE01 UE00 MN01 MN00 UN01 UN00 平均增幅/%
Relax BLR 0.199 2 0.427 8 0.249 7 0.370 1 0.314 4 0.304 9 0.267 9 0.218 0 27.59
WE_QE 0.230 1 0.461 5 0.284 2 0.437 5 0.378 3 0.326 4 0.299 3 0.283 1 10.49
WAP_QE 0.255 1 0.477 7 0.269 4 0.413 0 0.351 7 0.344 7 0.296 3 0.277 7 10.92
WMS_QE 0.201 0 0.463 1 0.281 5 0.462 2 0.337 0 0.348 1 0.271 4 0.269 9 14.67
WPNP_QE 0.212 2 0.462 8 0.300 6 0.500 3 0.333 2 0.287 1 0.312 2 0.271 3 12.51
ARG_QE 0.221 7 0.456 0 0.282 7 0.468 0 0.350 2 0.295 9 0.294 9 0.289 1 12.62
WEL&ETM_QE 0.270 9 0.470 5 0.318 9 0.545 1 0.377 4 0.358 2 0.343 7 0.291 9
Rigid BLR 0.120 0 0.281 4 0.179 5 0.207 5 0.185 0 0.208 9 0.183 9 0.125 3 27.87
WE_QE 0.131 0 0.314 5 0.178 7 0.253 1 0.214 2 0.198 7 0.190 6 0.159 6 16.15
WAP_QE 0.169 0 0.331 3 0.166 1 0.216 5 0.197 6 0.201 6 0.195 4 0.159 7 15.71
WMS_QE 0.139 8 0.335 9 0.166 8 0.245 2 0.185 6 0.217 2 0.177 6 0.149 6 18.21
WPNP_QE 0.138 3 0.321 8 0.199 7 0.305 6 0.190 1 0.189 4 0.205 4 0.147 4 12.84
ARG_QE 0.151 0 0.332 6 0.180 4 0.230 6 0.196 6 0.177 1 0.180 6 0.166 5 17.78
WEL&ETM_QE 0.185 1 0.325 9 0.208 4 0.324 6 0.198 8 0.246 5 0.229 9 0.162 7
MAP Values of Retrieval Results of the Title Queries for Different Expansion Methods
评价 扩展方法 EN01 EN00 UE01 UE00 MN01 MN00 UN01 UN00 平均增幅/%
Relax BLR 0.203 9 0.384 8 0.323 9 0.384 8 0.323 7 0.286 2 0.288 9 0.228 0 22.47
WE_QE 0.209 7 0.384 0 0.340 5 0.384 0 0.416 4 0.341 4 0.312 8 0.312 6 10.09
WAP_QE 0.245 7 0.403 6 0.345 6 0.403 6 0.380 9 0.356 5 0.335 3 0.326 3 10.09
WMS_QE 0.220 3 0.434 0 0.311 8 0.434 0 0.364 1 0.348 2 0.328 4 0.283 2 9.00
WPNP_QE 0.251 4 0.478 8 0.379 4 0.478 8 0.347 6 0.329 4 0.383 8 0.288 9 0.96
ARG_QE 0.207 6 0.295 9 0.310 7 0.295 9 0.365 3 0.284 9 0.279 3 0.279 9 27.88
WEL&ETM_QE 0.255 5 0.469 6 0.378 0 0.469 6 0.361 9 0.341 6 0.370 0 0.306 5
Rigid BLR 0.110 3 0.278 2 0.173 1 0.278 2 0.176 9 0.189 5 0.196 5 0.143 9 23.21
WE_QE 0.113 9 0.269 0 0.177 3 0.269 0 0.234 6 0.215 7 0.197 4 0.199 9 13.73
WAP_QE 0.137 5 0.272 7 0.191 0 0.272 7 0.208 2 0.219 9 0.227 8 0.193 5 8.72
WMS_QE 0.120 4 0.303 0 0.181 6 0.303 0 0.200 1 0.211 7 0.225 0 0.166 9 11.21
WPNP_QE 0.134 9 0.315 6 0.219 8 0.315 6 0.184 3 0.231 2 0.256 1 0.172 0 3.67
ARG_QE 0.110 9 0.223 6 0.171 9 0.223 6 0.198 1 0.182 1 0.158 3 0.171 9 31.24
WEL&ETM_QE 0.147 9 0.338 9 0.209 3 0.338 9 0.192 0 0.241 1 0.252 8 0.176 2
MAP Values of Retrieval Results of the Desc Queries for Different Expansion Methods
评价 扩展方法 EN01 EN00 UE01 UE00 MN01 MN00 UN01 UN00 平均增幅/%
Relax BLR 0.200 0 0.325 0 0.193 1 0.206 9 0.358 8 0.316 7 0.336 4 0.260 0 20.41
WE_QE 0.191 4 0.339 6 0.210 3 0.260 3 0.450 0 0.309 7 0.355 7 0.341 1 7.98
WAP_QE 0.243 1 0.358 3 0.232 8 0.265 5 0.413 2 0.325 0 0.352 3 0.331 1 3.74
WMS_QE 0.194 8 0.333 3 0.222 4 0.289 7 0.394 1 0.340 3 0.312 5 0.328 9 8.65
WPNP_QE 0.165 5 0.333 3 0.200 0 0.289 7 0.376 5 0.300 0 0.368 2 0.337 8 12.21
ARG_QE 0.191 7 0.338 3 0.245 5 0.306 2 0.410 6 0.273 3 0.380 0 0.342 2 6.14
WEL&ETM_QE 0.206 9 0.358 3 0.234 5 0.289 7 0.447 1 0.361 1 0.390 9 0.346 7
Rigid BLR 0.137 9 0.208 3 0.158 6 0.137 9 0.264 7 0.244 4 0.263 6 0.220 0 17.46
WE_QE 0.113 8 0.245 8 0.162 1 0.155 2 0.310 3 0.229 2 0.270 5 0.276 7 11.57
WAP_QE 0.174 1 0.237 5 0.184 5 0.167 2 0.269 1 0.225 0 0.273 9 0.265 6 5.65
WMS_QE 0.134 5 0.233 3 0.172 4 0.179 3 0.276 5 0.250 0 0.252 3 0.255 6 8.91
WPNP_QE 0.103 4 0.225 0 0.158 6 0.193 1 0.270 6 0.222 2 0.300 0 0.253 3 13.55
ARG_QE 0.143 4 0.235 0 0.195 9 0.189 0 0.287 1 0.192 2 0.296 4 0.266 7 6.30
WEL&ETM_QE 0.151 7 0.225 0 0.200 0 0.186 2 0.294 1 0.272 2 0.313 6 0.262 2
P@5 Values of Retrieval Results of the Title Queries for Different Expansion Methods
评价 扩展方法 EN01 EN00 UE01 UE00 MN01 MN00 UN01 UN00 平均增幅/%
Relax BLR 0.248 3 0.333 3 0.234 5 0.234 5 0.364 7 0.266 7 0.368 2 0.244 4 14.95
WE_QE 0.244 8 0.354 2 0.241 4 0.256 9 0.451 5 0.302 8 0.337 5 0.318 9 5.82
WAP_QE 0.265 5 0.366 7 0.248 3 0.265 5 0.422 1 0.316 7 0.393 2 0.291 1 2.70
WMS_QE 0.250 0 0.306 3 0.262 1 0.265 5 0.398 5 0.326 4 0.368 2 0.280 0 6.90
WPNP_QE 0.248 3 0.350 0 0.269 0 0.275 9 0.429 4 0.333 3 0.418 2 0.275 6 1.70
ARG_QE 0.238 6 0.366 7 0.235 9 0.229 0 0.418 8 0.316 7 0.353 6 0.270 2 9.35
WEL&ETM_QE 0.282 8 0.350 0 0.282 8 0.255 2 0.452 9 0.322 2 0.404 5 0.284 4
Rigid BLR 0.172 4 0.191 7 0.165 5 0.172 4 0.247 1 0.183 3 0.304 5 0.320 0 17.13
WE_QE 0.155 2 0.222 9 0.170 7 0.198 3 0.316 2 0.216 7 0.271 6 0.391 1 7.54
WAP_QE 0.186 2 0.233 3 0.163 8 0.191 4 0.292 6 0.229 2 0.314 8 0.376 7 4.01
WMS_QE 0.167 2 0.206 3 0.191 4 0.196 6 0.276 5 0.231 9 0.285 2 0.356 7 7.18
WPNP_QE 0.186 2 0.208 3 0.206 9 0.193 1 0.282 4 0.244 4 0.336 4 0.373 3 1.07
ARG_QE 0.162 8 0.233 3 0.173 8 0.180 7 0.277 6 0.221 1 0.260 0 0.355 6 10.42
WEL&ETM_QE 0.200 0 0.216 7 0.200 0 0.186 2 0.311 8 0.238 9 0.322 7 0.373 3
P@5 Values of Retrieval Results of the Desc Queries for Different Expansion Methods
[1] Keikha A, Ensan F, Bagheri E. Query Expansion Using Pseudo Relevance Feedback on Wikipedia[J]. Journal of Intelligent Information Systems, 2018,50(3):455-478.
doi: 10.1007/s10844-017-0466-3
[2] Pan M, Huang J X, He T T, et al. A Simple Kernel Co-Occurrence-Based Enhancement for Pseudo-Relevance Feedback[J]. Journal of the Association for Information Science and Technology, 2020,71(3):264-281.
doi: 10.1002/asi.v71.3
[3] Rungsawang A, Tangpong A, Laohawee P, et al. Novel Query Expansion Technique Using Apriori Algorithm[C]// Proceedings of the 8th Text Retrieval Conference(TREC 8), 1999: 453-456.
[4] Latiri C, Haddad H, Hamrouni T. Towards an Effective Automatic Query Expansion Process Using an Association Rule Mining Approach[J]. Journal of Intelligent Information Systems, 2012,39(1):209-247.
doi: 10.1007/s10844-011-0189-9
[5] Liu C H, Qi R H, Liu Q. Query Expansion Terms Based on Positive and Negative Association Rules[C]// Proceedings of the 3rd International Conference on Information Science and Technology (ICIST). 2013: 802-808.
[6] Bouziri A, Latiri C, Gaussier E, et al. Learning Query Expansion from Association Rules Between Terms[C]// Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). 2015: 525-530.
[7] Bouziri A, Latiri C, Gaussier E. Efficient Association Rules Selecting for Automatic Query Expansion[C]// Proceedings of the 18th International Conference on Computational Linguistics & Intelligent Text Processing (CICLing 2017). 2017: 563-574.
[8] Bouziri A, Latiri C, Gaussier E. LTR-expand: Query Expansion Model Based on Learning to Rank Association Rules[J]. Journal of Intelligent Information Systems, 2020,55:261-286.
doi: 10.1007/s10844-020-00596-8
[9] Jabri S, Dahbi A, Gadi T. A Graph-Based Approach for Text Query Expansion Using Pseudo Relevance Feedback and Association Rules Mining[J]. International Journal of Electrical & Computer Engineering, 2019,9(6):5016-5023.
[10] 黄名选. 基于加权关联模式挖掘的越英跨语言查询扩展[J]. 情报学报, 2017,36(3):307-318.
[10] (Huang Mingxuan. Vietnamese-English Cross Language Query Expansion Based on Weighted Association Patterns Mining[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(3):307-318.)
[11] 黄名选, 严小卫, 张师超. 基于矩阵加权关联规则挖掘的伪相关反馈查询扩展[J]. 软件学报, 2009,20(7):1854-1865.
[11] (Huang Mingxuan, Yan Xiaowei, Zhang Shichao. Query Expansion of Pseudo Relevance Feedback Based on Matrix-Weighted Association Rules Mining[J]. Journal of Software, 2009,20(7):1854-1865.)
[12] 黄名选. 完全加权模式挖掘与相关反馈融合的印尼汉跨语言查询扩展[J]. 小型微型计算机系统, 2017,38(8):1783-1791.
[12] (Huang Mingxuan. Indonesian-Chinese Cross Language Query Expansion Based on All-Weighted Patterns Mining and Relevance Feedback[J]. Journal of Chinese Computer Systems, 2017,38(8):1783-1791.)
[13] 黄名选, 蒋曹清. 基于完全加权正负关联模式挖掘的越-英跨语言查询译后扩展[J]. 电子学报, 2018,46(12):3029-3036.
[13] (Huang Mingxuan, Jiang Caoqing. Vietnamese-English Cross Language Query Post-Translation Expansion Based on All-Weighted Positive and Negative Association Patterns Mining[J]. Acta Electronica Sinica, 2018,46(12):3029-3036.)
[14] 黄名选, 蒋曹清. 基于项权值排序挖掘的跨语言查询扩展[J]. 电子学报, 2020,48(3):568-576.
[14] (Huang Mingxuan, Jiang Caoqing. Cross Language Query Expansion Based on Item Weight Sorting Mining[J]. Acta Electronica Sinica, 2020,48(3):568-576.)
[15] Zhang H R, Zhang J W, Wei X Y, et al. A New Frequent Pattern Mining Algorithm with Weighted Multiple Minimum Supports[J]. Intelligent Automation & Soft Computing, 2017,23(4):605-612.
[16] Sklar A. Fonctions de Repartition À N Dimensions Et Leurs Marges[J]. Publication de l'Institut de Statistique l'Universite Paris, 1959,8(1):229-231.
[17] Kuzi S, Shtok A, Kurland O. Query Expansion Using Word Embeddings[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 1929-1932.
[18] ALMasri M, Berrut C, Chevallet J P. A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information[C]// Proceedings of the 38th European Conference on IR Research. 2016: 709-715.
[19] Roy D, Ganguly D, Mitra M, et al. Word Vector Compositionality Based Relevance Feedback Using Kernel Density Estimation[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 1281-1290.
[20] Li W J, Sheng W, Yu Z T. Deep Learning and Semantic Concept Spaceare Used in Query Expansion[J]. Automatic Control and Computer Sciences, 2018,52(3):175-183.
doi: 10.3103/S0146411618030082
[21] 许侃, 林原, 曲忱, 等. 专利查询扩展的词向量方法研究[J]. 计算机科学与探索, 2018,12(6):972-980.
[21] (Xu Kan, Lin Yuan, Qu Chen, et al. Research on Patent Query Expansion Methods Using Word Embedding[J]. Journal of Frontiers of Computer Science and Technology, 2018,12(6):972-980.)
[22] 余传明, 蔡林, 胡莎莎, 等. 基于深度学习的查询扩展研究[J]. 情报学报, 2019,38(10):1066-1077.
[22] (Yu Chuanming, Cai Lin, Hu Shasha, et al. Research on Query Expansion Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(10):1066-1077.)
[23] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the 1st International Conference on Learning Representations. 2013.
[24] 张剑, 屈丹, 李真. 基于词向量特征的循环神经网络语言模型[J]. 模式识别与人工智能, 2015,28(4):299-305.
[24] (Zhang Jian, Qu Dan, Li Zhen. Recurrent Neural Network Language Model Based on Word Vector Features[J]. Pattern Recognition and Artificial Intelligence, 2015,28(4):299-305.)
[25] Eickhoff C, Vries A P, Collins-Thompson K. Copulas for Information Retrieval[C]// Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13). 2013: 663-672.
[26] 张书波, 张引, 张斌, 等. 基于Copulas框架的混合式查询扩展方法[J]. 计算机科学, 2016,43(6A):485-488.
[26] (Zhang Shubo, Zhang Yin, Zhang Bin, et al. Combined Query Expansion Method Based on Copulas Framework[J]. Computer Science, 2016,43(4A):485-488.)
[27] Nelson R B. An Introduction to Copulas (The 2nd Edition)[M]. New York, USA: Springer Science+Business Media, Inc., 2006.
[28] 欧俊豪, 王家生, 徐漪萍, 等. 应用概率统计[M]. 第二版. 天津: 天津大学出版社, 1999.
[28] (Ou Junhao, Wang Jiasheng, Xu Yiping, et al. Applied Probability and Statistics [M]. The 2nd Edition. Tianjin: Tianjin University Press, 1999.)
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[4] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[5] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[7] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[8] Xu Guang,Ren Ming,Song Chengyu. Extracting China’s Economic Image from Western News[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[9] Li Yueyan,Wang Hao,Deng Sanhong,Wang Wei. Research Trends of Information Retrieval——Case Study of SIGIR Conference Papers[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[10] Dai Bing,Hu Zhengyin. Review of Studies on Literature-Based Discovery[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[11] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[12] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[13] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[14] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[15] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn