Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (5): 66-74    DOI: 10.11925/infotech.2096-3467.2019.1297
Current Issue | Archive | Adv Search |
Calculating Word Similarities Based on Formal Concept Analysis
Liu Ping1,2(),Peng Xiaofang1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Institute for Digital Library, Wuhan University, Wuhan 430072, China
Download: PDF (756 KB)   HTML ( 9
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to add a topic layer between document and word layers, aiming to calculate word similarities effectively. [Methods] First, we proposed a topic defintion and representation model based on the theory of formal concept analysis. Then, we mapped words to the topic layer. Finally, we developed an algorithm to calculate word similarities with the help of topic-to-topic relationship.[Results] We analyzed papers of SIGIR conference from 2006 to 2016 with the proposed method to calculate word similarities in the field of information retrieval. The precision and recall of the proposed method were up to 30% and 21% higher than those of the FastText method.[Limitations] The proposed method relies on the quality of extracted feature words of documents.[Conclusions] The proposed method utilizes the semantic relations among associated topics, and effectively calculate word similarities.

Key wordsWords Similarity      Formal Concept Analysis      Concept Lattices      Topic     
Received: 03 December 2019      Published: 15 June 2020
ZTFLH:  TP391.1  
Corresponding Authors: Liu Ping     E-mail: pliuleeds@126.com

Cite this article:

Liu Ping,Peng Xiaofang. Calculating Word Similarities Based on Formal Concept Analysis. Data Analysis and Knowledge Discovery, 2020, 4(5): 66-74.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1297     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I5/66

The Relationship Among Documents, Topics and Words
k1 k2 k3 k4 k5
d1 × × × ×
d2 × × ×
d3 × × × ×
d4 × × × ×
d5 × × × × ×
An Example of Formal Context
Table1
">
Concept Lattice Based on Table1
k1 k2 k3 k4 k5
T1 0 0 0 0 1
T2 0 1 0 0 1
T3 0 0 1 0 1
T4 1 0 0 0 1
T5 0 1 1 0 1
T6 1 1 0 0 1
T7 1 0 1 0 1
T8 1 0 0 1 1
T9 1 1 1 0 1
T10 1 1 0 1 1
T11 1 0 1 1 1
T12 1 1 1 1 1
The Association Matrix of T-K Based on the Concept Lattice in Fig.2
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
k1 0 0 0 1 0 1 1 1 1 1 1 1
k2 0 1 0 0 1 1 0 0 1 1 0 1
k3 0 0 1 0 1 0 1 0 1 0 1 1
k4 0 0 0 0 0 0 0 1 0 1 1 1
k5 1 1 1 1 1 1 1 1 1 1 1 1
The Association Matrix of K-T Based on the Concept Lattice in Fig.2
序号 关键词 词频 序号 关键词 词频
1 information search 117 26 entity 22
2 information retrieval 93 27 test collection 22
3 relevance 77 28 personalization 22
4 query 68 29 summarization 21
5 ranking 62 30 statistical analysis 21
6 user 51 31 tweets 21
7 behavior 50 32 log data 20
8 tag 50 33 term 19
9 evaluation 40 34 language model 19
10 clustering 36 35 bm25 19
11 indexing 35 36 search behavior 19
12 text 34 37 task 19
13 recommendation 31 38 eye tracking 19
14 semantics 29 39 spam 19
15 blog 29 40 experiment 19
16 interactive information
retrieval
28 41 retrieval model 18
17 effectiveness 27 42 music 18
18 model 27 43 classification 18
19 relevance feedback 26 44 subtopic 18
20 ndcg 25 45 search session 18
21 prediction 23 46 query reformulation 18
22 topic model 23 47 wikipedia 18
23 bayesian 23 48 diversity 18
24 human factors 23 49 visualization 17
25 user interface 22 50 twitter 17
High Frequency Keywords (Top 50)
relevance topic model text scalability information search semantics
d1 ×
d2 × ×
d3 ×
d4 × ×
Topic Formal Context (Partial)
T1 T2 T3 T1306 T1307 T1308
T1 1.00 0.73 0.40 0.25 0.22 0.25
T2 0.73 1.00 0.25 0.73 0.83 0.73
T3 0.40 0.25 1.00 0.40 0.33 0.40
T1306 0.25 0.73 0.40 1.00 0.89 0.25
T1307 0.22 0.83 0.33 0.89 1.00 0.22
T1308 0.25 0.73 0.40 0.25 0.22 1.00
The Matrix of Topic Similarity (Partial)
k1 k2 k3 k178 k179 k180
k1 1.00 0.58 0.57 0.58 0.52 0.56
k2 0.58 1.00 0.55 0.55 0.52 0.56
k3 0.57 0.55 1.00 0.60 0.53 0.57
k178 0.58 0.55 0.60 1.00 0.62 0.58
k179 0.52 0.52 0.53 0.62 1.00 0.56
k180 0.56 0.56 0.57 0.58 0.56 1.00
The Matrix of Words Similarity (Partial)
词汇对类型 序号 词汇对 相似度
本文方法 FastText方法
单词-单词 1 tweets; twitter(1) 0.837 4 0.816 8
2 tweets; microblog(1) 0.800 1 0.747 6
3 spam; email(1) 0.813 4 0.708 2
4 behavior; opinion(0) 0.478 2 0.598 3
5 crowdsourcing; twitter(0) 0.482 7 0.636 3
6 task; opinion(0) 0.463 9 0.533 6
单词-词组 7 opinion; opinion mining(1) 0.920 5 0.605 7
8 cqa; question answering(1) 0.912 8 0.586 7
9 crowdsourcing; amazon mechanical turk(1) 0.779 5 0.502 6
10 click; opinion mining(0) 0.469 5 0.525 7
11 fusion; query log analysis(0) 0.468 7 0.542 3
12 visualization; query log analysis(0) 0.470 9 0.549 2
词组-词组 13 log data; query log analysis(1) 0.875 3 0.652 2
14 query log; query log analysis(1) 0.807 1 0.880 2
15 information search; search strategy(1) 0.751 9 0.762 7
16 user study; collaborative filtering(0) 0.482 5 0.715 8
17 query log; question answering(0) 0.476 7 0.608 1
18 human factors; opinion mining(0) 0.474 2 0.607 2
Comparison of Words Similarity Calculation
评价指标 描述
精确率
(Precision@n)
排名前n的结果中检测出标准集合中的词汇对个数与n个词汇对的百分比。
召回率
(Recall@v)
阈值大于v的词汇对中检测出标准集合中的词汇对个数与标准词汇集合所有相似词汇对总数的百分比。
The Calculation Method of Precision and Recall
方法 P@10 P@20 P@30 P@40 P@50
本文方法 1.000 0.850 0.767 0.675 0.600
FastText方法 0.700 0.550 0.433 0.425 0.420
Comparison of Precision
方法 R@0.5 R@0.6 R@0.7
本文方法 1.000 0.819 0.667
FastText方法 1.000 0.680 0.458
Comparison of Recall
[1] 秦春秀, 赵捧未, 刘怀亮. 词语相似度计算研究[J]. 情报理论与实践, 2007,30(1):105-108.
[1] ( Qin Chunxiu, Zhao Pengwei, Liu Huailiang. Computational Research on Word Similarity[J]. Information Studies: Theory & Practice, 2007,30(1):105-108.)
[2] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002,7(2):59-76.
[2] ( Liu Qun, Li Sujian. Word Similarity Computing Based on How-Net[J]. Chinese Computational Linguisties, 2002,7(2):59-76. )
[3] 韩普, 王东波, 王子敏. 词汇相似度计算和相似词挖掘研究进展[J]. 情报科学, 2016,34(9):161-165.
[3] ( Han Pu, Wang Dongbo, Wang Zimin. Research Advancement in Word Similarity Calculation and Mining[J]. Information Science, 2016,34(9):161-165.)
[4] 刘萍, 陈烨. 词汇相似度研究进展综述[J].现代图书情报技术, 2012(7):82-89.
[4] ( Liu Ping, Chen Ye. Survey of the State of the Art in Word Similarity[J].New Technology of Library and Information Service, 2012(7):82-89.)
[5] Rada R, Mili H, Bicknell E, et al. Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989,19(1):17-30.
[6] Gao J B, Zhang B W, Chen X H. A WordNet-based Semantic Similarity Measurement Combining Edge-counting and Information Content Theory[J]. Engineering Applications of Artificial Intelligence, 2015,39:80-88.
[7] 朱新华, 马润聪, 孙柳, 等. 基于知网与词林的词语语义相似度计算[J]. 中文信息学报, 2016,30(4):29-36.
[7] ( Zhu Xinhua, Ma Runcong, Sun Liu, et al. Word Semantic Similarity Computation Based on HowNet and CiLin[J]. Journal of Chinese Information Processing, 2016,30(4):29-36.)
[8] 池哲洁, 张全. 基于概念基元的词语相似度计算研究[J]. 电子与信息学报, 2017,39(1):150-158.
[8] ( Chi Zhejie, Zhang Quan. Word Similarity Measurement Based on Concept Primitive[J]. Journal of Electronics and Information Technology, 2017,39(1):150-158.)
[9] Strube M, Ponzetto S P . WikiRelate! Computing Semantic Relatedness Using Wikipedia [C]// Proceedings of the 21st National Conference on Artificial Intelligence. 2006: 1419-1424.
[10] Jiang Y, Zhang X, Tang Y, et al. Feature-based Approaches to Semantic Similarity Assessment of Concepts Using Wikipedia[J]. Information Processing & Management, 2015,51(3):215-234.
[11] 彭丽针, 吴扬扬. 基于维基百科社区挖掘的词语语义相似度计算[J]. 计算机科学, 2016,43(4):45-49.
[11] ( Peng Lizhen, Wu Yangyang. Semantic Similarity Computing Based on Community Mining of Wikipedia[J]. Computer Science, 2016,43(4):45-49.)
[12] Salton G. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
[13] Saif A, Aziz M J A, Omar N. Reducing Explicit Semantic Representation Vectors Using Latent Dirichlet Allocation[J]. Knowledge-Based Systems, 2016,100:145-149.
[14] 吕亚伟, 李芳, 戴龙龙. 基于LDA的中文词语相似度计算[J]. 北京化工大学学报: 自然科学版, 2016,43(5):79-83.
[14] ( Lv Yawei, Li Fang, Dai Longlong. Chinese Word Similarity Computing Based on Latent Dirichlet Allocation(LDA) Model[J]. Journal of Beijing University of Chemical Technology: Natural Science Edition, 2016,43(5):79-83.)
[15] Bollegala D, Matsuo Y, Ishizuka M. A Web Search Engine-Based Approach to Measure Semantic Similarity Between Words[J]. IEEE Transactions on Knowledge and Data Engineering, 2011,23(7):977-990.
doi: 10.1109/TKDE.2010.172
[16] 陈海燕. 基于搜索引擎的词汇语义相似度计算方法[J]. 计算机科学, 2015,42(1):261-267.
[16] ( Chen Haiyan. Measuring Semantic Similarity Between Words Using Web Search Engine[J]. Computer Science, 2015,42(1):261-267.)
[17] 张硕望, 欧阳纯萍, 阳小华, 等. 融合《知网》和搜索引擎的词汇语义相似度计算[J]. 计算机应用, 2017,37(4):1056-1060.
[17] ( Zhang Shuowang, Ouyang Chunping, Yang Xiaohua, et al. Word Semantic Similarity Computation Based on Integrating HowNet and Search Engines[J]. Computer Applications, 2017,37(4):1056-1060.)
[18] Wille R . Restructing Lattice Theory: An Approach Based on Hierarchies of Concepts [C]// Proceedings of the 7th International Conference on Formal Concept Analysis. 2009: 314-339.
[19] Morris S A, Yen G G. Crossmaps: Visualization of Overlapping Relationships in Collections of Journal Papers[J]. Proceedings of the National Academy of Sciences, 2004,101(S1):5291-5296.
[20] Wu Z, Palmer M . Verb Semantic and Lexical Selection [C]// Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics. 1994: 133-138.
[21] Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. Transactions of the Association for Computational Linguistics, 2017,5:135-146.
doi: 10.1162/tacl_a_00051
[22] Grave E, Bojanowski P, Gupta P , et al. Learning Word Vectors for 157 Languages [C]// Proceedings of the 11th International Conference on Language Resources and Evaluation. 2018: 3483-3487.
[1] Liu Qian, Li Chenliang. A Survey of Topic Evolution on Social Media[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[2] Sheng Jiaqi, Xu Xin. Expanding Scholar Labels with Research Similarity and Co-authorship Network[J]. 数据分析与知识发现, 2020, 4(8): 75-85.
[3] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[4] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[5] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[6] Pan Youneng,Ni Xiuli. Recommending Online Medical Experts with Labeled-LDA Model[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[7] Liang Yanping,An Lu,Liu Jing. Topic Resonance of Micro-blogs on Similar Public Health Emergencies[J]. 数据分析与知识发现, 2020, 4(2/3): 122-133.
[8] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[9] Xu Jianmin,Zhang Liqing,Wang Miao. Tracking Static Topics with Bayesian Network[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[10] Ding Shengchun,Yu Fengyang,Li Zhen. Identifying Potential Trending Topics of Online Public Opinion[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[11] Jie Ma,Yan Ge,Hongyu Pu. Survey of Attribute Reduction Methods[J]. 数据分析与知识发现, 2020, 4(1): 40-50.
[12] Manyu Huang,Qi Yun,Hufeng Peng,Xuemeng Dou. Analyzing Textual Features of Excess-funded Agricultural Products——Case Study of Crowdfunding Website[J]. 数据分析与知识发现, 2019, 3(9): 124-134.
[13] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[14] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[15] Bowen Liu,Rujiang Bai,Yanting Zhou,Xiaoyue Wang. Identifying Frontier Topics from Funding and Paper——Case Study of Carbon Nanotube[J]. 数据分析与知识发现, 2019, 3(8): 114-122.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn