Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence
Zhang Guofang1,2,Wang Xin3,Xu Jianmin1,3()
1School of Management, Hebei University, Baoding 071002, China 2College of Mathematics and Information Science, Hebei University, Baoding 071002, China 3School of Cyber Security and Computer, Hebei University, Baoding 071002, China
[Objective] This paper proposes a quantitative model, aiming to explore the asymmetric relationship between documents. [Methods] Firstly, we examined the asymmetric association between topics with the help of co-occurrence. Secondly, we introduced the concept of the document coverage degree to quantify the asymmetric relationship between documents. Finally, we used document clustering to evaluate the proposed model’s performance. [Results] Compared with two existing measurement models, the average value of clustering was reduced by up to 22.6% and 23.3% with the proposed model. [Limitations] The proposed model only analyzed textual contents, which did not include pictures and formulas. [Conclusions] The proposed model could effectively improve the accuracy of document clustering.
张国防, 王鑫, 徐建民. 基于主题词共现的文档非对称关系量化研究*[J]. 数据分析与知识发现, 2023, 7(3): 110-120.
Zhang Guofang, Wang Xin, Xu Jianmin. Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence. Data Analysis and Knowledge Discovery, 2023, 7(3): 110-120.
Bao J P, Shen J Y, Liu X D, et al. Quick Asymmetric Text Similarity Measures[C]// Proceedings of the 2003 International Conference on Machine Learning and Cybernetics. IEEE, 2003: 374-379.
[2]
Garg A, Enright C G, Madden M G. On Asymmetric Similarity Search[C]// Proceedings of 2015 IEEE 14th International Conference on Machine Learning and Applications. IEEE, 2015: 649-654.
( Pang Beibei, Gou Juanqiong, Mu Wenxin. Extracting Topics and Their Relationship from College Student Mentoring[J]. Data Analysis and Knowledge Discovery, 2018, 2(6): 92-101.)
( Zhang Bao, Chen Weirong, Zhang Mengyi, et al. Mining Hyponymy Relation from Social Tags by Tag Embedding[J]. Command Information System and Technology, 2020, 11(4): 64-69, 73.)
( Wang Sili, Zhu Zhongming, Yang Heng, et al. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 15-25.)
( Xu Ge, Yang Xiaoyan, Wang Tao. Survey on Semantic Similarity Calculation of Words[J]. Computer Engineering and Applications, 2020, 56(4): 9-15.)
doi: 10.3778/j.issn.1002-8331.1909-0384
( Zhang Zhichang, Chen Songyi, Liu Xin, et al. Hyponymy Relation Validation Combined with Context and Brown Clustering Feature[J]. Computer Engineering, 2015, 41(2): 145-150.)
( Liu Wei, Huang Kaiyu, Yu Hao, et al. Consistency Check for Chinese Word Segmentation via Contextual Similarity[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2022, 58(1): 99-105.)
( Cai Dongfeng, Bai Yu, Yu Shui, et al. A Context Based Word Similarity Computing Method[J]. Journal of Chinese Information Processing, 2010, 24(3): 24-28.)
( Xu Jianmin, Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.)
[14]
Yoshida H, Shida T, Kindo T. Asymmetric Similarity with Modified Overlap Coefficient among Documents[C]// Proceedings of 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. 2001: 99-102.
( Xu Jianmin, Wang Xin. A Double Mode Measurement Method of Asymmetric Relationship Between Scientific Documents[J]. Journal of Hebei University (Natural Science Edition), 2021, 41(5): 587-598.)
[16]
Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents[C]// Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries. 1995.
[17]
Teitelbaum J C. Asymmetric Empirical Similarity[J]. Mathematical Social Sciences, 2013, 66(3): 346-351.
doi: 10.1016/j.mathsocsci.2013.07.005
( Song Shaoxu, Li Chunping. Text Clustering Based on Asymmetric Similarity[J]. Journal of Tsinghua University (Science and Technology), 2006, 46(7): 1325-1328.)
[19]
Olszewski D. Asymmetric k-Means Algorithm[C]// Proceedings of the 10th International Conference on Adaptive and Natural Computing Algorithms. 2011: 1-10.
( Ruan Guangce, Xia Lei. Knowledge Connection of Retrieval Results Based on Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(12): 1247-1254.)
( Wang Zhongyi, Tan Xu, Xia Lixin. Research on Fine-Granularity and Semantization of Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 969-978.)
( Li Gang, Ba Zhichao. Co-word Analysis: Limitations and Solutions[J]. Journal of Library Science in China, 2017, 43(4): 93-113.)
[23]
Maron M E, Kuhns J L. On Relevance, Probabilistic Indexing and Information Retrieval[J]. Journal of the ACM, 1960, 7(3): 216-244.
doi: 10.1145/321033.321035
( Niu Fenggao, Li Xing. Enhanced Considerations on Co-occurrence Latent Semantic Vector Space Model[J]. Journal of Intelligence, 2017, 36(12): 166-172.)
Zhao Zhongwei. Research on Information Retrieval Discipline Topic Based on Comparative Study on SIGIR Mailing List and Scholar Papers[D]. Wuhan: Wuhan University, 2017.)
( Chang Peng, Feng Nan. A Co-occurrence Based Vector Space Model for Document Indexing[J]. Journal of Chinese Information Processing, 2012, 26(1): 51-57.)
[28]
Chen H, Lynch K J. Automatic Construction of Networks of Concepts Characterizing Document Databases[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1992, 22(5): 885-902.
doi: 10.1109/21.179830
[29]
Glenisson P, Glanzel W, Janssens F, et al. Combining Full Text and Bibliometric Information in Mapping Scientific Disciplines[J]. Information Processing & Management, 2005, 41(6): 1548-1572.
doi: 10.1016/j.ipm.2005.03.021
( Li Ming, Li Ying, Zhou Qing, et al. Analyzing Knowledge Demand and Supply of Community Question Answering with TF-PIDF[J]. Data Analysis and Knowledge Discovery, 2021, 5(2): 106-115.)
( Xiao Yuejun, Li Honglian, Zhang Le, et al. Classifying Chinese Patent Texts with Feature Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(4): 49-59.)
[34]
Johannesson M. Modelling Asymmetric Similarity with Prominence[J]. British Journal of Mathematical and Statistical Psychology, 2000, 53(1): 121-139.
doi: 10.1348/000711000159213
[35]
Moussa M, Măndoiu I I. Single Cell RNA-seq Data Clustering Using TF-IDF Based Methods[J]. BMC Genomics, 2018, 19(Suppl 6): 569.
doi: 10.1186/s12864-018-4922-4
pmid: 30367575
( Xu Jianmin, Wang Ping. Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
[37]
Ahmadzadehgoli N, Mohammadpour A, Behzadi M H. LINEX k-Means: Clustering by an Asymmetric Dissimilarity Measure[J]. Journal of Statistical Theory and Applications, 2018, 17(1): 29.
doi: 10.2991/jsta.2018.17.1.3
Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)