Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (3): 110-120    DOI: 10.11925/infotech.2096-3467.2022.0342
Current Issue | Archive | Adv Search |
Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence
Zhang Guofang1,2,Wang Xin3,Xu Jianmin1,3()
1School of Management, Hebei University, Baoding 071002, China
2College of Mathematics and Information Science, Hebei University, Baoding 071002, China
3School of Cyber Security and Computer, Hebei University, Baoding 071002, China
Download: PDF (1094 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a quantitative model, aiming to explore the asymmetric relationship between documents. [Methods] Firstly, we examined the asymmetric association between topics with the help of co-occurrence. Secondly, we introduced the concept of the document coverage degree to quantify the asymmetric relationship between documents. Finally, we used document clustering to evaluate the proposed model’s performance. [Results] Compared with two existing measurement models, the average value of clustering was reduced by up to 22.6% and 23.3% with the proposed model. [Limitations] The proposed model only analyzed textual contents, which did not include pictures and formulas. [Conclusions] The proposed model could effectively improve the accuracy of document clustering.

Key wordsAsymmetric Relationship      Topic Word Co-occurrence      Coverage     
Received: 13 April 2022      Published: 13 April 2023
ZTFLH:  TP391 G354  
Fund:National Social Science Fund of China(17FTQ002);Social Science Fund of Hebei Province(HB20TQ002)
Corresponding Authors: Xu Jianmin,ORCID:0000-0001-6050-8058,E-mail:hbuxjm@mail.hbu.edu.cn。   

Cite this article:

Zhang Guofang, Wang Xin, Xu Jianmin. Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence. Data Analysis and Knowledge Discovery, 2023, 7(3): 110-120.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0342     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I3/110

Asymmetry Between Topic Words and Asymmetry Between Documents
Framework of Document Coverage Measurement Model Based on Topic Word Co-occurrence
Potential Asymmetric Relationship Between Topic Words in the Document
模型简写 解释
CSSM[31] 考虑主题的对称相似度方法(基于LDA抽取主题词)
NSAM[19] 不考虑主题的非对称方法
CSAM(本文) 考虑主题的非对称方法(基于LDA抽取主题词)
Model Abbreviations and Explanations
d i ?Confusion Degree of LDA Model under Different Number of Topics
">
Document d i ?Confusion Degree of LDA Model under Different Number of Topics
Visual Representation of Co-occurrence of Topic Words
Performance Between CSSM and CSAM
Performance Between NSAM and CSAM
[1] Bao J P, Shen J Y, Liu X D, et al. Quick Asymmetric Text Similarity Measures[C]// Proceedings of the 2003 International Conference on Machine Learning and Cybernetics. IEEE, 2003: 374-379.
[2] Garg A, Enright C G, Madden M G. On Asymmetric Similarity Search[C]// Proceedings of 2015 IEEE 14th International Conference on Machine Learning and Applications. IEEE, 2015: 649-654.
[3] 庞贝贝, 苟娟琼, 穆文歆. 面向高校学生深度辅导领域的主题建模和主题上下位关系识别研究[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[3] ( Pang Beibei, Gou Juanqiong, Mu Wenxin. Extracting Topics and Their Relationship from College Student Mentoring[J]. Data Analysis and Knowledge Discovery, 2018, 2(6): 92-101.)
[4] 张豹, 陈伟荣, 张梦易, 等. 通过标签嵌入从社交标签中挖掘上下位关系[J]. 指挥信息系统与技术, 2020, 11(4): 64-69, 73.
[4] ( Zhang Bao, Chen Weirong, Zhang Mengyi, et al. Mining Hyponymy Relation from Social Tags by Tag Embedding[J]. Command Information System and Technology, 2020, 11(4): 64-69, 73.)
[5] 王思丽, 祝忠明, 杨恒, 等. 基于模式和投影学习的领域概念上下位关系自动识别研究[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[5] ( Wang Sili, Zhu Zhongming, Yang Heng, et al. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 15-25.)
[6] 徐戈, 杨晓燕, 汪涛. 单词语义相似性计算综述[J]. 计算机工程与应用, 2020, 56(4): 9-15.
doi: 10.3778/j.issn.1002-8331.1909-0384
[6] ( Xu Ge, Yang Xiaoyan, Wang Tao. Survey on Semantic Similarity Calculation of Words[J]. Computer Engineering and Applications, 2020, 56(4): 9-15.)
doi: 10.3778/j.issn.1002-8331.1909-0384
[7] 杨泉. 基于遗传算法的词语语义相似度计算研究[J]. 计算机技术与发展, 2021, 31(2): 8-13.
[7] ( Yang Quan. Research on Word Semantic Similarity Calculation Based on Genetic Algorithm[J]. Computer Technology and Development, 2021, 31(2): 8-13.)
[8] 张志昌, 陈松毅, 刘鑫, 等. 结合语境与布朗聚类特征的上下位关系验证[J]. 计算机工程, 2015, 41(2): 145-150.
[8] ( Zhang Zhichang, Chen Songyi, Liu Xin, et al. Hyponymy Relation Validation Combined with Context and Brown Clustering Feature[J]. Computer Engineering, 2015, 41(2): 145-150.)
[9] 刘伟, 黄锴宇, 余浩, 等. 基于语境相似度的中文分词一致性检验研究[J]. 北京大学学报(自然科学版), 2022, 58(1): 99-105.
[9] ( Liu Wei, Huang Kaiyu, Yu Hao, et al. Consistency Check for Chinese Word Segmentation via Contextual Similarity[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2022, 58(1): 99-105.)
[10] 蔡东风, 白宇, 于水, 等. 一种基于语境的词语相似度计算方法[J]. 中文信息学报, 2010, 24(3): 24-28.
[10] ( Cai Dongfeng, Bai Yu, Yu Shui, et al. A Context Based Word Similarity Computing Method[J]. Journal of Chinese Information Processing, 2010, 24(3): 24-28.)
[11] 赵宁宁, 梁意文. 综合结构和内容的XML文档相似度计算方法[J]. 微电子学与计算机, 2016, 33(4): 69-72, 76.
[11] ( Zhao Ningning, Liang Yiwen. Combining Structure and Content Similaritiesmeasure for XML Document[J]. Microelectronics & Computer, 2016, 33(4): 69-72, 76.)
[12] 单华玮, 路冬媛. 基于双向注意力语境关联建模的论辩关系预测[J]. 软件学报, 2022, 33(5): 1880-1892.
[12] ( Shan Huawei, Lu Dongyuan. Predicting Argumentative Relation with Co-attention Contextual Relevance Network[J]. Journal of Software, 2022, 33(5): 1880-1892.)
[13] 徐建民, 许彩云. 基于文本和公式的科技文档相似度计算[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
[13] ( Xu Jianmin, Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.)
[14] Yoshida H, Shida T, Kindo T. Asymmetric Similarity with Modified Overlap Coefficient among Documents[C]// Proceedings of 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. 2001: 99-102.
[15] 徐建民, 王鑫. 科技文档间非对称关系的双模态度量方法[J]. 河北大学学报(自然科学版), 2021, 41(5): 587-598.
[15] ( Xu Jianmin, Wang Xin. A Double Mode Measurement Method of Asymmetric Relationship Between Scientific Documents[J]. Journal of Hebei University (Natural Science Edition), 2021, 41(5): 587-598.)
[16] Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents[C]// Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries. 1995.
[17] Teitelbaum J C. Asymmetric Empirical Similarity[J]. Mathematical Social Sciences, 2013, 66(3): 346-351.
doi: 10.1016/j.mathsocsci.2013.07.005
[18] 宋韶旭, 李春平. 基于非对称相似度的文本聚类方法[J]. 清华大学学报(自然科学版), 2006, 46(7): 1325-1328.
[18] ( Song Shaoxu, Li Chunping. Text Clustering Based on Asymmetric Similarity[J]. Journal of Tsinghua University (Science and Technology), 2006, 46(7): 1325-1328.)
[19] Olszewski D. Asymmetric k-Means Algorithm[C]// Proceedings of the 10th International Conference on Adaptive and Natural Computing Algorithms. 2011: 1-10.
[20] 阮光册, 夏磊. 基于词共现关系的检索结果知识关联研究[J]. 情报学报, 2017, 36(12): 1247-1254.
[20] ( Ruan Guangce, Xia Lei. Knowledge Connection of Retrieval Results Based on Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(12): 1247-1254.)
[21] 王忠义, 谭旭, 夏立新. 共词分析方法的细粒度化与语义化研究[J]. 情报学报, 2014, 33(9): 969-978.
[21] ( Wang Zhongyi, Tan Xu, Xia Lixin. Research on Fine-Granularity and Semantization of Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 969-978.)
[22] 李纲, 巴志超. 共词分析过程中的若干问题研究[J]. 中国图书馆学报, 2017, 43(4): 93-113.
[22] ( Li Gang, Ba Zhichao. Co-word Analysis: Limitations and Solutions[J]. Journal of Library Science in China, 2017, 43(4): 93-113.)
[23] Maron M E, Kuhns J L. On Relevance, Probabilistic Indexing and Information Retrieval[J]. Journal of the ACM, 1960, 7(3): 216-244.
doi: 10.1145/321033.321035
[24] 牛奉高, 李星. 共现潜在语义向量空间模型的进一步研究[J]. 情报杂志, 2017, 36(12): 166-172.
[24] ( Niu Fenggao, Li Xing. Enhanced Considerations on Co-occurrence Latent Semantic Vector Space Model[J]. Journal of Intelligence, 2017, 36(12): 166-172.)
[25] 李佳. 基于词共现的跨语言检索平台研究[J]. 情报杂志, 2015, 34(8): 195-198.
[25] ( Li Jia. The Research of Cross-language Retrieval Platform Based on Word Co-occurrence[J]. Journal of Intelligence, 2015, 34(8): 195-198.)
[26] 赵忠伟. 基于SIGIR邮件列表和学术文本的信息检索主题比较研究[D]. 武汉: 武汉大学, 2017.
[26] Zhao Zhongwei. Research on Information Retrieval Discipline Topic Based on Comparative Study on SIGIR Mailing List and Scholar Papers[D]. Wuhan: Wuhan University, 2017.)
[27] 常鹏, 冯楠. 基于词共现的文档表示模型[J]. 中文信息学报, 2012, 26(1): 51-57.
[27] ( Chang Peng, Feng Nan. A Co-occurrence Based Vector Space Model for Document Indexing[J]. Journal of Chinese Information Processing, 2012, 26(1): 51-57.)
[28] Chen H, Lynch K J. Automatic Construction of Networks of Concepts Characterizing Document Databases[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1992, 22(5): 885-902.
doi: 10.1109/21.179830
[29] Glenisson P, Glanzel W, Janssens F, et al. Combining Full Text and Bibliometric Information in Mapping Scientific Disciplines[J]. Information Processing & Management, 2005, 41(6): 1548-1572.
doi: 10.1016/j.ipm.2005.03.021
[30] 唐晓波, 向坤. 基于LDA模型和微博热度的热点挖掘[J]. 图书情报工作, 2014, 58(5): 58-63.
[30] ( Tang Xiaobo, Xiang Kun. Hotspot Mining Based on LDA Model and Microblog Heat[J]. Library and Information Service, 2014, 58(5): 58-63.)
[31] Schütze H, Manning C D, Raghavan P. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 108-109.
[32] 李明, 李莹, 周庆, 等. 基于TF-PIDF的网络问答社区中的知识供需研究[J]. 数据分析与知识发现, 2021, 5(2): 106-115.
[32] ( Li Ming, Li Ying, Zhou Qing, et al. Analyzing Knowledge Demand and Supply of Community Question Answering with TF-PIDF[J]. Data Analysis and Knowledge Discovery, 2021, 5(2): 106-115.)
[33] 肖悦珺, 李红莲, 张乐, 等. 特征融合的中文专利文本分类方法研究[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[33] ( Xiao Yuejun, Li Honglian, Zhang Le, et al. Classifying Chinese Patent Texts with Feature Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(4): 49-59.)
[34] Johannesson M. Modelling Asymmetric Similarity with Prominence[J]. British Journal of Mathematical and Statistical Psychology, 2000, 53(1): 121-139.
doi: 10.1348/000711000159213
[35] Moussa M, Măndoiu I I. Single Cell RNA-seq Data Clustering Using TF-IDF Based Methods[J]. BMC Genomics, 2018, 19(Suppl 6): 569.
doi: 10.1186/s12864-018-4922-4 pmid: 30367575
[36] 徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28(1): 13-16.
[36] ( Xu Jianmin, Wang Ping. Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
[37] Ahmadzadehgoli N, Mohammadpour A, Behzadi M H. LINEX k-Means: Clustering by an Asymmetric Dissimilarity Measure[J]. Journal of Statistical Theory and Applications, 2018, 17(1): 29.
doi: 10.2991/jsta.2018.17.1.3
[38] 杨燕, 靳蕃, Kamel Mohamed. 聚类有效性评价综述[J]. 计算机应用研究, 2008, 25(6):1630-1632,1638.
[38] ( Yang Yan, Jin Fan, Mohamed K. Survey of Clustering Validity Evaluation[J]. Application Research of Computers, 2008, 25(6): 1630-1632, 1638.)
[39] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[39] Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[1] Xu Jianmin,Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
[2] Zhang Xiaoyong,Zhou Qingqing,Zhang Chengzhi. Identifying Food Topics from User-Generated Contents in Microblogs[J]. 现代图书情报技术, 2016, 32(10): 70-80.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn