|
|
Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence |
Zhang Guofang1,2,Wang Xin3,Xu Jianmin1,3() |
1School of Management, Hebei University, Baoding 071002, China 2College of Mathematics and Information Science, Hebei University, Baoding 071002, China 3School of Cyber Security and Computer, Hebei University, Baoding 071002, China |
|
|
Abstract [Objective] This paper proposes a quantitative model, aiming to explore the asymmetric relationship between documents. [Methods] Firstly, we examined the asymmetric association between topics with the help of co-occurrence. Secondly, we introduced the concept of the document coverage degree to quantify the asymmetric relationship between documents. Finally, we used document clustering to evaluate the proposed model’s performance. [Results] Compared with two existing measurement models, the average value of clustering was reduced by up to 22.6% and 23.3% with the proposed model. [Limitations] The proposed model only analyzed textual contents, which did not include pictures and formulas. [Conclusions] The proposed model could effectively improve the accuracy of document clustering.
|
Received: 13 April 2022
Published: 13 April 2023
|
|
Fund:National Social Science Fund of China(17FTQ002);Social Science Fund of Hebei Province(HB20TQ002) |
Corresponding Authors:
Xu Jianmin,ORCID:0000-0001-6050-8058,E-mail:hbuxjm@mail.hbu.edu.cn。
|
[1] |
Bao J P, Shen J Y, Liu X D, et al. Quick Asymmetric Text Similarity Measures[C]// Proceedings of the 2003 International Conference on Machine Learning and Cybernetics. IEEE, 2003: 374-379.
|
[2] |
Garg A, Enright C G, Madden M G. On Asymmetric Similarity Search[C]// Proceedings of 2015 IEEE 14th International Conference on Machine Learning and Applications. IEEE, 2015: 649-654.
|
[3] |
庞贝贝, 苟娟琼, 穆文歆. 面向高校学生深度辅导领域的主题建模和主题上下位关系识别研究[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
|
[3] |
( Pang Beibei, Gou Juanqiong, Mu Wenxin. Extracting Topics and Their Relationship from College Student Mentoring[J]. Data Analysis and Knowledge Discovery, 2018, 2(6): 92-101.)
|
[4] |
张豹, 陈伟荣, 张梦易, 等. 通过标签嵌入从社交标签中挖掘上下位关系[J]. 指挥信息系统与技术, 2020, 11(4): 64-69, 73.
|
[4] |
( Zhang Bao, Chen Weirong, Zhang Mengyi, et al. Mining Hyponymy Relation from Social Tags by Tag Embedding[J]. Command Information System and Technology, 2020, 11(4): 64-69, 73.)
|
[5] |
王思丽, 祝忠明, 杨恒, 等. 基于模式和投影学习的领域概念上下位关系自动识别研究[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
|
[5] |
( Wang Sili, Zhu Zhongming, Yang Heng, et al. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 15-25.)
|
[6] |
徐戈, 杨晓燕, 汪涛. 单词语义相似性计算综述[J]. 计算机工程与应用, 2020, 56(4): 9-15.
doi: 10.3778/j.issn.1002-8331.1909-0384
|
[6] |
( Xu Ge, Yang Xiaoyan, Wang Tao. Survey on Semantic Similarity Calculation of Words[J]. Computer Engineering and Applications, 2020, 56(4): 9-15.)
doi: 10.3778/j.issn.1002-8331.1909-0384
|
[7] |
杨泉. 基于遗传算法的词语语义相似度计算研究[J]. 计算机技术与发展, 2021, 31(2): 8-13.
|
[7] |
( Yang Quan. Research on Word Semantic Similarity Calculation Based on Genetic Algorithm[J]. Computer Technology and Development, 2021, 31(2): 8-13.)
|
[8] |
张志昌, 陈松毅, 刘鑫, 等. 结合语境与布朗聚类特征的上下位关系验证[J]. 计算机工程, 2015, 41(2): 145-150.
|
[8] |
( Zhang Zhichang, Chen Songyi, Liu Xin, et al. Hyponymy Relation Validation Combined with Context and Brown Clustering Feature[J]. Computer Engineering, 2015, 41(2): 145-150.)
|
[9] |
刘伟, 黄锴宇, 余浩, 等. 基于语境相似度的中文分词一致性检验研究[J]. 北京大学学报(自然科学版), 2022, 58(1): 99-105.
|
[9] |
( Liu Wei, Huang Kaiyu, Yu Hao, et al. Consistency Check for Chinese Word Segmentation via Contextual Similarity[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2022, 58(1): 99-105.)
|
[10] |
蔡东风, 白宇, 于水, 等. 一种基于语境的词语相似度计算方法[J]. 中文信息学报, 2010, 24(3): 24-28.
|
[10] |
( Cai Dongfeng, Bai Yu, Yu Shui, et al. A Context Based Word Similarity Computing Method[J]. Journal of Chinese Information Processing, 2010, 24(3): 24-28.)
|
[11] |
赵宁宁, 梁意文. 综合结构和内容的XML文档相似度计算方法[J]. 微电子学与计算机, 2016, 33(4): 69-72, 76.
|
[11] |
( Zhao Ningning, Liang Yiwen. Combining Structure and Content Similaritiesmeasure for XML Document[J]. Microelectronics & Computer, 2016, 33(4): 69-72, 76.)
|
[12] |
单华玮, 路冬媛. 基于双向注意力语境关联建模的论辩关系预测[J]. 软件学报, 2022, 33(5): 1880-1892.
|
[12] |
( Shan Huawei, Lu Dongyuan. Predicting Argumentative Relation with Co-attention Contextual Relevance Network[J]. Journal of Software, 2022, 33(5): 1880-1892.)
|
[13] |
徐建民, 许彩云. 基于文本和公式的科技文档相似度计算[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
|
[13] |
( Xu Jianmin, Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.)
|
[14] |
Yoshida H, Shida T, Kindo T. Asymmetric Similarity with Modified Overlap Coefficient among Documents[C]// Proceedings of 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. 2001: 99-102.
|
[15] |
徐建民, 王鑫. 科技文档间非对称关系的双模态度量方法[J]. 河北大学学报(自然科学版), 2021, 41(5): 587-598.
|
[15] |
( Xu Jianmin, Wang Xin. A Double Mode Measurement Method of Asymmetric Relationship Between Scientific Documents[J]. Journal of Hebei University (Natural Science Edition), 2021, 41(5): 587-598.)
|
[16] |
Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents[C]// Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries. 1995.
|
[17] |
Teitelbaum J C. Asymmetric Empirical Similarity[J]. Mathematical Social Sciences, 2013, 66(3): 346-351.
doi: 10.1016/j.mathsocsci.2013.07.005
|
[18] |
宋韶旭, 李春平. 基于非对称相似度的文本聚类方法[J]. 清华大学学报(自然科学版), 2006, 46(7): 1325-1328.
|
[18] |
( Song Shaoxu, Li Chunping. Text Clustering Based on Asymmetric Similarity[J]. Journal of Tsinghua University (Science and Technology), 2006, 46(7): 1325-1328.)
|
[19] |
Olszewski D. Asymmetric k-Means Algorithm[C]// Proceedings of the 10th International Conference on Adaptive and Natural Computing Algorithms. 2011: 1-10.
|
[20] |
阮光册, 夏磊. 基于词共现关系的检索结果知识关联研究[J]. 情报学报, 2017, 36(12): 1247-1254.
|
[20] |
( Ruan Guangce, Xia Lei. Knowledge Connection of Retrieval Results Based on Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(12): 1247-1254.)
|
[21] |
王忠义, 谭旭, 夏立新. 共词分析方法的细粒度化与语义化研究[J]. 情报学报, 2014, 33(9): 969-978.
|
[21] |
( Wang Zhongyi, Tan Xu, Xia Lixin. Research on Fine-Granularity and Semantization of Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 969-978.)
|
[22] |
李纲, 巴志超. 共词分析过程中的若干问题研究[J]. 中国图书馆学报, 2017, 43(4): 93-113.
|
[22] |
( Li Gang, Ba Zhichao. Co-word Analysis: Limitations and Solutions[J]. Journal of Library Science in China, 2017, 43(4): 93-113.)
|
[23] |
Maron M E, Kuhns J L. On Relevance, Probabilistic Indexing and Information Retrieval[J]. Journal of the ACM, 1960, 7(3): 216-244.
doi: 10.1145/321033.321035
|
[24] |
牛奉高, 李星. 共现潜在语义向量空间模型的进一步研究[J]. 情报杂志, 2017, 36(12): 166-172.
|
[24] |
( Niu Fenggao, Li Xing. Enhanced Considerations on Co-occurrence Latent Semantic Vector Space Model[J]. Journal of Intelligence, 2017, 36(12): 166-172.)
|
[25] |
李佳. 基于词共现的跨语言检索平台研究[J]. 情报杂志, 2015, 34(8): 195-198.
|
[25] |
( Li Jia. The Research of Cross-language Retrieval Platform Based on Word Co-occurrence[J]. Journal of Intelligence, 2015, 34(8): 195-198.)
|
[26] |
赵忠伟. 基于SIGIR邮件列表和学术文本的信息检索主题比较研究[D]. 武汉: 武汉大学, 2017.
|
[26] |
Zhao Zhongwei. Research on Information Retrieval Discipline Topic Based on Comparative Study on SIGIR Mailing List and Scholar Papers[D]. Wuhan: Wuhan University, 2017.)
|
[27] |
常鹏, 冯楠. 基于词共现的文档表示模型[J]. 中文信息学报, 2012, 26(1): 51-57.
|
[27] |
( Chang Peng, Feng Nan. A Co-occurrence Based Vector Space Model for Document Indexing[J]. Journal of Chinese Information Processing, 2012, 26(1): 51-57.)
|
[28] |
Chen H, Lynch K J. Automatic Construction of Networks of Concepts Characterizing Document Databases[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1992, 22(5): 885-902.
doi: 10.1109/21.179830
|
[29] |
Glenisson P, Glanzel W, Janssens F, et al. Combining Full Text and Bibliometric Information in Mapping Scientific Disciplines[J]. Information Processing & Management, 2005, 41(6): 1548-1572.
doi: 10.1016/j.ipm.2005.03.021
|
[30] |
唐晓波, 向坤. 基于LDA模型和微博热度的热点挖掘[J]. 图书情报工作, 2014, 58(5): 58-63.
|
[30] |
( Tang Xiaobo, Xiang Kun. Hotspot Mining Based on LDA Model and Microblog Heat[J]. Library and Information Service, 2014, 58(5): 58-63.)
|
[31] |
Schütze H, Manning C D, Raghavan P. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 108-109.
|
[32] |
李明, 李莹, 周庆, 等. 基于TF-PIDF的网络问答社区中的知识供需研究[J]. 数据分析与知识发现, 2021, 5(2): 106-115.
|
[32] |
( Li Ming, Li Ying, Zhou Qing, et al. Analyzing Knowledge Demand and Supply of Community Question Answering with TF-PIDF[J]. Data Analysis and Knowledge Discovery, 2021, 5(2): 106-115.)
|
[33] |
肖悦珺, 李红莲, 张乐, 等. 特征融合的中文专利文本分类方法研究[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
|
[33] |
( Xiao Yuejun, Li Honglian, Zhang Le, et al. Classifying Chinese Patent Texts with Feature Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(4): 49-59.)
|
[34] |
Johannesson M. Modelling Asymmetric Similarity with Prominence[J]. British Journal of Mathematical and Statistical Psychology, 2000, 53(1): 121-139.
doi: 10.1348/000711000159213
|
[35] |
Moussa M, Măndoiu I I. Single Cell RNA-seq Data Clustering Using TF-IDF Based Methods[J]. BMC Genomics, 2018, 19(Suppl 6): 569.
doi: 10.1186/s12864-018-4922-4
pmid: 30367575
|
[36] |
徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28(1): 13-16.
|
[36] |
( Xu Jianmin, Wang Ping. Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
|
[37] |
Ahmadzadehgoli N, Mohammadpour A, Behzadi M H. LINEX k-Means: Clustering by an Asymmetric Dissimilarity Measure[J]. Journal of Statistical Theory and Applications, 2018, 17(1): 29.
doi: 10.2991/jsta.2018.17.1.3
|
[38] |
杨燕, 靳蕃, Kamel Mohamed. 聚类有效性评价综述[J]. 计算机应用研究, 2008, 25(6):1630-1632,1638.
|
[38] |
( Yang Yan, Jin Fan, Mohamed K. Survey of Clustering Validity Evaluation[J]. Application Research of Computers, 2008, 25(6): 1630-1632, 1638.)
|
[39] |
关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
|
[39] |
Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|