Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (3): 110-120     https://doi.org/10.11925/infotech.2096-3467.2022.0342
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于主题词共现的文档非对称关系量化研究*
张国防1,2,王鑫3,徐建民1,3()
1河北大学管理学院 保定 071002
2河北大学数学与信息科学学院 保定 071002
3河北大学网络空间安全与计算机学院 保定 071002
Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence
Zhang Guofang1,2,Wang Xin3,Xu Jianmin1,3()
1School of Management, Hebei University, Baoding 071002, China
2College of Mathematics and Information Science, Hebei University, Baoding 071002, China
3School of Cyber Security and Computer, Hebei University, Baoding 071002, China
全文: PDF (1094 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 探究文档间的非对称关系并提出量化模型。【方法】 基于主题词共现思想,挖掘主题词间的非对称关联信息,采用文档覆盖度指标量化文档间的非对称关系,通过文档聚类进行实证分析。【结果】 在文档聚类应用中,与已有的两种文档间关系量化模型相比,所提出的基于主题词共现的文档非对称关系量化模型使聚类结果的平均熵值分别最大下降了22.6%和23.3%。【局限】 量化模型只聚焦了文档的文本内容,未考虑图片和公式等非文本内容对文档间非对称关系的影响。【结论】 利用文档间非对称关系能更好地区分文档间差异性,有助于提高文档聚类准确率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张国防
王鑫
徐建民
关键词 非对称关系共现覆盖度    
Abstract

[Objective] This paper proposes a quantitative model, aiming to explore the asymmetric relationship between documents. [Methods] Firstly, we examined the asymmetric association between topics with the help of co-occurrence. Secondly, we introduced the concept of the document coverage degree to quantify the asymmetric relationship between documents. Finally, we used document clustering to evaluate the proposed model’s performance. [Results] Compared with two existing measurement models, the average value of clustering was reduced by up to 22.6% and 23.3% with the proposed model. [Limitations] The proposed model only analyzed textual contents, which did not include pictures and formulas. [Conclusions] The proposed model could effectively improve the accuracy of document clustering.

Key wordsAsymmetric Relationship    Topic Word Co-occurrence    Coverage
收稿日期: 2022-04-13      出版日期: 2023-04-13
ZTFLH:  TP391 G354  
基金资助:国家社会科学基金后期资助项目(17FTQ002);河北省社科基金项目(HB20TQ002)
通讯作者: 徐建民,ORCID:0000-0001-6050-8058,E-mail:hbuxjm@mail.hbu.edu.cn。   
引用本文:   
张国防, 王鑫, 徐建民. 基于主题词共现的文档非对称关系量化研究*[J]. 数据分析与知识发现, 2023, 7(3): 110-120.
Zhang Guofang, Wang Xin, Xu Jianmin. Analyzing Asymmetric Relationship Between Documents Based on Topic Word Co-occurrence. Data Analysis and Knowledge Discovery, 2023, 7(3): 110-120.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0342      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I3/110
Fig.1  主题词间的非对称性以及文档间的非对称性
Fig.2  基于主题词共现的文档覆盖度计量模型研究框架
Fig.3  文档主题词潜在非对称关系示意图
模型简写 解释
CSSM[31] 考虑主题的对称相似度方法(基于LDA抽取主题词)
NSAM[19] 不考虑主题的非对称方法
CSAM(本文) 考虑主题的非对称方法(基于LDA抽取主题词)
Table 1  模型简写及其解释
Fig.4  文档 d i不同主题个数下LDA模型的困惑度
Fig.5  主题词共现可视化表示
Fig. 6  CSSM模型与CSAM模型性能可视化对比
Fig. 7  NSAM模型和CSAM模型性能可视化对比
[1] Bao J P, Shen J Y, Liu X D, et al. Quick Asymmetric Text Similarity Measures[C]// Proceedings of the 2003 International Conference on Machine Learning and Cybernetics. IEEE, 2003: 374-379.
[2] Garg A, Enright C G, Madden M G. On Asymmetric Similarity Search[C]// Proceedings of 2015 IEEE 14th International Conference on Machine Learning and Applications. IEEE, 2015: 649-654.
[3] 庞贝贝, 苟娟琼, 穆文歆. 面向高校学生深度辅导领域的主题建模和主题上下位关系识别研究[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[3] ( Pang Beibei, Gou Juanqiong, Mu Wenxin. Extracting Topics and Their Relationship from College Student Mentoring[J]. Data Analysis and Knowledge Discovery, 2018, 2(6): 92-101.)
[4] 张豹, 陈伟荣, 张梦易, 等. 通过标签嵌入从社交标签中挖掘上下位关系[J]. 指挥信息系统与技术, 2020, 11(4): 64-69, 73.
[4] ( Zhang Bao, Chen Weirong, Zhang Mengyi, et al. Mining Hyponymy Relation from Social Tags by Tag Embedding[J]. Command Information System and Technology, 2020, 11(4): 64-69, 73.)
[5] 王思丽, 祝忠明, 杨恒, 等. 基于模式和投影学习的领域概念上下位关系自动识别研究[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[5] ( Wang Sili, Zhu Zhongming, Yang Heng, et al. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 15-25.)
[6] 徐戈, 杨晓燕, 汪涛. 单词语义相似性计算综述[J]. 计算机工程与应用, 2020, 56(4): 9-15.
doi: 10.3778/j.issn.1002-8331.1909-0384
[6] ( Xu Ge, Yang Xiaoyan, Wang Tao. Survey on Semantic Similarity Calculation of Words[J]. Computer Engineering and Applications, 2020, 56(4): 9-15.)
doi: 10.3778/j.issn.1002-8331.1909-0384
[7] 杨泉. 基于遗传算法的词语语义相似度计算研究[J]. 计算机技术与发展, 2021, 31(2): 8-13.
[7] ( Yang Quan. Research on Word Semantic Similarity Calculation Based on Genetic Algorithm[J]. Computer Technology and Development, 2021, 31(2): 8-13.)
[8] 张志昌, 陈松毅, 刘鑫, 等. 结合语境与布朗聚类特征的上下位关系验证[J]. 计算机工程, 2015, 41(2): 145-150.
[8] ( Zhang Zhichang, Chen Songyi, Liu Xin, et al. Hyponymy Relation Validation Combined with Context and Brown Clustering Feature[J]. Computer Engineering, 2015, 41(2): 145-150.)
[9] 刘伟, 黄锴宇, 余浩, 等. 基于语境相似度的中文分词一致性检验研究[J]. 北京大学学报(自然科学版), 2022, 58(1): 99-105.
[9] ( Liu Wei, Huang Kaiyu, Yu Hao, et al. Consistency Check for Chinese Word Segmentation via Contextual Similarity[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2022, 58(1): 99-105.)
[10] 蔡东风, 白宇, 于水, 等. 一种基于语境的词语相似度计算方法[J]. 中文信息学报, 2010, 24(3): 24-28.
[10] ( Cai Dongfeng, Bai Yu, Yu Shui, et al. A Context Based Word Similarity Computing Method[J]. Journal of Chinese Information Processing, 2010, 24(3): 24-28.)
[11] 赵宁宁, 梁意文. 综合结构和内容的XML文档相似度计算方法[J]. 微电子学与计算机, 2016, 33(4): 69-72, 76.
[11] ( Zhao Ningning, Liang Yiwen. Combining Structure and Content Similaritiesmeasure for XML Document[J]. Microelectronics & Computer, 2016, 33(4): 69-72, 76.)
[12] 单华玮, 路冬媛. 基于双向注意力语境关联建模的论辩关系预测[J]. 软件学报, 2022, 33(5): 1880-1892.
[12] ( Shan Huawei, Lu Dongyuan. Predicting Argumentative Relation with Co-attention Contextual Relevance Network[J]. Journal of Software, 2022, 33(5): 1880-1892.)
[13] 徐建民, 许彩云. 基于文本和公式的科技文档相似度计算[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
[13] ( Xu Jianmin, Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.)
[14] Yoshida H, Shida T, Kindo T. Asymmetric Similarity with Modified Overlap Coefficient among Documents[C]// Proceedings of 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. 2001: 99-102.
[15] 徐建民, 王鑫. 科技文档间非对称关系的双模态度量方法[J]. 河北大学学报(自然科学版), 2021, 41(5): 587-598.
[15] ( Xu Jianmin, Wang Xin. A Double Mode Measurement Method of Asymmetric Relationship Between Scientific Documents[J]. Journal of Hebei University (Natural Science Edition), 2021, 41(5): 587-598.)
[16] Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents[C]// Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries. 1995.
[17] Teitelbaum J C. Asymmetric Empirical Similarity[J]. Mathematical Social Sciences, 2013, 66(3): 346-351.
doi: 10.1016/j.mathsocsci.2013.07.005
[18] 宋韶旭, 李春平. 基于非对称相似度的文本聚类方法[J]. 清华大学学报(自然科学版), 2006, 46(7): 1325-1328.
[18] ( Song Shaoxu, Li Chunping. Text Clustering Based on Asymmetric Similarity[J]. Journal of Tsinghua University (Science and Technology), 2006, 46(7): 1325-1328.)
[19] Olszewski D. Asymmetric k-Means Algorithm[C]// Proceedings of the 10th International Conference on Adaptive and Natural Computing Algorithms. 2011: 1-10.
[20] 阮光册, 夏磊. 基于词共现关系的检索结果知识关联研究[J]. 情报学报, 2017, 36(12): 1247-1254.
[20] ( Ruan Guangce, Xia Lei. Knowledge Connection of Retrieval Results Based on Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(12): 1247-1254.)
[21] 王忠义, 谭旭, 夏立新. 共词分析方法的细粒度化与语义化研究[J]. 情报学报, 2014, 33(9): 969-978.
[21] ( Wang Zhongyi, Tan Xu, Xia Lixin. Research on Fine-Granularity and Semantization of Co-word Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 969-978.)
[22] 李纲, 巴志超. 共词分析过程中的若干问题研究[J]. 中国图书馆学报, 2017, 43(4): 93-113.
[22] ( Li Gang, Ba Zhichao. Co-word Analysis: Limitations and Solutions[J]. Journal of Library Science in China, 2017, 43(4): 93-113.)
[23] Maron M E, Kuhns J L. On Relevance, Probabilistic Indexing and Information Retrieval[J]. Journal of the ACM, 1960, 7(3): 216-244.
doi: 10.1145/321033.321035
[24] 牛奉高, 李星. 共现潜在语义向量空间模型的进一步研究[J]. 情报杂志, 2017, 36(12): 166-172.
[24] ( Niu Fenggao, Li Xing. Enhanced Considerations on Co-occurrence Latent Semantic Vector Space Model[J]. Journal of Intelligence, 2017, 36(12): 166-172.)
[25] 李佳. 基于词共现的跨语言检索平台研究[J]. 情报杂志, 2015, 34(8): 195-198.
[25] ( Li Jia. The Research of Cross-language Retrieval Platform Based on Word Co-occurrence[J]. Journal of Intelligence, 2015, 34(8): 195-198.)
[26] 赵忠伟. 基于SIGIR邮件列表和学术文本的信息检索主题比较研究[D]. 武汉: 武汉大学, 2017.
[26] Zhao Zhongwei. Research on Information Retrieval Discipline Topic Based on Comparative Study on SIGIR Mailing List and Scholar Papers[D]. Wuhan: Wuhan University, 2017.)
[27] 常鹏, 冯楠. 基于词共现的文档表示模型[J]. 中文信息学报, 2012, 26(1): 51-57.
[27] ( Chang Peng, Feng Nan. A Co-occurrence Based Vector Space Model for Document Indexing[J]. Journal of Chinese Information Processing, 2012, 26(1): 51-57.)
[28] Chen H, Lynch K J. Automatic Construction of Networks of Concepts Characterizing Document Databases[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1992, 22(5): 885-902.
doi: 10.1109/21.179830
[29] Glenisson P, Glanzel W, Janssens F, et al. Combining Full Text and Bibliometric Information in Mapping Scientific Disciplines[J]. Information Processing & Management, 2005, 41(6): 1548-1572.
doi: 10.1016/j.ipm.2005.03.021
[30] 唐晓波, 向坤. 基于LDA模型和微博热度的热点挖掘[J]. 图书情报工作, 2014, 58(5): 58-63.
[30] ( Tang Xiaobo, Xiang Kun. Hotspot Mining Based on LDA Model and Microblog Heat[J]. Library and Information Service, 2014, 58(5): 58-63.)
[31] Schütze H, Manning C D, Raghavan P. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 108-109.
[32] 李明, 李莹, 周庆, 等. 基于TF-PIDF的网络问答社区中的知识供需研究[J]. 数据分析与知识发现, 2021, 5(2): 106-115.
[32] ( Li Ming, Li Ying, Zhou Qing, et al. Analyzing Knowledge Demand and Supply of Community Question Answering with TF-PIDF[J]. Data Analysis and Knowledge Discovery, 2021, 5(2): 106-115.)
[33] 肖悦珺, 李红莲, 张乐, 等. 特征融合的中文专利文本分类方法研究[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[33] ( Xiao Yuejun, Li Honglian, Zhang Le, et al. Classifying Chinese Patent Texts with Feature Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(4): 49-59.)
[34] Johannesson M. Modelling Asymmetric Similarity with Prominence[J]. British Journal of Mathematical and Statistical Psychology, 2000, 53(1): 121-139.
doi: 10.1348/000711000159213
[35] Moussa M, Măndoiu I I. Single Cell RNA-seq Data Clustering Using TF-IDF Based Methods[J]. BMC Genomics, 2018, 19(Suppl 6): 569.
doi: 10.1186/s12864-018-4922-4 pmid: 30367575
[36] 徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28(1): 13-16.
[36] ( Xu Jianmin, Wang Ping. Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
[37] Ahmadzadehgoli N, Mohammadpour A, Behzadi M H. LINEX k-Means: Clustering by an Asymmetric Dissimilarity Measure[J]. Journal of Statistical Theory and Applications, 2018, 17(1): 29.
doi: 10.2991/jsta.2018.17.1.3
[38] 杨燕, 靳蕃, Kamel Mohamed. 聚类有效性评价综述[J]. 计算机应用研究, 2008, 25(6):1630-1632,1638.
[38] ( Yang Yan, Jin Fan, Mohamed K. Survey of Clustering Validity Evaluation[J]. Application Research of Computers, 2008, 25(6): 1630-1632, 1638.)
[39] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[39] Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[1] 石静,厉臣璐,钱宇星,周利琴,张斌. 国内外健康问答社区用户信息需求对比研究*——基于主题和时间视角的实证分析[J]. 数据分析与知识发现, 2019, 3(5): 1-10.
[2] 严娇,马静,房康. 基于融合共现距离的句法网络下文本语义相似度计算 *[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[3] 徐建民, 许彩云. 基于文本和公式的科技文档相似度计算*[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
[4] 陈梅梅, 薛康杰. 基于改进张量分解模型的个性化推荐算法研究*[J]. 数据分析与知识发现, 2017, 1(3): 38-45.
[5] 刘通, 杨敬成. 基于信号传播算法的在线医疗咨询反馈内容评估方法*[J]. 数据分析与知识发现, 2017, 1(11): 29-36.
[6] 王曰芬,靳嘉林. 比较分析《现代图书情报技术》近10年发文特征与发展趋势*[J]. 现代图书情报技术, 2016, 32(9): 1-16.
[7] 马红, 蔡永明. 共词网络LDA模型的中文文本主题分析: 以交通法学文献(2000-2016)为例*[J]. 数据分析与知识发现, 2016, 32(12): 17-26.
[8] 龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[9] 黄微, 高俊峰, 王晨, 齐玥. 基于社会网络分析的隐性知识推送服务方法研究[J]. 现代图书情报技术, 2014, 30(2): 48-54.
[10] 叶春蕾, 冷伏海. 技术路线图中未来技术词表构建方法研究[J]. 现代图书情报技术, 2013, (5): 59-63.
[11] 李树青, 刘晓倩. 基于向心扩散加权XML模型的异构用户个性化模式匹配方法[J]. 现代图书情报技术, 2012, 28(5): 32-40.
[12] 李军莲, 李丹亚, 黄利辉, 孙海霞, 冀玉静, 王钤. 基于词共现的中文医学概念空间研究[J]. 现代图书情报技术, 2010, 26(11): 59-63.
[13] 杨代庆,张智雄. 基于Hadoop的海量共现矩阵生成方法*[J]. 现代图书情报技术, 2009, 25(4): 23-26.
[14] 胡泽文,王效岳. 1998-2008年国内外本体应用研究计量分析及可视化[J]. 现代图书情报技术, 2009, 25(12): 25-30.
[15] 张玉连,刘娟,齐峰,周兴林. 基于摘要和日志中相关词共现策略的移动查询扩展*[J]. 现代图书情报技术, 2009, (10): 40-44.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn