Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (4): 112-124     https://doi.org/10.11925/infotech.2096-3467.2023.0232
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于深度文本聚类的论文与专利数据融合方法研究*
谢士尧1,2,王小梅1()
1中国科学院科技战略咨询研究院 北京 100190
2中国科学院大学公共政策与管理学院 北京 100049
Paper and Patent Data Fusion Based on Deep Text Clustering
Xie Shiyao1,2,Wang Xiaomei1()
1Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China
2School of Public Policy and Management, University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF (2367 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 克服论文与专利之间语言特征差异的障碍,将论文和专利数据按照研究主题集成融合。【方法】 以维基百科为基本分类体系,通过半自动方式构建少量标注集,设计半监督深度文本聚类模型,将相似主题的论文与专利聚类融合,设计指标评估数据融合结果的质量。【结果】 所提模型在两个数据集上的聚类准确率比其他基线模型提升了2.4~11.9个百分点,数据融合结果的质量评估得分超过0.9,优于基线模型,可以在已知主题的基础上补充研究主题。【局限】 未利用融合数据开展实证分析,聚类数目需要人工确定。【结论】 所提模型可以从论文和专利差异化的文本中提取与主题相关的特征,有效地实现数据融合。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
谢士尧
王小梅
关键词 深度文本聚类数据融合论文专利研究主题识别    
Abstract

[Objective] This study integrates papers and patents based on research topics to bridge their language gaps. [Method] Using Wikipedia as the primary classification system, we constructed a small number of annotation sets semi-automatically. Then, we designed a semi-supervised deep text clustering model to fuse papers and patents with similar topics. Finally,we created indicators to evaluate the data fusion quality. [Results] Our model’s clustering accuracy was 2.4~11.9% higher than that of other baseline models. Its quality evaluation score of data fusion reached 0.9, which can supplement research topics based on the known topics. [Limitations] We did not conduct empirical analysis using the fused data and need to determine the cluster numbers manually. [Conclusion] The proposed model can extract topic-related features from differentiated texts of papers and patents to effectively realize data fusion.

Key wordsDeep Text Clustering    Data Fusion    Papers    Patents    Research Topic Identification
收稿日期: 2023-03-20      出版日期: 2024-03-15
ZTFLH:  G350  
基金资助:* 中国科学院战略研究专项“重要学科领域发展态势研究与决策支持”(GHJ-ZLZX-2022-09)
通讯作者: 王小梅,ORCID:0000-0002-9895-1511,E-mail: wangxm@casisd.cn。   
引用本文:   
谢士尧, 王小梅. 基于深度文本聚类的论文与专利数据融合方法研究*[J]. 数据分析与知识发现, 2024, 8(4): 112-124.
Xie Shiyao, Wang Xiaomei. Paper and Patent Data Fusion Based on Deep Text Clustering. Data Analysis and Knowledge Discovery, 2024, 8(4): 112-124.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0232      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I4/112
Fig.1  研究主题层面的数据融合
Fig.2  知识指导的文本深度聚类模型框架
Fig.3  NLP_WCT的部分结构
数据集 NLP
论文
NLP
专利
标注
类别数
CV
论文
CV
专利
标注
类别数
标注集 6 935 1 600 8 12 673 4 407 9
全集 37 307 9 422 / 84 818 47 820 /
Table 1  实验数据统计
类别 模型 NLP CV
ACC ARI NMI ACC ARI NMI
无监督 K-Means 22.41 4.95 8.7 21.33 3.83 7.19
BERTopic 25.91 7.56 12.54 23.47 5.69 10.17
SSCL 27.73 8.55 13.94 30.04 9.95 15.43
Self-Training 28.44 11.37 17.63 31.12 10.53 16.48
半监督
(20%
标注)
BERTopic 31.63 15.63 20.42 24.65 6.14 10.50
DeepAligned 67.32 51.13 51.34 64.38 41.87 46.83
KGDC 74.51 61.94 58.38 71.66 49.69 51.29
半监督
(50%
标注)
BERTopic 32.40 11.72 17.17 25.25 6.65 10.96
DeepAligned 68.15 51.21 51.67 75.21 60.43 62.48
KGDC 80.05 68.42 65.20 77.67 60.21 61.23
Table 2  聚类性能的比较
Fig.4  数据融合的质量评估结果
Fig.5  NLP论文和专利在MPNet向量空间中的分布及K-Means聚类结果
Fig.6  NLP论文和专利在KGDC向量空间中的分布及聚类结果
自然语言处理 NLP
论文(%)
NLP
专利(%)
计算机视觉 CV
论文(%)
CV
专利(%)
问答与理解 8.52 8.17 图像识别与分类 9.45 10.28
文本挖掘 8.23 11.43 图像分割 10.43 6.95
机器翻译与多语言 8.87 7.39 目标检测 8.76 6.96
信息抽取 6.67 18.20 神经图像处理 8.68 10.71
自然语言生成 7.76 10.56 目标追踪 8.49 11.40
语言模型 9.09 3.52 机器视觉 8.36 8.80
语音处理与多模态 8.79 3.95 视觉估计 10.06 7.09
语义与句法解析 7.88 7.96 医学影像 6.84 10.95
知识表示与推理 7.43 10.20 三维视觉 11.00 4.59
搜索与推荐系统 7.88 14.22 神经网络架构 12.08 2.88
对抗攻击和解释性 8.73 2.24 生物身份识别 5.85 19.38
社会计算 9.31 2.15
Table 3  自然语言处理和计算机视觉领域的研究主题及在论文和专利中占比
[1] 刘自强, 许海云, 罗瑞, 等. 基于主题关联分析的科技互动模式识别方法研究[J]. 情报学报, 2019, 38(10): 997-1011.
[1] (Liu Ziqiang, Xu Haiyun, Luo Rui, et al. Research on Scientific and Technological Interaction Patterns Based on Topic Relevance Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(10): 997-1011.)
[2] 李慧, 胡吉霞, 佟志颖. 面向多源数据的学科主题挖掘与演化分析[J]. 数据分析与知识发现, 2022, 6(7): 44-55.
[2] (Li Hui, Hu Jixia, Tong Zhiying. Subject Topic Mining and Evolution Analysis with Multi-Source Data[J]. Data Analysis and Knowledge Discovery, 2022, 6(7): 44-55.)
[3] 张雪, 张志强, 曹玲静, 等. 学科领域研究前沿识别方法研究进展[J]. 图书情报工作, 2022, 66(12): 139-151.
doi: 10.13266/j.issn.0252-3116.2022.12.013
[3] (Zhang Xue, Zhang Zhiqiang, Cao Lingjing, et al. Research Progress of Research Front Recognition Methods in Subject Fields[J]. Library and Information Service, 2022, 66(12): 139-151.)
doi: 10.13266/j.issn.0252-3116.2022.12.013
[4] 周源, 刘宇飞, 薛澜. 一种基于机器学习的新兴技术识别方法: 以机器人技术为例[J]. 情报学报, 2018, 37(9): 939-955.
[4] (Zhou Yuan, Liu Yufei, Xue Lan. An Approach to Identify Emerging Technologies Using Machine Learning: A Case Study of Robotics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9): 939-955.)
[5] 裘惠麟, 邵波. 多源数据环境下科研热点识别方法研究[J]. 图书情报工作, 2020, 64(5): 78-88.
doi: 10.13266/j.issn.0252-3116.2020.05.009
[5] (Qiu Huilin, Shao Bo. Research on Identification Methods of Scientific Research Hotspots under Multi-source Data[J]. Library and Information Service, 2020, 64(5): 78-88.)
doi: 10.13266/j.issn.0252-3116.2020.05.009
[6] 周群, 化柏林. 基于多源数据融合的科技决策需求主题识别研究[J]. 情报理论与实践, 2019, 42(3): 107-113.
doi: 10.16353/j.cnki.1000-7490.2019.03.019
[6] (Zhou Qun, Hua Bolin. Topic Identification of Scientific and Technical Decision-Making Demands Based on Multi-source Data Fusion[J]. Information Studies: Theory & Application, 2019, 42(3): 107-113 )
doi: 10.16353/j.cnki.1000-7490.2019.03.019
[7] 马翠嫦, 司徒俊峰, 曹树金. 网络学术文档细粒度关联与聚合的信息组织机制研究[J]. 现代情报, 2019, 39(12): 37-45, 54.
doi: 10.3969/j.issn.1008-0821.2019.12.005
[7] (Ma Cuichang, Situ Junfeng, Cao Shujin. Study on Mechanism of Information Organization for Fine-Grained Correlation and Aggregation of Academic Documents in the Internet Environment[J]. Journal of Modern Information, 2019, 39(12): 37-45, 54.)
doi: 10.3969/j.issn.1008-0821.2019.12.005
[8] 张新兴, 杨志刚, 庞弘燊, 等. 科学数据集成体系及最新进展研究[J]. 情报理论与实践, 2022, 45(6): 199-206.
[8] (Zhang Xinxing, Yang Zhigang, Pang Hongshen, et al. Research on Science Data Integration System and the Latest Progress[J]. Information Studies: Theory & Application, 2022, 45(6): 199-206.)
[9] Yin W P, Hay J, Roth D. Benchmarking Zero-Shot Text Classification: Datasets, Evaluation and Entailment Approach[OL]. arXiv Preprint, arXiv:1909.00161.
[10] 许海云, 董坤, 隗玲, 等. 科学计量中多源数据融合方法研究述评[J]. 情报学报, 2018, 37(3): 318-328.
[10] (Xu Haiyun, Dong Kun, Wei Ling, et al. Research on Multi-source Data Fusion Method in Scientometrics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 318-328.)
[11] 马红岩, 陈峰, 曾文. 科技情报中多源信息融合的模式构建[J]. 中国科技资源导刊, 2022, 54(3): 1-8.
[11] (Ma Hongyan, Chen Feng, Zeng Wen. Model Construction of Multi-source Information Fusion in Science and Technology Information[J]. China Science & Technology Resources Review, 2022, 54(3): 1-8.)
[12] 李维思, 谭力铭, 章国亮, 等. 基于多源信息融合的产业链关键核心技术主题识别研究——以人工智能领域为例[J]. 信息资源管理学报, 2022, 12(1): 116-126.
doi: 10.13365/j.jirm.2022.01.116
[12] (Li Weisi, Tan Liming, Zhang Guoliang, et al. Research on Topic Recognition of Key Core Technology in Industrial Chain Based on Multi-source Information Fusion: Taking AI as an Example[J]. Journal of Information Resources Management, 2022, 12(1): 116-126.)
doi: 10.13365/j.jirm.2022.01.116
[13] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[14] 刘怀兰, 刘盛, 周源, 等. 基于多源文本挖掘的技术演化路径识别[J]. 情报理论与实践, 2022, 45(11): 178-187.
[14] (Liu Huailan, Liu Sheng, Zhou Yuan, et al. Technology Evolution Path Recognition Based on Multi-source Text Mining[J]. Information Studies: Theory & Application, 2022, 45(11): 178-187.)
[15] 徐路路, 王效岳, 白如江. 基于PLDA模型与多数据源融合相关性分析的新兴主题探测研究——以石墨烯领域为例[J]. 情报理论与实践, 2018, 41(4): 63-69, 43.
[15] (Xu Lulu, Wang Xiaoyue, Bai Rujiang. Research on the Emerging Topic Detection Based on the Correlation Analysis of PLDA Model and Multiple Data Source Fusion[J]. Information Studies: Theory & Application, 2018, 41(4): 63-69, 43.)
[16] 冯佳, 穆晓敏, 王伟. 面向研究前沿识别的载体-特征-关系融合模型研究[J]. 图书馆杂志, 2020, 39(9): 56-63.
[16] (Feng Jia, Mu Xiaomin, Wang Wei. Carrier-Feature-Relationship Fusion Model for Research Fronts Identification[J]. Library Journal, 2020, 39(9): 56-63.)
[17] 许晓阳, 郑彦宁, 刘志辉. 论文和专利相结合的研究前沿识别方法研究[J]. 图书情报工作, 2016, 60(24): 97-106.
doi: 10.13266/j.issn.0252-3116.2016.24.014
[17] (Xu Xiaoyang, Zheng Yanning, Liu Zhihui. Study on the Method of Identifying Research Fronts Based on Scientific Papers and Patents[J]. Library and Information Service, 2016, 60(24): 97-106.)
doi: 10.13266/j.issn.0252-3116.2016.24.014
[18] 张彪, 吴红, 高道斌, 等. 基于潜在高被引论文与高价值专利的创新前沿识别研究[J]. 图书情报工作, 2022, 66(18): 72-83.
doi: 10.13266/j.issn.0252-3116.2022.18.007
[18] (Zhang Biao, Wu Hong, Gao Daobin, et al. Research on Identification of Innovation Fronts Based on Potentially High Cited Papers and High Value Patents[J]. Library and Information Service, 2022, 66(18): 72-83.)
doi: 10.13266/j.issn.0252-3116.2022.18.007
[19] 周云泽, 闵超. 基于LDA模型与共享语义空间的新兴技术识别——以自动驾驶汽车为例[J]. 数据分析与知识发现, 2022, 6(2/3): 55-66.
[19] (Zhou Yunze, Min Chao. Identifying Emerging Technology with LDA Model and Shared Semantic Space——Case Study of Autonomous Vehicles[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 55-66.)
[20] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[21] Xu S, Zhai D S, Wang F F, et al. A Novel Method for Topic Linkages Between Scientific Publications and Patents[J]. Journal of the Association for Information Science and Technology, 2019, 70(9): 1026-1042.
[22] Xu S, Li L, An X, et al. An Approach for Detecting the Commonality and Specialty Between Scientific Publications and Patents[J]. Scientometrics, 2021, 126(9): 7445-7475.
[23] 韩晓彤, 朱东华, 汪雪锋. 科学推动下技术机会发现方法研究[J]. 图书情报工作, 2022, 66(10): 19-32.
doi: 10.13266/j.issn.0252-3116.2022.10.002
[23] (Han Xiaotong, Zhu Donghua, Wang Xuefeng. Research on the Method of Technology Opportunity Discovery Promoted by Science[J]. Library and Information Service, 2022, 66(10): 19-32.)
doi: 10.13266/j.issn.0252-3116.2022.10.002
[24] Lu K, Cai X, Ajiferuke I, et al. Vocabulary Size and its Effect on Topic Representation[J]. Information Processing & Management, 2017, 53(3): 653-665.
[25] Li X M, Zhang A, Li C C, et al. Exploring Coherent Topics by Topic Modeling with Term Weighting[J]. Information Processing & Management, 2018, 54(6): 1345-1358.
[26] Chi J J, Ouyang J H, Li C C, et al. Topic Representation: Finding More Representative Words in Topic Models[J]. Pattern Recognition Letters, 2019, 123(C): 53-60.
[27] 杨金庆, 陆伟, 吴乐艳. 面向学科新兴主题探测的多源科技文献时滞计算及启示——以农业学科领域为例[J]. 情报学报, 2021, 40(1): 21-29.
[27] (Yang Jinqing, Lu Wei, Wu Leyan. Time-Lag Calculation and Enlightenment of Multi-source Science and Technology Literature Fusion for the Detection of Emerging Research Topic: A Case Study in the Field of Agriculture[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(1): 21-29.)
[28] Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. 2016: 478-487.
[29] van der Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of machine learning research, 2008, 9: 2579-2605.
[30] Hadifar A, Sterckx L, Demeester T, et al. A Self-Training Approach for Short Text Clustering[C]// Proceedings of the 4th Workshop on Representation Learning for NLP. 2019: 194-199.
[31] Zhang D J, Nan F, Wei X K, et al. Supporting Clustering with Contrastive Learning[OL]. arXiv Preprint, arXiv:2103.12953.
[32] Ren Y Z, Hu K R, Dai X Y, et al. Semi-supervised Deep Embedded Clustering[J]. Neurocomputing, 2019, 325: 121-130.
doi: 10.1016/j.neucom.2018.10.016
[33] Caron M, Bojanowski P, Joulin A, et al. Deep Clustering for Unsupervised Learning of Visual Features[C]// Proceedings of the European Conference on Computer Vision. 2018: 139-156.
[34] Zhang H L, Xu H, Lin T E, et al. Discovering New Intents with Deep Aligned Clustering[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 14365-14373.
[35] Arthur D, Vassilvitskii S. K-Means++: The Advantages of Careful Seeding[C]// Proceedings of the 18th annual ACM-SIAM Symposium on Discrete Algorithms. 2007: 1027-1035.
[36] Shen X, Sun Y G, Zhang Y, et al. Semi-supervised Intent Discovery with Contrastive Learning[C]// Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI. 2021: 120-129.
[37] Lewis M, Liu Y H, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[OL]. arXiv Preprint, arXiv:1910.13461.
[38] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[39] Meng Y, Zhang Y Y, Huang J X, et al. Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations[C]// Proceedings of the ACM Web Conference 2022. 2022: 3143-3152.
[40] Guo X F, Gao L, Liu X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation[C]// Proceedings of International Joint Conference on Artificial Intelligence. 2017: 1753-1759.
[41] Wang F, Cheng J, Liu W Y, et al. Additive Margin Softmax for Face Verification[J]. IEEE Signal Processing Letters, 2018, 25(7): 926-930.
[42] Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings[OL]. arXiv Preprint, arXiv: 2104.08821.
[43] Gopal S, Yang Y M. von Mises-Fisher Clustering Models[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. 2014: I-154-I-162.
[44] Schroff F, Kalenichenko D, Philbin J. FaceNet: A Unified Embedding for Face Recognition and Clustering[C]// Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015: 815-823.
[45] Thomas P, Murdick D. Patents and Artificial Intelligence: A Primer[R]. Center for Security and Emerging Technology, 2020.
[46] Grootendorst M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure[OL]. arXiv Preprint, arXiv:2203.05794.
[47] McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction[OL]. arXiv Preprint, arXiv:1802.03426.
[48] Song K T, Tan X, Qin T, et al. MPNet: Masked and Permuted Pre-training for Language Understanding[C]// Proceedings of the 34th Conference on Neural Information Processing Systems. 2020.
[1] 刘佳程, 马廷灿, 岳名亮. 融合创新性与影响力的论文代表作遴选方法研究*[J]. 数据分析与知识发现, 2024, 8(4): 88-98.
[2] 白如江, 陈启明, 张玉洁, 杨超. 基于ChatGPT+Prompt的专利技术功效实体自动生成研究*[J]. 数据分析与知识发现, 2024, 8(4): 14-25.
[3] 杜新玉, 李宁. 中文学术论文全文语步识别研究*[J]. 数据分析与知识发现, 2024, 8(2): 74-83.
[4] 向姝璇, 操玉杰, 毛进. 基于权利要求层级特征的专利相似度计算方法研究*[J]. 数据分析与知识发现, 2024, 8(2): 33-43.
[5] 何玉, 张晓冬, 郑鑫. 基于SpERT-Aggcn模型的专利知识图谱构建研究*[J]. 数据分析与知识发现, 2024, 8(1): 146-156.
[6] 翟东升, 娄莹, 阚慧敏, 何喜军, 梁国强, 马自飞. 基于多源异构数据的中医药知识图谱构建与应用研究*[J]. 数据分析与知识发现, 2023, 7(9): 146-158.
[7] 冯立杰, 刘可辉, 王金凤, 张珂, 张世斌. 基于知识网络与多维技术创新地图的技术机会识别路径研究与应用*[J]. 数据分析与知识发现, 2023, 7(8): 62-77.
[8] 杨辰, 郑若桢, 王楚涵, 耿爽, 王楠. 集成因子分解机及其在论文推荐中的应用研究*[J]. 数据分析与知识发现, 2023, 7(8): 128-137.
[9] 赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[10] 王诗炜, 陈春. 基于科学论文和技术专利关联关系识别潜在知识发现方法研究综述*[J]. 数据分析与知识发现, 2023, 7(7): 18-31.
[11] 施国良, 周抒, 王云峰, 施春江, 刘亮. 基于改进多头注意力机制的专利文本摘要生成研究*[J]. 数据分析与知识发现, 2023, 7(6): 61-72.
[12] 俞琰, 王丽, 郑斯煜. 融入术语与层级信息的专利关键短语抽取方法研究[J]. 数据分析与知识发现, 2023, 7(6): 99-112.
[13] 李锴君, 牛振东, 时恺泽, 邱萍. 基于学术知识图谱及主题特征嵌入的论文推荐方法*[J]. 数据分析与知识发现, 2023, 7(5): 48-59.
[14] 李爱华, 王迪文, 续维佳, 李子沫, 姚思涵. 基于多数据源融合的创业板上市公司财务造假异常检测*[J]. 数据分析与知识发现, 2023, 7(5): 33-47.
[15] 邓娜, 何昕洋, 陈伟杰, 陈旭. MPMFC:一种融合网络邻里结构特征和专利语义特征的中药专利分类模型*[J]. 数据分析与知识发现, 2023, 7(4): 145-158.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn