Paper and Patent Data Fusion Based on Deep Text Clustering
Xie Shiyao1,2,Wang Xiaomei1()
1Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China 2School of Public Policy and Management, University of Chinese Academy of Sciences, Beijing 100049, China
[Objective] This study integrates papers and patents based on research topics to bridge their language gaps. [Method] Using Wikipedia as the primary classification system, we constructed a small number of annotation sets semi-automatically. Then, we designed a semi-supervised deep text clustering model to fuse papers and patents with similar topics. Finally,we created indicators to evaluate the data fusion quality. [Results] Our model’s clustering accuracy was 2.4~11.9% higher than that of other baseline models. Its quality evaluation score of data fusion reached 0.9, which can supplement research topics based on the known topics. [Limitations] We did not conduct empirical analysis using the fused data and need to determine the cluster numbers manually. [Conclusion] The proposed model can extract topic-related features from differentiated texts of papers and patents to effectively realize data fusion.
谢士尧, 王小梅. 基于深度文本聚类的论文与专利数据融合方法研究*[J]. 数据分析与知识发现, 2024, 8(4): 112-124.
Xie Shiyao, Wang Xiaomei. Paper and Patent Data Fusion Based on Deep Text Clustering. Data Analysis and Knowledge Discovery, 2024, 8(4): 112-124.
(Liu Ziqiang, Xu Haiyun, Luo Rui, et al. Research on Scientific and Technological Interaction Patterns Based on Topic Relevance Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(10): 997-1011.)
(Li Hui, Hu Jixia, Tong Zhiying. Subject Topic Mining and Evolution Analysis with Multi-Source Data[J]. Data Analysis and Knowledge Discovery, 2022, 6(7): 44-55.)
(Zhang Xue, Zhang Zhiqiang, Cao Lingjing, et al. Research Progress of Research Front Recognition Methods in Subject Fields[J]. Library and Information Service, 2022, 66(12): 139-151.)
doi: 10.13266/j.issn.0252-3116.2022.12.013
(Zhou Yuan, Liu Yufei, Xue Lan. An Approach to Identify Emerging Technologies Using Machine Learning: A Case Study of Robotics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9): 939-955.)
(Qiu Huilin, Shao Bo. Research on Identification Methods of Scientific Research Hotspots under Multi-source Data[J]. Library and Information Service, 2020, 64(5): 78-88.)
doi: 10.13266/j.issn.0252-3116.2020.05.009
(Zhou Qun, Hua Bolin. Topic Identification of Scientific and Technical Decision-Making Demands Based on Multi-source Data Fusion[J]. Information Studies: Theory & Application, 2019, 42(3): 107-113 )
doi: 10.16353/j.cnki.1000-7490.2019.03.019
(Ma Cuichang, Situ Junfeng, Cao Shujin. Study on Mechanism of Information Organization for Fine-Grained Correlation and Aggregation of Academic Documents in the Internet Environment[J]. Journal of Modern Information, 2019, 39(12): 37-45, 54.)
doi: 10.3969/j.issn.1008-0821.2019.12.005
(Zhang Xinxing, Yang Zhigang, Pang Hongshen, et al. Research on Science Data Integration System and the Latest Progress[J]. Information Studies: Theory & Application, 2022, 45(6): 199-206.)
[9]
Yin W P, Hay J, Roth D. Benchmarking Zero-Shot Text Classification: Datasets, Evaluation and Entailment Approach[OL]. arXiv Preprint, arXiv:1909.00161.
(Xu Haiyun, Dong Kun, Wei Ling, et al. Research on Multi-source Data Fusion Method in Scientometrics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 318-328.)
(Ma Hongyan, Chen Feng, Zeng Wen. Model Construction of Multi-source Information Fusion in Science and Technology Information[J]. China Science & Technology Resources Review, 2022, 54(3): 1-8.)
(Li Weisi, Tan Liming, Zhang Guoliang, et al. Research on Topic Recognition of Key Core Technology in Industrial Chain Based on Multi-source Information Fusion: Taking AI as an Example[J]. Journal of Information Resources Management, 2022, 12(1): 116-126.)
doi: 10.13365/j.jirm.2022.01.116
[13]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
(Liu Huailan, Liu Sheng, Zhou Yuan, et al. Technology Evolution Path Recognition Based on Multi-source Text Mining[J]. Information Studies: Theory & Application, 2022, 45(11): 178-187.)
(Xu Lulu, Wang Xiaoyue, Bai Rujiang. Research on the Emerging Topic Detection Based on the Correlation Analysis of PLDA Model and Multiple Data Source Fusion[J]. Information Studies: Theory & Application, 2018, 41(4): 63-69, 43.)
(Feng Jia, Mu Xiaomin, Wang Wei. Carrier-Feature-Relationship Fusion Model for Research Fronts Identification[J]. Library Journal, 2020, 39(9): 56-63.)
(Xu Xiaoyang, Zheng Yanning, Liu Zhihui. Study on the Method of Identifying Research Fronts Based on Scientific Papers and Patents[J]. Library and Information Service, 2016, 60(24): 97-106.)
doi: 10.13266/j.issn.0252-3116.2016.24.014
(Zhang Biao, Wu Hong, Gao Daobin, et al. Research on Identification of Innovation Fronts Based on Potentially High Cited Papers and High Value Patents[J]. Library and Information Service, 2022, 66(18): 72-83.)
doi: 10.13266/j.issn.0252-3116.2022.18.007
(Zhou Yunze, Min Chao. Identifying Emerging Technology with LDA Model and Shared Semantic Space——Case Study of Autonomous Vehicles[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 55-66.)
[20]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[21]
Xu S, Zhai D S, Wang F F, et al. A Novel Method for Topic Linkages Between Scientific Publications and Patents[J]. Journal of the Association for Information Science and Technology, 2019, 70(9): 1026-1042.
[22]
Xu S, Li L, An X, et al. An Approach for Detecting the Commonality and Specialty Between Scientific Publications and Patents[J]. Scientometrics, 2021, 126(9): 7445-7475.
(Han Xiaotong, Zhu Donghua, Wang Xuefeng. Research on the Method of Technology Opportunity Discovery Promoted by Science[J]. Library and Information Service, 2022, 66(10): 19-32.)
doi: 10.13266/j.issn.0252-3116.2022.10.002
[24]
Lu K, Cai X, Ajiferuke I, et al. Vocabulary Size and its Effect on Topic Representation[J]. Information Processing & Management, 2017, 53(3): 653-665.
[25]
Li X M, Zhang A, Li C C, et al. Exploring Coherent Topics by Topic Modeling with Term Weighting[J]. Information Processing & Management, 2018, 54(6): 1345-1358.
[26]
Chi J J, Ouyang J H, Li C C, et al. Topic Representation: Finding More Representative Words in Topic Models[J]. Pattern Recognition Letters, 2019, 123(C): 53-60.
(Yang Jinqing, Lu Wei, Wu Leyan. Time-Lag Calculation and Enlightenment of Multi-source Science and Technology Literature Fusion for the Detection of Emerging Research Topic: A Case Study in the Field of Agriculture[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(1): 21-29.)
[28]
Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. 2016: 478-487.
[29]
van der Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of machine learning research, 2008, 9: 2579-2605.
[30]
Hadifar A, Sterckx L, Demeester T, et al. A Self-Training Approach for Short Text Clustering[C]// Proceedings of the 4th Workshop on Representation Learning for NLP. 2019: 194-199.
[31]
Zhang D J, Nan F, Wei X K, et al. Supporting Clustering with Contrastive Learning[OL]. arXiv Preprint, arXiv:2103.12953.
[32]
Ren Y Z, Hu K R, Dai X Y, et al. Semi-supervised Deep Embedded Clustering[J]. Neurocomputing, 2019, 325: 121-130.
doi: 10.1016/j.neucom.2018.10.016
[33]
Caron M, Bojanowski P, Joulin A, et al. Deep Clustering for Unsupervised Learning of Visual Features[C]// Proceedings of the European Conference on Computer Vision. 2018: 139-156.
[34]
Zhang H L, Xu H, Lin T E, et al. Discovering New Intents with Deep Aligned Clustering[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 14365-14373.
[35]
Arthur D, Vassilvitskii S. K-Means++: The Advantages of Careful Seeding[C]// Proceedings of the 18th annual ACM-SIAM Symposium on Discrete Algorithms. 2007: 1027-1035.
[36]
Shen X, Sun Y G, Zhang Y, et al. Semi-supervised Intent Discovery with Contrastive Learning[C]// Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI. 2021: 120-129.
[37]
Lewis M, Liu Y H, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[OL]. arXiv Preprint, arXiv:1910.13461.
[38]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[39]
Meng Y, Zhang Y Y, Huang J X, et al. Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations[C]// Proceedings of the ACM Web Conference 2022. 2022: 3143-3152.
[40]
Guo X F, Gao L, Liu X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation[C]// Proceedings of International Joint Conference on Artificial Intelligence. 2017: 1753-1759.
[41]
Wang F, Cheng J, Liu W Y, et al. Additive Margin Softmax for Face Verification[J]. IEEE Signal Processing Letters, 2018, 25(7): 926-930.
[42]
Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings[OL]. arXiv Preprint, arXiv: 2104.08821.
[43]
Gopal S, Yang Y M. von Mises-Fisher Clustering Models[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. 2014: I-154-I-162.
[44]
Schroff F, Kalenichenko D, Philbin J. FaceNet: A Unified Embedding for Face Recognition and Clustering[C]// Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015: 815-823.
[45]
Thomas P, Murdick D. Patents and Artificial Intelligence: A Primer[R]. Center for Security and Emerging Technology, 2020.
[46]
Grootendorst M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure[OL]. arXiv Preprint, arXiv:2203.05794.
[47]
McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction[OL]. arXiv Preprint, arXiv:1802.03426.
[48]
Song K T, Tan X, Qin T, et al. MPNet: Masked and Permuted Pre-training for Language Understanding[C]// Proceedings of the 34th Conference on Neural Information Processing Systems. 2020.