Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (4): 25-36    DOI: 10.11925/infotech.2096-3467.2020.1255
Current Issue | Archive | Adv Search |
Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model
Yi Huifang,Liu Xiwen()
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1107 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper explores issues facing topic modeling, such as lack of context, weak interpretability, and poor IPC integration. [Methods] First, we proposed the concept of context enhancement. Then, we built a Context-LDA model using both the IPC and the extracted vocabulary as training corpus at the same time. Third, we constructed our topic model with Python, and compared its generalization and topic representation abilities with traditional LDA models. [Results] We examined the proposed model with 38,354 pieces of patents of graphene. The new model had lower perplexity values (below 100), and had a strong generalization ability in different scenarios. The JS value was about 0.1 higher than the traditional LDA model. The combined IPC and the topic words represented each other and enhanced the topic readability. The average IPC position was 9.6/20 with little noise. [Limitations] The vocabulary representation under the new model needs to be expanded to n-gram from uni-gram. [Conclusions] Topic models play an important role in supporting analysis of patent topics, and more effective and accurate models should be developed based on actual needs.

Key wordsTechnology Topic Analysis      Topic Model      Context-Enhance      Context-LDA     
Received: 14 December 2020      Published: 17 May 2021
ZTFLH:  分类号: G250  
Corresponding Authors: Liu Xiwen     E-mail: liuxw@mail.las.ac.cn

Cite this article:

Yi Huifang,Liu Xiwen. Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model. Data Analysis and Knowledge Discovery, 2021, 5(4): 25-36.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1255     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I4/25

Context-LDA Model
统计对象 词(预处理前) 词(预处理后) 最小值 最大值 均值 中位数 标准差
LDA整个语料集大小 357 186 242 279 / / / / /
LDA文档长度统计 / / 2 19 6.3 6 1.9
Context-LDA整个语料集大小 493 768 378 861 / / / / /
Context-LDA文档长度统计 / / 3 46 9.9 10 3.3
Statistical Description of Corpus and Text Length
Curve of Perplexity with the Number of Iterations
Curve of Perplexity with the Number of Topics
Curve of Confusion Degree with the Number of Document Words
JS Distance Curve of the Topic Model
主题 主题内容
Topic0 0.216 制备+0.200 导电+0.171 石墨烯+0.120 复合+0.077 纳米材料+0.061 石墨烯纳米带+0.022 橡胶+0.021 应用+0.020 浆料+ 0.010 水性聚氨酯
Topic11 0.180 制备+0.177 纳米+0.149 石墨烯+0.094 纳米复合材料+0.077 复合物+0.050 二氧化钛+0.032 应用+0.028 可控+0.027 粒子+0.025 复合
Topic35 0.207 石墨烯+0.159 增强+0.146 生长+0.111 制备+0.057 设备+0.052直接+0.025 层数+0.021 复合材料+0.021 石墨烯微片+0.012 原位
Topic19 0.178 电池+0.136 石墨烯+0.125 锂离子+0.113 负极+0.107 制备+0.103 材料+0.063 制作+0.044 复合+0.016 低成本+ 0.009 发射
Topic37 0.151 制备+0.120 材料+0.118 石墨烯+0.103 正极+0.080 包覆+0.058复合+0.054 锂离子电池+0.053 聚苯胺+0.042 催化+0.031 活性
Topic44 0.283 材料 +0.187制备+0.150 石墨烯+0.123 电极+0.051 复合+0.041应用+0.032 吸附+0.014 光催化剂+0.011 超级电容器+0.008 氨基
LDA Topic Content (Partial)
主题 主题内容 IPC释义 IPC位置
Topic3 0.374 C01B+0.194 制备+0.188 石墨烯+ 0.100 B82Y+0.023 掺杂+0.016 量子点 +0.013 材料+0.012 纳米材料+0.010 复合材料+ 0.007 薄膜 C01B:非金属元素;其化合物
B82Y:纳米结构的特定用途或应用;纳米结构的测量或分析;纳米结构的制造或处理
1/20,4/20
Topic7 0.277 C30B+0.145 石墨烯+0.078 G02F+ 0.074多层+0.044 生长+ 0.041 制备+0.034 单晶+0.030 功能+0.019 选择性+0.015 结构 C03B:单晶生长;共晶材料的定向凝固或共析材料的定向分层…
G02F:用于控制光的强度、颜色、相位、偏振或方向的器件或装置
1/20,3/20
Topic20 0.378 C08K+0.294 C08L+0.084 制备+ 0.053 石墨烯+0.047 复合材料+0.013 氧化石墨烯+0.013复合+0.010 材料+0.006 应用+ 0.006 增强 C08K:用无机物或非高分子有机物作为配料
C08L:分子化合物的组合物
1/20,2/20
Topic31 0.457 H01G+0.125 制备+0.083 石墨烯+ 0.066 电极+0.044 材料+0.036 复合材料+ 0.036 应用+0.022电容器+0.021 外延+ 0.008 掺杂 H01G:电容器;电解型的电容器、整流器、检波器、开关器件、光敏器件或热敏器件 1/20
Topic41 0.486 H01M+0.121 制备+0.100 石墨烯 +0.066 复合材料+ 0.043 应用+ 0.042 B82Y +0.017 锂离子电池+0.014 电极+ 0.013 掺杂+ 0.012 材料 H01M:用于直接转变化学能为电能的方法或装置,例如电池组 1/20,6/20
Context-LDA Topic Content (Partial)
[1] 胡阿沛, 张静, 雷孝平, 等. 基于文本挖掘的专利技术主题分析研究综述[J]. 情报杂志, 2013,32(12):88-92.
[1] ( Hu Apei, Zhang Jing, Lei Xiaoping, et al. A Review of Technical Topic Analysis Based on Text Mining[J]. Journal of Intelligence, 2013,32(12):88-92.)
[2] Alexander J, Chase J, Newman N, et al. Emergence as a Conceptual Framework for Understanding Scientific and Technological Progress[C]// Proceedings of the 2012 Portland International Conference on Management of Engineering and Technology. 2012: 1286-1292.
[3] 杨超, 朱东华, 汪雪锋, 等. 专利技术主题分析: 基于SAO结构的LDA主题模型方法[J]. 图书情报工作, 2017,61(3):86-96.
[3] ( Yang Chao, Zhu Donghua, Wang Xuefeng, et al. Technical Topic Analysis in Patents: SAO-Based LDA Modeling[J]. Library and Information Service, 2017,61(3):86-96.)
[4] Callon M, Courtial J P, Laville F. Co-word Analysis as a Tool for Describing the Network of Interactions Between Basic and Technological Research: The Case of Polymer Chemistry[J]. Scientometrics, 1991,22(1):155-205.
doi: 10.1007/BF02019280
[5] Lee H, Kim C, Cho H, et al. An ANP-Based Technology Network for Identification of Core Technologies: A Case of Telecommunication Technologies[J]. Expert Systems with Applications, 2009,36(1):894-908.
doi: 10.1016/j.eswa.2007.10.026
[6] Kajikawa Y, Yoshikawa J, Takeda Y, et al. Tracking Emerging Technologies in Energy Research: Toward a Roadmap for Sustainable Energy[J]. Technological Forecasting & Social Change, 2008,75(6):771-782.
[7] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[8] Wang X, Mc Callum A. Topics over Time: A Non-markov Continuous-Time Model of Topical Trends[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006: 424-433.
[9] Wallach H M. Topic Modeling: Beyond Bag-of-Words[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 977-984.
[10] Wang X, McCallum A, Wei X. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval[C]// Proceedings of the 7th International Conference on Data Mining. 2007: 697-702.
[11] 艾楚涵, 姜迪, 吴建德. 基于主题模型和文本相似度计算的专利推荐研究[J]. 信息技术, 2020,44(4):65-70.
[11] ( Ai Chuhan, Jiang Di, Wu Jiande. Patent Recommendation Research Based on Topic Model and Text Similarity Calculation[J]. Information Technology, 2020,44(4):65-70.)
[12] 艾楚涵, 熊新, 吴建德. 基于LDA主题模型的专利文本分析应用研究[J]. 科技和产业, 2019,19(3):77-82.
[12] ( Ai Chuhan, Xiong Xin, Wu Jiande. Research on Application of Patent Text Analysis Based on LDA Topic Model[J]. Science Technology and Industry, 2019,19(3):77-82.)
[13] 马永红, 孔令凯, 林超然, 等. 基于专利挖掘的关键共性技术识别研究[J]. 情报学报, 2020,39(10):1093-1103.
[13] ( Ma Yonghong, Kong Lingkai, Lin Chaoran, et al. Key Generic Technology Identification Based on Patent Mining[J]. Journal of the China Society for Scientific and Technical Information, 2020,39(10):1093-1103.)
[14] 李慧, 玄洪升. 专利视角下融合多属性的技术创新主题挖掘方法——以芯片领域专利为例[J]. 图书情报工作, 2020,64(11):96-107.
[14] ( Li Hui, Xuan Hongsheng. Multi-Attribute Mining Method for Technology Innovation Subject from the Perspective of Patent: The Case of Chip Patent[J]. Library and Information Service, 2020,64(11):96-107.)
[15] Blei D M, Jordan M I, Griffiths T L, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process[C]// Proceedings of the 16th International Conference on Neural Information Processing Systems. 2003: 17-24.
[16] Blei D M, Lafferty J D. Dynamic Topic Models[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 113-120.
[17] Wang B, Liu S, Ding K, et al. Identifying Technological Topics and Institution-Topic Distribution Probability for Patent Competitive Intelligence Analysis: A Case Study in LTE Technology[J]. Scientometrics, 2014,101(1):685-704.
doi: 10.1007/s11192-014-1342-3
[18] Tang J, Wang B, Yang Y, et al. PatentMiner: Topic-Driven Patent Analysis and Mining[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012: 1366-1374.
[19] 吴菲菲, 张亚茹, 黄鲁成, 等. 基于AToT模型的技术主题多维动态演化分析——以石墨烯技术为例[J]. 图书情报工作, 2017,61(5):95-102.
[19] ( Wu Feifei, Zhang Yaru, Huang Lucheng, et al. Multi-Dimensional Dynamic Evolution Analysis of Technology Topics Based on the AToT by Taking Graphene Technology as an Example[J]. Library and Information Service, 2017,61(5):95-102.)
[20] 吴红, 伊惠芳, 马永新, 等. 面向专利技术主题分析的WI-LDA模型研究[J]. 图书情报工作, 2018,62(17):68-74.
[20] ( Wu Hong, Yi Huifang, Ma Yongxin, et al. WI-LDA: Technical Topic Analysis in Patents[J]. Library and Information Service, 2018,62(17):68-74.)
[21] 王龙飞. 基于主题模型的汽车专利文本主题挖掘与应用研究[D]. 合肥: 合肥工业大学, 2018.
[21] ( Wang Longfei. Research on Topic Mining and Application of Auto Patent Text Based on Topic Model[D]. Hefei: Hefei University of Technology, 2018.)
[22] 陈玲, 林平, 段尧清. 产业链视角下结合K-means和LDA的专利技术主题挖掘与趋势分析——以虚拟现实技术为例[J]. 知识管理论坛, 2020,5(3):135-146.
[22] ( Chen Ling, Lin Ping, Duan Yaoqing. Technology Topic Mining and Trend Analysis from the Perspective of the Industrial Chain Combined with K-Means and LDA—Taking Virtual Reality Technology as an Example[J]. Knowledge Management Forum, 2020,5(3):135-146.)
[23] 廖列法, 勒孚刚. 基于LDA模型和分类号的专利技术演化研究[J]. 现代情报, 2017,37(5):13-18.
[23] ( Liao Liefa, Le Fugang. Research on Patent Technology Evolution Based on LDA Model and Classification Number[J]. Journal of Modern Information, 2017,37(5):13-18.)
[24] Mao X L, Ming Z Y, Chua T S, et al. SSHLDA: A Semi-Supervised Hierarchical Topic Model[C]// Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing & Computational Natural Language Learning. 2012: 800-809.
[25] 陈亮. 面向专利分析的Patent Classification LDA模型[J]. 情报学报, 2016,35(8):864-874.
[25] ( Chen Liang. Patent Classification LDA: Topic Model for Patent Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(8):864-874.)
[26] Hohenstein U, Plesser V. Semantic Enrichment: A First Step to Provide Database Interoperability[C]// Proceedings of the 1996 Wokshop Föderierte Datenbanken. 1996: 3-17.
[27] 白如江, 祝娜, 王效岳. 语义增强的科技创新内容表征研究[J]. 情报理论与实践, 2016,39(3):73-79.
[27] ( Bai Rujiang, Zhu Na, Wang Xiaoyue. Research on Representation of Technical Innovation Content with Enhanced Semantics[J]. Information Studies: Theory & Application, 2016,39(3):73-79.)
[28] 刘自强, 许海云, 岳丽欣, 等. 基于Chunk-LDAvis的核心技术主题识别方法研究[J]. 图书情报工作, 2019,63(9):73-84.
[28] ( Liu Ziqiang, Xu Haiyun, Yue Lixin, et al. Research on Core Technology Topic Identification Based on Chunk-LDAvis[J]. Library and Information Service, 2019,63(9):73-84.)
[29] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011,34(8):1423-1436.
[29] ( Xu Ge, Wang Houfeng. The Development of Topic Models in Natural Language Processing[J]. Chinese Journal of Computers, 2011,34(8):1423-1436.)
[30] Lee L. On the Eectiveness of the Skew Divergence for Statistical Language Analysis[C]// Proceedings of the 4th International Conference on Artificial Intelligence & Statistics. 2001: 65-72.
[31] 杨曦, 余翔, 刘鑫. 基于专利情报的石墨烯产业技术竞争态势研究[J]. 情报杂志, 2017,36(12):75-81, 89.
[31] ( Yang Xi, Yu Xiang, Liu Xin. A Study on the Technological Competition Situation of Graphene Industry Under the Perspective of Patent Information[J]. Journal of Intelligence, 2017,36(12):75-81,89.)
[32] 赵振霞, 陈红. 我国石墨烯技术发展现状及趋势分析——基于专利数据[J]. 纺织导报, 2016(9):40-43.
[32] ( Zhao Zhenxia, Chen Hong . Development of Graphene Technology in China: Present and Future-Based on Patent Statistics[J]. ​​China Textile Leader, 2016(9):40-43.)
[33] 王博, 刘盛博, 丁堃, 等. 基于LDA主题模型的专利内容分析方法[J]. 科研管理, 2015,36(3):111-117.
[33] ( Wang Bo, Liu Shengbo, Ding Kun, et al. Patent Content Analysis Method Based on LDA Topic Model[J]. Science Research Management, 2015,36(3):111-117.)
[34] GitHub. Stopwords[EB/OL]. [2020-09-05]. https://github.com/goto456/stopwords.
[35] 百度AI开放平台. SDK文档-Python语言[EB/OL]. [2018-10- 14]. http://ai.baidu.com/docs#/NLP-Python-SDK/top.
[35] ( Baidu AI Open Platform. SDK Documentation-Python Language[EB/OL].[ 2018- 10- 14]. http://ai.baidu.com/docs#/NLP-Python-SDK/top.
[36] O'Callaghan D, Greene D, Carthy J, et al. An Analysis of the Coherence of Descriptors in Topic Modeling[J]. Expert Systems with Applications, 2015,42(13):5645-5657.
doi: 10.1016/j.eswa.2015.02.055
[37] liuph_脚本之家. Python_LDA实现方法详解[EB/OL]. [2017-10-25]. https://www.jb51.net/article/126747.htm.
[37] ( liuph_Script Home. Python_LDA Implementation Method Detailed[EB/OL]. [2017-10-25]. https://www.jb51.net/article/126747.htm.)
[38] AlSumait L, Daniel B, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking[C]// Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 3-12.
[39] 陈伟, 林超然, 李金秋, 等. 基于LDA-HMM的专利技术主题演化趋势分析——以船用柴油机技术为例[J]. 情报学报, 2018,37(7):732-741.
[39] ( Chen Wei, Lin Chaoran, Li Jinqiu, et al. Analysis of the Evolutionary Trend of Technical Topics in Patents Based on LDA and HMM——Taking Marine Diesel Engine Technology as an Example[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(7):732-741.)
[1] Zhang Xin,Wen Yi,Xu Haiyun. A Prediction Model with Network Representation Learning and Topic Model for Author Collaboration[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[2] Zhao Tianzi, Duan Liang, Yue Kun, Qiao Shaojie, Ma Zijuan. Generating News Clues with Biterm Topic Model[J]. 数据分析与知识发现, 2021, 5(2): 1-13.
[3] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[4] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[5] Pan Youneng,Ni Xiuli. Recommending Online Medical Experts with Labeled-LDA Model[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[6] Xu Jianmin,Zhang Liqing,Wang Miao. Tracking Static Topics with Bayesian Network[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[7] Chen Wenjie. Predicting Research Collaboration Based on Translation Model[J]. 数据分析与知识发现, 2020, 4(10): 28-36.
[8] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[9] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[10] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[11] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[12] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[13] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[14] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[15] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn