Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (4): 25-36     https://doi.org/10.11925/infotech.2096-3467.2020.1255
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种专利技术主题分析的IPC语境增强Context-LDA模型研究
伊惠芳,刘细文()
中国科学院文献情报中心 北京 100190
中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model
Yi Huifang,Liu Xiwen()
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (1107 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 改善当下多数主题模型建模缺乏语境、可解释性弱、IPC结合不佳的问题。【方法】 提出语境增强概念及IPC语境增强Context-LDA模型,将文本下所有IPC与抽取词汇同时作为训练语料,通过Python进行主题建模,并与传统LDA模型比较泛化能力和主题表示能力。【结果】 基于38 354条石墨烯专利数据,不同场景下IPC语境增强Context-LDA模型困惑度值较低,多为100以下,泛化能力强;JS值高于传统LDA模型约0.1,主题辨识度更明显;IPC与主题词互相表征,主题可读性增强,且IPC平均位置在9.6/20,不会带来噪声。【局限】 尚未将IPC语境增强Context-LDA模型下的词汇表示从uni-gram向n-gram拓展。【结论】 主题模型对专利主题分析有着重要的支持作用,需要基于实际需求开发更多有效、精准的分析模型。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
伊惠芳
刘细文
关键词 技术主题分析主题模型语境增强Context-LDA    
Abstract

[Objective] This paper explores issues facing topic modeling, such as lack of context, weak interpretability, and poor IPC integration. [Methods] First, we proposed the concept of context enhancement. Then, we built a Context-LDA model using both the IPC and the extracted vocabulary as training corpus at the same time. Third, we constructed our topic model with Python, and compared its generalization and topic representation abilities with traditional LDA models. [Results] We examined the proposed model with 38,354 pieces of patents of graphene. The new model had lower perplexity values (below 100), and had a strong generalization ability in different scenarios. The JS value was about 0.1 higher than the traditional LDA model. The combined IPC and the topic words represented each other and enhanced the topic readability. The average IPC position was 9.6/20 with little noise. [Limitations] The vocabulary representation under the new model needs to be expanded to n-gram from uni-gram. [Conclusions] Topic models play an important role in supporting analysis of patent topics, and more effective and accurate models should be developed based on actual needs.

Key wordsTechnology Topic Analysis    Topic Model    Context-Enhance    Context-LDA
收稿日期: 2020-12-14      出版日期: 2021-05-17
ZTFLH:  分类号: G250  
通讯作者: 刘细文     E-mail: liuxw@mail.las.ac.cn
引用本文:   
伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
Yi Huifang,Liu Xiwen. Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model. Data Analysis and Knowledge Discovery, 2021, 5(4): 25-36.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1255      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I4/25
Fig.1  Context-LDA模型
统计对象 词(预处理前) 词(预处理后) 最小值 最大值 均值 中位数 标准差
LDA整个语料集大小 357 186 242 279 / / / / /
LDA文档长度统计 / / 2 19 6.3 6 1.9
Context-LDA整个语料集大小 493 768 378 861 / / / / /
Context-LDA文档长度统计 / / 3 46 9.9 10 3.3
Table 1  语料集和文本长度的统计描述
Fig.2  困惑度随迭代次数变化的曲线
Fig.3  困惑度随主题数目变化的曲线
Fig.4  困惑度随文档词汇数目变化的曲线
Fig.5  主题模型的JS距离曲线
主题 主题内容
Topic0 0.216 制备+0.200 导电+0.171 石墨烯+0.120 复合+0.077 纳米材料+0.061 石墨烯纳米带+0.022 橡胶+0.021 应用+0.020 浆料+ 0.010 水性聚氨酯
Topic11 0.180 制备+0.177 纳米+0.149 石墨烯+0.094 纳米复合材料+0.077 复合物+0.050 二氧化钛+0.032 应用+0.028 可控+0.027 粒子+0.025 复合
Topic35 0.207 石墨烯+0.159 增强+0.146 生长+0.111 制备+0.057 设备+0.052直接+0.025 层数+0.021 复合材料+0.021 石墨烯微片+0.012 原位
Topic19 0.178 电池+0.136 石墨烯+0.125 锂离子+0.113 负极+0.107 制备+0.103 材料+0.063 制作+0.044 复合+0.016 低成本+ 0.009 发射
Topic37 0.151 制备+0.120 材料+0.118 石墨烯+0.103 正极+0.080 包覆+0.058复合+0.054 锂离子电池+0.053 聚苯胺+0.042 催化+0.031 活性
Topic44 0.283 材料 +0.187制备+0.150 石墨烯+0.123 电极+0.051 复合+0.041应用+0.032 吸附+0.014 光催化剂+0.011 超级电容器+0.008 氨基
Table 2  LDA主题内容(部分)
主题 主题内容 IPC释义 IPC位置
Topic3 0.374 C01B+0.194 制备+0.188 石墨烯+ 0.100 B82Y+0.023 掺杂+0.016 量子点 +0.013 材料+0.012 纳米材料+0.010 复合材料+ 0.007 薄膜 C01B:非金属元素;其化合物
B82Y:纳米结构的特定用途或应用;纳米结构的测量或分析;纳米结构的制造或处理
1/20,4/20
Topic7 0.277 C30B+0.145 石墨烯+0.078 G02F+ 0.074多层+0.044 生长+ 0.041 制备+0.034 单晶+0.030 功能+0.019 选择性+0.015 结构 C03B:单晶生长;共晶材料的定向凝固或共析材料的定向分层…
G02F:用于控制光的强度、颜色、相位、偏振或方向的器件或装置
1/20,3/20
Topic20 0.378 C08K+0.294 C08L+0.084 制备+ 0.053 石墨烯+0.047 复合材料+0.013 氧化石墨烯+0.013复合+0.010 材料+0.006 应用+ 0.006 增强 C08K:用无机物或非高分子有机物作为配料
C08L:分子化合物的组合物
1/20,2/20
Topic31 0.457 H01G+0.125 制备+0.083 石墨烯+ 0.066 电极+0.044 材料+0.036 复合材料+ 0.036 应用+0.022电容器+0.021 外延+ 0.008 掺杂 H01G:电容器;电解型的电容器、整流器、检波器、开关器件、光敏器件或热敏器件 1/20
Topic41 0.486 H01M+0.121 制备+0.100 石墨烯 +0.066 复合材料+ 0.043 应用+ 0.042 B82Y +0.017 锂离子电池+0.014 电极+ 0.013 掺杂+ 0.012 材料 H01M:用于直接转变化学能为电能的方法或装置,例如电池组 1/20,6/20
Table 3  Context-LDA主题内容(部分)
[1] 胡阿沛, 张静, 雷孝平, 等. 基于文本挖掘的专利技术主题分析研究综述[J]. 情报杂志, 2013,32(12):88-92.
[1] ( Hu Apei, Zhang Jing, Lei Xiaoping, et al. A Review of Technical Topic Analysis Based on Text Mining[J]. Journal of Intelligence, 2013,32(12):88-92.)
[2] Alexander J, Chase J, Newman N, et al. Emergence as a Conceptual Framework for Understanding Scientific and Technological Progress[C]// Proceedings of the 2012 Portland International Conference on Management of Engineering and Technology. 2012: 1286-1292.
[3] 杨超, 朱东华, 汪雪锋, 等. 专利技术主题分析: 基于SAO结构的LDA主题模型方法[J]. 图书情报工作, 2017,61(3):86-96.
[3] ( Yang Chao, Zhu Donghua, Wang Xuefeng, et al. Technical Topic Analysis in Patents: SAO-Based LDA Modeling[J]. Library and Information Service, 2017,61(3):86-96.)
[4] Callon M, Courtial J P, Laville F. Co-word Analysis as a Tool for Describing the Network of Interactions Between Basic and Technological Research: The Case of Polymer Chemistry[J]. Scientometrics, 1991,22(1):155-205.
doi: 10.1007/BF02019280
[5] Lee H, Kim C, Cho H, et al. An ANP-Based Technology Network for Identification of Core Technologies: A Case of Telecommunication Technologies[J]. Expert Systems with Applications, 2009,36(1):894-908.
doi: 10.1016/j.eswa.2007.10.026
[6] Kajikawa Y, Yoshikawa J, Takeda Y, et al. Tracking Emerging Technologies in Energy Research: Toward a Roadmap for Sustainable Energy[J]. Technological Forecasting & Social Change, 2008,75(6):771-782.
[7] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[8] Wang X, Mc Callum A. Topics over Time: A Non-markov Continuous-Time Model of Topical Trends[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006: 424-433.
[9] Wallach H M. Topic Modeling: Beyond Bag-of-Words[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 977-984.
[10] Wang X, McCallum A, Wei X. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval[C]// Proceedings of the 7th International Conference on Data Mining. 2007: 697-702.
[11] 艾楚涵, 姜迪, 吴建德. 基于主题模型和文本相似度计算的专利推荐研究[J]. 信息技术, 2020,44(4):65-70.
[11] ( Ai Chuhan, Jiang Di, Wu Jiande. Patent Recommendation Research Based on Topic Model and Text Similarity Calculation[J]. Information Technology, 2020,44(4):65-70.)
[12] 艾楚涵, 熊新, 吴建德. 基于LDA主题模型的专利文本分析应用研究[J]. 科技和产业, 2019,19(3):77-82.
[12] ( Ai Chuhan, Xiong Xin, Wu Jiande. Research on Application of Patent Text Analysis Based on LDA Topic Model[J]. Science Technology and Industry, 2019,19(3):77-82.)
[13] 马永红, 孔令凯, 林超然, 等. 基于专利挖掘的关键共性技术识别研究[J]. 情报学报, 2020,39(10):1093-1103.
[13] ( Ma Yonghong, Kong Lingkai, Lin Chaoran, et al. Key Generic Technology Identification Based on Patent Mining[J]. Journal of the China Society for Scientific and Technical Information, 2020,39(10):1093-1103.)
[14] 李慧, 玄洪升. 专利视角下融合多属性的技术创新主题挖掘方法——以芯片领域专利为例[J]. 图书情报工作, 2020,64(11):96-107.
[14] ( Li Hui, Xuan Hongsheng. Multi-Attribute Mining Method for Technology Innovation Subject from the Perspective of Patent: The Case of Chip Patent[J]. Library and Information Service, 2020,64(11):96-107.)
[15] Blei D M, Jordan M I, Griffiths T L, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process[C]// Proceedings of the 16th International Conference on Neural Information Processing Systems. 2003: 17-24.
[16] Blei D M, Lafferty J D. Dynamic Topic Models[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 113-120.
[17] Wang B, Liu S, Ding K, et al. Identifying Technological Topics and Institution-Topic Distribution Probability for Patent Competitive Intelligence Analysis: A Case Study in LTE Technology[J]. Scientometrics, 2014,101(1):685-704.
doi: 10.1007/s11192-014-1342-3
[18] Tang J, Wang B, Yang Y, et al. PatentMiner: Topic-Driven Patent Analysis and Mining[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012: 1366-1374.
[19] 吴菲菲, 张亚茹, 黄鲁成, 等. 基于AToT模型的技术主题多维动态演化分析——以石墨烯技术为例[J]. 图书情报工作, 2017,61(5):95-102.
[19] ( Wu Feifei, Zhang Yaru, Huang Lucheng, et al. Multi-Dimensional Dynamic Evolution Analysis of Technology Topics Based on the AToT by Taking Graphene Technology as an Example[J]. Library and Information Service, 2017,61(5):95-102.)
[20] 吴红, 伊惠芳, 马永新, 等. 面向专利技术主题分析的WI-LDA模型研究[J]. 图书情报工作, 2018,62(17):68-74.
[20] ( Wu Hong, Yi Huifang, Ma Yongxin, et al. WI-LDA: Technical Topic Analysis in Patents[J]. Library and Information Service, 2018,62(17):68-74.)
[21] 王龙飞. 基于主题模型的汽车专利文本主题挖掘与应用研究[D]. 合肥: 合肥工业大学, 2018.
[21] ( Wang Longfei. Research on Topic Mining and Application of Auto Patent Text Based on Topic Model[D]. Hefei: Hefei University of Technology, 2018.)
[22] 陈玲, 林平, 段尧清. 产业链视角下结合K-means和LDA的专利技术主题挖掘与趋势分析——以虚拟现实技术为例[J]. 知识管理论坛, 2020,5(3):135-146.
[22] ( Chen Ling, Lin Ping, Duan Yaoqing. Technology Topic Mining and Trend Analysis from the Perspective of the Industrial Chain Combined with K-Means and LDA—Taking Virtual Reality Technology as an Example[J]. Knowledge Management Forum, 2020,5(3):135-146.)
[23] 廖列法, 勒孚刚. 基于LDA模型和分类号的专利技术演化研究[J]. 现代情报, 2017,37(5):13-18.
[23] ( Liao Liefa, Le Fugang. Research on Patent Technology Evolution Based on LDA Model and Classification Number[J]. Journal of Modern Information, 2017,37(5):13-18.)
[24] Mao X L, Ming Z Y, Chua T S, et al. SSHLDA: A Semi-Supervised Hierarchical Topic Model[C]// Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing & Computational Natural Language Learning. 2012: 800-809.
[25] 陈亮. 面向专利分析的Patent Classification LDA模型[J]. 情报学报, 2016,35(8):864-874.
[25] ( Chen Liang. Patent Classification LDA: Topic Model for Patent Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(8):864-874.)
[26] Hohenstein U, Plesser V. Semantic Enrichment: A First Step to Provide Database Interoperability[C]// Proceedings of the 1996 Wokshop Föderierte Datenbanken. 1996: 3-17.
[27] 白如江, 祝娜, 王效岳. 语义增强的科技创新内容表征研究[J]. 情报理论与实践, 2016,39(3):73-79.
[27] ( Bai Rujiang, Zhu Na, Wang Xiaoyue. Research on Representation of Technical Innovation Content with Enhanced Semantics[J]. Information Studies: Theory & Application, 2016,39(3):73-79.)
[28] 刘自强, 许海云, 岳丽欣, 等. 基于Chunk-LDAvis的核心技术主题识别方法研究[J]. 图书情报工作, 2019,63(9):73-84.
[28] ( Liu Ziqiang, Xu Haiyun, Yue Lixin, et al. Research on Core Technology Topic Identification Based on Chunk-LDAvis[J]. Library and Information Service, 2019,63(9):73-84.)
[29] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011,34(8):1423-1436.
[29] ( Xu Ge, Wang Houfeng. The Development of Topic Models in Natural Language Processing[J]. Chinese Journal of Computers, 2011,34(8):1423-1436.)
[30] Lee L. On the Eectiveness of the Skew Divergence for Statistical Language Analysis[C]// Proceedings of the 4th International Conference on Artificial Intelligence & Statistics. 2001: 65-72.
[31] 杨曦, 余翔, 刘鑫. 基于专利情报的石墨烯产业技术竞争态势研究[J]. 情报杂志, 2017,36(12):75-81, 89.
[31] ( Yang Xi, Yu Xiang, Liu Xin. A Study on the Technological Competition Situation of Graphene Industry Under the Perspective of Patent Information[J]. Journal of Intelligence, 2017,36(12):75-81,89.)
[32] 赵振霞, 陈红. 我国石墨烯技术发展现状及趋势分析——基于专利数据[J]. 纺织导报, 2016(9):40-43.
[32] ( Zhao Zhenxia, Chen Hong . Development of Graphene Technology in China: Present and Future-Based on Patent Statistics[J]. ​​China Textile Leader, 2016(9):40-43.)
[33] 王博, 刘盛博, 丁堃, 等. 基于LDA主题模型的专利内容分析方法[J]. 科研管理, 2015,36(3):111-117.
[33] ( Wang Bo, Liu Shengbo, Ding Kun, et al. Patent Content Analysis Method Based on LDA Topic Model[J]. Science Research Management, 2015,36(3):111-117.)
[34] GitHub. Stopwords[EB/OL]. [2020-09-05]. https://github.com/goto456/stopwords.
[35] 百度AI开放平台. SDK文档-Python语言[EB/OL]. [2018-10- 14]. http://ai.baidu.com/docs#/NLP-Python-SDK/top.
[35] ( Baidu AI Open Platform. SDK Documentation-Python Language[EB/OL].[ 2018- 10- 14]. http://ai.baidu.com/docs#/NLP-Python-SDK/top.
[36] O'Callaghan D, Greene D, Carthy J, et al. An Analysis of the Coherence of Descriptors in Topic Modeling[J]. Expert Systems with Applications, 2015,42(13):5645-5657.
doi: 10.1016/j.eswa.2015.02.055
[37] liuph_脚本之家. Python_LDA实现方法详解[EB/OL]. [2017-10-25]. https://www.jb51.net/article/126747.htm.
[37] ( liuph_Script Home. Python_LDA Implementation Method Detailed[EB/OL]. [2017-10-25]. https://www.jb51.net/article/126747.htm.)
[38] AlSumait L, Daniel B, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking[C]// Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 3-12.
[39] 陈伟, 林超然, 李金秋, 等. 基于LDA-HMM的专利技术主题演化趋势分析——以船用柴油机技术为例[J]. 情报学报, 2018,37(7):732-741.
[39] ( Chen Wei, Lin Chaoran, Li Jinqiu, et al. Analysis of the Evolutionary Trend of Technical Topics in Patents Based on LDA and HMM——Taking Marine Diesel Engine Technology as an Example[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(7):732-741.)
[1] 张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[2] 赵天资, 段亮, 岳昆, 乔少杰, 马子娟. 基于Biterm主题模型的新闻线索生成方法 *[J]. 数据分析与知识发现, 2021, 5(2): 1-13.
[3] 陈浩, 张梦毅, 程秀峰. 融合主题模型与决策树的跨地区专利合作关系发现与推荐*——以广东省和武汉市高校专利库为例[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[4] 余传明,原赛,朱星宇,林虹君,张普亮,安璐. 基于深度学习的热点事件主题表示研究*[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[5] 潘有能,倪秀丽. 基于Labeled-LDA模型的在线医疗专家推荐研究*[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[6] 陈文杰. 基于翻译模型的科研合作预测研究*[J]. 数据分析与知识发现, 2020, 4(10): 28-36.
[7] 凌洪飞,欧石燕. 面向主题模型的主题自动语义标注研究综述 *[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[8] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[9] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[10] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[11] 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究*[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[12] 张杰,赵君博,翟东升,孙宁宁. 基于主题模型的微藻生物燃料产业链专利技术分析*[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[13] 刘俊婉,龙志昕,王菲菲. 基于LDA主题模型与链路预测的新兴主题关联机会发现研究*[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[14] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[15] 张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn