Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (10): 93-103     https://doi.org/10.11925/infotech.2096-3467.2020.0272
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于表示学习的无监督跨语言专利推荐研究*
张金柱1,2(),主立鹏1,刘菁婕1
1南京理工大学经济管理学院 南京 210094
2江苏省社会公共安全科技协同创新中心 南京 210094
Unsupervised Cross-Language Model for Patent Recommendation Based on Representation
Zhang Jinzhu1,2(),Zhu Lipeng1,Liu Jingjie1
1School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
2Jiangsu Provincial Social Public Safety Science and Technology Collaborative Innovation Center, anjing 210094, China
全文: PDF (1501 KB)   HTML ( 6
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 减少双语词典和大规模双语语料库的构建,提高专利文本语义的揭示和利用,从文本语义表示角度设计无监督的跨语言专利推荐方法,提高跨语言专利推荐效果和领域适用能力。【方法】 首先设计无监督跨语言词向量映射方法,通过线性变换将独立的中英专利词向量映射到统一语义向量空间,构建中英词语间的语义映射关系;然后利用平滑倒词频的词向量加权方法,形成基于跨语言专利词向量的专利文本语义表示方法,实现中英专利文本在同一向量空间中的语义表示;最后应用向量相似度计算指标,计算不同语言专利文本间的语义相似度,构建基于表示学习的无监督跨语言专利推荐方法,实现跨语言专利推荐。【结果】 在无线通信领域的实验中,无监督跨语言专利推荐方法的Top-1和Top-5推荐准确率分别达到55.63%和77.82%,较弱监督跨语言专利推荐方法分别提高了0.66%和1.45%,较基于机器翻译的跨语言专利推荐方法分别提高了4.29%和3.90%。【局限】 仅对特定领域中英专利进行推荐,尚需扩展领域和语言范围。【结论】 能够实现有效的中英跨语言专利推荐,并可扩展应用到其他领域和语种下的专利推荐中。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张金柱
主立鹏
刘菁婕
关键词 跨语言专利推荐表示学习语义表示    
Abstract

[Objective] This paper designs a cross-language recommendation model for patents based on text semantic representation, aiming to reduce the number of bilingual dictionaries and large-scale corpus, as well as improve the ability of domain adaptation.[Methods] First, we designed a word vector mapping method with unsupervised cross-language algorithm. Then, we mapped Chinese and English word vectors to the unified semantic vector space with linear transformation, which constructed the semantic mapping relationship between Chinese and English words. Third, we created semantic representation of patent texts based on cross-language word vector with smooth inverse frequency (SIF) reweighting method. It realized the semantic representation of Chinese-English patent texts in the same vector space. Finally, we calculated the semantic similarity between patent texts and recommend the cross-language patents.[Results] We examined the proposed method with patents on “wireless communication” and the recommendation accuracy rate of the top 1 and the top 5 reached 55.63% and 77.82%, which were 0.66% and 1.45% higher than those of the weak supervised based cross-language recommendation. They were also 4.29% and 3.90% better than the machine translation based ones.[Limitations] We only examined the proposed method with Chinese and English patents from one specific field.[Conclusions] This proposed method could recommend Chinese and English patents effectively, which help future research in cross-language patent recommendations.

Key wordsCross-Language    Patent Recommendation    Representation Learning    Semantic Representation
收稿日期: 2020-03-31      出版日期: 2020-07-28
ZTFLH:  G254  
基金资助:*本文系国家自然科学基金项目“基于表示学习的专利信息语义融合与深度挖掘研究”(71974095);江苏省社会科学基金项目“基于社团结构动态演化的主题突变监测与形成机制研究”(17TQC003);国家自然科学基金项目“基于被引科学知识突变的突破性创新动态识别及其形成机理研究”的研究成果之一(71503125)
通讯作者: 张金柱     E-mail: zhangjinzhu@njust.edu.cn
引用本文:   
张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation. Data Analysis and Knowledge Discovery, 2020, 4(10): 93-103.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0272      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I10/93
Fig.1  映射前的中英单语专利词向量可视化
Fig.2  基于无监督跨语言词映射的中英专利词向量表示可视化
检索方法 CSLS KNN
弱监督 46.72 43.68
无监督 49.08 46.87
Table 1  中英跨语言词映射准确率(%)
Fig.3  利单词数量对跨语言词映射准确率的影响
中文单词 英文映射单词 常见匹配单词
移动终端 mobile-terminal; terminal; mobile-phone mobile-terminal
接入点 access-point; AP; access-points access-point
选择 selecting; selection; selected select
检测 reducing; reduced; reduce reduce
快速的 quickly; rapid; rapidly fast
准确的 accurately; accuracy; accurate accurate
Table 2  中英跨语言专利相关词映射实例
Fig.4  基于跨语言专利词向量表示的中英专利文本表示
跨语言专利推荐方法 Top-1 Accuracy Top-5 Accuracy
机器翻译 51.34 73.92
无监督+平均词向量 33.75 56.50
无监督+TF-IDF 42.01 65.45
弱监督+SIF 54.97 76.37
无监督+SIF 55.63 77.82
Table 3  中英跨语言专利推荐准确率(%)
Table 4  中英跨语言专利推荐结果示例
[1] Jochim C, Lioma C, Schütze H, et al. Preliminary Study into Query Translation for Patent Retrieval[C]//Proceedings of the 3rd Workshop on Patent Information Retrieval. 2010: 57-66.
[2] Magdy W, Jones G J F. Studying Machine Translation Technologies for Large-Data CLIR Tasks: A Patent Prior-Art Search Case Study[J]. Information Retrieval, 2014,17(5):492-519.
doi: 10.1007/s10791-013-9231-6
[3] Magdy W, Jones G J F. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011: 1925-1928.
[4] Shen X, Huang H Y, Li L Z, et al. A Parallel Cross-Language Retrieval System for Patent Documents[C]//Proceedings of the 6th IEEE International Conference on Software Engineering and Service Science. 2015: 672-676.
[5] Lee C S, Wang M H, Hsiao Y C, et al. Ontology-Based GFML Agent for Patent Technology Requirement Evaluation and Recommendation[J]. Soft Computing, 2019,23(2):537-556.
doi: 10.1007/s00500-017-2859-1
[6] Ji X, Gu X J, Dai F, et al. Patent Collaborative Filtering Recommendation Approach Based on Patent Similarity[C]//Proceedings of the 8th International Conference on Fuzzy Systems and Knowledge Discovery. 2011: 1699-1703.
[7] Rui X H, Min D. HIM-PRS: A Patent Recommendation System Based on Hierarchical Index-Based MapReduce Framework[C]//Proceedings of UCAWSN 2016, CUTE 2016, CSA 2016: Advances in Computer Science and Ubiquitous Computing. 2016: 843-848.
[8] 李枫林, 柯佳. 词向量语义表示研究进展[J]. 情报科学, 2019,37(5):155-165.
[8] ( Li Fenglin, Ke Jia. Research Progress of Word Vector Semantic Representation[J]. Information Science, 2019,37(5):155-165.)
[9] 涂存超, 杨成, 刘知远, 等. 网络表示学习综述[J]. 中国科学:信息科学, 2017,47(8):980-996.
[9] ( Tu Cunchao, Yang Cheng, Liu Zhiyuan, et al. Network Representation Learning: An Overview[J]. SCIENTIA SINICA Informationis, 2017,47(8):980-996.)
[10] 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016,53(2):247-261.
[10] ( Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge Representation Learning: A Review[J]. Journal of Computer Research and Development, 2016,53(2):247-261.)
[11] 彭晓娅, 周栋. 跨语言词向量研究综述[J]. 中文信息学报, 2020,34(2):1-15.
[11] ( Peng Xiaoya, Zhou Dong. Survey of Cross-Lingual Word Embedding[J]. Journal of Chinese Information Processing, 2020,34(2):1-15.)
[12] Mikolov T, Le Q V, Sutskever I. Exploiting Similarities among Languages for Machine Translation[OL].arXiv Preprint, arXiv: 1309.4168.
[13] Dinu G, Baroni M. Improving Zero-Shot Learning by Mitigating the Hubness Problem[C]// Proceedings of the 3rd International Conference on Learning Representations. 2014. DOI: 10.1007/978-3-319-23528-8_9.
[14] Faruqui M, Dyer C. Improving Vector Space Word Representations Using Multilingual Correlation[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014: 462-471.
[15] Lu A, Wang W, Bansal M, et al. Deep Multilingual Correlation for Improved Word Embeddings[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 250-256.
[16] Smith S L, Turban D H P, Hamblin S, et al. Offline Bilingual Word Vectors, Orthogonal Transformations and the Inverted Softmax[C]//Proceedings of the 5th International Conference on Learning Representations. 2017.
[17] Xing C, Wang D, Liu C, et al. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1006-1011.
[18] Barone A V M. Towards Cross-Lingual Distributed Representations Without Parallel Text Trained with Adversarial Autoencoders[C]//Proceedings of the 1st Workshop on Representation Learning for NLP. 2016: 121-126.
[19] Zhang M, Liu Y, Luan H B, et al. Adversarial Training for Unsupervised Bilingual Lexicon Induction[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017. DOI: 10.18653/v1/P17-1179.
[20] Artetxe M, Labaka G, Agirre E. A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018. DOI: 10.18653/v1/P18-1073.
[21] Arora S, Liang Y, Ma T. A Simple but Tough-to-Beat Baseline for Sentence Embeddings[C]//Proceedings of the 5th International Conference on Learning Representations. 2017.
[22] Conneau A, Lample G, Ranzato M A, et al. Word Translation Without Parallel Data[C]//Proceedings of the 6th International Conference on Learning Representations. 2017.
[23] Artetxe M, Labaka G, Agirre E. Learning Bilingual Word Embeddings with (Almost) no Bilingual Data[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 451-462.
[24] Oh S, Lei Z, Lee W C, et al. CV-PCR: A Context-Guided Value-Driven Framework for Patent Citation Recommendation[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013: 2291-2296.
[1] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[2] 陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[3] 徐峥,乐小虬. 类目式文档语义特征AND-OR逻辑表达式生成方法[J]. 数据分析与知识发现, 2021, 5(5): 95-103.
[4] 张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[5] 张金柱, 于文倩. 基于短语表示学习的主题识别及其表征词抽取方法研究[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[6] 余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[7] 余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[8] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[9] 余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[10] 丁勇,陈夕,蒋翠清,王钊. 一种融合网络表示学习与XGBoost的评分预测模型*[J]. 数据分析与知识发现, 2020, 4(11): 52-62.
[11] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[12] 黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 *[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[13] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[14] 傅柱,王曰芬,丁绪辉. 面向知识重用的设计过程知识语义表示研究*[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[15] 曾庆田,戴明弟,李超,段华,赵中英. 轨迹数据融合用户表示方法的重要位置发现*[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn