基于表示学习的无监督跨语言专利推荐研究<sup>*</sup>

doi:10.11925/infotech.2096-3467.2020.0272

数据分析与知识发现

2020, Vol. 4

Issue (10): 93-103 https://doi.org/10.11925/infotech.2096-3467.2020.0272

研究论文

本期目录 | 过刊浏览 | 高级检索

基于表示学习的无监督跨语言专利推荐研究^*

张金柱^1,²(

),主立鹏¹,刘菁婕¹

¹南京理工大学经济管理学院南京 210094
²江苏省社会公共安全科技协同创新中心南京 210094

Unsupervised Cross-Language Model for Patent Recommendation Based on Representation

Zhang Jinzhu^1,²(

),Zhu Lipeng¹,Liu Jingjie¹

¹School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
²Jiangsu Provincial Social Public Safety Science and Technology Collaborative Innovation Center, anjing 210094, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1501 KB) HTML ( 8 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 减少双语词典和大规模双语语料库的构建,提高专利文本语义的揭示和利用,从文本语义表示角度设计无监督的跨语言专利推荐方法,提高跨语言专利推荐效果和领域适用能力。【方法】 首先设计无监督跨语言词向量映射方法,通过线性变换将独立的中英专利词向量映射到统一语义向量空间,构建中英词语间的语义映射关系;然后利用平滑倒词频的词向量加权方法,形成基于跨语言专利词向量的专利文本语义表示方法,实现中英专利文本在同一向量空间中的语义表示;最后应用向量相似度计算指标,计算不同语言专利文本间的语义相似度,构建基于表示学习的无监督跨语言专利推荐方法,实现跨语言专利推荐。【结果】 在无线通信领域的实验中,无监督跨语言专利推荐方法的Top-1和Top-5推荐准确率分别达到55.63%和77.82%,较弱监督跨语言专利推荐方法分别提高了0.66%和1.45%,较基于机器翻译的跨语言专利推荐方法分别提高了4.29%和3.90%。【局限】 仅对特定领域中英专利进行推荐,尚需扩展领域和语言范围。【结论】 能够实现有效的中英跨语言专利推荐,并可扩展应用到其他领域和语种下的专利推荐中。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张金柱
	主立鹏
	刘菁婕

关键词 ：跨语言, 专利推荐, 表示学习, 语义表示

Abstract：

[Objective] This paper designs a cross-language recommendation model for patents based on text semantic representation, aiming to reduce the number of bilingual dictionaries and large-scale corpus, as well as improve the ability of domain adaptation.[Methods] First, we designed a word vector mapping method with unsupervised cross-language algorithm. Then, we mapped Chinese and English word vectors to the unified semantic vector space with linear transformation, which constructed the semantic mapping relationship between Chinese and English words. Third, we created semantic representation of patent texts based on cross-language word vector with smooth inverse frequency (SIF) reweighting method. It realized the semantic representation of Chinese-English patent texts in the same vector space. Finally, we calculated the semantic similarity between patent texts and recommend the cross-language patents.[Results] We examined the proposed method with patents on “wireless communication” and the recommendation accuracy rate of the top 1 and the top 5 reached 55.63% and 77.82%, which were 0.66% and 1.45% higher than those of the weak supervised based cross-language recommendation. They were also 4.29% and 3.90% better than the machine translation based ones.[Limitations] We only examined the proposed method with Chinese and English patents from one specific field.[Conclusions] This proposed method could recommend Chinese and English patents effectively, which help future research in cross-language patent recommendations.

Key words： Cross-Language Patent Recommendation Representation Learning Semantic Representation

收稿日期: 2020-03-31 出版日期: 2020-07-28

ZTFLH:

G254

基金资助:*本文系国家自然科学基金项目“基于表示学习的专利信息语义融合与深度挖掘研究”(71974095);江苏省社会科学基金项目“基于社团结构动态演化的主题突变监测与形成机制研究”(17TQC003);国家自然科学基金项目“基于被引科学知识突变的突破性创新动态识别及其形成机理研究”的研究成果之一(71503125)

通讯作者: 张金柱 E-mail: zhangjinzhu@njust.edu.cn

引用本文:

张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究^*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation. Data Analysis and Knowledge Discovery, 2020, 4(10): 93-103.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0272 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I10/93

Fig.1 映射前的中英单语专利词向量可视化

Fig.2 基于无监督跨语言词映射的中英专利词向量表示可视化

Table 1 中英跨语言词映射准确率（%）

Fig.3 利单词数量对跨语言词映射准确率的影响

Table 2 中英跨语言专利相关词映射实例

Fig.4 基于跨语言专利词向量表示的中英专利文本表示

Table 3 中英跨语言专利推荐准确率(%)

Table 4 中英跨语言专利推荐结果示例

[1]	Jochim C, Lioma C, Schütze H, et al. Preliminary Study into Query Translation for Patent Retrieval[C]//Proceedings of the 3rd Workshop on Patent Information Retrieval. 2010: 57-66.
[2]	Magdy W, Jones G J F. Studying Machine Translation Technologies for Large-Data CLIR Tasks: A Patent Prior-Art Search Case Study[J]. Information Retrieval, 2014,17(5):492-519. doi: 10.1007/s10791-013-9231-6
[3]	Magdy W, Jones G J F. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011: 1925-1928.
[4]	Shen X, Huang H Y, Li L Z, et al. A Parallel Cross-Language Retrieval System for Patent Documents[C]//Proceedings of the 6th IEEE International Conference on Software Engineering and Service Science. 2015: 672-676.
[5]	Lee C S, Wang M H, Hsiao Y C, et al. Ontology-Based GFML Agent for Patent Technology Requirement Evaluation and Recommendation[J]. Soft Computing, 2019,23(2):537-556. doi: 10.1007/s00500-017-2859-1
[6]	Ji X, Gu X J, Dai F, et al. Patent Collaborative Filtering Recommendation Approach Based on Patent Similarity[C]//Proceedings of the 8th International Conference on Fuzzy Systems and Knowledge Discovery. 2011: 1699-1703.
[7]	Rui X H, Min D. HIM-PRS: A Patent Recommendation System Based on Hierarchical Index-Based MapReduce Framework[C]//Proceedings of UCAWSN 2016, CUTE 2016, CSA 2016: Advances in Computer Science and Ubiquitous Computing. 2016: 843-848.
[8]	李枫林, 柯佳. 词向量语义表示研究进展[J]. 情报科学, 2019,37(5):155-165.
[8]	( Li Fenglin, Ke Jia. Research Progress of Word Vector Semantic Representation[J]. Information Science, 2019,37(5):155-165.)
[9]	涂存超, 杨成, 刘知远, 等. 网络表示学习综述[J]. 中国科学:信息科学, 2017,47(8):980-996.
[9]	( Tu Cunchao, Yang Cheng, Liu Zhiyuan, et al. Network Representation Learning: An Overview[J]. SCIENTIA SINICA Informationis, 2017,47(8):980-996.)
[10]	刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016,53(2):247-261.
[10]	( Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge Representation Learning: A Review[J]. Journal of Computer Research and Development, 2016,53(2):247-261.)
[11]	彭晓娅, 周栋. 跨语言词向量研究综述[J]. 中文信息学报, 2020,34(2):1-15.
[11]	( Peng Xiaoya, Zhou Dong. Survey of Cross-Lingual Word Embedding[J]. Journal of Chinese Information Processing, 2020,34(2):1-15.)
[12]	Mikolov T, Le Q V, Sutskever I. Exploiting Similarities among Languages for Machine Translation[OL].arXiv Preprint, arXiv: 1309.4168.
[13]	Dinu G, Baroni M. Improving Zero-Shot Learning by Mitigating the Hubness Problem[C]// Proceedings of the 3rd International Conference on Learning Representations. 2014. DOI: 10.1007/978-3-319-23528-8_9.
[14]	Faruqui M, Dyer C. Improving Vector Space Word Representations Using Multilingual Correlation[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014: 462-471.
[15]	Lu A, Wang W, Bansal M, et al. Deep Multilingual Correlation for Improved Word Embeddings[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 250-256.
[16]	Smith S L, Turban D H P, Hamblin S, et al. Offline Bilingual Word Vectors, Orthogonal Transformations and the Inverted Softmax[C]//Proceedings of the 5th International Conference on Learning Representations. 2017.
[17]	Xing C, Wang D, Liu C, et al. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1006-1011.
[18]	Barone A V M. Towards Cross-Lingual Distributed Representations Without Parallel Text Trained with Adversarial Autoencoders[C]//Proceedings of the 1st Workshop on Representation Learning for NLP. 2016: 121-126.
[19]	Zhang M, Liu Y, Luan H B, et al. Adversarial Training for Unsupervised Bilingual Lexicon Induction[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017. DOI: 10.18653/v1/P17-1179.
[20]	Artetxe M, Labaka G, Agirre E. A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018. DOI: 10.18653/v1/P18-1073.
[21]	Arora S, Liang Y, Ma T. A Simple but Tough-to-Beat Baseline for Sentence Embeddings[C]//Proceedings of the 5th International Conference on Learning Representations. 2017.
[22]	Conneau A, Lample G, Ranzato M A, et al. Word Translation Without Parallel Data[C]//Proceedings of the 6th International Conference on Learning Representations. 2017.
[23]	Artetxe M, Labaka G, Agirre E. Learning Bilingual Word Embeddings with (Almost) no Bilingual Data[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 451-462.
[24]	Oh S, Lei Z, Lee W C, et al. CV-PCR: A Context-Guided Value-Driven Framework for Patent Citation Recommendation[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013: 2291-2296.

[1]	李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[2]	陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法^*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[3]	徐峥,乐小虬. 类目式文档语义特征AND-OR逻辑表达式生成方法[J]. 数据分析与知识发现, 2021, 5(5): 95-103.
[4]	张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型^*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[5]	张金柱, 于文倩. 基于短语表示学习的主题识别及其表征词抽取方法研究[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[6]	余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究^*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[7]	余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[8]	梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[9]	余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究^*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[10]	丁勇,陈夕,蒋翠清,王钊. 一种融合网络表示学习与XGBoost的评分预测模型*[J]. 数据分析与知识发现, 2020, 4(11): 52-62.
[11]	余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[12]	黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 ^*[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[13]	曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 ^*[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[14]	傅柱,王曰芬,丁绪辉. 面向知识重用的设计过程知识语义表示研究^*[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[15]	曾庆田,戴明弟,李超,段华,赵中英. 轨迹数据融合用户表示方法的重要位置发现^*[J]. 数据分析与知识发现, 2019, 3(6): 75-82.

Viewed

Full text

Abstract

Cited

Shared

Discussed