Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (6): 1-14     https://doi.org/10.11925/infotech.2096-3467.2019.1145
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*
梁野1,2,李小元3,许航2(),胡伊然2
1北京外国语大学人工智能与人类语言重点实验室 北京 100089
2北京外国语大学信息科学技术学院 北京 100089
3北京外国语大学亚洲学院 北京 100089
CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning
Liang Ye1,2,Li Xiaoyuan3,Xu Hang2(),Hu Yiran2
1Artificial Intelligence and Human Languages Lab, Beijing Foreign Studies University, Beijing 100089, China
2School of Information Science and Technology, Beijing Foreign Studies University, Beijing 100089, China
3School of Asian Studies, Beijing Foreign Studies University, Beijing 100089, China
全文: PDF (1563 KB)   HTML ( 65
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 探索信息在不同语言之间的映射关系,可以实现对域外舆情的有效监控,并对境内受众进行积极正面引导。【方法】 提出涵盖多来源的面向舆情分析与预警领域的跨语言知识图谱构建架构CLOpin,针对不同场景设计多个工具集处理跨语言的数据集,高效整合多种来源的数据,构建跨语言知识图谱CLKG(Cross-Lingual Knowledge Graph)以实现跨语言的舆情分析与预警。【结果】 CLKG与单一语言知识图谱相比,突发事件一小时内的知识完整度提升13.9%,且仅比后者24小时内的完整度低5.2%。【局限】 CLKG的构建受制于领域专家的稀缺,成为非通用语知识图谱建设的瓶颈。【结论】 在CLOpin架构中,不同来源的知识相互补充,对事件信息量的扩充效果显著,有利于准确把握舆情动态并据此做出预警。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
梁野
李小元
许航
胡伊然
关键词 跨语言知识图谱舆情分析预警机器学习    
Abstract

[Objective] This paper explores the relationship of information mapping among different languages, aiming to effectively monitor public opinion around the world and guide domestic audience effectively. [Methods] We proposed CLOpin, a cross-linguistic knowledge-mapping framework in the field of public opinion analysis and early warning. The platform developed several toolsets for different scenarios to process cross-linguistic data sets. CLOpin could integrate data from various sources efficiently and construct a knowledge graph to implement cross-linguistic public opinion analysis and early warning. [Results] Within the first hour following breaking news, the knowledge integrity of our model was 13.9% higher than that of the single language knowledge graph models. Our model’s knowledge integrity was 5.2% lower than that of the latter in 24 hours. [Limitations] The construction of our model was constrained by the scarcity of domain experts, which is the bottleneck for the knowledge graph of non-common language. [Conclusions] The CLOpin framework help us accurately grasp public opinion and early warning accordingly.

Key wordsCross-Lingual    Knowledge Graph    Public Opinion Analysis    Early Warning    Machine Learning
收稿日期: 2019-10-18      出版日期: 2020-07-07
ZTFLH:  TP393 G250  
基金资助:*本文系北京市社会科学基金基础研究项目“网络社会中的跨语言信息传播与舆情预警机制研究”(15SHA002);国家社会科学基金项目“大数据时代面向国家安全的非通用语社交网络舆情研究”(15CTQ028);北京外国语大学一流学科建设数据库建设项目“大数据背景下多语种汉外大规模在线语料库建设”的研究成果之一(YY19SSK02)
通讯作者: 许航     E-mail: xuhangbfsu@163.com
引用本文:   
梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning. Data Analysis and Knowledge Discovery, 2020, 4(6): 1-14.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1145      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I6/1
Fig.1  三种知识图谱之间的关系
架构名称 输入数据源 是否引入专家工具集与机器学习和深度学习方法相结合 输出
CLOpin ①从结构化实例转换的RDF数据(英语和非英语)
②非结构数据
③舆情分析专家先验知识
CKG、IKG
XLore 非结构化数据 CKG、IKG
XLORE2 非结构化数据 CKG、IKG
WikiCiKE 结构化数据 CKG
ConceptNet5.5 结构化数据、非结构化数据及专家先验知识 CKG
CLEQS YAGO2 CKG
DBpedia NIF 非结构化数据 A corpus
EventKG 结构化数据、非结构化数据 CKG
Body-Mind-Language Europarl corpus CKG
CrossOIE 结构化数据 A classifier
Table 1  多种跨语言知识图谱构架
Fig.2  CLOpin总体架构
Fig.3  CUOL的生成
概念识别码 词汇识别码 字符串识别码 词源识别码
C0005896
特朗普

K?p
Trump
Trompete
L0005874
特朗普
K?p
S0008563 A0008123
特朗普(汉藏语系) 特朗普(汉语)
S0008548 A0009306
K?p(Undetermined) K?p(越南语)
S0008521 A0008966
(Undetermined) (老挝语)
L0005873
Trump
Trompete
S0005623 A0001452
Trump(印欧语系) Trump(英语)
S0004578 A0007896
Trompete(印欧语系) Trompete(葡萄牙语)
Table 2  概念特征
概念语料 中文释义 唯一识别码
terrorist attack 恐怖袭击 C0008532
blast 爆炸 C0008745
casualities 受害者 C0005241
Table 3  融合过程中的专家语料样本
输入材料 抽取的概念 新词
恐怖分子承认了这一行动,受害者人数可能会增加。爆炸对周围的商店造成了巨大的破坏。 1.恐怖分子:Concept: [C0005622] terrorist
2.爆炸:Concept: [C0008745] blast
3.受害者:Concept: [C0005241] casualities
恐怖分子:Concept: [C0005622]
Terrorist
Table 4  单词发现示例
Fig.4  概念与关系融合子系统
CUI String Source
C0005896 特朗普 汉语媒体
Trump 英语媒体
老挝语媒体
Table 5  概念融合结果
Fig.5  IKG的构建流程
模式类型 基于模式的抽取规则
事件发生时间 情况出现在****
事件导致后果 本次事件造成****
Table 6  实体抽取中的规则库
Fig.6  实体和关系融合过程
Fig.7  利用Canopy+K-means方法实现聚类的过程
关系类型 主语 关系 宾语
两个概念之间的关系 C0008532
(恐怖袭击)
避开 C0001235(安检)
两个实例之间的关系 I0008745
(爆炸发生)
导致 I0005241(受害者出现)
Table 7  三元组示例
Fig.8  实体与关系抽取的结果
Fig.9  跨语言融合的结果
事件编号 事件名称 发生时间 汉语 英语 德语 印尼语 越南语
11468 印尼海啸 2018/9/30 42 24 9 265 5
11793 沙特记者被肢解事件 2018/10/2 21 33 17 1 2
14854 法国“黄背心”活动 2018/11/17 34 18 30 6 4
15298 俄罗斯扣押乌克兰军舰事件 2018/11/25 15 42 26 2 5
17583 嫦娥四号月背探测事件 2019/1/3 213 8 6 4 3
18820 美国退出《中导条约》事件 2019/2/1 8 23 13 8 2
20136 索马里首都恐怖袭击事件 2019/3/1 11 18 10 0 3
21033 埃航波音客机坠毁事件 2019/3/10 78 36 19 5 6
21812 新西兰清真寺枪击事件 2019/3/15 39 15 11 1 2
23515 巴黎圣母院火灾事件 2019/4/15 53 27 23 4 3
Table 8  相同事件在不同语种新闻中的报道情况(单位:次)
事件编号 事件名称 信息点数量(1小时) 信息点数量(24小时)
单语言最大值
单语言平均值 跨语言复合值
11468 印尼海啸 26 29 30
11793 沙特记者被肢解事件 18 22 25
14854 法国“黄背心”活动 12 15 16
15298 俄罗斯扣押乌克兰军舰事件 26 28 29
17583 嫦娥四号月背探测事件 68 69 71
18820 美国退出《中导条约》事件 19 21 21
20136 索马里首都恐怖袭击事件 14 17 18
21033 埃航波音客机坠毁事件 37 42 45
21812 新西兰清真寺枪击事件 35 39 40
23515 巴黎圣母院火灾事件 42 48 51
Table 9  不同时间维度下跨语言与单语言信息融合效果对比(单位:个)
[1] 丁晟春, 侯琳琳, 王颖. 基于电商数据的产品知识图谱构建研究[J]. 数据分析与知识发现, 2019,3(3):45-56.
[1] ( Ding Shengchun, Hou Linlin, Wang Ying. Product Knowledge Map Construction Based on the E-commerce Data[J]. Data Analysis and Knowledge Discovery, 2019,3(3):45-56.)
[2] 杨海慈, 王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019,3(6):109-116.
[2] ( Yang Haici, Wang Jun. Visualizing Knowledge Graph of Academic Inheritance in Song Dynasty[J]. Data Analysis and Knowledge Discovery, 2019,3(6):109-116.)
[3] 王颖, 钱力, 谢靖, 等. 科技大数据知识图谱构建模型与方法研究[J]. 数据分析与知识发现, 2019,3(1):15-26.
[3] ( Wang Ying, Qian Li, Xie Jing, et al. Building Knowledge Graph with Sci-Tech Big Data[J]. Data Analysis and Knowledge Discovery, 2019,3(1):15-26.)
[4] 马捷, 胡漠, 张世良, 等. 网络舆情危机等级评价模型构建及其应用[J]. 情报资料工作, 2017,38(4):36-42.
[4] ( Ma Jie, Hu Mo, Zhang Shiliang, et al. Construction and Application of Network Public Opinion Crisis Level Evaluation Model: Taking the Public Opinion of the Incorrupt Government as an Example[J]. Information and Documentation Services, 2017,38(4):36-42.)
[5] Bollacker K D, Evans C, Paritosh P, et al. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge [C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008: 1247-1250.
[6] Malyshev S, Krötzsch M, González L, et al. Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph [C]//Proceedings of International Semantic Web Conference. 2018: 376-394.
[7] Mustio C, Semeraro G, De Gemmis M, et al. Tuning Personalized PageRank for Semantics-Aware Recommendations Based on Linked Open Data [C]//Proceedings of the 2017 European Semantic Web Conference. 2017: 169-183.
[8] Hoffart J, Suchanek F M, Berberich K, et al. YAGO2: Exploring and Querying World Knowledge in Time, Space, Context, and Many Languages [C]//Proceedings of the 20th International Conference Companion on World Wide Web. 2011: 229-232.
[9] Xu B, Xu Y, Liang J Q, et al. CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System [C]//Proceedings of the 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems. 2017: 428-438.
[10] Lv X, Hou L, Li J Z, et al. Differentiating Concepts and Instances for Knowledge Graph Embedding [C]//Proceedings of the 2018 Conference on Empirical Methods on Natural Language Processing. 2018: 1971-1979.
[11] Cuzzola J, Bagheri E, Jovanovic J. UMLS to DBPedia Link Discovery Through Circular Resolution[J]. Journal of the American Medical Informatics Association, 2018,25(7):819-826.
doi: 10.1093/jamia/ocy021 pmid: 29648604
[12] 寿亦敏. 跨语言信息检索的国内外比较研究[J]. 情报资料工作, 2009(4):53-57.
[12] ( Shou Yimin. A Comparative Study of Cross-Language Information Retrieval at Home and Abroad[J]. Information and Documentation Services, 2009(4):53-57.)
[13] Speer R, Chin J, Havasi C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge [C]//Proceedings of the 2017 AAAI Conference on Artificial Intelligence. 2017: 4444-4451.
[14] Su Y H, Zhang C, Li J Y. Cross-Lingual Entity Query from Large-Scale Knowledge Graphs [C]//Proceedings of the 2015 Asia-Pacific Web Workshops Conference. 2015: 139-150.
[15] Wang Z G, Li J Z, Wang Z C, et al. XLore: A Large-Scale English-Chinese Bilingual Knowledge Graph [C]//Proceedings of the 2013 International Semantic Web Conference. 2013: 121-124.
[16] Jin H L, Li C J, Zhang J, et al. XLORE2: Large-Scale Cross-Lingual Knowledge Graph Construction and Application[J]. Data Intelligent, 2019,1(1):77-98.
[17] Wang Z G, Li Z X, Li J Z, et al. Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia [C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 641-650.
[18] Zhou Y L, Steven S, Shah J. Predicting ConceptNet Path Quality Using Crowdsourced Assessments of Naturalness [C]//Proceedings of the 2019 Web Conference. 2019: 2460-2471.
[19] 苏永浩, 张驰, 程文亮, 等. CLEQS——基于知识图谱构建的跨语言实体查询系统[J]. 计算机应用, 2016,36(S1):204-206, 223.
[19] ( Su Yonghao, Zhang Chi, Cheng Wenliang, et al. CLEQS: A Cross-Lingual Entity Query System Based on Knowledge Graphs[J]. Journal of Computer Applications, 2016,36(S1):204-206, 223.)
[20] Bu Q, Simperl E, Zerr S, et al. Using Microtasks to Crowdsource DBpedia Entity Classification: A Study in Workflow Design[J]. Semantic Web, 2017,9(4):1-18.
doi: 10.3233/SW-170287
[21] Gottschalk S, Demidova E. EventKG: A Multilingual Event-Centric Temporal Knowledge Graph [C]// Proceedings of the 15th Extended Semantic Web Conference. 2018: 272-287.
[22] Gromann D, Hedblom M M. Body-Mind-Language: Multilingual Knowledge Extraction Based on Embodied Cognition [C]//Proceedings of the 5th International Workshop on Artificial Intelligence and Cognition. 2017: 20-33.
[23] Cabral B S, Glauber R, Souza M, et al. CrossOIE: Cross-Lingual Classifier for Open Information Extraction [C]//Proceedings of the 14th International Conference on Computational Processing of the Portuguese Language. 2020: 368-378.
[24] Wang Z C, Li J Z, Tang J. Boosting Cross-Lingual Knowledge Linking via Concept Annotation[C]// Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2013: 2733-2739.
[25] Holzinger A. Human-Computer Interaction and Knowledge Discovery (HCI-KDD): What is the Benefit of Bringing Those Two Fields to Work Together [C]//Proceedings of the 2013 International Conference on Availability, Reliability, and Security. 2013: 319-328.
[26] Chen M H, Tian Y T, Yang M H, et al. Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment [C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1511-1517.
[27] Belkebir R, Guessoum A. Concept Generalization and Fusion for Abstractive Sentence Generation[J]. Expert Systems with Applications, 2016,53(1):43-56.
doi: 10.1016/j.eswa.2016.01.007
[28] Shang T, Zhao Z, Guan Z Y, et al. A DP Canopy K-Means Algorithm for Privacy Preservation of Hadoop Platform [C]// Proceedings of the 2017 International Symposium on Cyberspace Safety and Security. 2017: 189-198.
[29] Li L F, Nie Y P, Han W H, et al. A Multi-Attention-Based Bidirectional Long Short-Term Memory Network for Relation Extraction [C]//Proceedings of the 2017 International Conference on Neural Information Processing. 2017: 216-227.
[1] 陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 *[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[2] 杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[3] 吕华揆,洪亮,马费成. 金融股权知识图谱构建与应用*[J]. 数据分析与知识发现, 2020, 4(5): 27-37.
[4] 黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 *[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[5] 王若佳,张璐,王继民. 基于机器学习的在线问诊平台智能分诊研究[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[6] 李纲,周华阳,毛进,陈思菁. 基于机器学习的社交媒体用户分类研究 *[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[7] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[8] 杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[9] 张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[10] 刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[11] 徐红霞,李春旺. 科技文献内容知识点抽取研究综述[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[12] 丁晟春,侯琳琳,王颖. 基于电商数据的产品知识图谱构建研究*[J]. 数据分析与知识发现, 2019, 3(3): 45-56.
[13] 李静,潘舒笑,李雪岩,贾立静,赵宇卓. 基于多目标量子优化分类器的急诊危重患者关键指标筛选 *[J]. 数据分析与知识发现, 2019, 3(12): 101-112.
[14] 沈洋,庄伟超,吴清华,钱玲飞. 基于区间模糊VIKOR的监犯特征风险评估研究 *[J]. 数据分析与知识发现, 2019, 3(11): 70-78.
[15] 王颖,钱力,谢靖,常志军,孔贝贝. 科技大数据知识图谱构建模型与方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn