Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 52-60    DOI: 10.11925/infotech.2096-3467.2018.0914
Extracting Keywords with Topic Embedding and Network Structure Analysis
Qingtian Zeng1,2,Xiaohui Hu2,Chao Li1,3()
1(College of Electronic Information Engineering, Shandong University of Science and Technology, Qingdao 266590, China)
2(College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China)
3(Key Laboratory of Embedded System and Service Computing (Tongji University), Ministry of Education, Shanghai 201804, China)
[Objective] This paper proposes a new model to extract topic keywords, aiming to detect those low frequency words of high relevance. [Methods] First, we designed a topic keyword extraction method, which integrated the topic embedding and network structure analysis techniques. Then, we extracted the preliminary set of topic keywords based on the LDA model, and trained the word vector with Word2Vec model. Third, we built a network based on word vector similarity and identified the final topic keywords with the help of network structure analysis. [Results] The new method improved the average similarity between topic keywords by 14.75%. Our method extracted the low frequency keywords with high topic relevance more effectively than the LDA model. [Limitations] The sample size needs to be expanded, and the segmentation process requires more manual adjustments. More research is needed to quantitatively analyze the topic keywords. [Conclusions] Our method improves the abstracting and public opinion analysis.

Key wordsNetwork Structure Analysis      Word Embeddings      Topic Model      Keywords Extraction      Representation Learning     
Received: 19 August 2018      Published: 06 September 2019
ZTFLH:  TP393 G35  
Chao Li

Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis. Data Analysis and Knowledge Discovery, 2019, 3(7): 52-60.

参数 说明 默认值
-sentence 用于训练的语料
-size 单词向量维数 100
-window 训练中的滑动窗口大小 5
-min_count 最小单词数量 5
-negative “噪音词”数量 5
-hs 选择训练算法 0
-sg 选择使用的模型 0
-workers 工作线程数量 3
-sample 采样阈值 1e-3
停止词 举例
标点等特殊符号 , 、 : 《 》等
年月日期 2016年、3月等
分词后的单字 人、区、校、期等
数量多且无实义的词 通知、关于、做好、组织等词
关键词 教学 停电 SCI 国家奖学金
1 实习 停水 EI 省政府奖学金
2 培养 封闭 收录 国家励志奖学金
3 课程 停暖 SSCI 国家助学金
4 课堂 楼房 CSSCI 奖学金
5 立项 供水 索引 上海创立奖学金
