Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (2): 86-95    DOI: 10.11925/infotech.2096-3467.2017.0626
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究*
贾晓婷1,王名扬1(),曹宇2
1 (东北林业大学信息与计算机工程学院 哈尔滨 150040)
2 (同方知网(北京)技术有限公司 北京 100192)
Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm
Xiaoting Jia1,Mingyang Wang1(),Yu Cao2
1 (College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China)
2(Tongfang Knowledge Network, Beijing 100192, China)
全文: PDF(741 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 

目的】引入深度神经网络模型Doc2Vec, 以综合考察文本的上下文语境信息。结合改进的K-means聚类算法, 实现中文单文档摘要的提取。【方法】利用Doc2Vec模型, 提取语句的语义、语法、语序等特征, 将其转化为固定维度的向量。基于密度最大距离最远原则为K-means聚类算法选取初始聚类中心, 对语句向量进行聚类。在每个类簇内计算句子的信息熵, 提取类内与其他语句均具有较高相似度的句子作为摘要句。【结果】相对于传统的向量化表示方法PLSA, 利用本文方法生成的摘要效果在准确率、召回率、F值上分别提高了9.57%、7.62%、10.30%。【局限】提取的摘要句来源于正文, 而标准摘要是对正文的高度凝练总结, 二者通常难以完全匹配。【结论】实验结果表明, 相对于常见的向量化表示方法, 本文提出的方法能较为显著地提升自动摘要的效果, 对多文档自动摘要的实现提供了一种思路。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
贾晓婷
王名扬
曹宇
关键词 自动摘要Doc2VecK-means信息熵    
Abstract

[Objective] This paper aims to improve the performance of automatic abstracting with the help of “Doc2vec” model and improved K-means clustering algorithm. [Methods] First, we introduced the Doc2Vec model, which could examine the document contextual information, to extract the semantics, grammar and word sequences of Chinese document sentences. Then, we transformed these sentences to vectors of fixed dimensions. Third, we identified clustering centers for the improved K-means algorithm, and then processed the sentence vectors. Finally, the sentences with larger information entropy in one cluster, as well as higher similarity with other sentences in the cluster, were extracted. [Results] Compared with the PLSA method, the precision, recall, and F value of the proposed model increased by 9.57%, 7.62% and 10.30% respectively. [Limitations] We could not use the sentences extracted from the documents to generate high quality abstracts. [Conclusions] The proposed method could improve the performance of automatic abstracting of Chinese documents.

Key wordsAutomatic Abstracting    Doc2Vec    K-means Clustering    Information Entropy
收稿日期: 2017-06-29     
基金资助:*本文系中央高校基本科研业务费专项资金项目“基于社会网络特征提取的群体性突发事件预警方法研究”(项目编号: 2572014DB05)和国家自然科学基金项目“群体性突发事件预警的超网络方法研究” (项目编号: 71473034)的研究成果之一
引用本文:   
贾晓婷,王名扬,曹宇. 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究*[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
Xiaoting Jia,Mingyang Wang,Yu Cao. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2017.0626.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0626
图1  本文方法示意图
图2  句子及其向量化表示部分结果
文档编号 文档句子数 k取不同值时聚类中心句索引
25 36 [[35, 28, 26], [35, 28, 26, 9], [35, 28, 19, 29, 13], [35, 28, 19, 7, 22, 29], [35, 28, 10, 19, 22, 7, 29],
[20, 17, 31, 10, 18, 11, 22, 36]]
26 33 [[18, 11, 1], [18, 11, 1, 31], [18, 11, 31, 1, 33], [18, 11, 33, 1, 31, 7], [18, 11, 33, 1, 31, 14, 7],
[18, 11, 33, 1, 31, 14, 7, 5]]
27 47 [[21, 33, 29], [2, 33, 41, 16], [21, 33, 41, 46, 39], [21, 6, 15, 42, 34, 18], [21, 33, 41, 38, 36, 18, 34],
[21, 40, 15, 34, 31, 2, 39, 19]]
28 40 [[3, 29, 10], [3, 29, 28, 16], [3, 29, 31, 20, 14], [3, 29, 37, 16, 8, 19], [3, 29, 22, 31, 8, 16, 26],
[3, 29, 22, 31, 8, 16, 26, 7]]
29 44 [[31, 42, 26], [31, 42, 13, 35], [31, 42, 13, 41, 6], [31, 42, 13, 41, 6, 4], [31, 42, 30, 41, 13, 3, 39],
[28, 1, 31, 30, 19, 41, 6, 40]]
表1  实验文档聚类中心句索引部分结果
图3  部分实验文档的标准摘要及生成摘要对比图
图4  不同算法摘要效果的对比图
[1] 程园, 吾守尔·斯拉木, 买买提依明·哈斯木. 基于综合的句子特征的文本自动摘要[J]. 计算机科学, 2015, 42(4): 226-229.
(Cheng Yuan, Wushouer Silamu, Maimaitiyiming Hasimua.Automation Text Summarization Based on Comprehensive Characteristics of Sentence[J]. Computer Science, 2015, 42(4): 226-229.)
[2] 余珊珊, 苏锦钿, 李鹏飞. 基于改进的TextRank的自动摘要提取方法[J]. 计算机科学, 2016, 43(6): 240-247.
(Yu Shanshan, Su Jindian, Li Pengfei.Improved TextRank-based Method for Automatic Summarization[J]. Computer Science, 2016, 43(6): 240-247.)
[3] 刘星含, 霍华. 基于互信息的文本自动摘要[J]. 合肥工业大学学报: 自然科学版, 2014, 37(10): 1198-1203.
(Liu Xinghan, Huo Hua.Automatic Summarization for Text Based on Mutual Information[J]. Journal of Hefei University of Technology: Natural Science, 2014, 37(10): 1198-1203.)
[4] 高永兵, 王宇, 马占飞. 基于CR-PageRank算法的个人事件自动摘要研究[J]. 计算机工程, 2016, 42(11): 64-69.
doi: 10.3969/j.issn.1000-3428.2016.11.011
(Gao Yongbing, Wang Yu, Ma Zhanfei.Research on Automatic Summarization of Personal Events Based on CR-PageRank Algorithm[J]. Computer Engineering, 2016, 42(11): 64-69.)
[5] 米文丽, 孙曰昕. 利用概率主题模型的微博热点话题发现方法[J]. 计算机系统应用, 2014, 23(8): 163-167.
(Mi Wenli, Sun Yuexin.Microblog Hot Topics Discovery Method Based on Probabilistic Topic Model[J]. Computer Systems & Applications, 2014, 23(8): 163-167.
[6] 李文鹏, 赵俊峰, 谢冰. 基于LDA的软件代码主题摘要自动生成方法[J]. 计算机科学, 2017, 44(4): 35-38.
doi: 10.11896/j.issn.1002-137X.2017.04.008
(Li Wenpeng, Zhao Junfeng, Xie Bing.Summary Extraction Method for Code Topic Based on LDA[J]. Computer Science, 2017, 44(4): 35-38.)
[7] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space [OL]. arXiv: 1301.3781.
[8] Le Q, Mikolov T.Distributed Representations of Sentences and Documents[C]//Proceedings of International Conference on Machine Learning. 2014.
[9] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12): 27-35.
(Zhang Qun, Wang Hongjun, Wang Lunwen.Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12): 27-35.)
[10] 林江豪, 周咏梅, 阳爱民, 等. 结合词向量和聚类算法的新闻评论话题演进分析[J]. 计算机工程与科学, 2016, 38(11): 2368-2374.
doi: 10.3969/j.issn.1007-130X.2016.11.032
(Lin Jianghao, Zhou Yongmei, Yang Aimin, et al.Analysis on Topic Evolution of News Comments by Combining Word Vector and Clustering Algorithm[J]. Computer Engineering and Science, 2016, 38(11): 2368-2374.)
[11] Dai X, Bikdash M, Meyer B.From Social Media to Public Health Surveillance: Word Embedding Based Clustering Method for Twitter Classification[C]// Proceedings of SoutheastCon. IEEE, 2017.
[12] 杨宇婷, 王名扬, 田宪允, 等. 基于文档分布式表达的新浪微博情感分类研究[J]. 情报杂志, 2016, 35(2): 151-156.
doi: 10.3969/j.issn.1002-1965.2016.02.027
(Yang Yuting, Wang Mingyang, Tian Xianyun, et al.Sina Microblog Sentiment Classification Based on Distributed Representation of Documents[J]. Journal of Intelligence, 2016, 35(2): 151-156.)
[13] 黄仁, 张卫. 基于Word2Vec的互联网商品评论情感倾向研究[J]. 计算机科学, 2016, 43(S1): 387-389.
(Huang Ren, Zhang Wei.Study on Sentiment Analyzing of Internet Commodities Review Based on Word2Vec[J]. Computer Science, 2016, 43(S1): 387-389.)
[14] Cholakov K, Kordoni V.Using Word Embeddings for Improving Statistical Machine Translation of Phrasal Verbs[C]// Proceedings of the Workshop on Multiword Expressions. 2016: 56-60.
[15] Wei H, Zhang H, Gao G.Representing Word Image Using Visual Word Embeddings and RNN for Keyword Spotting on Historical Document Images[C]// Proceedings of IEEE International Conference on Multimedia and Expo. IEEE Computer Society, 2017: 1368-1373.
[16] 于洁. Skip-Gram模型融合词向量投影的微博新词发现[J]. 计算机系统应用, 2016, 25(7): 130-136.
doi: 10.15888/j.cnki.csa.005236
(Yu Jie.Microblog New Word Recognition Combining Skip-Gram Model and Word Vector Projection[J]. Computer Systems & Applications, 2016, 25(7): 130-136.)
[17] Rui W, Liu J, Jia Y.Unsupervised Feature Selection for Text Classification via Word Embedding[C]// Proceedings of IEEE International Conference on Big Data Analysis. IEEE, 2016: 1-5.
[18] 刘广聪, 黄婷婷, 陈海南. 改进的二分K均值聚类算法[J]. 计算机应用与软件, 2015, 32(2): 261-263.
doi: 10.3969/j.issn.1000-386x.2015.02.063
(Liu Guangcong, Huang Tingting, Chen Hainan.Improved Bisecting K-means Clustering Algorithm[J]. Computer Application and Software, 2015, 32(2): 261-263.)
[19] 翟东海, 鱼江, 高飞, 等. 最大距离法选取初始簇中心的K-means文本聚类算法的研究[J]. 计算机应用研究, 2014, 31(3): 713-715.
doi: 10.3969/j.issn.1001-3695.2014.03.017
(Zhai Donghai, Yu Jiang, Gao Fei, et al.K-means Text Clustering Algorithm Based on Initial Cluster Centers Selection According to Maximum Distance[J]. Application Research of Computers, 2014, 31(3): 713-715.)
[20] 左进, 陈泽茂. 基于改进K均值聚类的异常检测算法[J]. 计算机科学, 2016, 43(8): 258-261.
doi: 10.11896/j.issn.1002-137X.2016.8.052
(Zuo Jin, Chen Zemao.Anomaly Detection Algorithm Based on Improved K-means Clustering[J]. Computer Science, 2016, 43(8): 258-261.)
[21] 张银明, 黄廷磊, 林科, 等. 一种改进的k均值文本聚类算法[J]. 桂林电子科技大学学报, 2016, 36(4): 311-314.
doi: 10.3969/j.issn.1673-808X.2016.04.011
(Zhang Yinming, Huang Tinglei, Lin Ke, et al.An Improved K-means Algorithm for Text Clustering[J]. Journal of Guilin University of Electronic Technology, 2016, 36(4): 311-314.)
[1] 刘洪伟,高鸿铭,陈丽,詹明君,梁周扬. 基于用户浏览行为的兴趣识别管理模型*[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[2] 王雪颖,张紫玄,王昊,邓三鸿. 中国农产品品牌评价研究的内容解析*[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[3] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[4] 方小飞,黄孝喜,王荣波,谌志群,王小华. 基于LDA模型的移动投诉文本热点话题识别*[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[5] 刘睿伦,叶文豪,高瑞卿,唐梦嘉,王东波. 基于大数据岗位需求的文本聚类研究*[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
[6] 钮亮. 共主题网络方法及应用*[J]. 现代图书情报技术, 2016, 32(7-8): 137-146.
[7] 刘天祎,步一,赵丹群,黄文彬. 自动引文摘要研究述评[J]. 现代图书情报技术, 2016, 32(5): 1-8.
[8] 唐晓波, 邱鑫. 面向主题的高质量评论挖掘模型研究[J]. 现代图书情报技术, 2015, 31(7-8): 104-112.
[9] 陈挺, 韩涛, 李泽霞, 李国鹏, 王小梅. 科研项目布局差异对比方法研究——以NSF和EUFP项目为例[J]. 现代图书情报技术, 2015, 31(7-8): 89-96.
[10] 任育伟, 吕学强, 李卓, 徐丽萍. 搜索日志中命名实体识别[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[11] 肖天久, 刘颖. 《红楼梦》词和N元文法分析[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[12] 张文君, 王军, 徐山川. 电商用户需求状态的聚类分析——以淘宝网女装为例[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[13] 何跃, 宋灵犀, 齐丽云. 负面事件中的品牌网络口碑溢出效应研究——以“圆通夺命快递”事件为例[J]. 现代图书情报技术, 2015, 31(10): 58-64.
[14] 唐晓波, 胡华. 中文UGC信息源的本体概念抽取研究*[J]. 现代图书情报技术, 2014, 30(5): 41-49.
[15] 陈勇, 李红莲, 吕学强. 网络用户搜索行为特征分析[J]. 现代图书情报技术, 2014, 30(12): 10-17.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn