|
|
Semantic Text Mining Methodologies for Intelligence Analysis |
Zhao Dongxiao(),Wang Xiaoyue,Bai Rujiang,Liu Ziqiang |
Institute of Scientific & Technical Information, Shandong University of Technology, Zibo 255049, China |
|
|
Abstract [Objective] This paper reviews the semantic text mining techniques for intelligence analysis. [Coverage] We surveyed the leading semantic text mining research on intelligence analysis from the last ten years and a few earlier studies. [Methods] We first discussed the semantic text mining methodologies and algorithms for words, sentences and paragraphs. Then, we analyzed these techniques from the perspective of topic evolution and applications of mining technologies. [Results] Compared to the traditional intelligence analysis methods, semantic text mining approaches could process unstructured data and deal with multi-layer structured data. [Limitations] Only reviewed the leading studies and their applications in the scientific field. [Conclusions] Semantic text mining improve the performance of traditional intelligence analysis systems and become the future direction of research methodology. More research is needed to enrich the outlier semantic resources.
|
Received: 06 June 2016
Published: 23 November 2016
|
[1] | Kantardzic M.数据挖掘: 概念、模型、方法和算法[M]. 王晓海, 吴志刚译. 北京: 清华大学出版社, 2013: 250-251. | [1] | (Kantardzic M.Data Mining: Concepts, Models, Methods, and Algorithms [M]. Translated by Wang Xiaohai, Wu Zhigang. Beijing: Tsinghua University Press, 2013: 250-251.) | [2] | 王丽杰, 车万翔, 刘挺. 基于SVMTool的中文词性标注[J]. 中文信息学报, 2009, 23(4): 16-21. | [2] | (Wang Lijie, Che Wanxiang, Liu Ting.An SVMTool-Based Chinese POS Tagger[J]. Journal of Chinese Information Processing, 2009, 23(4): 16-21.) | [3] | 张民, 李生, 赵铁军, 等. 统计与规则并举的汉语词性自动标注算法[J]. 软件学报, 1998, 9(2): 134-138. | [3] | (Zhang Min, Li Sheng, Zhao Tiejun, et al.Part of Speech Tagging Chinese Corpus Based on Statistics and Rules[J]. Journal of Software, 1998, 9(2): 134-138.) | [4] | 郭永辉, 吴保民, 王炳锡. 一种用于词性标注的相关投票融合策略[J]. 中文信息学报, 2007, 21(2): 9-13. | [4] | (Guo Yonghui, Wu Baomin, Wang Bingxi.Correlation Voting Fusion Strategy Used for Part of Speech Tagging[J]. Journal of Chinese Information Processing, 2007, 21(2): 9-13.) | [5] | 洪铭材, 张阔, 唐杰, 等. 基于条件随机场CRFs的中文词性标注方法[J]. 计算机科学, 2006, 33(10): 148-155. | [5] | (Hong Mingcai, Zhang Kuo, Tang Jie, et al.A Chinese Part-of- Speech Tagging Approach Using Conditional Random Fields[J]. Computer Science, 2006, 33(10): 148-155.) | [6] | 张民, 李生, 赵铁军, 等. 统计与规则并举的汉语词性自动标注算法[J]. 软件学报, 1998, 9(2): 134-138. | [6] | (Zhang Min, Li Sheng, Zhao Tiejun, et al.Part of Speech Tagging Chinese Corpus Based on Statistics and Rules[J]. Journal of Software, 1998, 9(2): 134-138.) | [7] | ICTCLAS[K]. [2015-07-28]. CTCLAS[K]. [2015-07-28]. . | [8] | 哈工大语言云[K]. [2015-08-13]. 工大语言云[K]. [2015-08-13]. . | [8] | (LTP[K]. [2015-08-13]. TP[K]. [2015-08-13]. | [9] | Stanford Log-linear Part-Of-SpeechTagger[K]. [2015-09-15]. tanford Log-linear Part-Of-SpeechTagger[K]. [2015-09-15]. . | [10] | CLAWS POS Tagger[K]. [2015-09-18]. LAWS POS Tagger[K]. [2015-09-18]. . | [11] | NLTK [K]. [2015-07-20]. LTK [K]. [2015-07-20]. . | [12] | 商宪丽, 王学东.微博话题识别中基于动态共词网络的文本特征提取方法[J]. 图书情报知识, 2016(3): 80-88. | [12] | (Shang Xianli, Wang Xuedong.A Feature Selection Method Based on Dynamic Co-word Network for Microblog Topic Detection[J]. Documentation, Information&Knowledge, 2016(3): 80-88.) | [13] | 杜思奇, 李红莲, 吕学强. 基于汉语组块分析的情感标签抽取[J]. 情报理论与实践, 2016, 39(5): 125-129. | [13] | (Du Siqi, Li Honglian, Lv Xueqiang.Chinese Chunking Based Emotional Label Extraction[J]. Information Studies: Theory & Application, 2016, 39(5): 125-129.) | [14] | 兰秋军, 刘文星, 李卫康, 等. 融合句法信息的金融论坛文本情感计算研究[J]. 现代图书情报技术, 2016(4): 64-71. | [14] | (Lan Qiujun, Liu Wenxing, Li Weikang, et al.Sentiment Analysis of Financial Forum Textual Message[J]. New Technology of Library and Information Service, 2016(4): 64-71.) | [15] | 翟羽佳, 王芳. 基于文本挖掘的中文领域本体构建方法研究[J]. 情报科学, 2015, 33(6): 3-10. | [15] | (Zhai Yujia, Wang Fang.Research on Construction Methods of Chinese Domain Ontology Based on Text Mining[J]. Information Science, 2015, 33(6): 3-10.) | [16] | 吴云芳. 词义消歧研究: 资源、方法与评测[J]. 当代语言学, 2009, 11(2): 113-123. | [16] | (Wu Yunfang.A Survey of Chinese Word Sensedisambiguation: Resources, Methods and Evaluation[J]. Contemporary Linguistics, 2009, 11(2): 113-123.) | [17] | 卢志茂, 刘挺, 李生. 统计词义消歧的研究进展[J]. 电子学报, 2006, 34(2): 333-343. | [17] | (Lu Zhimao, Liu Ting, Li Sheng.The Research Progress of Statistical Word Sense Disambugation[J]. Electronic Sinica, 2006, 34(2): 333-343.) | [18] | Lesk M E.Automated Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from All Ice Cream Cone[C]. In: Proceedings of the S1GDOC Conference. New York: Association for Computing Machinery, 1986: 24-26. | [19] | Pook S L, Catlett J.Making Sense out of Searching[R]. Sydney: AT&T Bell Laboratories, 1988. | [20] | Agirre E, Rigau G.A Proposal for Word Sense Disambiguation Using Conceptual Distance [C]. In: Proceedings of the 1st International Conference on Recent Advances in NLP. 1995: 162-171. | [21] | 鹿文鹏, 黄河燕, 吴昊. 基于领域知识的图模型词义消歧方法[J]. 自动化学报, 2014, 40(12): 2836-2850. | [21] | (Lu Wenpeng, Huang Heyan, Wu Hao.Word Sense Disambiguation Based with Graph Model Based on Domain Knowledge[J]. Acta Automatic Sinica, 2014, 40(12): 2836-2850.) | [22] | 张仰森, 郭江. 四种统计词义消歧模型的分析与比较[J]. 北京信息科技大学学报, 2011, 26(2): 13-18. | [22] | (Zhang Yangsen, Guo Jiang.Analysis and Comparison of 4 Kinds of Statistical Word Sense Disambiguation Models[J]. Journal of Beijing Information Science & Technology, 2011, 26(2): 13-18.) | [23] | 鲁松, 白硕, 黄雄, 等. 基于向量空间模型的有导消歧[J]. 计算机研究与发展, 2011, 38(6): 662-667. | [23] | (Lu Song, Bai Shuo, Huang Xiong, et al.Supervised Word Sense Disambiguation Based on Vector Space Model[J]. Computer Research and Development, 2011, 38(6): 662-667.) | [24] | 王瑞琴, 孔繁胜. 无监督词义消歧研究[J]. 软件学报, 2009, 20(8): 2138-2152. | [24] | (Wang Ruiqin, Kong Fansheng.Unsupervised Word Sense Disambiguation Research[J]. Journal of Software, 2009, 20(8): 2138-2152.) | [25] | BRAT [K]. [2015-09-18]. RAT [K]. [2015-09-18]. . | [26] | 杨建林, 王文龙. 公共卫生类突发事件的抽取研究[J]. 情报理论与实践, 2016, 39(4): 51-59. | [26] | (Yang Jianlin, Wang Wenlong.Public Sanitation Emergency Event Extraction[J]. Information Studies: Theory & Application, 2016, 39(4): 51-59.) | [27] | 陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016, 60(2): 122-128. | [27] | (Chen Feng, Zhai Yujia, Wang Fang.Automatic Theory Recognition in Academic Journals Based on CRF[J]. Library and Information Service, 2016, 60(2): 122-128.) | [28] | 祝娜, 王效岳, 白如江. 语义角色标注及其在科技情报分析中的应用研究[J]. 情报理论与实践, 2015, 38(1): 98-103. | [28] | (Zhu Na, Wang Xiaoyue, Bai Rujiang.Semantic Role Labeling and the Application in Intelligence Analysis[J]. Information Studies: Theory & Application, 2015, 38(1): 98-103.) | [29] | Hacioglu K.Semantic Role Labeling Using Dependency Trees [C]. In: Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 2004. | [30] | 王步康, 王红玲, 袁晓虹, 等. 基于依存句法分析的中文语义角色标注[J]. 中文信息学报, 2010, 24(1): 25-29. | [30] | (Wang Bukang, Wang Hongling, Yuan Xiaohong, et al.Chinese Dependency Parse Based Semantic Role Labeling[J]. Journal of Chinese Information Processing, 2010, 24(1): 25-29.) | [31] | Gildea D, Palmer M.The Necessity of Parsing for Predicate Argument Recognition [C]. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 239-246. | [32] | Pradhan S, Ward W, Hacioglu K, et al.Shallow Semantic Parsing Using Support Vector Machines [C]. In: Proceedings of HLT-NAACL.2004: 233-240. | [33] | 李世奇, 赵铁军, 李晗静, 等. 基于特征组合的中文语义角色标注[J]. 软件学报, 2011, 22(2): 222-232. | [33] | (Li Shiqi, Zhao Tiejun, Li Hanjing, et al.Chinese Semantic Role Labeling Based on Feature Combination[J]. Journal of Software, 2011, 22(2): 222-232.) | [34] | 王红玲. 基于特征向量的中英文语义角色标注研究[D]. 苏州: 苏州大学, 2009. | [34] | (Wang Hongling.Chinese and English Semantic Role Labeling Based on Feature Vector [D]. Suzhou: Soochow University, 2009.) | [35] | 宋毅君, 王瑞波, 李济洪, 等. 基于条件随机场的汉语框架语义角色自动标注[J]. 中文信息学报, 2014, 28(3): 36-47. | [35] | (Song Yijun, Wang Ruibo, Li Jihong.et al.Semantic Role Labeling of Chinese FrameNet Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2014, 28(3): 36-47.) | [36] | 李明, 王亚斌, 张其文, 等. 基于树状条件随机场模型的语义角色标注[J]. 计算机工程, 2010, 36(18): 41-45. | [36] | (Li Ming, Wang Yabin, Zhang Qiwen, et al.Semantic Role Labeling Based on Tree Conditional Random Fields Model[J]. Computer Engineering, 2010, 36(18): 41-45.) | [37] | 白如江, 祝娜, 王效岳. 语义增强的科技创新内容表征研究[J]. 情报理论与实践, 2016, 39(3): 73-79. | [37] | (Bai Rujiang, Zhu Na, Wang Xiaoyue.Semantic Representation of Technical Innovation Content Based on Semantic Enhancement[J]. Information Studies: Theory & Application, 2016, 39(3): 73-79.) | [38] | 张帆, 乐小虬. 领域科技文献创新点句中主题属性实例识别方法研究[J]. 现代图书情报技术, 2015(5): 15-23. | [38] | (Zhang Fan, Le Xiaoqiu.Research on Recognition of Concept Attribute Instances in Innovation Sentences of Scientific Research Paper[J]. New Technology of Library and Information Service, 2015 (5): 15-23.) | [39] | 祝娜, 王效岳, 杨京, 等. 基于LDA 的科技创新主题语义识别研究[J]. 图书情报工作, 2015, 59(14): 126-134. | [39] | (Zhu Na, Wang Xiaoyue, Yang Jing, et al.Semantic Recognition of Technological Innovation Theme Based on LDA[J]. Library and Information Service, 2015, 59(14): 126-134.) | [40] | 洪韵佳, 许鑫. 基于领域本体的知识库多层次文本聚类研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2013(12): 19-26. | [40] | (Hong Yunjia, Xu Xin.Study on Multi-Level Text Clustering for Knowledge Base Based on Domain Ontology——Taking Knowledge Base of Chinese Cuisine Culture as an Example[J]. New Technology of Library and Information Service, 2013(12): 19-26.) | [41] | 常娥. 基于LSI 理论的文本自动聚类研究[J]. 图书情报工作, 2012, 56(11): 89-92. | [41] | (Chang E.Automatic Text Clustering Based on Latent Semantic Index Theory[J]. Library and Information Service, 2012, 56(11): 89-92.) | [42] | 叶春蕾, 冷伏海. 基于共词分析的学科主题演化方法改进研究[J]. 情报理论与实践, 2012, 35(3): 79-82. | [42] | (Ye Chunlei, Leng Fuhai.Development of Discipline Theme Evolution Analysis Based on Co-word Analysis[J]. Information Studies: Theory & Application, 2012, 35(3): 79-82.) | [43] | 唐晓波, 房小可. 基于文本聚类与 LDA 相融合的微博主题检索模型研究[J]. 情报理论与实践, 2013, 36(8): 85-90. | [43] | (Tang Xiaobo, Fang Xiaoke.Micro Blog Topic Retrieval Model Research Based on Text Clustering and LDA[J]. Information Studies: Theory & Application, 2013, 36(8): 85-90.) | [44] | Mitchell T.Machine Learning[M]. McCraw Hill, 1996. | [45] | Yang Y.An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval, 1999, 1(1-2): 69-90. | [46] | Church K W, Hanks P. Word Association Norms, Mutual Information and Lexicography[J]. Computational Linguistics, 1990, 16(1): 22-29. | [47] | Google新闻的工作原理[EB/OL]. [2016-04-28]. Hl=zh-Hans&topic =2428790. | [47] | (The Working Principle of Google News [EB/OL]. [2016-04-28]. Hl=zh-Hans&topic =2428790 | [48] | 新华网[EB/OL]. [2016-04-28]. . | [48] | (xinhuanet [EB/OL]. [2016-04-28]. | [49] | 宁海燕. 实体关系自动抽取技术的比较研究[D]. 哈尔滨: 哈尔滨工业大学, 2010. | [49] | (Ning Haiyan.Comparative Study of Automatic Entity Relation Extraction [D]. Harbin: Harbin Insititute of Technology, 2010.) | [50] | 杨锦锋, 于秋滨, 关毅, 等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014, 40(8): 1537-1560. | [50] | (Yang Jinfeng, Yu Qiubin, Guan Yi, et al.An Overview of Research on Electronic Medical Record Oriented Named Entity Recognition and Entity Relation Extraction[J]. Acta Automatic Sinica, 2014, 40(8): 1537-1560.) | [51] | 候跃芳, 崔雷, 吴迪. 应用引文共引聚类-内容词分析法对学科发展的研究[J]. 情报学报, 2007, 26(2): 309-314. | [51] | (Hou Yuefang, Cui Lei, Wu Di.Co-Citation Clustering-Content Words Analysis in Subject Development[J]. Journal of the China Society for Scientific and Technical Information, 2007, 26(2): 309-314.) | [52] | 柴省三. 内容词-共引聚类分析及其在科学结构研究中的应用[J]. 情报学报, 1997, 16(1): 68-73. | [52] | (Chai Shengsan.Application of Content Words and Co-citation Clustering Analysis to Science Structure Studies[J]. Journal of the China Society for Scientific and Technical Information, 1997, 16(1): 68-73.) | [53] | Callon M, Law J, Rip A.Mapping the Dynamics of Science and Technology: Sociology of Science in the Real World[M]. London: The Macmillan Press LTD, 1998. | [54] | 崔雷. 当年高被引论文的主题词链聚类分析及其在情报预测中的应用[J]. 情报学报, 1995, 14(5): 368-373. | [54] | (Cui Lei.Keyword Link Cluster Analysis of the Immediately Highly Cited Papers and Its Utilization in Information Prediction[J]. Journal of the China Society for Scientific and Technical Information, 1995, 14(5): 368-373.) | [55] | Callon M, Courtial J P, Laville F.Co-word Analysis as a Tool for Describing the Network of Interactions Between Basic and Technological Research: The Case of Polymer Chemistry[J]. Scientometrics, 1991, 22(1): 155-205. | [56] | Kostoff R N, Eberhart H J, Toothman D R.Data-base Tomography for Technical Intelligence: A Roadmap of The Near-earth Space Science and Technology Literature[J]. Information Processing & Management, 1997, 34(1): 69-85. | [57] | 王晓光. 科学知识网络的结构与演化(Ι): 共词网络方法的提出[J]. 情报学报, 2009, 28(4): 599-605. | [57] | (Wang Xiaoguang.Structure and Evolution of Scientific Knowledge Network: Co-word Network[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(4): 599-605.) | [58] | 白如江, 冷伏海. k-clique社区知识创新演化方法研究[J]. 图书情报工作, 2013, 57(17): 86-94. | [58] | (Bai Rujiang, Leng Fuhai.Knowledge Innovational Evolution Analysis Based on k-clique Community Network[J]. Library and Information Service, 2013, 57(17): 86-94.) | [59] | 郑彦宁, 许晓阳, 刘志辉. 基于关键词共现的研究前沿识别方法研究[J]. 图书情报工作, 2016, 60(4): 85-92. | [59] | (Zheng Yanning, Xu Xiaoyang, Liu Zhihui.Study on the Method of Identifying Research Fronts Based on Keywords Co-occurrence[J]. Library and Information Service, 2016, 60(4): 85-92.) | [60] | 巴志超, 杨子江, 朱世伟, 等. 基于关键词语义网络的领域主题演化分析方法研究[J]. 情报理论与实践, 2016, 39(3): 67-72. | [60] | (Ba Zhichao, Yang Zijiang, Zhu Shiwei, et al.Key Words Semantic Network Based Field Topic Evolution Analysis Model[J]. Information Studies: Theory & Application, 2016, 39(3): 67-72.) | [61] | 陈千, 桂志国, 郭鑫, 等. 基于特征本体的文本流主题演化[J]. 计算机应用, 2015, 35(2): 456-460. | [61] | (Chen Qian, Gui Zhiguo, Guo Xin, et al.Topic Evolution in Text Stream Based on Feature Ontology[J]. Journal of Computer Applications, 2015, 35(2): 456-460.) | [62] | 王平. 基于层次概率主题模型的科技文献主题发现及演化[J]. 图书情报工作, 2014, 58(22): 70-77. | [62] | (Wang Ping.Topic Extraction and Evolution for Scientific Literature Based on Hierarchical Probabilistic Topic Model[J]. Library and Information Service, 2014, 58(22): 70-77.) | [63] | 何建民, 李雪. 面向微博舆情演化分析的隐马尔科夫模型研究[J]. 情报科学, 2016, 34(4): 7-12. | [63] | (He Jianmin, Li Xue.A Hidden Markov Model Research in the Microblog Public Opinion Evolutionary Analysis[J]. Information Science, 2016, 34(4): 7-12.) | [64] | Song M, Heo G E, Kim S Y.Analyzing Topic Evolution in Bioinformatics: Investigation of Dynamics of the Field with Conference Data in DBLP[J]. Scientometrics, 2014, 101(1): 397-428. | [65] | 胡正银, 方曙. 专利文本技术挖掘研究进展综述[J]. 现代图书情报技术, 2014(6): 62-70. | [65] | (Hu Zhengyin, Fang Shu.Review of Patent Text Technology Mining Research Development[J]. New Technology of Library and Information Service, 2014(6): 62-70.) | [66] | Yoon J, Kim K.Identifying Rapidly Evolving Technological Trends for R&D Planning Using SAO-based Semantic Patent Networks[J]. Scientometrics, 2011, 88(1): 213-228. | [67] | Park H, Yoon J, Kim K.Using Function-based Patent Analysis to Identify Potential Application Areas of Technology for Technology Transfer[J]. Expert Systems with Applications, 2013, 40(13): 5260-5265. | [68] | Yoon J, Kim K.Detecting Signals of New Technological Opportunities Using Semantic Patent Analysis and Outlier Detection[J]. Scientometrics, 2012, 90(2): 1-17. | [69] | 胡正银, 方曙, 隗玲. 基于SAO的专利技术演化分析[C]. 见: 中国图书馆学会专业图书馆分会2015年年会论文集, 贵阳. 2015. | [69] | (Hu Zhengyin, Fang Shu, Kui Ling.Patent Technology Evolution Analysis Based on SAO [C]. In: Proceedings of Professional Library Branch of China Library Association 2015 Scholar Conference, Guiyang. 2015.) |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|