A Feature Representation Method of Scientific Data Based on Complex Text Description
Sun Wei
(National Science Library, Chinese Academy of Sciences, Beijing 100190,China)
(Graduate University of Chinese Academy of Sciences, Beijing 100049,China)
Feature representation is one of the key issues in data clustering. Currently, feature representation of scientific data is deficient and influences the effect of data clustering.The paper proposes the concept of complex text description and a feature representation method based on it. The method uses different feature weighting computations to represent candidate features from two kinds of data sources respectively, and strengthenes the feature set by merging the two feature sets. Experiments show that the method is much better than kinds of traditional feature representation methods and it can improve the performance of data clustering markedly.
孙巍. 一种基于复合文本描述的科学数据特征表示方法*[J]. 现代图书情报技术, 2009, 25(5): 22-27.
Sun Wei. A Feature Representation Method of Scientific Data Based on Complex Text Description. New Technology of Library and Information Service, 2009, 25(5): 22-27.
[1] 焦李成,刘芳,缑水平,等. 智能数据挖掘与知识发现[M]. 西安:西安电子科技大学出版社,2006:16.
[2] 邓绪斌.面向复杂数据源的数据的抽取模型和算法研究[D]. 上海:复旦大学,2005.
[3] Masys D R, Welsh J B, Lynn Fink J,et al. Use of Keyword Hierarchies to Interpret Gene Expression Patterns[J]. Bioinformatics,2001,17(4):319-326.
[4] Liu Y, Brandon M, Navathe S,et al. Text Mining Functional Keywords Associated with Genes[J]. Stud Health Technol Inform,2004,107(Pt 1):292-296.
[5] 李欣宇,傅彦. 一种适合于科学数据的聚类算法[J]. 成都信息工程学院学报,2006,21(3):327-330.
[6] 孙志茹,韩涛,杨文.生物信息学科学数据与科学文献的关联关系分析[J].图书情报工作,2008,52(2):88-91.
[7] Liu Y, Ciliax B J, Borges K,et al. Comparison of Two Schemes for Automatic Keyword Extraction from MEDLINE for Functional Gene Clustering[C]. In:Proc. IEEE Comput. Syst. Bioinform Conf., 2004:394-404.
[8] Liu Y, Navathe S B, Civera J, et al. Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005,2(1):62-76.
[9] National Center for Biotechnology Information. Etrez, the Life Sciences Search Engine[EB/OL]. [2008-09-28]. http://www.ncbi.nlm.nih.gov/.
[10] King Yee.生物医学词汇[EB/OL].[2008-02-21]. http://www.medscape.com.cn/download/downloadManager/detail.jsp?id=43.
[11] The U.S. Department of Energy (DOE). Glossary of Bioinformatics Terms[R/OL].[2008-02-21]. http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/genejargon.shtml#sequence.
[12] 基因专业词汇[EB/OL]. [2008-02-21]. http://down.foodmate.net/ziliao/sort/14/7038.html.
[13] 刘海峰,王元元,张学仁.文本分类中一种改进的特征选择方法[J].情报科学,2007,25(10):1534-1537.