Abstract:Aiming at the performance issue and limitation on data set size in the process of mass-data mining of traditional similarity algorithm, this paper takes unstructured textual data as research subject and introduces the method of Hadoop distributed textual similarity algorithm, which combines Hive data mining platform with PostgreSQL RMDB, and describes the basic technical ideas, implementations and the empirical research in details. The testing result shows that Hive SQL can effectively simplify the complexity of distributed data mining but its real-time performance should be improved.
赵华茗. 分布式环境下的文档相似度研究与实现[J]. 现代图书情报技术, 2011, 27(7/8): 14-20.
Zhao Huaming. Research and Implementation of Textual Similarity in Distributed Environment. New Technology of Library and Information Service, 2011, 27(7/8): 14-20.
[1] Willett P. Recent Trends in Hierarchical Document Clustering: A Critical Review[J]. Information Processing and Management,1988,24(5):577-597.[2] Salton G, Buckley C. Term Weighting Approaches in Automatic Text Retrieval[J]. Information Processing and Management,1988,24(5):513-523.[3] Callan J P. Passage-level Evidence in Document Retrieval . In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA. New York:Springer-Verlag.1994:302-310.[4] 洪毅虹.基于MapReduce架构的文档相似度计算方法[J]. 网络与信息, 2010(9):36-37.[5] Map Reduce.http://hadoop.apache.org/mapreduce/.[6] Hive.http://hive.apache.org/.[7] Thusoo A, Sarma J S, Jain N, et al. Hive-A Petabyte Scale Data Warehouse Using Hadoop .In: Proceedings of the 2010 IEEE 26th International Conference on Data Engineering(ICDE),Long Beach, California, USA.2010:996-1005.[8] Hadoop开发者入门专刊.http://ishare.iask.sina.com.cn/f/11493440.html.[9] HBase.http://hbase.apache.org/.[10] Pig.http://pig.apache.org/.[11] Thrift.http://incubator.apache.org/thrift/.[12] Pavlo A, Paulson E, Rasin A, et al. A Comparison of Approaches to Large-Scale Data Analysis . In: Proceedings of the 35th SIGMOD International Conference on Management of Data, New York, NY, USA.2009:165-178.[13] PostgreSQL.http://www.postgresql. org/.[14] Eclipse.http://www.eclipse.org/.[15] Tomcat.http://tomcat.apache.org/.[16] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM,1975,18(11):613-620.[17] Salton G.Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer[M]. Boston, MA, USA:Addison-Wesley Longman Publishing Co.,1988.[18] Tian R, Xie P. Study on the Standardization of Similarity Evaluation Method of Chromatographic Fingerprints(Part I)[J].Traditional Chinese Drug Research & Clinical Pharmacology,2006,17(1):40-42.