Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 14-20    DOI: 10.11925/infotech.1003-3513.2011.07-08.03
Current Issue | Archive | Adv Search |
Research and Implementation of Textual Similarity in Distributed Environment
Zhao Huaming
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF(546 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  Aiming at the performance issue and limitation on data set size in the process of mass-data mining of traditional similarity algorithm, this paper takes unstructured textual data as research subject and introduces the method of Hadoop distributed textual similarity algorithm, which combines Hive data mining platform with PostgreSQL RMDB, and describes the basic technical ideas, implementations and the empirical research in details. The testing result shows that Hive SQL can effectively simplify the complexity of distributed data mining but its real-time performance should be improved.
Key wordsHadoop      Hive      Similarity      Unstructured     
Received: 29 April 2011      Published: 09 October 2011



Cite this article:

Zhao Huaming. Research and Implementation of Textual Similarity in Distributed Environment. New Technology of Library and Information Service, 2011, 27(7/8): 14-20.

URL:     OR

[1] Willett P. Recent Trends in Hierarchical Document Clustering: A Critical Review[J]. Information Processing and Management,1988,24(5):577-597.

[2] Salton G, Buckley C. Term Weighting Approaches in Automatic Text Retrieval[J]. Information Processing and Management,1988,24(5):513-523.

[3] Callan J P. Passage-level Evidence in Document Retrieval . In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA. New York:Springer-Verlag.1994:302-310.

[4] 洪毅虹.基于MapReduce架构的文档相似度计算方法[J]. 网络与信息, 2010(9):36-37.

[5] Map Reduce.

[6] Hive.

[7] Thusoo A, Sarma J S, Jain N, et al. Hive-A Petabyte Scale Data Warehouse Using Hadoop .In: Proceedings of the 2010 IEEE 26th International Conference on Data Engineering(ICDE),Long Beach, California, USA.2010:996-1005.

[8] Hadoop开发者入门专刊.

[9] HBase.

[10] Pig.

[11] Thrift.

[12] Pavlo A, Paulson E, Rasin A, et al. A Comparison of Approaches to Large-Scale Data Analysis . In: Proceedings of the 35th SIGMOD International Conference on Management of Data, New York, NY, USA.2009:165-178.

[13] PostgreSQL.http://www.postgresql. org/.

[14] Eclipse.

[15] Tomcat.

[16] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM,1975,18(11):613-620.

[17] Salton G.Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer[M]. Boston, MA, USA:Addison-Wesley Longman Publishing Co.,1988.

[18] Tian R, Xie P. Study on the Standardization of Similarity Evaluation Method of Chromatographic Fingerprints(Part I)[J].Traditional Chinese Drug Research & Clinical Pharmacology,2006,17(1):40-42.
[1] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[2] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[3] Dan Wu,Liuxing Lu. Semantic Changes of Queries from Cross-device Searching[J]. 数据分析与知识发现, 2018, 2(8): 69-78.
[4] Haixia Sun,Lei Wang,Yingjie Wu,Weina Hua,Junlian Li. Matching Strategies for Institution Names in Literature Database[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[5] Ya’nan Zhao,Yuqing Wang. Research on Collaborative Filtering Traveling Products Recommendation Algorithm Based on IUNCF[J]. 数据分析与知识发现, 2018, 2(7): 63-71.
[6] Mansheng Xiao, Lijuan Zhou, Zhicheng Wen. A Fuzzy C-Means Algorithm Based on Huffman Tree[J]. 数据分析与知识发现, 2018, 2(7): 81-88.
[7] Daoping Wang,Zhongyang Jiang,Boqing Zhang. Collaborative Filtering Algorithm Based on Gray Correlation Analysis and Time Factor[J]. 数据分析与知识发现, 2018, 2(6): 102-109.
[8] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[9] Yong Wang,Yongdong Wang,Huifang Guo,Yumin Zhou. Measuring Item Similarity Based on Increment of Diversity[J]. 数据分析与知识发现, 2018, 2(5): 70-76.
[10] Lingfeng Hua,Gaoming Yang,Xiujun Wang. Recommending Diversified News Based on User’s Locations[J]. 数据分析与知识发现, 2018, 2(5): 94-104.
[11] Junwan Liu,Bo Yang,Feifei Wang. Ranking Scholarly Impacts Based on Citations and Academic Similarity[J]. 数据分析与知识发现, 2018, 2(4): 59-70.
[12] Yuying Wu,Ping Sun,Xijun He,Guorui Jiang. Predicting Transactions Among Agents in Patent Transfer Weighted Networks for New Energy[J]. 数据分析与知识发现, 2018, 2(11): 73-79.
[13] Jianmin Xu,Caiyun Xu. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
[14] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[15] Rujiang Bai,Fuhai Leng,Junhua Liao. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938