New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 14-20    DOI: 10.11925/infotech.1003-3513.2011.07-08.03
Research and Implementation of Textual Similarity in Distributed Environment
Zhao Huaming
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Abstract  Aiming at the performance issue and limitation on data set size in the process of mass-data mining of traditional similarity algorithm, this paper takes unstructured textual data as research subject and introduces the method of Hadoop distributed textual similarity algorithm, which combines Hive data mining platform with PostgreSQL RMDB, and describes the basic technical ideas, implementations and the empirical research in details. The testing result shows that Hive SQL can effectively simplify the complexity of distributed data mining but its real-time performance should be improved.
Key wordsHadoop      Hive      Similarity      Unstructured     
Received: 29 April 2011      Published: 09 October 2011



Zhao Huaming. Research and Implementation of Textual Similarity in Distributed Environment. New Technology of Library and Information Service, 2011, 27(7/8): 14-20.

