Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 14-20    DOI: 10.11925/infotech.1003-3513.2011.07-08.03
Current Issue | Archive | Adv Search |
Research and Implementation of Textual Similarity in Distributed Environment
Zhao Huaming
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  Aiming at the performance issue and limitation on data set size in the process of mass-data mining of traditional similarity algorithm, this paper takes unstructured textual data as research subject and introduces the method of Hadoop distributed textual similarity algorithm, which combines Hive data mining platform with PostgreSQL RMDB, and describes the basic technical ideas, implementations and the empirical research in details. The testing result shows that Hive SQL can effectively simplify the complexity of distributed data mining but its real-time performance should be improved.
Key wordsHadoop      Hive      Similarity      Unstructured     
Received: 29 April 2011      Published: 09 October 2011
: 

TP393

 

Cite this article:

Zhao Huaming. Research and Implementation of Textual Similarity in Distributed Environment. New Technology of Library and Information Service, 2011, 27(7/8): 14-20.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.07-08.03     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I7/8/14

[1] Willett P. Recent Trends in Hierarchical Document Clustering: A Critical Review[J]. Information Processing and Management,1988,24(5):577-597.

[2] Salton G, Buckley C. Term Weighting Approaches in Automatic Text Retrieval[J]. Information Processing and Management,1988,24(5):513-523.

[3] Callan J P. Passage-level Evidence in Document Retrieval . In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA. New York:Springer-Verlag.1994:302-310.

[4] 洪毅虹.基于MapReduce架构的文档相似度计算方法[J]. 网络与信息, 2010(9):36-37.

[5] Map Reduce.http://hadoop.apache.org/mapreduce/.

[6] Hive.http://hive.apache.org/.

[7] Thusoo A, Sarma J S, Jain N, et al. Hive-A Petabyte Scale Data Warehouse Using Hadoop .In: Proceedings of the 2010 IEEE 26th International Conference on Data Engineering(ICDE),Long Beach, California, USA.2010:996-1005.

[8] Hadoop开发者入门专刊.http://ishare.iask.sina.com.cn/f/11493440.html.

[9] HBase.http://hbase.apache.org/.

[10] Pig.http://pig.apache.org/.

[11] Thrift.http://incubator.apache.org/thrift/.

[12] Pavlo A, Paulson E, Rasin A, et al. A Comparison of Approaches to Large-Scale Data Analysis . In: Proceedings of the 35th SIGMOD International Conference on Management of Data, New York, NY, USA.2009:165-178.

[13] PostgreSQL.http://www.postgresql. org/.

[14] Eclipse.http://www.eclipse.org/.

[15] Tomcat.http://tomcat.apache.org/.

[16] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM,1975,18(11):613-620.

[17] Salton G.Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer[M]. Boston, MA, USA:Addison-Wesley Longman Publishing Co.,1988.

[18] Tian R, Xie P. Study on the Standardization of Similarity Evaluation Method of Chromatographic Fingerprints(Part I)[J].Traditional Chinese Drug Research & Clinical Pharmacology,2006,17(1):40-42.
[1] Han Hui, Liu Xiuwen. Automatic Scoring for Subjective Questions in Maritime Competency Assessment[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[2] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[3] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[4] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[5] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[6] Wu Yanwen, Cai Qiuting, Liu Zhi, Deng Yunze. Digital Resource Recommendation Based on Multi-Source Data and Scene Similarity Calculation[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[7] Sheng Jiaqi, Xu Xin. Expanding Scholar Labels with Research Similarity and Co-authorship Network[J]. 数据分析与知识发现, 2020, 4(8): 75-85.
[8] Xu Yicong,Tian Xuedong,Li Xinfu,Yang Fang,Shi Qingxuan. Retrieving Mathematical Expressions Based on Hesitant Fuzzy Weight[J]. 数据分析与知识发现, 2020, 4(7): 118-126.
[9] Su Qing,Chen Sizhao,Wu Weimin,Li Xiaomei,Huang Tiankuan. Personalized Recommendation Model Based on Collaborative Filtering Algorithm of Learning Situation[J]. 数据分析与知识发现, 2020, 4(5): 105-117.
[10] Liu Ping,Peng Xiaofang. Calculating Word Similarities Based on Formal Concept Analysis[J]. 数据分析与知识发现, 2020, 4(5): 66-74.
[11] Wei Guohui,Zhang Fengcong,Fu Xianjun,Wang Zhenguo. Similarity Measurement of Traditional Chinese Medicine Components for Cold-hot Nature Discrimination[J]. 数据分析与知识发现, 2020, 4(5): 75-83.
[12] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[13] Han Kangkang,Xu Jianmin,Zhang Bin. Recommending Microblogs with User’s Interests and Multidimensional Trust[J]. 数据分析与知识发现, 2020, 4(12): 95-104.
[14] Li Jiaquan,Li Baoan,You Xindong,Lü Xueqiang. Computing Similarity of Patent Terms Based on Knowledge Graph[J]. 数据分析与知识发现, 2020, 4(10): 104-112.
[15] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn