Please wait a minute...
New Technology of Library and Information Service  2012, Vol. 28 Issue (2): 41-47    DOI: 10.11925/infotech.1003-3513.2012.02.07
Current Issue | Archive | Adv Search |
Research on Chinese New Word Recognition in Specialized Field Based on N-Gram
Duan Yufeng, Ju Fei
Business School, East China Normal University, Shanghai 200241, China
Download: PDF(1015 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  The paper researches automatic new word recognition in specialized field which is represented by phytology. A set of 200 documents on plant description randomly drawn from “Flora of China” is taken as sample set. At first, draw new words candidates are drawn by N-Gram method based on words split by ICTCLAS. Then all the new words candidates are sorted respectively by term frequency (TF), document frequency (D) and average term frequency (TF/D) and the candidates are selected among certain boundary as true new words. The experiments show that new words recognition according to TF is the best and F measurement is 0.65. This method can automatically produce user dictionary of specialized field and is highly portable.
Key wordsN-Gram      New word recognition      Term frequency     
Received: 12 December 2011      Published: 23 March 2012
: 

G350

 

Cite this article:

Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-Gram. New Technology of Library and Information Service, 2012, 28(2): 41-47.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2012.02.07     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2012/V28/I2/41

[1] 孙茂松.汉语自动分词研究中的若干理论问题[J]. 语言文字应用 ,2005 (4):40-46. (Sun Maosong.Several Theoretical Problems in Automatic Chinese Word Segmentation Research[J]. Application of Language,2005(4):40-46.)

[2] 张德鑫.“水至清则无鱼”——我的新生词语规范观[J]. 北京大学学报:哲社版 ,2000,37(5):106-119.(Zhang Dexin. My Point of View on the Standard of Newborn Words[J]. Journal of Peking University:Philosophy and Social Sciences,2000,37(5):106-119.)

[3] 黄昌宁,赵海.中文分词十年回顾[J]. 中文信息学报 ,2007,21(3):8-19.(Huang Changning, Zhao Hai. Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing,2007,21(3):8-19.)

[4] 张海军,史树敏,朱朝勇,等.中文新词识别技术综述[J]. 计算机科学 ,2010,37(3): 6-12.(Zhang Haijun, Shi Shumin, Zhu Chaoyong,et al. Survey of Chinese New Words Identification[J]. Computer Science,2010,37(3):6-12.)

[5] 郑家恒,李文花.基于构词法的网络新词自动识别初探[J]. 山西大学学报:自然科学版 ,2002,25(2):115-119.(Zhen Jiaheng, Li Wenhua. Study on Automatic Identification for Internet New Words According to Word-Building Rule[J]. Journal of Shanxi University:Natural Science Edition,2002,25(2):115-119.)

[6] Chen K J, Bai M H. Unknown Word Detection for Chinese by a Corpus-based Learning Method[J]. International Journal of Computational Linguistics and Chinese Language Processing,1998, 3(1):27-44.

[7] 吴涛,张毛迪,陈传波.一种改进的统计与后串最大匹配的中文分词算法研究[J]. 计算机工程与科学 ,2008,30(8):79-82.(Wu Tao, Zhang Maodi, Chen Chuanbo. Research of Chinese Word Segmentation Algorithms Based on Statistics and Reverse Maximum Match[J]. Computer Engineering & Science,2008,30(8):79-82.)

[8] Nie J Y, Hannah M L, Jin W. Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge[J].Communications of COLIPS,1995,5(1):47-57.

[9] Yang M H, Ahuja N. A Geometric Approach to Train Support Vector Machines[C]. In:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA.2000:430-437.

[10] 秦浩伟,步丰林.一个中文新词识别特征的研究[J]. 计算机工程 ,2004,30(S1):369-370.(Qin Haowei, Bu Fenglin. Research on a Feature of Chinese New Word Identification[J].Computer Engineering,2004,30(S1):369-379.)

[11] 李钝,曹元大,万月亮.Internet中的新词识别[J]. 北京邮电大学学报 ,2008,31(1):26-29.(Li Dun, Cao Yuanda, Wan Yueliang. Internet-Oriented New Words Identification[J]. Journal of Beijing University of Posts and Telecommunications,2008,31(1):26-29.)

[12] 韩艳,林煜熙,姚建民. 基于统计信息的未登录词的扩展识别方法[J]. 中文信息学报 ,2009,23(3): 24-30.(Han Yan, Lin Yixi, Yao Jianmin. Study on Chinese OOV Identification Based on Extension[J].Journal of Chinese Information Processing,2009,23(3):24-30.)

[13] 丁建立,慈祥,黄剑雄.一种基于免疫遗传算法的网络新词识别方法[J]. 计算机科学 ,2011,38(1): 240-245.(Ding Jianli, Ci Xiang, Huang Jianxiong. Approach of Internet New Word Identification Based on Innmune Genetic Algorithm[J].Computer Science, 2011,38(1):240-245.)

[14] 崔世起.中文新词检测与分析[D].北京:中国科学院研究生院,2006.(Cui Shiqi. Research on Chinese New Word Identification and Analysis [D]. Beijing: Graduate University of Chinese Academy of Sciences,2006.)

[15] 韩客松,王永成,陈桂林.汉语语言的无词典分词模型系统[J]. 计算机应用研究 ,1999(10):8-9.(Han Kesong, Wang Yongcheng, Chen Guilin. Chinese Word Segmentation System Without Dictionary[J]. Application Research of Computers, 1999(10):8-9.)

[16] 魏莎莎.一种中文未登录词识别及词典设计新方法[D].重庆:西南大学,2011.(Wei Shasha. A New Method of Chinese Out-of-Vocabulary Identification and Dictionary Design[D]. Chongqing: Southeast University,2011.)

[17] 贺敏.面向互联网的中文有意义串挖掘[D].北京:中国科学院研究生院,2007.(He Min. Internet-Oriented Chinese Meaningful Word Reorganization [D].Beijing: Graduate University of Chinese Academy of Sciences,2009.)

[18] 黄玉兰.有意义串挖掘及其应用[D].北京:中国科学院研究生院,2009.(Huang Yulan. Meaningful Word Reorganization and Application [D]. Beijing: Graduate University of Chinese Academy of Sciences,2009.)

[19] 贺敏,龚才春,张华平,等.一种基于大规模语料的新词识别方法[J]. 计算机工程与应用 ,2007,43(21):157-159.(He Min, Gong Caichun, Zhang Huaping,et al. Method of New Word of Identification Based on Lager-scale Corpus[J]. Computer Engineering and Applications,2007,43(21):157-159.)

[20] 张海军,史树敏,丁溪源,等.基于分词提取重复串的未登录词遗漏量化模型[J]. 中文信息学报 ,2011,25(2):122-128.(Zhang Haijun, Shi Shumin, Ding Xiyuan,et al. Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction[J].Journal of Chinese Information Processing,2011,25(2):122-128.)

[21] 中国植物志[R/OL].[2011-09-12].http://frps.plantphoto.cn/dzb_list2.asp.(Flora of China [R/OL].[2011-09-12].http://frps.plantphoto.cn/dzb_list2.asp.)

[22] Rogati M, Yang Y M.High-Performing Feature Selection for Text Classification[C]. In: Proceedings of the 11th International Conference on Information and Knowledge Management. New York:ACM,2002:659-661.

[23] 陈小荷.自动分词中未登录词问题的一揽子解决方案[J]. 语言文字应用 ,1999(3):103-109.(Chen Xiaohe. A Package Scheme for Identifying Unlisted Words in Chinese Segmentation[J].Applied Linguistics,1999(3):103-109.)

[24] 都菁,熊海灵.基于论坛语料识别中文未登录词的方法[J]. 计算机工程与设计 ,2010,31(3):630-633.(Du Jing, Xiong Hailing. Algorithm to Recognize Unknown Chinese Words Based on BBS Corpus[J].Computer Engineering and Design,2010,31(3):630-633.)

[25] 吕美香,何琳, 李玥,等.基于N-gram文本表达的新闻领域关键词词典构建研究[J]. 情报科学 ,2010,28(4):571-574.(Lv Meixiang, He Lin, Li Yue,et al. Research on Construction of News Keyword Dictionary Based on N-Gram Text Representation[J]. Intelligence Science, 2010,28(4):571-574.)
[1] Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[2] Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics[J]. 现代图书情报技术, 2016, 32(2): 34-42.
[3] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[4] Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram[J]. 现代图书情报技术, 2013, (4): 54-61.
[5] Sun Haixia, Li Junlian, Wu Yingjie, Wu Suhui. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method[J]. 现代图书情报技术, 2013, 29(1): 15-21.
[6] Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao. N-gram Based on Cluster Label Extracting Algorithm for English Paper[J]. 现代图书情报技术, 2011, 27(7/8): 68-75.
[7] Yang Siluo. Research on Ranking Technology of Search Engines[J]. 现代图书情报技术, 2005, 21(1): 43-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn