New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 59-65    DOI: 10.11925/infotech.1003-3513.2014.11.09
Chinese New Words Identification from Query Log by Extending the Context
Li Xuewei1, Lv Xueqiang1, Liu Kehui2,3
1 Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China;
2 School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China;
3 Beijing Research Center of Urban System Engineering, Beijing 100035, China
[Objective] Collect and collate new words to expand the current dictionary, which can improve the accuracy of Chinese segment and promote the development of Chinese information processing. [Methods] A new word recognition method of context extension is proposed depending on features of query strings and new words. Firstly, get the seed collection based on features of query strings and obtain candidate new words through full extension. Secondly, get candidate new words according to the words time span. Finally, filter candidates by the use of improved left-right entropy according to the boundary information of words. [Results] Experiments on Sogou log show that precision rate of P@100 can reach 89.60%. [Limitations] The scale of contrast strings affects the accuracy of new words, to a certain extent. [Conclusions] Experiment results demonstrate that the method is suitable for the search logs of which context information to identify new words is missed.

Key wordsSearch log      Full extension      New words      Boundary      Improved left-right entropy     
Received: 24 April 2014      Published: 18 December 2014
PACS:  TP391  

Li Xuewei, Lv Xueqiang, Liu Kehui. Chinese New Words Identification from Query Log by Extending the Context. New Technology of Library and Information Service, 2014, 30(11): 59-65.

