Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 59-65    DOI: 10.11925/infotech.1003-3513.2014.11.09
Current Issue | Archive | Adv Search |
Chinese New Words Identification from Query Log by Extending the Context
Li Xuewei1, Lv Xueqiang1, Liu Kehui2,3
1 Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China;
2 School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China;
3 Beijing Research Center of Urban System Engineering, Beijing 100035, China
Download: PDF(580 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Collect and collate new words to expand the current dictionary, which can improve the accuracy of Chinese segment and promote the development of Chinese information processing. [Methods] A new word recognition method of context extension is proposed depending on features of query strings and new words. Firstly, get the seed collection based on features of query strings and obtain candidate new words through full extension. Secondly, get candidate new words according to the words time span. Finally, filter candidates by the use of improved left-right entropy according to the boundary information of words. [Results] Experiments on Sogou log show that precision rate of P@100 can reach 89.60%. [Limitations] The scale of contrast strings affects the accuracy of new words, to a certain extent. [Conclusions] Experiment results demonstrate that the method is suitable for the search logs of which context information to identify new words is missed.

Key wordsSearch log      Full extension      New words      Boundary      Improved left-right entropy     
Received: 24 April 2014      Published: 18 December 2014
:  TP391  

Cite this article:

Li Xuewei, Lv Xueqiang, Liu Kehui. Chinese New Words Identification from Query Log by Extending the Context. New Technology of Library and Information Service, 2014, 30(11): 59-65.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.11.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I11/59

[1] 翟海军, 郭嘉丰, 王小磊, 等. 基于用户查询日志的命名实体挖掘[J]. 中文信息学报, 2010, 24(1): 71-76,116.(Zhai Haijun, Guo Jiafeng, Wang Xiaolei, et al. Mining Named Entities from Query Logs [J]. Journal of Chinese Information Processing, 2010, 24(1): 71-76,116.)
[2] 张磊, 王斌, 靖红芳, 等. 中文网页搜索日志中的特殊命名实体挖掘[J]. 哈尔滨工业大学学报, 2011, 43(5): 119-122. (Zhang Lei, Wang Bin, Jing Hongfang, et al. Mining Special Named Entities from Chinese Web Search Query Logs [J]. Journal of Harbin Institute of Technology, 2011, 43(5): 119-122.)
[3] Liu H, Hu X, Zhao J, et al. Identification of Complex Named-Entities in Chinese Queries Using WWW [C]. In: Proceedings of the 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'08), Ji'nan, Shandong, China. IEEE, 2008: 180-185.
[4] 胡学营, 刘慧, 陆汝占. 搜索引擎用户查询中的复杂专有名词识别[J]. 计算机工程与应用, 2008, 44(19): 153-155. (Hu Xueying, Liu Hui, Lu Ruzhan. Recognition of Complex Named-entities in User Queries of Search Engine [J]. Computer Engineering and Applications, 2008, 44(19): 153-155.)
[5] 曹雷, 郭嘉丰, 白露, 等. 基于半监督话题模型的用户查询日志命名实体挖掘[J]. 中文信息学报, 2012, 26(5): 26-32. (Cao Lei, Guo Jiafeng, Bai Lu, et al. Named Entity Mining from Query Log through Semi-supervised Topic Modeling [J]. Journal of Chinese Information Processing, 2012, 26(5): 26-32.)
[6] 余慧佳, 刘奕群, 张敏, 等. 基于大规模日志分析的搜索引擎用户行为分析[J]. 中文信息学报, 2007, 21(1): 109-114. (Yu Huijia, Liu Yiqun, Zhang Min, et al. Research in Search Engine User Behavior Based on Log Analysis [J]. Journal of Chinese Information Processing, 2007, 21(1): 109-114.)
[7] Liu Y, Miao J, Zhang M, et al. How do Users Describe Their Information Need: Query Recommendation Based on Snippet Click Model [J]. Expert Systems with Applications, 2011, 38(11): 13847-13856.
[8] 刘奕群, 岑荣伟, 张敏, 等. 基于用户行为分析的搜索引擎自动性能评价[J]. 软件学报, 2007, 19(11): 3023-3032. (Liu Yiqun, Cen Rongwei, Zhang Min, et al. Automatic Search Engine Performance Evaluation Based on User Behavior Analysis [J]. Journal of Software, 2007, 19(11): 3023-3032.)
[9] Zheng Y, Liu Z, Sun M, et al. Incorporating User Behaviors in New Word Detection [C]. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI'09). San Francisco: Morgan Kaufmann Publishers Inc., 2009: 2101-2106.
[10] 郑家恒, 李文花. 基于构词法的网络新词自动识别初探[J].山西大学学报: 自然科学版, 2002, 25(2): 115-119. (Zheng Jiaheng, Li Wenhua. A Study on Automatic Identification for Internet New Words According to Word-building Rule [J]. Journal of Shanxi University: Natural Science Edition, 2002, 25(2): 115-119.)
[11] 邹刚, 刘洋, 刘群, 等. 面向Internet的中文新词语检测[J].中文信息学报, 2004, 18(6): 1-9. (Zou Gang, Liu Yang, Liu Qun, et al. Internet-oriented Chinese New Words Detection [J]. Journal of Chinese Information Processing, 2004, 18(6): 1-9.)
[12] 陈飞, 刘奕群, 魏超, 等. 基于条件随机场方法的开放领域新词发现[J]. 软件学报, 2013, 24(5): 1051-1060. (Chen Fei, Liu Yiqun, Wei Chao, et al. Open Domain New Word Detection Using Condition Random Field Method [J]. Journal of Software, 2013, 24(5): 1051-1060.)
[13] 张海军, 栾静, 李勇, 等. 基于统计学习框架的中文新词检测方法[J]. 计算机科学, 2012, 39(2): 232-235. (Zhang Haijun, Luan Jing, Li Yong, et al. Method of New Chinese Word Detection Based on Statistical Learning Framework [J]. Computer Science, 2012, 39(2): 232-235.)
[14] Wu A, Jiang Z. Statistically-enhanced New Word Identi­fication in a Rule-based Chinese System [C]. In: Proceedings of the 2nd Workshop on Chinese Language Processing: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, HongKong, China. Stroudsburg: Association for Computational Linguis­tics, 2000: 46-51.
[15] Li H, Huang C, Gao J, et al. The Use of SVM for Chinese New Word Identification [C]. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, Sanya, Hainan Island, China. Heidelberg: Springer-Verlag Berlin, 2004: 723-732.
[16] 周浪, 冯冲, 黄河燕. 一种面向术语抽取的短语过滤技术[J]. 计算机工程与应用, 2009, 45(19): 9-11. (Zhou Lang, Feng Chong, Huang Heyan. Phrase Filtering Technology Oriented to Term Extraction [J]. Computer Engineering and Applications, 2009, 45(19): 9-11.)
[17] 搜狗. 用户查询日志[EB/OL]. [2013-07-10]. http://www. sogou.com/labs/dl/q.html. (Sogou. SogouQ [EB/OL]. [2013- 07-10]. http://www.sogou.com/labs/dl/q.html.)
[18] NLPIR汉语分词系统[EB/OL]. [2013-07-10]. http://ictclas. nlpir.org/downloads. (NLPIR Chinese Word Segmentation System [EB/OL]. [2013- 07-10]. http://ictclas.nlpir.org/downloads.)
[19] 黄玉兰, 龚才春, 许洪波, 等. 基于局部性原理的有意义串提取方法[C]. 见: 第四届全国信息检索与内容安全学术会议论文集 (上). 2008. (Huang Yulan, Gong Caichun, Xu Hongbo, et al. A Meaningful String Extraction Algorithm Based on Locality [C]. In: Proceedings of the 4th National Conference on Information Retrieval and Content Securit. 2008.)

[1] Ren Yuwei, Lv Xueqiang, Li Zhuo, Xu Liping. Named Entity Recognition from Search Log[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[2] Wang Hao, Zou Jieli, Deng Sanhong. Model Construction and Experiment Analysis of Automatic Indexing for Chinese Books[J]. 现代图书情报技术, 2013, 29(7/8): 55-62.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn