Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (11): 59-65     https://doi.org/10.11925/infotech.1003-3513.2014.11.09
  情报分析与研究 本期目录 | 过刊浏览 | 高级检索 |
扩展搜索日志上下文的新词识别
李雪伟1, 吕学强1, 刘克会2,3
1 北京信息科技大学网络文化与数字传播北京市重点实验室 北京 100101;
2 北京理工大学管理与经济学院 北京 100081;
3 北京城市系统工程研究中心 北京 100035
Chinese New Words Identification from Query Log by Extending the Context
Li Xuewei1, Lv Xueqiang1, Liu Kehui2,3
1 Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China;
2 School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China;
3 Beijing Research Center of Urban System Engineering, Beijing 100035, China
全文: PDF (580 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 大规模搜集、整理新词扩充现有词典, 提高汉语分词准确率, 推动中文信息处理的发展.[方法] 根据搜索日志查询串特征及新词特点, 提出扩展搜索日志上下文的新词识别方法.首先, 通过分析查询串的特点获取种子词集合, 利用种子词集在搜索日志中进行全文扩展, 提取候选新词.其次, 根据新词的时间属性发现新词串, 最后基于词语的边界信息, 提出改进左右熵方法抽取语料中存在的新词语.[结果] 在搜狗日志上进行实验, P@100的平均准确率达到89.60%.[局限] 对比词串集合的规模会在一定程度上影响新词的正确率.[结论] 实验表明该方法适用于搜索日志这种缺失上下文信息的文本的新词识别.

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吕学强
李雪伟
刘克会
关键词 搜索日志全文扩展新词边界改进左右熵    
Abstract

[Objective] Collect and collate new words to expand the current dictionary, which can improve the accuracy of Chinese segment and promote the development of Chinese information processing. [Methods] A new word recognition method of context extension is proposed depending on features of query strings and new words. Firstly, get the seed collection based on features of query strings and obtain candidate new words through full extension. Secondly, get candidate new words according to the words time span. Finally, filter candidates by the use of improved left-right entropy according to the boundary information of words. [Results] Experiments on Sogou log show that precision rate of P@100 can reach 89.60%. [Limitations] The scale of contrast strings affects the accuracy of new words, to a certain extent. [Conclusions] Experiment results demonstrate that the method is suitable for the search logs of which context information to identify new words is missed.

Key wordsSearch log    Full extension    New words    Boundary    Improved left-right entropy
收稿日期: 2014-04-24      出版日期: 2014-12-18
:  TP391  
基金资助:

本文系国家自然科学基金项目"基于本体的专利自动标引研究"(项目编号: 61271304)、北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目"面向领域的互联网多模态信息精准搜索方法研究"(项目编号:KZ201311232037)和北京市属高等学校创新团队建设与教师职业发展计划项目(项目编号: IDHT20130519)的研究成果之一.

通讯作者: 李雪伟 E-mail: li_xuewei163@163.com     E-mail: li_xuewei163@163.com
作者简介: 作者贡献声明: 吕学强: 提出研究命题; 李雪伟: 提出研究思路, 设计实验方案和完成实验, 起草、撰写论文; 吕学强, 刘克会: 提供数据, 修订论文.
引用本文:   
李雪伟, 吕学强, 刘克会. 扩展搜索日志上下文的新词识别[J]. 现代图书情报技术, 2014, 30(11): 59-65.
Li Xuewei, Lv Xueqiang, Liu Kehui. Chinese New Words Identification from Query Log by Extending the Context. New Technology of Library and Information Service, 2014, 30(11): 59-65.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.11.09      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I11/59

[1] 翟海军, 郭嘉丰, 王小磊, 等. 基于用户查询日志的命名实体挖掘[J]. 中文信息学报, 2010, 24(1): 71-76,116.(Zhai Haijun, Guo Jiafeng, Wang Xiaolei, et al. Mining Named Entities from Query Logs [J]. Journal of Chinese Information Processing, 2010, 24(1): 71-76,116.)
[2] 张磊, 王斌, 靖红芳, 等. 中文网页搜索日志中的特殊命名实体挖掘[J]. 哈尔滨工业大学学报, 2011, 43(5): 119-122. (Zhang Lei, Wang Bin, Jing Hongfang, et al. Mining Special Named Entities from Chinese Web Search Query Logs [J]. Journal of Harbin Institute of Technology, 2011, 43(5): 119-122.)
[3] Liu H, Hu X, Zhao J, et al. Identification of Complex Named-Entities in Chinese Queries Using WWW [C]. In: Proceedings of the 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'08), Ji'nan, Shandong, China. IEEE, 2008: 180-185.
[4] 胡学营, 刘慧, 陆汝占. 搜索引擎用户查询中的复杂专有名词识别[J]. 计算机工程与应用, 2008, 44(19): 153-155. (Hu Xueying, Liu Hui, Lu Ruzhan. Recognition of Complex Named-entities in User Queries of Search Engine [J]. Computer Engineering and Applications, 2008, 44(19): 153-155.)
[5] 曹雷, 郭嘉丰, 白露, 等. 基于半监督话题模型的用户查询日志命名实体挖掘[J]. 中文信息学报, 2012, 26(5): 26-32. (Cao Lei, Guo Jiafeng, Bai Lu, et al. Named Entity Mining from Query Log through Semi-supervised Topic Modeling [J]. Journal of Chinese Information Processing, 2012, 26(5): 26-32.)
[6] 余慧佳, 刘奕群, 张敏, 等. 基于大规模日志分析的搜索引擎用户行为分析[J]. 中文信息学报, 2007, 21(1): 109-114. (Yu Huijia, Liu Yiqun, Zhang Min, et al. Research in Search Engine User Behavior Based on Log Analysis [J]. Journal of Chinese Information Processing, 2007, 21(1): 109-114.)
[7] Liu Y, Miao J, Zhang M, et al. How do Users Describe Their Information Need: Query Recommendation Based on Snippet Click Model [J]. Expert Systems with Applications, 2011, 38(11): 13847-13856.
[8] 刘奕群, 岑荣伟, 张敏, 等. 基于用户行为分析的搜索引擎自动性能评价[J]. 软件学报, 2007, 19(11): 3023-3032. (Liu Yiqun, Cen Rongwei, Zhang Min, et al. Automatic Search Engine Performance Evaluation Based on User Behavior Analysis [J]. Journal of Software, 2007, 19(11): 3023-3032.)
[9] Zheng Y, Liu Z, Sun M, et al. Incorporating User Behaviors in New Word Detection [C]. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI'09). San Francisco: Morgan Kaufmann Publishers Inc., 2009: 2101-2106.
[10] 郑家恒, 李文花. 基于构词法的网络新词自动识别初探[J].山西大学学报: 自然科学版, 2002, 25(2): 115-119. (Zheng Jiaheng, Li Wenhua. A Study on Automatic Identification for Internet New Words According to Word-building Rule [J]. Journal of Shanxi University: Natural Science Edition, 2002, 25(2): 115-119.)
[11] 邹刚, 刘洋, 刘群, 等. 面向Internet的中文新词语检测[J].中文信息学报, 2004, 18(6): 1-9. (Zou Gang, Liu Yang, Liu Qun, et al. Internet-oriented Chinese New Words Detection [J]. Journal of Chinese Information Processing, 2004, 18(6): 1-9.)
[12] 陈飞, 刘奕群, 魏超, 等. 基于条件随机场方法的开放领域新词发现[J]. 软件学报, 2013, 24(5): 1051-1060. (Chen Fei, Liu Yiqun, Wei Chao, et al. Open Domain New Word Detection Using Condition Random Field Method [J]. Journal of Software, 2013, 24(5): 1051-1060.)
[13] 张海军, 栾静, 李勇, 等. 基于统计学习框架的中文新词检测方法[J]. 计算机科学, 2012, 39(2): 232-235. (Zhang Haijun, Luan Jing, Li Yong, et al. Method of New Chinese Word Detection Based on Statistical Learning Framework [J]. Computer Science, 2012, 39(2): 232-235.)
[14] Wu A, Jiang Z. Statistically-enhanced New Word Identi­fication in a Rule-based Chinese System [C]. In: Proceedings of the 2nd Workshop on Chinese Language Processing: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, HongKong, China. Stroudsburg: Association for Computational Linguis­tics, 2000: 46-51.
[15] Li H, Huang C, Gao J, et al. The Use of SVM for Chinese New Word Identification [C]. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, Sanya, Hainan Island, China. Heidelberg: Springer-Verlag Berlin, 2004: 723-732.
[16] 周浪, 冯冲, 黄河燕. 一种面向术语抽取的短语过滤技术[J]. 计算机工程与应用, 2009, 45(19): 9-11. (Zhou Lang, Feng Chong, Huang Heyan. Phrase Filtering Technology Oriented to Term Extraction [J]. Computer Engineering and Applications, 2009, 45(19): 9-11.)
[17] 搜狗. 用户查询日志[EB/OL]. [2013-07-10]. http://www. sogou.com/labs/dl/q.html. (Sogou. SogouQ [EB/OL]. [2013- 07-10]. http://www.sogou.com/labs/dl/q.html.)
[18] NLPIR汉语分词系统[EB/OL]. [2013-07-10]. http://ictclas. nlpir.org/downloads. (NLPIR Chinese Word Segmentation System [EB/OL]. [2013- 07-10]. http://ictclas.nlpir.org/downloads.)
[19] 黄玉兰, 龚才春, 许洪波, 等. 基于局部性原理的有意义串提取方法[C]. 见: 第四届全国信息检索与内容安全学术会议论文集 (上). 2008. (Huang Yulan, Gong Caichun, Xu Hongbo, et al. A Meaningful String Extraction Algorithm Based on Locality [C]. In: Proceedings of the 4th National Conference on Information Retrieval and Content Securit. 2008.)

[1] 陶兴,张向先,郭顺利,张莉曼. 学术问答社区用户生成内容的W2V-MMR自动摘要方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[2] 陈先来,韩超鹏,安莹,刘莉,李忠民,杨荣. 基于互信息和逻辑回归的新词发现 *[J]. 数据分析与知识发现, 2019, 3(8): 105-113.
[3] 任育伟, 吕学强, 李卓, 徐丽萍. 搜索日志中命名实体识别[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[4] 曾镇, 吕学强, 李卓. 搜索日志中中文人名的自动识别[J]. 现代图书情报技术, 2014, 30(12): 71-77.
[5] 王昊, 邹杰利, 邓三鸿. 面向中文图书的自动标引模型构建及实验分析[J]. 现代图书情报技术, 2013, 29(7/8): 55-62.
[6] 段宇锋, 鞠菲. 基于N-Gram的专业领域中文新词识别研究[J]. 现代图书情报技术, 2012, 28(2): 41-47.
[7] 刘志杰, 吕学强, 程涛. 搜索引擎日志中“N1+N2”型名词短语研究[J]. 现代图书情报技术, 2010, 26(12): 58-63.
[8] 李纲,寇广增,夏晨曦,全吉,张东赫. 中文词义消歧上下文最优边界问题研究*[J]. 现代图书情报技术, 2009, 25(7-8): 49-53.
[9] 王文荣,乔晓东,朱礼军. 针对特定领域的新词发现和新技术发现*[J]. 现代图书情报技术, 2008, 24(2): 35-40.
[10] 吕学强,黄河,李渝勤,施水才 . BBS中文新词语自动挖掘*[J]. 现代图书情报技术, 2007, 2(1): 37-39.
[11] 黄水清,程冲 . 基于既定词表的自适应汉语分词技术研究[J]. 现代图书情报技术, 2006, 1(5): 13-17.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn