Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (11): 53-58     https://doi.org/10.11925/infotech.1003-3513.2014.11.08
  情报分析与研究 本期目录 | 过刊浏览 | 高级检索 |
半监督的网络科技信息分类模型
李传席, 张智雄, 刘建华, 钱力
中国科学院文献情报中心 北京 100190
A Semi-supervised Web Scientific and Technical Information Classification Model
Li Chuanxi, Zhang Zhixiong, Liu Jianhua, Qian Li
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (533 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 开放的网络科技信息网页内容之间区分度较小, 传统基于规则和统计学习的方法无法满足网络科技信息网页分类的具体应用需求.[方法] 通过深入分析网络科技信息主题网页的内容和结构, 利用开放本体等资源实现领域特征的学习, 构建半监督的网络科技信息分类模型.[结果] 实验结果表明提出的方法在网络科技信息分类实验中的精度、召回率和F1值分别达到0.9016、0.8756和0.8884, 相比贝叶斯方法具有明显优势.[局限] 该方法在应用到其他类别的网络科技信息分类时, 仍然需要领域专家提供相关领域的核心种子特征.[结论] 该方法可以满足网络科技信息深度加工的需求, 实现有效的网络科技信息网页分类.

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李传席
张智雄
刘建华
钱力
关键词 网络科技信息网络科技信息分类模型开放资源    
Abstract

[Objective] Considering the difference of open Web scientific and techical information is minor, general rule-based and statistical learning methods cannot classify the information effectively for the practical application demands. [Methods] By analyzing the content and structure of Web pages, and utilizing the open resources (such as domain Ontology and thesaurus etc.) to perform the self-learning of domain features, this paper proposes a semi-supervised classification model of scientific and technical information. [Results] The experiment results show that the proposed method achieves the precision of 0.9016, recall of 0.8756 and F1 score of 0.8884 respectively, which are superior to Naive Bayes classification. [Limitations] Applying the proposed method to new domain, the domain seed features need be supplied still. [Conclusions] The proposed method can classify the scientific and technical information effectively and satisfy the demand of the information deep analysis and process.

Key wordsWeb scientific and technical information    Scientific and technical information classification model    Open resources
收稿日期: 2014-05-20      出版日期: 2014-12-18
:  TP181  
  G356  
基金资助:

本文系中国科学院文献情报能力建设专项"网络科技信息自动监测系统二期建设"项目(项目编号:院1306)和国家"十二五"科技支撑计划课题"科技知识组织体系共享服务平台建设"(项目编号:2011BAH10B03)的研究成果之一.

通讯作者: 李传席 E-mail: lichuanxi@mail.las.ac.cn     E-mail: lichuanxi@mail.las.ac.cn
作者简介: 作者贡献声明: 李传席, 张智雄: 提出研究问题, 设计研究框架; 李传席: 研究方法的设计和实现, 以及论文的撰写; 刘建华: 提供部分实验数据和研究思路的讨论; 钱力: 参与实验过程的设计与分析.
引用本文:   
李传席, 张智雄, 刘建华, 钱力. 半监督的网络科技信息分类模型[J]. 现代图书情报技术, 2014, 30(11): 53-58.
Li Chuanxi, Zhang Zhixiong, Liu Jianhua, Qian Li. A Semi-supervised Web Scientific and Technical Information Classification Model. New Technology of Library and Information Service, 2014, 30(11): 53-58.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.11.08      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I11/53

[1] 张智雄, 刘建华, 邹益民, 等. 网络科技信息自动监测服务系统的建设[J]. 科研信息化技术与应用, 2013, 4(2): 9-17. (Zhang Zhixiong, Liu Jianhua, Zou Yimin, et al. Implementation of Automatic Monitoring System for Science and Technology Information on the Web [J]. E-Science Technology & Application, 2013, 4(2): 9-17.)
[2] 陈旭玲, 楼佩煌. 改进层次聚类算法在文献分析中的应用[J]. 数值计算与计算机应用, 2009, 30(4): 277-287. (Chen Xuling, Lou Peihuang. The Application of Improved Hierarchical Clustering Algorithm to Analyze Literature [J]. Journal on Numerical Methods and Computer Applications, 2009, 30(4): 277-287.)
[3] 宋丹, 吴晨, 薛德军, 等. 基于KNN的科技主题跟踪[C].见: 第五届全国信息检索学术会议论文集. 2009. (Song Dan, Wu Chen, Xue Dejun, et al. Scientific Subject Tracking Based on KNN Algorithm [C]. In: Proceedings of the 5th China Conference on Information Retrieval. 2009.)
[4] 刘勘, 周丽红, 陈譞. 基于关键词的科技文献聚类研究[J]. 图书情报工作, 2012, 56(4): 6-11. (Liu Kan, Zhou Lihong, Chen Xuan. A New Clustering Algorithm for Scientific Literature Based on Keywords [J]. Library and Information Service, 2012, 56(4): 6-11.)
[5] 贺亮, 李芳. 基于话题模型的科技文献话题发现和趋势分析[J]. 中文信息学报, 2012, 26(2): 109-115. (He Liang, Li Fang. Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model [J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.)
[6] 楚存坤, 李韬. 模糊聚类技术在文献自动分类系统中的应用[J]. 现代情报, 2009, 29(9): 166-168, 172. (Chu Cunkun, Li Tao. Application of Fuzzy Clustering Technology in Literature Automatic Classification System [J]. Journal of Modern Information, 2009, 29(9): 166-168, 172.)
[7] 谢新洲, 金学慧, 张婧, 等. 网络信息资源分类研究述评[J]. 情报杂志, 2012, 31(2): 141-147. (Xie Xinzhou, Jin Xuehui, Zhang Jing, et al. Review of Network Information Resource Classification [J]. Journal of Intelligence, 2012, 31(2): 141-147.)
[8] 刘建华, 张智雄, 谢靖, 等. 基于规则的网络文本资源标题快速自动识别方法[J]. 现代图书情报技术, 2011(6): 27-31. (Liu Jianhua, Zhang Zhixiong, Xie Jing, et al. Automatic Identify Title of Web Text Resource Based on Rules [J]. New Technology of Library and Information Service, 2011(6): 27-31.)
[9] 王飞跃. 知识产生方式和科技决策支撑的重大变革——面向大数据和开源信息的科技态势解析与决策服务[J]. 中国科学院院刊, 2012, 27(5): 527-537. (Wang Feiyue. Decision Service and Academic Analytics for Development of S&T Based on Open Source Intelligence and Big Data [J]. Bulletin of the Chinese Academy of Sciences, 2012, 27(5): 527-537.)
[10] 刘云, 王小黎, 樊威. 国际科技资源监测与服务体系构建[J]. 科学学与科学技术管理, 2012, 33(8): 5-11. (Liu Yun, Wang Xiaoli, Fan Wei. Construction of the International S&T Resources Monitoring System [J]. Science of Science and Management of S. & T., 2012, 33(8): 5-11.)
[11] Qi X, Davison B D. Web Page Classification: Features and Algorithms [J]. ACM Computing Surveys, 2009, 41(2): Article No. 12.
[12] Tsukada M, Washio T, Motoda H. Automatic Web-Page Classification by Using Machine Learning Methods [C]. In: Proceedings of the 1st Asia-Pacific Conference on Web Intelligence: Research and Development. Springer, 2001, 2198: 303-313.
[13] Bartik V. Text-Based Web Page Classification with Use of Visual Information [C]. In: Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining. 2010: 416-420.
[14] Dumais S, Platt J, Heckerman D, et al. Inductive Learning Algorithms and Representations for Text Categorization [C]. In: Proceedings of the 17th International Conference on Information and Knowledge Management. ACM, 1998.
[15] Miller G A. WordNet: A Lexical Database for English [J]. Communications of the ACM, 1995, 38(11): 39-41.
[16] Hall M, Frank E, Holmes G, et al. The Weka Data Mining Software: An Update [J]. SIGKDD Explorations, 2009, 11(1): 10-18.
[17] Rajaraman A, Ullman J D. Mining of Massive Datasets [M]. Cambridge: Cambridge University Press, 2011.
[18] Kan M, Thi H O N. Fast Web Page Classification Using URL Features [C]. In: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management. ACM, 2005: 325-326.

[1] 董智鹏,刘静羽. 基于Drupal的项目网站建设——以“开放资源建设”网站为例[J]. 现代图书情报技术, 2016, 32(1): 81-86.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn