Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 53-58    DOI: 10.11925/infotech.1003-3513.2014.11.08
Current Issue | Archive | Adv Search |
A Semi-supervised Web Scientific and Technical Information Classification Model
Li Chuanxi, Zhang Zhixiong, Liu Jianhua, Qian Li
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF(533 KB)   HTML  
Export: BibTeX | EndNote (RIS)      

[Objective] Considering the difference of open Web scientific and techical information is minor, general rule-based and statistical learning methods cannot classify the information effectively for the practical application demands. [Methods] By analyzing the content and structure of Web pages, and utilizing the open resources (such as domain Ontology and thesaurus etc.) to perform the self-learning of domain features, this paper proposes a semi-supervised classification model of scientific and technical information. [Results] The experiment results show that the proposed method achieves the precision of 0.9016, recall of 0.8756 and F1 score of 0.8884 respectively, which are superior to Naive Bayes classification. [Limitations] Applying the proposed method to new domain, the domain seed features need be supplied still. [Conclusions] The proposed method can classify the scientific and technical information effectively and satisfy the demand of the information deep analysis and process.

Key wordsWeb scientific and technical information      Scientific and technical information classification model      Open resources     
Received: 20 May 2014      Published: 18 December 2014
:  TP181  

Cite this article:

Li Chuanxi, Zhang Zhixiong, Liu Jianhua, Qian Li. A Semi-supervised Web Scientific and Technical Information Classification Model. New Technology of Library and Information Service, 2014, 30(11): 53-58.

URL:     OR

[1] 张智雄, 刘建华, 邹益民, 等. 网络科技信息自动监测服务系统的建设[J]. 科研信息化技术与应用, 2013, 4(2): 9-17. (Zhang Zhixiong, Liu Jianhua, Zou Yimin, et al. Implementation of Automatic Monitoring System for Science and Technology Information on the Web [J]. E-Science Technology & Application, 2013, 4(2): 9-17.)
[2] 陈旭玲, 楼佩煌. 改进层次聚类算法在文献分析中的应用[J]. 数值计算与计算机应用, 2009, 30(4): 277-287. (Chen Xuling, Lou Peihuang. The Application of Improved Hierarchical Clustering Algorithm to Analyze Literature [J]. Journal on Numerical Methods and Computer Applications, 2009, 30(4): 277-287.)
[3] 宋丹, 吴晨, 薛德军, 等. 基于KNN的科技主题跟踪[C].见: 第五届全国信息检索学术会议论文集. 2009. (Song Dan, Wu Chen, Xue Dejun, et al. Scientific Subject Tracking Based on KNN Algorithm [C]. In: Proceedings of the 5th China Conference on Information Retrieval. 2009.)
[4] 刘勘, 周丽红, 陈譞. 基于关键词的科技文献聚类研究[J]. 图书情报工作, 2012, 56(4): 6-11. (Liu Kan, Zhou Lihong, Chen Xuan. A New Clustering Algorithm for Scientific Literature Based on Keywords [J]. Library and Information Service, 2012, 56(4): 6-11.)
[5] 贺亮, 李芳. 基于话题模型的科技文献话题发现和趋势分析[J]. 中文信息学报, 2012, 26(2): 109-115. (He Liang, Li Fang. Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model [J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.)
[6] 楚存坤, 李韬. 模糊聚类技术在文献自动分类系统中的应用[J]. 现代情报, 2009, 29(9): 166-168, 172. (Chu Cunkun, Li Tao. Application of Fuzzy Clustering Technology in Literature Automatic Classification System [J]. Journal of Modern Information, 2009, 29(9): 166-168, 172.)
[7] 谢新洲, 金学慧, 张婧, 等. 网络信息资源分类研究述评[J]. 情报杂志, 2012, 31(2): 141-147. (Xie Xinzhou, Jin Xuehui, Zhang Jing, et al. Review of Network Information Resource Classification [J]. Journal of Intelligence, 2012, 31(2): 141-147.)
[8] 刘建华, 张智雄, 谢靖, 等. 基于规则的网络文本资源标题快速自动识别方法[J]. 现代图书情报技术, 2011(6): 27-31. (Liu Jianhua, Zhang Zhixiong, Xie Jing, et al. Automatic Identify Title of Web Text Resource Based on Rules [J]. New Technology of Library and Information Service, 2011(6): 27-31.)
[9] 王飞跃. 知识产生方式和科技决策支撑的重大变革——面向大数据和开源信息的科技态势解析与决策服务[J]. 中国科学院院刊, 2012, 27(5): 527-537. (Wang Feiyue. Decision Service and Academic Analytics for Development of S&T Based on Open Source Intelligence and Big Data [J]. Bulletin of the Chinese Academy of Sciences, 2012, 27(5): 527-537.)
[10] 刘云, 王小黎, 樊威. 国际科技资源监测与服务体系构建[J]. 科学学与科学技术管理, 2012, 33(8): 5-11. (Liu Yun, Wang Xiaoli, Fan Wei. Construction of the International S&T Resources Monitoring System [J]. Science of Science and Management of S. & T., 2012, 33(8): 5-11.)
[11] Qi X, Davison B D. Web Page Classification: Features and Algorithms [J]. ACM Computing Surveys, 2009, 41(2): Article No. 12.
[12] Tsukada M, Washio T, Motoda H. Automatic Web-Page Classification by Using Machine Learning Methods [C]. In: Proceedings of the 1st Asia-Pacific Conference on Web Intelligence: Research and Development. Springer, 2001, 2198: 303-313.
[13] Bartik V. Text-Based Web Page Classification with Use of Visual Information [C]. In: Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining. 2010: 416-420.
[14] Dumais S, Platt J, Heckerman D, et al. Inductive Learning Algorithms and Representations for Text Categorization [C]. In: Proceedings of the 17th International Conference on Information and Knowledge Management. ACM, 1998.
[15] Miller G A. WordNet: A Lexical Database for English [J]. Communications of the ACM, 1995, 38(11): 39-41.
[16] Hall M, Frank E, Holmes G, et al. The Weka Data Mining Software: An Update [J]. SIGKDD Explorations, 2009, 11(1): 10-18.
[17] Rajaraman A, Ullman J D. Mining of Massive Datasets [M]. Cambridge: Cambridge University Press, 2011.
[18] Kan M, Thi H O N. Fast Web Page Classification Using URL Features [C]. In: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management. ACM, 2005: 325-326.

[1] Zhipeng Dong,Jingyu Liu. Project Website’s Construction Based on Drupal ——A Case Study of “Open Resources Development” Website[J]. 现代图书情报技术, 2016, 32(1): 81-86.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938