Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (11): 43-51    DOI: 10.11925/infotech.2096-3467.2020.0238
Current Issue | Archive | Adv Search |
Automatic Classification Method Based on Multi-factor Algorithm
Li Jiao1,Huang Yongwen1,Luo Tingting1,Zhao Ruixue1,2,Xian Guojian1,2()
1Agricultural Information Institute of CAAS, Beijing 100081, China
2Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
Download: PDF (709 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper develops an automatic method for classification indexing, aiming to better manage massive information resources and conduct knowledge discovery. [Methods] First, we analyzed the relationship between keywords (e.g., subject terms/concepts) and classification numbers. Then, we designed a multi-factor weighted algorithm. Finally, we proposed a scheme for automatic classification indexing. [Results] We examined our method with annotated corpora of authoritative domains and standard data sets. For literature with single subject classification number, the precision, recall and F values were 84.1%, 79.8%, and 81.9% respectively. For literature with two subject classification numbers, the precision, recall and F values were 83.4%, 78.8%, and 81.0%. [Limitations] The accuracy and completeness of our method relies on high-quality corpora, and the indexing of interdisciplinary literature needs to be improved. [Conclusions] The proposed method could effectively finish the classification tasks.

Key wordsAutomatic Classification      Subject Classification      Multi-factor Algorithm     
Received: 24 March 2020      Published: 04 December 2020
ZTFLH:  TP393  
Corresponding Authors: Xian Guojian     E-mail: xianguojian@caas.cn

Cite this article:

Li Jiao,Huang Yongwen,Luo Tingting,Zhao Ruixue,Xian Guojian. Automatic Classification Method Based on Multi-factor Algorithm. Data Analysis and Knowledge Discovery, 2020, 4(11): 43-51.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0238     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I11/43

Process of Automatic Classification
Multi-factor Algorithm Model
参数符号 参数描述
WP 文献中抽取来自标题、摘要、关键词位置(标引源)的关键词权重,通常依据其对主题表达能力的等级设定值
TP 抽取的关键词分别在标题、摘要、关键词位置出现的次数
M 命中的关键词数量
N 命中的学科分类号数量
Ki 命中的第i个关键词,i[1,M]
CNj 命中的第j个学科分类号,j[1,N]
KSj 命中的第j个学科分类号下包含的关键词数量
Pj(KS|CN) 命中的第j个学科分类号包含的关键词在所有关键词中的占比
Fl(CN|KS) 命中的第l个关键词对应的每个学科分类号在该语料库所有学科分类号中的概率,l[1,KSj]
Scorej(CN) 命中的第j个学科分类号的得分,j[1,N]
Parameter Description
题录信息 内容
原标注学
科分类号
TS971
标题 非茶叶主产区茶文化的推介
摘要 从近些年我国茶文化的对外推介来看,非茶叶主产区的茶文化在对外传播实践中还是一块短板,制约着我国茶文化整体均衡化的品牌口碑的生成。“注意力经济”迫使非茶叶主产区茶文化走向文化竞争、茶文化产业成为非茶叶主产区茶产业转型升级的重要方向等因素使得我国非茶叶主产区茶文化推介创新尤为迫切。非茶叶主产区茶文化的推介策略可以尝试整合营销传播推介策略、协同营销传播推介策略、“二级传播”推介策略等。
关键词 非茶叶主产区;茶文化推介;茶文化产业;茶业价值链;文化竞争
An Example of Automatic Classification
序号 关键词 学科分类号数量 学科分类号(次数)
1 茶文化 45 TS971(2028);H319.3(146);G641(141);F592.7(136);H315.9(128);F326.12(87);F592(85);F426.82(74);TS206.2(62);TS971-4(58)…
2 推介 5 F326.13(18);F426.6(17);F830.59(15);G206.3(12);F426.4(10)
3 主产区 18 F326.11(72);S512.1(68);F323.7(64);F724.721(50);S511(40);F326.13(32);S831(31);S831.5(28);F326.3(25);F326.12(19)…
4 茶叶 62 S571.1(923);TS272.7(735);F326.12(491);TS272(450);F426.82(289);TS971(209);S481.8(139);O657.63(118);S435.711(101);TS272.4(96)…
5 产业 137 F326.13(894);F326.12(302);F127(283);F326.2(240);F326.3(239);F326.11(200);F326.1(132);F426.82(124);F327(102);F062.9(85)…
6 文化 158 H319(329);H315.9(291);F270(267);G122(257);G124(169);F592.7(151);G0(150);G127(145);TU986(136);TS971(130)…
7 竞争 91 F270(141);F272(139);F274(137);F224(99);F272.92(79);F832.2(69);F626(53);F273.1(52);F426.61(47);F832.33(46)…
8 价值链 61 F270(171);F275.3(151);F275(144);F272(122);F274(90);F270.7(72);F224(62);F273.1(49);F406.72(49);F724.6(42)…
9 茶业 5 F326.12(191);S571.1(102);F426.82(90);F326.1(26);TS971(25)
10 策略 479 F274(811);F275(669);H319(524);F272.92(469);F426.61(393);TP393.08(354);G434(321);F724.6(297);F592.7(268);G258.6(264)…
Matching Results Between Keywords and Classification Number
排序 学科分类号 得分
1 TS971 1.72
2 F326.13 1.69
3 F326.12 1.51
4 F270 1.12
5 F274 0.94
Results of Subject Classification Number
中图分类号 F
经济
G
文化、科学、教育、体育
R
医药、卫生
S
农业科学
T
工业技术
多学科领域
单类号 双类号 单类号 双类号 单类号 双类号 单类号 双类号 单类号 双类号 单类号 双类号
标准集数据条数 1 000 500 1 000 500 1 000 500 1 000 500 1 000 500 5 000 2 500
标引出的数据条数 928 467 953 469 912 462 987 488 964 476 4 744 2 362
正确标引数据条数 776 385 802 398 765 376 827 409 819 403 3 989 1 971
准确率 83.6% 82.4% 84.2% 84.9% 83.9% 81.4% 83.8% 83.0% 85.0% 84.7% 84.1% 83.4%
召回率 77.6% 77.0% 80.2% 79.6% 76.5% 75.2% 82.7% 81.8% 81.9% 80.6% 79.8% 78.8%
F值 80.5% 79.6% 83.8% 82.2% 80.0% 78.2% 83.2% 82.8% 83.4% 82.6% 81.9% 81.0%
Performance of Automatic Classification Experiment
中图分类号 未正确标引数据条数 类目判错数据条数 判错类目数据占比
F 224 20 8.9%
G 198 50 10.1%
R 235 42 17.9%
S 173 48 27.7%
T 181 44 24.3%
Evaluation of Cross-Subject Classification Results
[1] 沈思, 苏新宁. 知识服务环境下分类表的知识组织探究[J]. 图书情报工作, 2014,58(7):113-118.
[1] ( Shen Si, Su Xinning. Exploring the Knowledge Organization of Classification Table Under the Condition of Knowledge Service[J]. Library and Information Service, 2014,58(7):113-118.)
[2] 樊瑜. 关于修订《中国图书馆分类法·期刊分类表》(第二版)的几点建议[J]. 图书情报工作, 2006,50(3):115-118.
[2] ( Fan Yu. Some Concerns About 2nd Edition of CLC-Classification Table of Periodical[J]. Library and Information Service, 2006,50(3):115-118.)
[3] 林美兰. 中国图书馆图书分类法(R类)与医学主题词表(MeSH)、中医药学主题词表对应表[M]. 北京: 中国科学技术出版社, 1992.
[3] ( Lin Meilan. Correspondence List of Chinese Library Classification(R), Medical Subject Headings, and Chinese Medicine Subject Thesaurus[M]. Beijing: Science and Technology of China Press, 1992.)
[4] Scorpion[EB/OL]. [2020-01-24]. https://www.oclc.org/research/activities/scorpion.html.
[5] KBS-CROSS[EB/OL]. [2020-01-24]. http://it.civil.auc.dk/it/delphi/KBS/projects/kbscross.html.
[6] Prasetyo P K, Lo D, Achananuparp P, et al. Automatic Classification of Software Related Microblogs[C]// Proceedings of the 28th IEEE International Conference on Software Maintenance, Riva del Garda, Trento, Italy. IEEE Computer Society, 2012.
[7] 苏新宁, 徐进鸿, 史九林. 档案自动分类算法研究[J]. 情报学报, 1995,14(3):194-200.
[7] ( Su Xinning, Xu Jinhong, Shi Jiulin. On Automatic Classification of Archive Documents[J]. Journal of the China Society for Scientific and Technical Information, 1995,14(3):194-200.)
[8] 刁倩, 王永成, 张惠惠. 中文信息自动分类系统及其神经网络优化算法[J]. 信息与控制, 1999,28(3):179-184.
[8] ( Diao Qian, Wang Yongcheng, Zhang Huihui. Neural Network Optimizing Algorithm of Chinese Information Auto-classification[J]. Information and Control, 1999,28(3):179-184.)
[9] 侯汉清, 薛鹏军. 中文信息自动分类用知识库的设计与构建[J]. 情报学报, 2003,22(6):681-686.
[9] ( Hou Hanqing, Xue Pengjun. Design & Construction of Knowledge Database for Automatic Classification in Chinese[J]. Journal of the China Society for Scientific and Technical Information, 2003,22(6):681-686.)
[10] 赵妍, 侯汉清, 耿金玉, 等. 中文期刊论文自动标引加权设计研究[J]. 新世纪图书馆, 2004(1):40-43.
[10] ( Zhao Yan, Hou Hanqing, Geng Jinyu, et al. A Study on the Weighted Design of Automatic Indexing of Chinese Journal Articles[J]. New Century Library, 2004(1):40-43.)
[11] 何琳, 侯汉清. 基于标引经验和机器学习相结合的多层自动分类[J]. 中国索引, 2006,4(1):39-43.
[11] ( He Lin, Hou Hanqing. Indexing Experiences and Machine Learning Based Multilevel Auto-classify[J]. Journal of the China Society of Indexers, 2006,4(1):39-43.)
[12] 李湘东, 徐朋, 黄莉, 等. 基于KNN算法的文本自动分类方法研究——以学术期刊栏目自动归类为例[J]. 图书情报知识, 2010(4):71-76.
[12] ( Li Xiangdong, Xu Peng, Huang Li, et al. Research of Journals Manuscript Categorization Based on KNN Algorithm[J]. Document, Information & Knowledge, 2010(4):71-76.)
[13] 李湘东, 巴志超, 高凡. 数字文本自动分类中特征语义关联及加权策略研究综述与展望[J]. 现代图书情报技术, 2016(9):17-26.
[13] ( Li Xiangdong, Ba Zhichao, Gao Fan. Review of Digital Documents Automatic Classification Research[J]. New Technology of Library and Information Service, 2016(9):17-26.)
[14] 李湘东, 丁丛, 高凡. 基于复合加权LDA模型的书目信息分类方法研究[J]. 情报学报, 2017,36(4):352-360.
[14] ( Li Xiangdong, Ding Cong, Gao Fan. The Research of Bibliographic Information Classification Method Based on the Composite Weighted LDA Model[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(4):352-360.)
[15] 李湘东, 阮涛, 刘康. 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
[15] ( Li Xiangdong, Ruan Tao, Liu Kang. Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[16] Ning W, Yu M. Exploiting Distributional Semantics to Benefit Machine Learning in Automated Classification of Chinese Clinical Text[C]// Proceedings of the 2016 IEEE International Conference on Bioinformatics & Biomedicine. IEEE, 2017.
[17] Tateisi Y, Shidahara Y, Miyao Y, et al. Annotation of Computer Science Papers for Semantic Relation Extraction[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland. European Language Resources Association (ELRA), 2014.
[18] 钱力, 张晓林, 王茜. 科技论文的研究设计指纹自动识别方法构建与实现[J]. 图书情报工作, 2018,62(2):135-143.
[18] ( Qian Li, Zhang Xiaolin, Wang Qian. Building and Implement on Automatic Identification Method of Research Design Fingerprint of Scientific Papers[J]. Library and Information Service, 2018,62(2):135-143.)
[19] Tsai C T, Kundu G, Roth D. Concept-based Analysis of Scientific Literature[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, USA. Association for Computing Machinery, 2013: 1733-1738.
[20] 余丽, 钱力, 付常雷, 等. 基于深度学习的文本中细粒度知识元抽取方法研究[J]. 数据分析与知识发现, 2019,3(1):38-45.
[20] ( Yu Li, Qian Li, Fu Changlei, et al. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. Data Analysis and Knowledge Discovery, 2019,3(1):38-45.)
[21] 侯汉清, 薛鹏军. 基于知识库的网页自动标引和自动分类系统的设计[J]. 大学图书馆学报, 2004,22(1):50-55, 64.
[21] ( Hou Hanqing, Xue Pengjun. Design of Web Page Auto-indexing & Auto-classification System Based on the Knowledge Database[J]. Journal of Academic Libraries, 2004,22(1):50-55, 64.)
[1] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[2] Deng Sanhong,Fu Yuyangzi,Wang Hao. Multi-Label Classification of Chinese Books with LSTM Model[J]. 数据分析与知识发现, 2017, 1(7): 52-60.
[3] Li Xiangdong,Ba Zhichao,Gao Fan. Review of Digital Documents Automatic Classification Research[J]. 现代图书情报技术, 2016, 32(9): 17-26.
[4] He Lin, Wan Jian, He Juan, Guo Shiyun. Research on Automatic Classification of Chinese Books Based on Social Tagging[J]. 现代图书情报技术, 2014, 30(9): 1-7.
[5] Hu Bing, Zhang Jianli. Research on Chinese Patent Automatic Classification Method Based on Statistical Distribution[J]. 现代图书情报技术, 2013, 29(7/8): 101-106.
[6] Xu Jian, Wen Haosheng. Study on Talents Description Web Page Automatic Recognition System[J]. 现代图书情报技术, 2011, 27(6): 20-26.
[7] Ma Fang. Research of Patent Automatic Classification Based on RBFNN[J]. 现代图书情报技术, 2011, 27(12): 58-63.
[8] Ouyang Jian. Application and Experiment of Book Subject Classification  Navigation of Online Bookstore in OPAC[J]. 现代图书情报技术, 2009, (9): 86-90.
[9] Wang Meiwen. Design and Implementation of Automatic Classification Meta-search Engine Based on Ontology[J]. 现代图书情报技术, 2008, 24(9): 58-63.
[10] Guo Shaoyou. Research on Automatic Classification Based on Term Context Relations[J]. 现代图书情报技术, 2008, 24(5): 44-49.
[11] Qian Aibing,Jiang Lan . Automatic Classification Based on News Titles for Chinese News Web Pages[J]. 现代图书情报技术, 2008, 24(10): 59-68.
[12] Yue Qingling. Automated Folksonomy Research of Tag Resource Based on Synergetic Mechanism[J]. 现代图书情报技术, 2007, 2(9): 58-61.
[13] Luan Fangfang. Automatic Classification Approach and Implement of Multi-media Information Resources[J]. 现代图书情报技术, 2007, 2(7): 83-87.
[14] Fu Liang. A Design of Automatic Classification Based on the Military Information Resources Classification’s Indexing-experience[J]. 现代图书情报技术, 2007, 2(11): 76-79.
[15] Zang Guoquan. On Automatic Classification of Web Page in Virtual Library[J]. 现代图书情报技术, 2002, 18(3): 28-31.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn