Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (7): 52-60    DOI: 10.11925/infotech.2096-3467.2017.0484
Orginal Article Current Issue | Archive | Adv Search |
Multi-Label Classification of Chinese Books with LSTM Model
Deng Sanhong, Fu Yuyangzi(), Wang Hao
School of Information Management, Nanjing University, Nanjing 210023
Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210023, China
Download: PDF (1324 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method to automatically cataloguing Chinese books based on LSTM model, aiming to solve the issues facing single or multi-label classification. [Methods] First, we introduced deep learning algorithms to construct a new classification system with character embedding technique. Then, we trained the LSTM model with strings consisting of titles and keywords. Finally, we constructed multiple binary classifiers, which were examined with bibliographic data from three universities. [Results] The proposed model performed well and had practical value. [Limitations] We only analyzed five categories of Chinese bibliographies, and the granularity of classification was coarse. [Conclusions] The proposed Chinese book classification system based on LSTM model could preprocess data and learn incrementally, which could be transferred to other fields.

Key wordsLSTM Model      Deep Learning      Character Embedding      Book Automatic Classification      Multi-label Classification     
Received: 27 May 2017      Published: 26 July 2017
ZTFLH:  TP391  

Cite this article:

Deng Sanhong,Fu Yuyangzi,Wang Hao. Multi-Label Classification of Chinese Books with LSTM Model. Data Analysis and Knowledge Discovery, 2017, 1(7): 52-60.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0484     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I7/52

MARC字段 含义
001 MARC标识号
200 题名
330 摘要
606 主题词
690 中图分类号
类标号 书目数
A 8 486
C 28 514
F 146 228
N 6 935
X 16 463
总计 206 626
类标号 书目数 类标号 书目数
A 8 101 A、X 5
C 25 595 C、F 1 217
F 133 401 C、N 69
N 6 461 C、X 50
X 15 642 F、N 49
A、C 38 F、X 684
A、F 111 N、X 21
A、N 4 C、F、X 3
总计 191 451
类标号 精度 召回率 F1值
A 91.23% 94.32% 92.75%
C 85.47% 93.61% 89.35%
F 95.85% 98.56% 97.19%
N 83.43% 90.17% 86.67%
X 88.88% 96.13% 92.36%
多标
签项
实际
存在数
预测情况
包含至少一
个实际类别
包含全部
实际类别
恰好等于
实际类别
A、C 8 7 4 4
A、F 23 23 16 16
A、N 1 1 0 0
A、X 1 1 1 1
C、F 244 242 140 140
C、N 14 14 7 7
C、X 10 10 5 3
F、N 10 10 2 2
F、X 137 136 100 100
N、X 5 5 2 2
C、F、X 1 1 1 1
总计 454 450 278 276
[1] 罗雪英. 也谈数字图书馆的建设目标[J]. 现代情报, 2002, 22(12): 131-132.
doi: 10.3969/j.issn.1008-0821.2002.12.072
[1] (Luo Xueying.Talking About the Construction Target of Digital Library[J]. Modern Information, 2002, 22(12): 131-132.)
doi: 10.3969/j.issn.1008-0821.2002.12.072
[2] Luhn H P.Auto-encoding of Documents for Information Retrieval Systems[M]. IBM Research Center, 1958.
[3] 肖明. WWW科技信息资源自动标引的理论与实践研究[D]. 北京: 中国科学院文献情报中心, 2001.
[3] (Xiao Ming.Study on the Theory and Practice of Automatic Indexing of WWW Science and Technology Information Resources[D]. Beijing: National Science Library, Chinese Academy of Sciences, 2001.)
[4] Lewis D D, Ringuette M.A Comparison of Two Learning Algorithms for Text Categorization[C]//Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas. Information Science Research Institute, University of Nevada, 1994, 33: 81-93.
[5] Yang Y, Chute C G.An Example-based Mapping Method for Text Categorization and Retrieval[J]. ACM Transactions on Information Systems (TOIS), 1994, 12(3): 252-277.
doi: 10.1145/183422.183424
[6] 陈立孚, 周宁, 李丹. 基于机器学习的自动文本分类模型研究[J]. 现代图书情报技术, 2005(10): 23-27.
doi: 10.3969/j.issn.1003-3513.2005.10.006
[6] (Chen Lifu, Zhou Ning, Li Dan.Study on Machine Learning Based Automatic Text Categorization Model[J]. New Technology of Library and Information Service,2005(10): 23-27.)
doi: 10.3969/j.issn.1003-3513.2005.10.006
[7] Weigend A S, Wiener E D, Pedersen J O.Exploiting Hierarchy in Text Categorization[J]. Information Retrieval, 1999, 1(3): 193-216.
doi: 10.1023/A:1009983522080
[8] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859.
[8] (Su Jinshu, Zhang Bofeng, Xu Xin.Advances in Machine Learning Based Text Categorization[J]. Journal of Software, 2006, 17(9): 1848-1859.)
[9] 吕小勇, 石洪波. 基于频繁项集的多标签文本分类算法[J]. 计算机工程, 2010, 36(15): 83-85.
[9] (Lv Xiaoyong, Shi Hongbo.Multi-label Text Classification Algorithm Based on Frequent Item Sets[J]. Computer Engineering, 2010, 36(15): 83-85.)
[10] Joachims T.Text Categorization with Support Vector Machines: Learning with Many Relevant Features[A]// Machine Learning: ECML-98[M]. Springer, Berlin, Heidelberg, 1998: 137-142.
[11] Crammer K, Singer Y.A New Family of Online Algorithms for Category Ranking[C]// Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. New York: ACM, 2002: 151-158.
[12] Ueda N, Saito K.Parametric Mixture Models for Multi- Labeled Text[A]//Advances in Neural Information Processing Systems[M]. MIT Press, 2003: 737-744.
[13] Zhang M, Zhou Z.Multi-Label Learning by Instance Differentiation[C]//Proceedings of the 22nd Conference on Artificial Intelligence. 2007: 669-674.
[14] Liu Y, Jin R, Yang L.Semi-supervised Multi-label Learning by Constrained Non-negative Matrix Factorization[C]// Proceedings of the 21st Conference on Artificial Intelligence, Boston, Massachusetts, USA. 2006, 6: 421-426.
[15] Hochreiter S, Schmidhuber J.Long Short-term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735
[16] Gers F A, Schmidhuber J, Cummins F.Learning to Forget: Continual Prediction with LSTM[J]. Neural Computation, 2000, 12(10): 2451-2471.
doi: 10.1162/089976600300015015
[17] Graves A.Supervised Sequence Labelling with Recurrent Neural Networks [D]. München: Technische Universität München, 2008.
[18] Zaremba W, Sutskever I, Vinyals O.Recurrent Neural Network Regularization [OL]. arXiv Preprint, arXiv: 1409.2329.
[19] Hochreiter S.Recurrent Neural Net Learning and Vanishing Gradient[J]. International Journal of Uncertainity, Fuzziness and Knowledge-Based Systems, 1998, 6(2): 107-116.
doi: 10.1142/S0218488598000094
[20] Hochreiter S, Bengio Y, Frasconi P, et al.Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies[A]// A Field Guide to Dynamical Recurrent Neural Networks[M]. Wiley-IEEE Press, 2001.
[21] 邱锡鹏. 神经网络与深度学习[EB/OL]. [2017-04-21].
[21] (Qiu Xipeng.Neural Network and Deep Learning [EB/OL]. [2017-04-21].)
[22] Hinton G E.Learning Distributed Representations of Concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1986.
[23] Chung J, Cho K, Bengio Y.A Character-Level Decoder Without Explicit Segmentation for Neural Machine Translation[OL]. arXiv Preprint, arXiv:1603.06147.
[24] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[24] (Zhou Zhihua.Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[25] Kingma D, Ba J.Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv:1412.6980.
[26] HUIWEN Software [EB/OL]. [2017-02-13].
[27] Python Software Foundation [EB/OL]. [2017-02-12].
[28] 李思男, 李宁, 李战怀. 多标签数据挖掘技术: 研究综述[J]. 计算机科学, 2013, 40(4): 14-21.
doi: 10.3969/j.issn.1002-137X.2013.04.003
[28] (Li Sinan, Li Ning, Li Zhanhuai.Multi-label Data Mining: A Survey[J]. Computer Science, 2013, 40(4): 14-21.)
doi: 10.3969/j.issn.1002-137X.2013.04.003
[29] 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010,36(6): 28-39.
[29] (Wang Hao, Yan Ming, Su Xinning.Research on Automatic Classification for Chinese Bibliography Based on Machine Learning[J]. Journal of the Library Science in China, 2010, 36(6): 28-39.)
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[3] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[4] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[5] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[8] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[9] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[10] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[11] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[12] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[13] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[14] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[15] Yu Chuanming, Zhang Zhengang, Kong Lingge. Comparing Knowledge Graph Representation Models for Link Prediction[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn