Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (6): 73-85    DOI: 10.11925/infotech.2096-3467.2022.0453
Current Issue | Archive | Adv Search |
Tibetan News Text Classification Based on Graph Convolutional Networks
Xu Guixian(),Zhang Zixin,Yu Shaona,Dong Yushuang,Tian Yuan
Information Engineer College, Minzu University of China, Beijing 100081, China
Download: PDF (3356 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve pre-training knowledge in Tibetan, this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network (GCN) using the construction relationship between Tibetan syllables and documents. [Methods] First, we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations. Then, we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset. Finally, we transformed the text classification tasks into node classification. [Results] The Graph Convolutional Network achieves an accuracy of 70.44% on the classification of Tibetan news body texts, which is 8.96%-20.66% higher than the baseline models. It had a 61.94% accuracy on the Tibetan news titles, 6.61%-26.05% higher than the baseline models. Additionally, the Graph Convolutional Network is 0.73%-15.1% higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO. It is 15.65% higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM. [Limitations] It still relies on labeled datasets in Tibetan, which are relatively scarce. [Conclusions] This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification. It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.

Key wordsGraph Convolutional Networks      Tibetan News Text Classification      Text Graph      Node Classification     
Received: 08 May 2022      Published: 09 November 2022
ZTFLH:  TP391  
  G35  
Fund:National Social Science Fund of China(19BGL241)
Corresponding Authors: Xu Guixian,ORCID:0000-0001-8815-479X,E-mail: guixian_xu@muc.edu.cn。   

Cite this article:

Xu Guixian, Zhang Zixin, Yu Shaona, Dong Yushuang, Tian Yuan. Tibetan News Text Classification Based on Graph Convolutional Networks. Data Analysis and Knowledge Discovery, 2023, 7(6): 73-85.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0453     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I6/73

Graph Convolutional Networks Structure
Text Graph Convolution Schematic
Construction of Tibetan Word
General Architecture of the Model
Samples of Tibetan Title Text
Generation of Adjacency Matrix
Feature Representation Update Process for Nodes
Node Classification Schematic
Dataset Category Statistics
Comparison of Accuracy in 10-Fold Cross-Validation
Comparison of F1 in 10-Fold Cross-Validation
模型 正文 标题
Acc Prec Rec F1 Acc Prec Rec F1
MultinomialNB 49.78 53.86 34.48 35.05 35.89 39.31 22.97 24.73
KNN 56.83 55.36 48.29 50.03 37.27 34.08 33.95 33.41
SVM 60.78 61.16 53.48 55.68 41.82 41.38 30.32 31.57
Transformer 29.68 19.87 17.16 13.26 31.16 12.01 17.31 12.43
MLP 61.48 55.99 55.13 54.86 44.70 38.17 35.72 35.41
CNN 60.33 53.06 54.40 51.91 53.20 44.80 46.17 44.08
CNN-rand 61.33 62.51 53.18 54.25 55.28 56.23 48.01 49.72
LSTM 52.10 44.10 43.81 42.71 55.33 48.28 48.89 47.45
GCN 70.44 70.68 66.79 67.98 61.94 61.24 56.91 58.39
Classification Performance of Various Models(%)
模型 正文 标题
Acc Prec Rec F1 Acc Prec Rec F1
Word2Vec+SVM 69.71 67.75 67.59 67.45 46.84 45.70 32.00 32.19
Word2Vec+CNN 61.51 59.39 56.65 57.34 54.42 49.22 48.34 48.64
Word2Vec+LSTM 54.79 52.63 48.62 49.59 62.65 58.33 56.43 56.99
CINO 61.82 51.97 50.53 48.86 59.64 50.84 50.97 49.17
GCN 70.44 70.68 66.79 67.98 61.94 61.24 56.91 58.39
Comparison of the Results with the Algorithm Introducing Pre-Trained Knowledge(%)
类别 SVM CNN-rand GCN
正文 标题 正文 标题 正文 标题
Politics 65.18 52.97 61.17 58.21 72.95 54.82
Customs 40.00 12.90 13.79 42.86 58.54 64.12
Language 47.37 7.14 41.03 54.55 63.41 60.00
Environment 68.16 51.28 67.02 65.43 78.31 55.56
Religion 56.58 32.52 59.84 46.62 67.67 57.66
Arts 42.00 28.57 41.82 41.82 52.00 57.35
Instruments 98.20 62.24 100.00 74.68 97.11 65.18
Education 56.19 39.62 56.35 51.56 68.17 74.7
Economics 53.14 31.28 47.41 40.91 64.58 60.00
Medicine 61.90 31.58 74.42 68.63 78.79 46.15
Tourism 47.31 12.31 44.44 43.40 64.76 70.05
Literature 41.86 10.26 38.30 12.50 61.54 52.17
F1 of Sub-Category Classification(%)
类别 SVM CNN-rand GCN
正文 标题 正文 标题 正文 标题
Politics 68.54 69.01 71.36 76.53 70.89 55.10
Customs 29.63 7.41 7.41 33.33 44.44 61.31
Language 36.00 4.00 32.00 48.00 52.00 51.92
Environment 64.21 42.11 67.37 55.79 77.89 60.00
Religion 64.18 29.85 56.72 46.27 67.16 62.75
Arts 42.00 24.00 46.00 46.00 52.00 58.21
Instruments 97.62 72.62 100.00 70.24 100.00 68.54
Education 61.31 45.99 51.82 42.34 77.37 73.81
Economics 56.12 28.57 56.12 36.73 63.27 60.00
Medicine 50.00 23.08 61.54 67.31 75.00 34.62
Tourism 43.14 7.84 31.37 45.10 66.67 72.63
Literature 34.62 7.69 34.62 7.69 61.54 44.44
Recall of Sub-Category Classification(%)
类别 SVM CNN-rand GCN
正文 标题 正文 标题 正文 标题
Politics 62.13 42.98 53.52 46.97 75.12 54.55
Customs 61.54 50.00 100.00 60.00 85.71 67.20
Language 69.23 33.33 57.14 63.16 81.25 71.05
Environment 72.62 65.57 66.67 79.10 78.72 51.72
Religion 50.59 35.71 63.33 46.97 68.18 53.33
Arts 42.00 35.29 38.33 38.33 52.00 56.52
Instruments 98.80 54.46 100.00 79.73 94.38 62.13
Education 51.85 34.81 61.74 65.91 60.92 75.61
Economics 50.46 34.57 41.04 46.15 65.96 60.00
Medicine 81.25 50.00 94.12 70.00 82.98 69.23
Tourism 52.38 28.57 76.19 41.82 62.96 67.65
Literature 52.94 15.38 42.86 33.33 61.54 63.16
Precision of Sub-Category Classification(%)
类别 正文 标题
Prec Rec F1 Prec Rec F1
Politics 69.74 74.83 71.97 61.62 63.84 62.59
Customs 74.76 49.63 58.44 60.63 45.34 51.37
Language 76.30 51.97 61.73 73.70 51.35 60.16
Environment 76.40 75.12 75.73 60.98 62.22 61.47
Religion 65.58 66.44 65.85 58.72 59.95 59.13
Arts 56.32 54.50 55.21 51.38 46.93 48.58
Instruments 94.57 99.29 96.86 67.54 69.91 68.56
Education 65.59 69.39 67.20 65.73 70.58 67.85
Economics 64.94 55.70 59.46 54.25 55.33 54.68
Medicine 83.07 76.84 79.75 68.10 51.35 57.95
Tourism 60.29 66.28 63.08 61.27 57.97 59.20
Literature 60.58 61.47 60.56 52.38 47.96 49.53
Sub-Category Classification Results of GCN(%)
[1] Goudjil M, Koudil M, Bedda M, et al. A Novel Active Learning Method Using SVM for Text Classification[J]. International Journal of Automation and Computing, 2018, 15(3): 290-298.
doi: 10.1007/s11633-015-0912-z
[2] Han E H, Karypis G, Kumar V. Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification[C]// Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer, 2001: 53-65.
[3] Sharma N, Singh M. Modifying Naive Bayes Classifier for Multinomial Text Classification[C]// Proceedings of the 2016 International Conference on Recent Advances and Innovations in Engineering. IEEE, 2017: 1-7.
[4] Song J, Huang X L, Qin S J, et al. A Bi-Directional Sampling Based on K-Means Method for Imbalance Text Classification[C]// Proceedings of the 15th International Conference on Computer and Information Science. IEEE, 2016: 1-5.
[5] Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417.
doi: 10.1145/321075.321084
[6] 贾会强, 李永宏. 藏文文本分类器的设计与实现[J]. 科技致富向导, 2010(8): 30-31.
[6] (Jia Huiqiang, Li Yonghong. Design and Implementation of Tibetan Text Classifier[J]. Keji Zhifu Xiangdao, 2010(8): 30-31.)
[7] 贾会强. 基于KNN算法的藏文文本分类关键技术研究[J]. 西北民族大学学报(自然科学版), 2011, 32(3): 24-29.
[7] (Jia Huiqiang. Research on Key Technologies of Tibetan Text Classification Based on KNN Algorithm[J]. Journal of Northwest University for Nationalities (Natural Science), 2011, 32(3): 24-29.)
[8] 周登. 基于N-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学, 2010.
[8] (Zhou Deng. The Research of Tibetan Text Categorization Base on N-Gram Information[D]. Lanzhou: Northwest University for Nationalities, 2010.)
[9] 刘晓丽, 于洪志. 基于词性特征提取的藏文文本分类方法研究[C]// 第二届中国计算机学会服务计算学术会议论文集. 2011: 93-97.
[9] (Liu Xiaoli, Yu Hongzhi. Research of Feature Extraction Methods Based on Part of Speech in Tibetan Documents Classification[C]// Proceedings of the 2nd CCF National Conference on Service Computing. 2011: 93-97.)
[10] 贾宏云, 群诺, 苏慧婧, 等. 基于SVM藏文文本分类的研究与实现[J]. 电子技术与软件工程, 2018(9): 144-146.
[10] (Jia Hongyun, Qun Nuo, Su Huijing, et al. Research and Implementation of Tibetan Text Classification Based on SVM[J]. Electronic Technology & Software Engineering, 2018(9): 144-146.)
[11] 李艾林, 李照耀. 基于朴素贝叶斯技术的藏文文本分类[J]. 中文信息, 2013(11): 11-12.
[11] (Li Ailin, Li Zhaoyao. Tibetan Text Classification Based on Naive Bayesian Technology[J]. Chinese Information, 2013(11): 11-12.)
[12] Yoon K. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[13] Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM, 2015: 649-657.
[14] Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1606.01781.
[15] Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
[16] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
[17] 苏慧婧. 基于MLP和SepCNN模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学, 2021.
[17] (Su Huijing. Research and Implementation of Tibetan Text Classification Based on MLP and SepCNN Models[D]. Lasa: Tibet University, 2021.)
[18] 王莉莉, 杨鸿武, 宋志蒙. 基于多分类器的藏文文本分类方法[J]. 南京邮电大学学报(自然科学版), 2020, 40(1): 102-110.
[18] (Wang Lili, Yang Hongwu, Song Zhimeng. Tibetan Text Classification Method Based on Multi-Classifiers[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2020, 40(1): 102-110.)
[19] Yang Z Q, Xu Z H, Cui Y M, et al. CINO: A Chinese Minority Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2202.13558.
[20] Sun Y, Liu S S, Deng J J, et al. TiBERT: Tibetan Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2205.07303.
[21] Shen D H, Wang G Y, Wang W L, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms[OL]. arXiv Preprint, arXiv: 1805.09843.
[22] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[OL]. arXiv Preprint, arXiv: 1607.01759.
[23] Cai H Y, Zheng V W, Chang K C C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1616-1637.
doi: 10.1109/TKDE.69
[24] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[25] Marcheggiani D, Titov I. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling[OL]. arXiv Preprint, arXiv: 1703.04826.
[26] Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1704.04675.
[27] Yao L A, Mao C S, Luo Y A. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33(1): 7370-7377.
[28] Peng H, Li J X, He Y, et al. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN[C]// Proceedings of the 2018 World Wide Web Conference. New York: ACM, 2018: 1063-1072.
[29] Qun N, Li X, Qiu X P, et al. End-to-End Neural Text Classification for Tibetan[C]// Proceedings of the 2017 International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. 2017: 472-480.
[30] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[1] Zhang Zhengang, Yu Chuanming. Knowledge Graph Completion Model Based on Entity and Relation Fusion[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn