[Objective] To improve pre-training knowledge in Tibetan, this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network (GCN) using the construction relationship between Tibetan syllables and documents. [Methods] First, we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations. Then, we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset. Finally, we transformed the text classification tasks into node classification. [Results] The Graph Convolutional Network achieves an accuracy of 70.44% on the classification of Tibetan news body texts, which is 8.96%-20.66% higher than the baseline models. It had a 61.94% accuracy on the Tibetan news titles, 6.61%-26.05% higher than the baseline models. Additionally, the Graph Convolutional Network is 0.73%-15.1% higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO. It is 15.65% higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM. [Limitations] It still relies on labeled datasets in Tibetan, which are relatively scarce. [Conclusions] This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification. It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.
Goudjil M, Koudil M, Bedda M, et al. A Novel Active Learning Method Using SVM for Text Classification[J]. International Journal of Automation and Computing, 2018, 15(3): 290-298.
doi: 10.1007/s11633-015-0912-z
[2]
Han E H, Karypis G, Kumar V. Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification[C]// Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer, 2001: 53-65.
[3]
Sharma N, Singh M. Modifying Naive Bayes Classifier for Multinomial Text Classification[C]// Proceedings of the 2016 International Conference on Recent Advances and Innovations in Engineering. IEEE, 2017: 1-7.
[4]
Song J, Huang X L, Qin S J, et al. A Bi-Directional Sampling Based on K-Means Method for Imbalance Text Classification[C]// Proceedings of the 15th International Conference on Computer and Information Science. IEEE, 2016: 1-5.
[5]
Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417.
doi: 10.1145/321075.321084
(Jia Huiqiang. Research on Key Technologies of Tibetan Text Classification Based on KNN Algorithm[J]. Journal of Northwest University for Nationalities (Natural Science), 2011, 32(3): 24-29.)
[8]
周登. 基于N-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学, 2010.
[8]
(Zhou Deng. The Research of Tibetan Text Categorization Base on N-Gram Information[D]. Lanzhou: Northwest University for Nationalities, 2010.)
(Liu Xiaoli, Yu Hongzhi. Research of Feature Extraction Methods Based on Part of Speech in Tibetan Documents Classification[C]// Proceedings of the 2nd CCF National Conference on Service Computing. 2011: 93-97.)
(Jia Hongyun, Qun Nuo, Su Huijing, et al. Research and Implementation of Tibetan Text Classification Based on SVM[J]. Electronic Technology & Software Engineering, 2018(9): 144-146.)
(Li Ailin, Li Zhaoyao. Tibetan Text Classification Based on Naive Bayesian Technology[J]. Chinese Information, 2013(11): 11-12.)
[12]
Yoon K. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[13]
Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM, 2015: 649-657.
[14]
Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1606.01781.
[15]
Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
[16]
Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
(Wang Lili, Yang Hongwu, Song Zhimeng. Tibetan Text Classification Method Based on Multi-Classifiers[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2020, 40(1): 102-110.)
[19]
Yang Z Q, Xu Z H, Cui Y M, et al. CINO: A Chinese Minority Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2202.13558.
[20]
Sun Y, Liu S S, Deng J J, et al. TiBERT: Tibetan Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2205.07303.
[21]
Shen D H, Wang G Y, Wang W L, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms[OL]. arXiv Preprint, arXiv: 1805.09843.
[22]
Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[OL]. arXiv Preprint, arXiv: 1607.01759.
[23]
Cai H Y, Zheng V W, Chang K C C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1616-1637.
doi: 10.1109/TKDE.69
[24]
Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[25]
Marcheggiani D, Titov I. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling[OL]. arXiv Preprint, arXiv: 1703.04826.
[26]
Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1704.04675.
[27]
Yao L A, Mao C S, Luo Y A. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33(1): 7370-7377.
[28]
Peng H, Li J X, He Y, et al. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN[C]// Proceedings of the 2018 World Wide Web Conference. New York: ACM, 2018: 1063-1072.
[29]
Qun N, Li X, Qiu X P, et al. End-to-End Neural Text Classification for Tibetan[C]// Proceedings of the 2017 International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. 2017: 472-480.
[30]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.