|
|
Tibetan News Text Classification Based on Graph Convolutional Networks |
Xu Guixian(),Zhang Zixin,Yu Shaona,Dong Yushuang,Tian Yuan |
Information Engineer College, Minzu University of China, Beijing 100081, China |
|
|
Abstract [Objective] To improve pre-training knowledge in Tibetan, this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network (GCN) using the construction relationship between Tibetan syllables and documents. [Methods] First, we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations. Then, we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset. Finally, we transformed the text classification tasks into node classification. [Results] The Graph Convolutional Network achieves an accuracy of 70.44% on the classification of Tibetan news body texts, which is 8.96%-20.66% higher than the baseline models. It had a 61.94% accuracy on the Tibetan news titles, 6.61%-26.05% higher than the baseline models. Additionally, the Graph Convolutional Network is 0.73%-15.1% higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO. It is 15.65% higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM. [Limitations] It still relies on labeled datasets in Tibetan, which are relatively scarce. [Conclusions] This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification. It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.
|
Received: 08 May 2022
Published: 09 November 2022
|
|
Fund:National Social Science Fund of China(19BGL241) |
Corresponding Authors:
Xu Guixian,ORCID:0000-0001-8815-479X,E-mail: guixian_xu@muc.edu.cn。
|
[1] |
Goudjil M, Koudil M, Bedda M, et al. A Novel Active Learning Method Using SVM for Text Classification[J]. International Journal of Automation and Computing, 2018, 15(3): 290-298.
doi: 10.1007/s11633-015-0912-z
|
[2] |
Han E H, Karypis G, Kumar V. Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification[C]// Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer, 2001: 53-65.
|
[3] |
Sharma N, Singh M. Modifying Naive Bayes Classifier for Multinomial Text Classification[C]// Proceedings of the 2016 International Conference on Recent Advances and Innovations in Engineering. IEEE, 2017: 1-7.
|
[4] |
Song J, Huang X L, Qin S J, et al. A Bi-Directional Sampling Based on K-Means Method for Imbalance Text Classification[C]// Proceedings of the 15th International Conference on Computer and Information Science. IEEE, 2016: 1-5.
|
[5] |
Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417.
doi: 10.1145/321075.321084
|
[6] |
贾会强, 李永宏. 藏文文本分类器的设计与实现[J]. 科技致富向导, 2010(8): 30-31.
|
[6] |
(Jia Huiqiang, Li Yonghong. Design and Implementation of Tibetan Text Classifier[J]. Keji Zhifu Xiangdao, 2010(8): 30-31.)
|
[7] |
贾会强. 基于KNN算法的藏文文本分类关键技术研究[J]. 西北民族大学学报(自然科学版), 2011, 32(3): 24-29.
|
[7] |
(Jia Huiqiang. Research on Key Technologies of Tibetan Text Classification Based on KNN Algorithm[J]. Journal of Northwest University for Nationalities (Natural Science), 2011, 32(3): 24-29.)
|
[8] |
周登. 基于N-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学, 2010.
|
[8] |
(Zhou Deng. The Research of Tibetan Text Categorization Base on N-Gram Information[D]. Lanzhou: Northwest University for Nationalities, 2010.)
|
[9] |
刘晓丽, 于洪志. 基于词性特征提取的藏文文本分类方法研究[C]// 第二届中国计算机学会服务计算学术会议论文集. 2011: 93-97.
|
[9] |
(Liu Xiaoli, Yu Hongzhi. Research of Feature Extraction Methods Based on Part of Speech in Tibetan Documents Classification[C]// Proceedings of the 2nd CCF National Conference on Service Computing. 2011: 93-97.)
|
[10] |
贾宏云, 群诺, 苏慧婧, 等. 基于SVM藏文文本分类的研究与实现[J]. 电子技术与软件工程, 2018(9): 144-146.
|
[10] |
(Jia Hongyun, Qun Nuo, Su Huijing, et al. Research and Implementation of Tibetan Text Classification Based on SVM[J]. Electronic Technology & Software Engineering, 2018(9): 144-146.)
|
[11] |
李艾林, 李照耀. 基于朴素贝叶斯技术的藏文文本分类[J]. 中文信息, 2013(11): 11-12.
|
[11] |
(Li Ailin, Li Zhaoyao. Tibetan Text Classification Based on Naive Bayesian Technology[J]. Chinese Information, 2013(11): 11-12.)
|
[12] |
Yoon K. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
|
[13] |
Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM, 2015: 649-657.
|
[14] |
Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1606.01781.
|
[15] |
Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
|
[16] |
Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
|
[17] |
苏慧婧. 基于MLP和SepCNN模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学, 2021.
|
[17] |
(Su Huijing. Research and Implementation of Tibetan Text Classification Based on MLP and SepCNN Models[D]. Lasa: Tibet University, 2021.)
|
[18] |
王莉莉, 杨鸿武, 宋志蒙. 基于多分类器的藏文文本分类方法[J]. 南京邮电大学学报(自然科学版), 2020, 40(1): 102-110.
|
[18] |
(Wang Lili, Yang Hongwu, Song Zhimeng. Tibetan Text Classification Method Based on Multi-Classifiers[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2020, 40(1): 102-110.)
|
[19] |
Yang Z Q, Xu Z H, Cui Y M, et al. CINO: A Chinese Minority Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2202.13558.
|
[20] |
Sun Y, Liu S S, Deng J J, et al. TiBERT: Tibetan Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2205.07303.
|
[21] |
Shen D H, Wang G Y, Wang W L, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms[OL]. arXiv Preprint, arXiv: 1805.09843.
|
[22] |
Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[OL]. arXiv Preprint, arXiv: 1607.01759.
|
[23] |
Cai H Y, Zheng V W, Chang K C C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1616-1637.
doi: 10.1109/TKDE.69
|
[24] |
Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
|
[25] |
Marcheggiani D, Titov I. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling[OL]. arXiv Preprint, arXiv: 1703.04826.
|
[26] |
Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1704.04675.
|
[27] |
Yao L A, Mao C S, Luo Y A. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33(1): 7370-7377.
|
[28] |
Peng H, Li J X, He Y, et al. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN[C]// Proceedings of the 2018 World Wide Web Conference. New York: ACM, 2018: 1063-1072.
|
[29] |
Qun N, Li X, Qiu X P, et al. End-to-End Neural Text Classification for Tibetan[C]// Proceedings of the 2017 International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. 2017: 472-480.
|
[30] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|