Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (6): 73-85     https://doi.org/10.11925/infotech.2096-3467.2022.0453
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于图卷积网络的藏文新闻文本分类*
胥桂仙(),张子欣,于绍娜,董玉双,田媛
中央民族大学信息工程学院 北京 100081
Tibetan News Text Classification Based on Graph Convolutional Networks
Xu Guixian(),Zhang Zixin,Yu Shaona,Dong Yushuang,Tian Yuan
Information Engineer College, Minzu University of China, Beijing 100081, China
全文: PDF (3356 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对藏文预训练知识缺少的现状,利用藏文音节和文档的构造关系,提出基于图卷积网络的藏文新闻文本分类方法。【方法】 基于音节-音节关系和音节-文档关系为藏文新闻语料库构建文本图,然后使用音节和文档的独热表示进行初始化,在训练集文档类别标签的监督下,使用图卷积网络联合学习音节和文档的嵌入,最后将文本分类问题转化为节点分类问题。【结果】 图卷积网络在藏文新闻正文文本分类任务上准确率达到70.44%,相比于基线模型高出8.96~20.66个百分点;在藏文新闻标题文本上准确率达到61.94%,比基线模型高出6.61~26.05个百分点。同时,图卷积网络相比引入预训练音节嵌入的SVM、CNN和少数民族语言预训练模型CINO在准确率上高出0.73~15.1个百分点,在正文上的准确率相比Word2Vec+LSTM方法高出15.65个百分点。【局限】 仍依赖于有标注数据集,但藏文的有监督文本相对稀缺。【结论】 图卷积网络在藏文新闻文本分类任务上具有有效性,能够解决藏文新闻文本信息杂乱的问题,有助于对各类别藏文新闻文本数据进行挖掘。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胥桂仙
张子欣
于绍娜
董玉双
田媛
关键词 图卷积网络藏文新闻文本分类文本图节点分类    
Abstract

[Objective] To improve pre-training knowledge in Tibetan, this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network (GCN) using the construction relationship between Tibetan syllables and documents. [Methods] First, we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations. Then, we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset. Finally, we transformed the text classification tasks into node classification. [Results] The Graph Convolutional Network achieves an accuracy of 70.44% on the classification of Tibetan news body texts, which is 8.96%-20.66% higher than the baseline models. It had a 61.94% accuracy on the Tibetan news titles, 6.61%-26.05% higher than the baseline models. Additionally, the Graph Convolutional Network is 0.73%-15.1% higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO. It is 15.65% higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM. [Limitations] It still relies on labeled datasets in Tibetan, which are relatively scarce. [Conclusions] This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification. It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.

Key wordsGraph Convolutional Networks    Tibetan News Text Classification    Text Graph    Node Classification
收稿日期: 2022-05-08      出版日期: 2022-11-09
ZTFLH:  TP391  
  G35  
基金资助:* 国家社会科学基金项目(19BGL241)
通讯作者: 胥桂仙,ORCID:0000-0001-8815-479X,E-mail: guixian_xu@muc.edu.cn。   
引用本文:   
胥桂仙, 张子欣, 于绍娜, 董玉双, 田媛. 基于图卷积网络的藏文新闻文本分类*[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
Xu Guixian, Zhang Zixin, Yu Shaona, Dong Yushuang, Tian Yuan. Tibetan News Text Classification Based on Graph Convolutional Networks. Data Analysis and Knowledge Discovery, 2023, 7(6): 73-85.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0453      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I6/73
Fig.1  图卷积网络模型结构
Fig.2  文本图卷积示意
Fig.3  藏文字构建图
Fig.4  模型总体框架
Fig.5  藏文标题文本样例
Fig.6  邻接矩阵构建
Fig.7  节点特征表示更新过程
Fig.8  节点分类示意
Fig.9  数据集类别统计
Fig.10  十折交叉验证准确率对比
Fig.11  十折交叉验证F1值对比
模型 正文 标题
Acc Prec Rec F1 Acc Prec Rec F1
MultinomialNB 49.78 53.86 34.48 35.05 35.89 39.31 22.97 24.73
KNN 56.83 55.36 48.29 50.03 37.27 34.08 33.95 33.41
SVM 60.78 61.16 53.48 55.68 41.82 41.38 30.32 31.57
Transformer 29.68 19.87 17.16 13.26 31.16 12.01 17.31 12.43
MLP 61.48 55.99 55.13 54.86 44.70 38.17 35.72 35.41
CNN 60.33 53.06 54.40 51.91 53.20 44.80 46.17 44.08
CNN-rand 61.33 62.51 53.18 54.25 55.28 56.23 48.01 49.72
LSTM 52.10 44.10 43.81 42.71 55.33 48.28 48.89 47.45
GCN 70.44 70.68 66.79 67.98 61.94 61.24 56.91 58.39
Table 1  各模型分类结果(%)
模型 正文 标题
Acc Prec Rec F1 Acc Prec Rec F1
Word2Vec+SVM 69.71 67.75 67.59 67.45 46.84 45.70 32.00 32.19
Word2Vec+CNN 61.51 59.39 56.65 57.34 54.42 49.22 48.34 48.64
Word2Vec+LSTM 54.79 52.63 48.62 49.59 62.65 58.33 56.43 56.99
CINO 61.82 51.97 50.53 48.86 59.64 50.84 50.97 49.17
GCN 70.44 70.68 66.79 67.98 61.94 61.24 56.91 58.39
Table2  与引入预训练知识算法的结果对比(%)
类别 SVM CNN-rand GCN
正文 标题 正文 标题 正文 标题
Politics 65.18 52.97 61.17 58.21 72.95 54.82
Customs 40.00 12.90 13.79 42.86 58.54 64.12
Language 47.37 7.14 41.03 54.55 63.41 60.00
Environment 68.16 51.28 67.02 65.43 78.31 55.56
Religion 56.58 32.52 59.84 46.62 67.67 57.66
Arts 42.00 28.57 41.82 41.82 52.00 57.35
Instruments 98.20 62.24 100.00 74.68 97.11 65.18
Education 56.19 39.62 56.35 51.56 68.17 74.7
Economics 53.14 31.28 47.41 40.91 64.58 60.00
Medicine 61.90 31.58 74.42 68.63 78.79 46.15
Tourism 47.31 12.31 44.44 43.40 64.76 70.05
Literature 41.86 10.26 38.30 12.50 61.54 52.17
Table 3  各模型子类别F1结果(%)
类别 SVM CNN-rand GCN
正文 标题 正文 标题 正文 标题
Politics 68.54 69.01 71.36 76.53 70.89 55.10
Customs 29.63 7.41 7.41 33.33 44.44 61.31
Language 36.00 4.00 32.00 48.00 52.00 51.92
Environment 64.21 42.11 67.37 55.79 77.89 60.00
Religion 64.18 29.85 56.72 46.27 67.16 62.75
Arts 42.00 24.00 46.00 46.00 52.00 58.21
Instruments 97.62 72.62 100.00 70.24 100.00 68.54
Education 61.31 45.99 51.82 42.34 77.37 73.81
Economics 56.12 28.57 56.12 36.73 63.27 60.00
Medicine 50.00 23.08 61.54 67.31 75.00 34.62
Tourism 43.14 7.84 31.37 45.10 66.67 72.63
Literature 34.62 7.69 34.62 7.69 61.54 44.44
Table 4  各模型子类别召回率结果(%)
类别 SVM CNN-rand GCN
正文 标题 正文 标题 正文 标题
Politics 62.13 42.98 53.52 46.97 75.12 54.55
Customs 61.54 50.00 100.00 60.00 85.71 67.20
Language 69.23 33.33 57.14 63.16 81.25 71.05
Environment 72.62 65.57 66.67 79.10 78.72 51.72
Religion 50.59 35.71 63.33 46.97 68.18 53.33
Arts 42.00 35.29 38.33 38.33 52.00 56.52
Instruments 98.80 54.46 100.00 79.73 94.38 62.13
Education 51.85 34.81 61.74 65.91 60.92 75.61
Economics 50.46 34.57 41.04 46.15 65.96 60.00
Medicine 81.25 50.00 94.12 70.00 82.98 69.23
Tourism 52.38 28.57 76.19 41.82 62.96 67.65
Literature 52.94 15.38 42.86 33.33 61.54 63.16
Table 5  各模型子类别精确率结果(%)
类别 正文 标题
Prec Rec F1 Prec Rec F1
Politics 69.74 74.83 71.97 61.62 63.84 62.59
Customs 74.76 49.63 58.44 60.63 45.34 51.37
Language 76.30 51.97 61.73 73.70 51.35 60.16
Environment 76.40 75.12 75.73 60.98 62.22 61.47
Religion 65.58 66.44 65.85 58.72 59.95 59.13
Arts 56.32 54.50 55.21 51.38 46.93 48.58
Instruments 94.57 99.29 96.86 67.54 69.91 68.56
Education 65.59 69.39 67.20 65.73 70.58 67.85
Economics 64.94 55.70 59.46 54.25 55.33 54.68
Medicine 83.07 76.84 79.75 68.10 51.35 57.95
Tourism 60.29 66.28 63.08 61.27 57.97 59.20
Literature 60.58 61.47 60.56 52.38 47.96 49.53
Table6  GCN模型子类别详细结果(%)
[1] Goudjil M, Koudil M, Bedda M, et al. A Novel Active Learning Method Using SVM for Text Classification[J]. International Journal of Automation and Computing, 2018, 15(3): 290-298.
doi: 10.1007/s11633-015-0912-z
[2] Han E H, Karypis G, Kumar V. Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification[C]// Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer, 2001: 53-65.
[3] Sharma N, Singh M. Modifying Naive Bayes Classifier for Multinomial Text Classification[C]// Proceedings of the 2016 International Conference on Recent Advances and Innovations in Engineering. IEEE, 2017: 1-7.
[4] Song J, Huang X L, Qin S J, et al. A Bi-Directional Sampling Based on K-Means Method for Imbalance Text Classification[C]// Proceedings of the 15th International Conference on Computer and Information Science. IEEE, 2016: 1-5.
[5] Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417.
doi: 10.1145/321075.321084
[6] 贾会强, 李永宏. 藏文文本分类器的设计与实现[J]. 科技致富向导, 2010(8): 30-31.
[6] (Jia Huiqiang, Li Yonghong. Design and Implementation of Tibetan Text Classifier[J]. Keji Zhifu Xiangdao, 2010(8): 30-31.)
[7] 贾会强. 基于KNN算法的藏文文本分类关键技术研究[J]. 西北民族大学学报(自然科学版), 2011, 32(3): 24-29.
[7] (Jia Huiqiang. Research on Key Technologies of Tibetan Text Classification Based on KNN Algorithm[J]. Journal of Northwest University for Nationalities (Natural Science), 2011, 32(3): 24-29.)
[8] 周登. 基于N-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学, 2010.
[8] (Zhou Deng. The Research of Tibetan Text Categorization Base on N-Gram Information[D]. Lanzhou: Northwest University for Nationalities, 2010.)
[9] 刘晓丽, 于洪志. 基于词性特征提取的藏文文本分类方法研究[C]// 第二届中国计算机学会服务计算学术会议论文集. 2011: 93-97.
[9] (Liu Xiaoli, Yu Hongzhi. Research of Feature Extraction Methods Based on Part of Speech in Tibetan Documents Classification[C]// Proceedings of the 2nd CCF National Conference on Service Computing. 2011: 93-97.)
[10] 贾宏云, 群诺, 苏慧婧, 等. 基于SVM藏文文本分类的研究与实现[J]. 电子技术与软件工程, 2018(9): 144-146.
[10] (Jia Hongyun, Qun Nuo, Su Huijing, et al. Research and Implementation of Tibetan Text Classification Based on SVM[J]. Electronic Technology & Software Engineering, 2018(9): 144-146.)
[11] 李艾林, 李照耀. 基于朴素贝叶斯技术的藏文文本分类[J]. 中文信息, 2013(11): 11-12.
[11] (Li Ailin, Li Zhaoyao. Tibetan Text Classification Based on Naive Bayesian Technology[J]. Chinese Information, 2013(11): 11-12.)
[12] Yoon K. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[13] Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM, 2015: 649-657.
[14] Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1606.01781.
[15] Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
[16] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
[17] 苏慧婧. 基于MLP和SepCNN模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学, 2021.
[17] (Su Huijing. Research and Implementation of Tibetan Text Classification Based on MLP and SepCNN Models[D]. Lasa: Tibet University, 2021.)
[18] 王莉莉, 杨鸿武, 宋志蒙. 基于多分类器的藏文文本分类方法[J]. 南京邮电大学学报(自然科学版), 2020, 40(1): 102-110.
[18] (Wang Lili, Yang Hongwu, Song Zhimeng. Tibetan Text Classification Method Based on Multi-Classifiers[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2020, 40(1): 102-110.)
[19] Yang Z Q, Xu Z H, Cui Y M, et al. CINO: A Chinese Minority Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2202.13558.
[20] Sun Y, Liu S S, Deng J J, et al. TiBERT: Tibetan Pre-Trained Language Model[OL]. arXiv Preprint, arXiv: 2205.07303.
[21] Shen D H, Wang G Y, Wang W L, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms[OL]. arXiv Preprint, arXiv: 1805.09843.
[22] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[OL]. arXiv Preprint, arXiv: 1607.01759.
[23] Cai H Y, Zheng V W, Chang K C C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1616-1637.
doi: 10.1109/TKDE.69
[24] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[25] Marcheggiani D, Titov I. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling[OL]. arXiv Preprint, arXiv: 1703.04826.
[26] Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1704.04675.
[27] Yao L A, Mao C S, Luo Y A. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33(1): 7370-7377.
[28] Peng H, Li J X, He Y, et al. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN[C]// Proceedings of the 2018 World Wide Web Conference. New York: ACM, 2018: 1063-1072.
[29] Qun N, Li X, Qiu X P, et al. End-to-End Neural Text Classification for Tibetan[C]// Proceedings of the 2017 International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. 2017: 472-480.
[30] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[1] 徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[2] 张贞港, 余传明. 基于实体与关系融合的知识图谱补全模型研究*[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[3] 李雪梅,蒋建洪. 基于改进图卷积神经网络的评论有用性识别*[J]. 数据分析与知识发现, 2022, 6(11): 38-51.
[4] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[5] 任秋彤, 王昊, 熊欣, 范涛. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(12): 123-136.
[6] 王文哲. 数字文本资料修复中的字符分割法及应用*[J]. 现代图书情报技术, 2010, 26(3): 82-85.
[7] 刘坤,吕学强,王涛,施水才. 基于多尺度条件随机场的文本图像二值化*[J]. 现代图书情报技术, 2009, 25(4): 79-81.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn