|
|
ChatGPT Performance Evaluation on Chinese Language and Risk Measures |
Zhang Huaping(),Li Linhan,Li Chunjin |
School of Computer Science, Beijing Institute of Technology, Beijing 100081, China |
|
|
Abstract [Objective] This paper briefly introduces the main technical innovations of ChatGPT, and evaluates the performance of ChatGPT in Chinese on four tasks using nine datasets, analyzes the risk with ChatGPT and proposes our solutions. [Methods] ChatGPT and WeLM models were tested using the ChnSentiCorp dataset, and ChatGPT and ERNIE 3.0 Titan were tested using the EPRSTMT dataset, and it was found that ChatGPT did not differ much from the large domestic models in sentiment analysis tasks. The LCSTS and TTNews datasets were used to test the ChatGPT and WeLM models, and both ChatGPT outperformed the WeLM model; CMRC2018 and DRCD were used for extractive machine reading comprehension(MRC), and the C3 dataset was used for common sense MRC, and it was found that ERNIE 3.0 Titan outperformed ChatGPT in this task. WebQA and CKBQA were used to do Chinese closed-book quiz testing, and it was found that ChatGPT was prone to make factual errors in this task, and the domestic model outperformed ChatGPT. [Results] ChatGPT performed well on classic tasks of natural language processing, such as sentiment analysis with an accuracy rate of more than 85% and a higher probability of factual errors on closed-book questions. [Limitations] The error of evaluation score may be introduced in the process of converting discriminative tasks into generative ones. This paper only evaluated ChatGPT in zero-shot case, so it is not clear how it performs in other cases. ChatGPT may be updated iteratively in subsequent releases, and the profiling results may be time-sensitive. [Conclusions] ChatGPT is powerful but still has some drawbacks, for the large model of Chinese need to be national strategy oriented and pay attention to the limitations of the language model.
|
Received: 13 March 2023
Published: 16 March 2023
|
|
Fund:Natural Science Foundation of Beijing(4212026);Fundamental Strengthening Program Technology Field Fund(2021-JCJQ-JJ-0059) |
Corresponding Authors:
Zhang Huaping,ORCID:0000-0002-0137-4069,E-mail:kevinzhang@bit.edu.cn。
|
[1] |
Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 33: 1877-1901.
|
[2] |
Thoppilan R, De Freitas D, Hall J, et al. LaMDA: Language Models for Dialog Applications [OL]. arXiv Preprint, arXiv:2201.08239.
|
[3] |
Wang S H, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[OL]. arXiv Preprint, arXiv:2112.12731.
|
[4] |
Zeng W, Ren X, Su T, et al. PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-Parallel Computation[OL]. arXiv Preprint, arXiv:2104.12369.
|
[5] |
Su H, Zhou X, Yu H J, et al. WeLM: A Well-Read Pre-trained Language Model for Chinese[OL]. arXiv Preprint, arXiv:2209.10372.
|
[6] |
Kiela D, Bartolo M, Nie Y X, et al. Dynabench: Rethinking Benchmarking in NLP[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2021: 4110-4124.
|
[7] |
Zhou J, Ke P, Qiu X P, et al. ChatGPT: Potential, Prospects, and Limitations[J]. Frontiers of Information Technology & Electronic Engineering. https://doi.org/10.1631/FITEE.2300089.
|
[8] |
van Dis E, Bollen J, Zuidema W, et al. ChatGPT: Five Priorities for Research[J]. Nature, 2023, 614(7947): 224-226.
doi: 10.1038/d41586-023-00288-7
|
[9] |
Thorp H H. ChatGPT is Fun, but Not an Author[J]. Science, 2023, 379(6630): 313.
doi: 10.1126/science.adg7879
pmid: 36701446
|
[10] |
Qin C W, Zhang A, Zhang Z S, et al. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [OL]. arXiv Preprint, arXiv:2302.06476.
|
[11] |
Bang Y, Cahyawijaya S, Lee N, et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity[OL]. arXiv Preprint, arXiv:2302.04023.
|
[12] |
Chen X T, Ye J J, Zu C, et al. How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks[OL]. arXiv Preprint, arXiv:2303.00293.
|
[13] |
Jiao W X, Wang W X, Huang J T, et al. Is ChatGPT a Good Translator? A Preliminary Study[OL]. arXiv Preprint, arXiv:2301.08745.
|
[14] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
|
[15] |
Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-training[OL]. https://gwern.net/doc/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf.
|
[16] |
Radford A, Wu J, Child R, et al. Language Models are Unsupervised Multitask Learners[OL]. OpenAI Blog. .https://gwern.net/doc/ai/nn/transformer/gpt/2019-radford.pdf
|
[17] |
Elman J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2): 179-211.
doi: 10.1207/s15516709cog1402_1
|
[18] |
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
|
[19] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
|
[20] |
Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code[OL]. arXiv Preprint, arXiv:2107.03374.
|
[21] |
Wei J, Bosma M, Zhao V Y, et al. Finetuned Language Models are Zero-Shot Learners[OL]. arXiv Preprint, arXiv:2109.01652.
|
[22] |
Zhang Y Z, Sun S Q, Galley M, et al. DialoGPT: Large-scale Generative Pre-training for Conversational Response Generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2020: 270-278.
|
[23] |
Nakano R, Hilton J, Balaji S, et al. WebGPT: Browser-assisted Question-Answering with Human Feedback[OL]. arXiv Preprint, arXiv:2112.09332.
|
[24] |
Ouyang L, Wu J, Jiang X, et al. Training Language Models to Follow Instructions with Human Feedback[OL]. arXiv Preprint, arXiv:2203.02155.
|
[25] |
Christiano P F, Leike J, Brown T, et al. Deep Reinforcement Learning from Human Preferences[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 4302-4310.
|
[26] |
Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[OL]. arXiv Preprint, arXiv:1707.06347.
|
[27] |
Xu L, Lu X, Yuan C, et al. FewCLUE: A Chinese Few-Shot Learning Evaluation Benchmark[OL]. arXiv Preprint, arXiv:2107.07498.
|
[28] |
Hu B T, Chen Q C, Zhu F Z. LCSTS: A Large Scale Chinese Short Text Summarization Dataset[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1967-1972.
|
[29] |
Hua L F, Wan X J, Li L. Overview of the NLPCC 2017 Shared Task: Single Document Summarization[C]// Proceedings of National CCF Conference on Natural Language Processing and Chinese Computing. Springer International Publishing, 2018: 942-947.
|
[30] |
Cui Y M, Liu T, Che W C, et al. A Span-Extraction Dataset for Chinese Machine Reading Comprehension[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5883-5889.
|
[31] |
Shao C C, Liu T, Lai Y, et al. DRCD: A Chinese Machine Reading Comprehension Dataset[OL]. arXiv Preprint, arXiv:1806.00920.
|
[32] |
Sun K, Yu D, Yu D, et al. Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 141-155.
doi: 10.1162/tacl_a_00305
|
[33] |
Li P, Li W, He Z Y, et al. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering[OL]. arXiv Preprint, arXiv:1607.06275.
|
[34] |
Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the Workshop on Text Summarization Branches Out (WAS2004). 2004.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|