[Objective] This paper briefly introduces the main technical innovations of ChatGPT, and evaluates the performance of ChatGPT in Chinese on four tasks using nine datasets, analyzes the risk with ChatGPT and proposes our solutions. [Methods] ChatGPT and WeLM models were tested using the ChnSentiCorp dataset, and ChatGPT and ERNIE 3.0 Titan were tested using the EPRSTMT dataset, and it was found that ChatGPT did not differ much from the large domestic models in sentiment analysis tasks. The LCSTS and TTNews datasets were used to test the ChatGPT and WeLM models, and both ChatGPT outperformed the WeLM model; CMRC2018 and DRCD were used for extractive machine reading comprehension(MRC), and the C3 dataset was used for common sense MRC, and it was found that ERNIE 3.0 Titan outperformed ChatGPT in this task. WebQA and CKBQA were used to do Chinese closed-book quiz testing, and it was found that ChatGPT was prone to make factual errors in this task, and the domestic model outperformed ChatGPT. [Results] ChatGPT performed well on classic tasks of natural language processing, such as sentiment analysis with an accuracy rate of more than 85% and a higher probability of factual errors on closed-book questions. [Limitations] The error of evaluation score may be introduced in the process of converting discriminative tasks into generative ones. This paper only evaluated ChatGPT in zero-shot case, so it is not clear how it performs in other cases. ChatGPT may be updated iteratively in subsequent releases, and the profiling results may be time-sensitive. [Conclusions] ChatGPT is powerful but still has some drawbacks, for the large model of Chinese need to be national strategy oriented and pay attention to the limitations of the language model.
张华平, 李林翰, 李春锦. ChatGPT中文性能测评与风险应对*[J]. 数据分析与知识发现, 2023, 7(3): 16-25.
Zhang Huaping, Li Linhan, Li Chunjin. ChatGPT Performance Evaluation on Chinese Language and Risk Measures. Data Analysis and Knowledge Discovery, 2023, 7(3): 16-25.
根据我的搜索结果,北方工业大学(North China University of Technology)的张华平博士是该校能源与环境科学与工程学院的副教授。以下是他的个人简介: 张华平博士于2009年毕业于中国科学院大学(原中国科学院研究生院),获工学博士学位。他现在主要从事新型功能纳米材料的制备与应用、能源催化材料、环境污染控制材料等方面的研究。他已在包括《Chemical Engineering Journal》、《Applied Catalysis B: Environmental》、《Journal of Materials Chemistry A》、《Journal of Power Sources》等国际著名期刊上发表了多篇论文。
程修文③http://geoscience.lzu.edu.cn/info/1078/3127.htm.……在国际刊物如Applied Catalysis B: Environmental、Journal of Hazardous Materials及Chemical Engineering Journal等期刊上发表了90余篇研究论文……
Table 9 ChatGPT编造的简历与可能的信息来源
[1]
Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 33: 1877-1901.
[2]
Thoppilan R, De Freitas D, Hall J, et al. LaMDA: Language Models for Dialog Applications [OL]. arXiv Preprint, arXiv:2201.08239.
[3]
Wang S H, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[OL]. arXiv Preprint, arXiv:2112.12731.
[4]
Zeng W, Ren X, Su T, et al. PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-Parallel Computation[OL]. arXiv Preprint, arXiv:2104.12369.
[5]
Su H, Zhou X, Yu H J, et al. WeLM: A Well-Read Pre-trained Language Model for Chinese[OL]. arXiv Preprint, arXiv:2209.10372.
[6]
Kiela D, Bartolo M, Nie Y X, et al. Dynabench: Rethinking Benchmarking in NLP[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2021: 4110-4124.
[7]
Zhou J, Ke P, Qiu X P, et al. ChatGPT: Potential, Prospects, and Limitations[J]. Frontiers of Information Technology & Electronic Engineering. https://doi.org/10.1631/FITEE.2300089.
[8]
van Dis E, Bollen J, Zuidema W, et al. ChatGPT: Five Priorities for Research[J]. Nature, 2023, 614(7947): 224-226.
doi: 10.1038/d41586-023-00288-7
Qin C W, Zhang A, Zhang Z S, et al. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [OL]. arXiv Preprint, arXiv:2302.06476.
[11]
Bang Y, Cahyawijaya S, Lee N, et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity[OL]. arXiv Preprint, arXiv:2302.04023.
[12]
Chen X T, Ye J J, Zu C, et al. How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks[OL]. arXiv Preprint, arXiv:2303.00293.
[13]
Jiao W X, Wang W X, Huang J T, et al. Is ChatGPT a Good Translator? A Preliminary Study[OL]. arXiv Preprint, arXiv:2301.08745.
[14]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[15]
Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-training[OL]. https://gwern.net/doc/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf.
[16]
Radford A, Wu J, Child R, et al. Language Models are Unsupervised Multitask Learners[OL]. OpenAI Blog. .https://gwern.net/doc/ai/nn/transformer/gpt/2019-radford.pdf
[17]
Elman J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2): 179-211.
doi: 10.1207/s15516709cog1402_1
[18]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
[19]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[20]
Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code[OL]. arXiv Preprint, arXiv:2107.03374.
[21]
Wei J, Bosma M, Zhao V Y, et al. Finetuned Language Models are Zero-Shot Learners[OL]. arXiv Preprint, arXiv:2109.01652.
[22]
Zhang Y Z, Sun S Q, Galley M, et al. DialoGPT: Large-scale Generative Pre-training for Conversational Response Generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2020: 270-278.
[23]
Nakano R, Hilton J, Balaji S, et al. WebGPT: Browser-assisted Question-Answering with Human Feedback[OL]. arXiv Preprint, arXiv:2112.09332.
[24]
Ouyang L, Wu J, Jiang X, et al. Training Language Models to Follow Instructions with Human Feedback[OL]. arXiv Preprint, arXiv:2203.02155.
[25]
Christiano P F, Leike J, Brown T, et al. Deep Reinforcement Learning from Human Preferences[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 4302-4310.
Xu L, Lu X, Yuan C, et al. FewCLUE: A Chinese Few-Shot Learning Evaluation Benchmark[OL]. arXiv Preprint, arXiv:2107.07498.
[28]
Hu B T, Chen Q C, Zhu F Z. LCSTS: A Large Scale Chinese Short Text Summarization Dataset[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1967-1972.
[29]
Hua L F, Wan X J, Li L. Overview of the NLPCC 2017 Shared Task: Single Document Summarization[C]// Proceedings of National CCF Conference on Natural Language Processing and Chinese Computing. Springer International Publishing, 2018: 942-947.
[30]
Cui Y M, Liu T, Che W C, et al. A Span-Extraction Dataset for Chinese Machine Reading Comprehension[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5883-5889.
[31]
Shao C C, Liu T, Lai Y, et al. DRCD: A Chinese Machine Reading Comprehension Dataset[OL]. arXiv Preprint, arXiv:1806.00920.
[32]
Sun K, Yu D, Yu D, et al. Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 141-155.
doi: 10.1162/tacl_a_00305
[33]
Li P, Li W, He Z Y, et al. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering[OL]. arXiv Preprint, arXiv:1607.06275.
[34]
Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the Workshop on Text Summarization Branches Out (WAS2004). 2004.