Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (3): 16-25    DOI: 10.11925/infotech.2096-3467.2023.0214
Current Issue | Archive | Adv Search |
ChatGPT Performance Evaluation on Chinese Language and Risk Measures
Zhang Huaping(),Li Linhan,Li Chunjin
School of Computer Science, Beijing Institute of Technology, Beijing 100081, China
Download: PDF (798 KB)   HTML ( 75
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper briefly introduces the main technical innovations of ChatGPT, and evaluates the performance of ChatGPT in Chinese on four tasks using nine datasets, analyzes the risk with ChatGPT and proposes our solutions. [Methods] ChatGPT and WeLM models were tested using the ChnSentiCorp dataset, and ChatGPT and ERNIE 3.0 Titan were tested using the EPRSTMT dataset, and it was found that ChatGPT did not differ much from the large domestic models in sentiment analysis tasks. The LCSTS and TTNews datasets were used to test the ChatGPT and WeLM models, and both ChatGPT outperformed the WeLM model; CMRC2018 and DRCD were used for extractive machine reading comprehension(MRC), and the C3 dataset was used for common sense MRC, and it was found that ERNIE 3.0 Titan outperformed ChatGPT in this task. WebQA and CKBQA were used to do Chinese closed-book quiz testing, and it was found that ChatGPT was prone to make factual errors in this task, and the domestic model outperformed ChatGPT. [Results] ChatGPT performed well on classic tasks of natural language processing, such as sentiment analysis with an accuracy rate of more than 85% and a higher probability of factual errors on closed-book questions. [Limitations] The error of evaluation score may be introduced in the process of converting discriminative tasks into generative ones. This paper only evaluated ChatGPT in zero-shot case, so it is not clear how it performs in other cases. ChatGPT may be updated iteratively in subsequent releases, and the profiling results may be time-sensitive. [Conclusions] ChatGPT is powerful but still has some drawbacks, for the large model of Chinese need to be national strategy oriented and pay attention to the limitations of the language model.

Key wordsLanguage Model      ChatGPT      Artificial Intelligence     
Received: 13 March 2023      Published: 16 March 2023
ZTFLH:  TP391  
Fund:Natural Science Foundation of Beijing(4212026);Fundamental Strengthening Program Technology Field Fund(2021-JCJQ-JJ-0059)
Corresponding Authors: Zhang Huaping,ORCID:0000-0002-0137-4069,E-mail:kevinzhang@bit.edu.cn。   

Cite this article:

Zhang Huaping, Li Linhan, Li Chunjin. ChatGPT Performance Evaluation on Chinese Language and Risk Measures. Data Analysis and Knowledge Discovery, 2023, 7(3): 16-25.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0214     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I3/16

The Decoding Process of Generative Language Models
Development Process of ChatGPT
数据集 评估指标 WeLM PanGu-α ChatGPT ERNIE 3.0 Titan
ChnSentiCorp Acc/% 81.58 87.50
EPRSTMT Acc/% 85.90 88.75
Evaluation Experiments on Sentiment Analysis Tasks
测试 结果
Prompt 判断以下句子的情感是积极或是消极:{Context}\n
Context 酒店正在申定五星中,不过目前看来四星都有点勉强。大堂很气派,不过细节很粗糙。硬件需要加强。服务人员礼貌热情。酒店的房间和早餐都还不错。
标注 1(积极)
ChatGPT输出 第一句话情感为消极;第二句话情感为消极;第三句话情感为积极;第四句话情感为积极。
An Input/Output Example of Sentiment Analysis Task
数据集 评估指标 WeLM PanGu-α ChatGPT ERNIE 3.0 Titan
LCSTS ROUGE-1 23.74 31.30
TTNews ROUGE-1 35.06 42.92
Evaluation Experiments on Text Summary Tasks
测试 结果
Prompt 为以下新闻生成标题:{Context}\n
Context 央行今日将召集大型商业银行和股份制银行开会,以应对当前的债市风暴。消息人士表示,央行一方面旨在维稳银行间债券市场,另一方面很可能探讨以丙类户治理为重点的改革内容。此次债市风暴中,国家审计署扮演了至关重要的角色。
标注 媒体称央行今日召集银行开会应对当前债市风暴
ChatGPT输出 央行召开会议,应对债市风暴,探讨丙类户改革
An Input/Output Example of Text Summary Task
数据集 评价指标 WeLM PanGu-α ChatGPT ERNIE 3.0 Titan
CMRC2018 EM/F1 -/31.31 1.46/19.28 0.86/49.45 16.62/44.20
DRCD EM/F1 -/39.33 0.66/10.55 7.01/36.32 21.08/37.83
C3 Acc/% 54.30 54.47 85.14 87.59
Evaluation Experiments on MRC Tasks
数据集 测试 结果
CMRC2018
DRCD
Prompt 阅读文章:{Context}\n问:{Question}\n答:
Context 《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴……
Question 《战国无双3》是由哪两个公司合作开发的?
标注 光荣和ω-force
ChatGPT输出 《战国无双3》是由光荣和ω-force合作开发的。
C3 Prompt 阅读文章:{Context}\n问:{Question}\n以下选项中正确的答案是:\n{Choices}
Context 梅兰芳是中国著名的京剧演员。有一次,在他表演的时候……
Question 梅兰芳是一个什么样的人?
Choices 他是一位年纪大的老人\n他是喜欢看京剧的人\n他是一个很虚心的人\n他是一个没礼貌的人\n
标注 他是一个很虚心的人
ChatGPT输出 梅兰芳是一个很虚心的人。
An Input/Output Example of MRC Task
数据集 评价指标 WeLM PanGu-α ChatGPT ERNIE 3.0 Titan
WebQA EM/F1 -/50.90 5.13/14.47 0.10/ 0.34 37.97/52.57
CKBQA Acc/% - 14.21 16.47 24.12
Evaluation Experiments on Chinese Closed Book Questions and Answers
测试 结果
Prompt 问:{Question}\n答:
Question 中国第一个进去美国wnba打球的女子篮球运动员是谁?
标注 郑海霞
ChatGPT输出 中国第一个进入美国WNBA打球的女子篮球运动员是姚明。
An Input/Output Example of Chinese Closed Book Question and Answer Task
An Example of ChatGPT’s Confusing Character Allusions
提问 帮我查一个人,北理工张华平博士,个人介绍
ChatGPT输出 根据我的搜索结果,北方工业大学(North China University of Technology)的张华平博士是该校能源与环境科学与工程学院的副教授。以下是他的个人简介:
张华平博士于2009年毕业于中国科学院大学(原中国科学院研究生院),获工学博士学位。他现在主要从事新型功能纳米材料的制备与应用、能源催化材料、环境污染控制材料等方面的研究。他已在包括《Chemical Engineering Journal》、《Applied Catalysis B: Environmental》、《Journal of Materials Chemistry A》、《Journal of Power Sources》等国际著名期刊上发表了多篇论文。
可能的信息来源 黄云薇 https://chem.zqu.edu.cn/info/1844/3479.htm.,女,工学博士,湖南永州人,1984年10月生。现主要从事高分子功能材料,新型纳米材料的制备与应用研究……
雷鸣 https://www.scuec.edu.cn/zhxy/info/1013/1008.htm.,博士,2011年毕业于华中科技大学化学与化工学院……主要从事环境催化材料,环境污染物分析和控制等方面的研究……
程修文 http://geoscience.lzu.edu.cn/info/1078/3127.htm.……在国际刊物如Applied Catalysis B: Environmental、Journal of Hazardous Materials及Chemical Engineering Journal等期刊上发表了90余篇研究论文……
Fabricates Profiles by ChatGPT and Possible Sources of Information
[1] Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 33: 1877-1901.
[2] Thoppilan R, De Freitas D, Hall J, et al. LaMDA: Language Models for Dialog Applications [OL]. arXiv Preprint, arXiv:2201.08239.
[3] Wang S H, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[OL]. arXiv Preprint, arXiv:2112.12731.
[4] Zeng W, Ren X, Su T, et al. PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-Parallel Computation[OL]. arXiv Preprint, arXiv:2104.12369.
[5] Su H, Zhou X, Yu H J, et al. WeLM: A Well-Read Pre-trained Language Model for Chinese[OL]. arXiv Preprint, arXiv:2209.10372.
[6] Kiela D, Bartolo M, Nie Y X, et al. Dynabench: Rethinking Benchmarking in NLP[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2021: 4110-4124.
[7] Zhou J, Ke P, Qiu X P, et al. ChatGPT: Potential, Prospects, and Limitations[J]. Frontiers of Information Technology & Electronic Engineering. https://doi.org/10.1631/FITEE.2300089.
[8] van Dis E, Bollen J, Zuidema W, et al. ChatGPT: Five Priorities for Research[J]. Nature, 2023, 614(7947): 224-226.
doi: 10.1038/d41586-023-00288-7
[9] Thorp H H. ChatGPT is Fun, but Not an Author[J]. Science, 2023, 379(6630): 313.
doi: 10.1126/science.adg7879 pmid: 36701446
[10] Qin C W, Zhang A, Zhang Z S, et al. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [OL]. arXiv Preprint, arXiv:2302.06476.
[11] Bang Y, Cahyawijaya S, Lee N, et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity[OL]. arXiv Preprint, arXiv:2302.04023.
[12] Chen X T, Ye J J, Zu C, et al. How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks[OL]. arXiv Preprint, arXiv:2303.00293.
[13] Jiao W X, Wang W X, Huang J T, et al. Is ChatGPT a Good Translator? A Preliminary Study[OL]. arXiv Preprint, arXiv:2301.08745.
[14] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[15] Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-training[OL]. https://gwern.net/doc/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf.
[16] Radford A, Wu J, Child R, et al. Language Models are Unsupervised Multitask Learners[OL]. OpenAI Blog. .https://gwern.net/doc/ai/nn/transformer/gpt/2019-radford.pdf
[17] Elman J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2): 179-211.
doi: 10.1207/s15516709cog1402_1
[18] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[19] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[20] Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code[OL]. arXiv Preprint, arXiv:2107.03374.
[21] Wei J, Bosma M, Zhao V Y, et al. Finetuned Language Models are Zero-Shot Learners[OL]. arXiv Preprint, arXiv:2109.01652.
[22] Zhang Y Z, Sun S Q, Galley M, et al. DialoGPT: Large-scale Generative Pre-training for Conversational Response Generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2020: 270-278.
[23] Nakano R, Hilton J, Balaji S, et al. WebGPT: Browser-assisted Question-Answering with Human Feedback[OL]. arXiv Preprint, arXiv:2112.09332.
[24] Ouyang L, Wu J, Jiang X, et al. Training Language Models to Follow Instructions with Human Feedback[OL]. arXiv Preprint, arXiv:2203.02155.
[25] Christiano P F, Leike J, Brown T, et al. Deep Reinforcement Learning from Human Preferences[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 4302-4310.
[26] Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[OL]. arXiv Preprint, arXiv:1707.06347.
[27] Xu L, Lu X, Yuan C, et al. FewCLUE: A Chinese Few-Shot Learning Evaluation Benchmark[OL]. arXiv Preprint, arXiv:2107.07498.
[28] Hu B T, Chen Q C, Zhu F Z. LCSTS: A Large Scale Chinese Short Text Summarization Dataset[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1967-1972.
[29] Hua L F, Wan X J, Li L. Overview of the NLPCC 2017 Shared Task: Single Document Summarization[C]// Proceedings of National CCF Conference on Natural Language Processing and Chinese Computing. Springer International Publishing, 2018: 942-947.
[30] Cui Y M, Liu T, Che W C, et al. A Span-Extraction Dataset for Chinese Machine Reading Comprehension[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5883-5889.
[31] Shao C C, Liu T, Lai Y, et al. DRCD: A Chinese Machine Reading Comprehension Dataset[OL]. arXiv Preprint, arXiv:1806.00920.
[32] Sun K, Yu D, Yu D, et al. Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 141-155.
doi: 10.1162/tacl_a_00305
[33] Li P, Li W, He Z Y, et al. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering[OL]. arXiv Preprint, arXiv:1607.06275.
[34] Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the Workshop on Text Summarization Branches Out (WAS2004). 2004.
[1] Zhao Chaoyang, Zhu Guibo, Wang Jinqiao. The Inspiration Brought by ChatGPT to LLM and the New Development Ideas of Multi-modal Large Model[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[2] Zhang Zhixiong, Yu Gaihong, Liu Yi, Lin Xin, Zhang Menting, Qian Li. The Influence of ChatGPT on Library & Information Services[J]. 数据分析与知识发现, 2023, 7(3): 36-42.
[3] Qian Li, Liu Yi, Zhang Zhixiong, Li Xuesi, Xie Jing, Xu Qinya, Li Yang, Guan Zhengyi, Li Xiyu, Wen Sen. An Analysis on the Basic Technologies of ChatGPT[J]. 数据分析与知识发现, 2023, 7(3): 6-15.
[4] Ou Guiyan, Pang Na, Wu Jiang. Influencing Factors of Patent Examination Cycle: Case Study of Artificial Intelligence in China[J]. 数据分析与知识发现, 2022, 6(8): 20-30.
[5] Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[6] Jing Shenqi, Zhao Youlin. Extracting Medical Entity Relationships with Domain-Specific Knowledge and Distant Supervision[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[7] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[8] Wang Yizhen,Ou Shiyan,Chen Jinju. Automatic Abstracting Civil Judgment Documents with Two-Stage Procedure[J]. 数据分析与知识发现, 2021, 5(5): 104-114.
[9] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[10] Shen Zhuo,Li Yan. Mining User Reviews with PreLM-FT Fine-Grain Sentiment Analysis[J]. 数据分析与知识发现, 2020, 4(4): 63-71.
[11] Lu Wei,Luo Mengqi,Ding Heng,Li Xin. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[12] Ba Zhichao,Li Gang,Zhu Shiwei. Similarity Measurement of Research Interests in Semantic Network[J]. 现代图书情报技术, 2016, 32(4): 81-90.
[13] Huang Xiaoxi, Zhang Hua, Lu Bei, Wang Rongbo, Wu Ting. An Approach to Chinese Metaphor Identification Based on Word Abstractness[J]. 现代图书情报技术, 2015, 31(4): 34-40.
[14] Mao Jin, Li Gang, Cao Yujie. Re-rank Retrieval Results Through Subject Indexing[J]. 现代图书情报技术, 2014, 30(7): 48-55.
[15] Fu Xin. Studies on Intelligent Trends in Third Generation Search Engines[J]. 现代图书情报技术, 2002, 18(6): 28-30.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn