Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (3): 6-15    DOI: 10.11925/infotech.2096-3467.2023.0229
Current Issue | Archive | Adv Search |
An Analysis on the Basic Technologies of ChatGPT
Qian Li1,2,3,Liu Yi1(),Zhang Zhixiong1,2,3,Li Xuesi1,2,Xie Jing1,2,Xu Qinya1,2,Li Yang1,2,Guan Zhengyi1,2,Li Xiyu1,2,Wen Sen1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Key Laboratory of New Publishing and Knowledge Services for Scholarly Journals, Beijing 100190, China
Download: PDF (1060 KB)   HTML ( 146
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Review and analyze the corpus, algorithms and models related to ChatGPT, and provide a systematic reference for peer research. [Methods] This paper systematically reviewed the relevant literature and materials since the release of GPT-3. We depict the overall architecture of ChatGPT technology, and explain and analyze the models, algorithms, and principles behind it. [Results] This paper restores the technical details that support ChatGPT functionality based on limited information through literature research. Rationalizing the overall technical architecture diagram of ChatGPT and explaining each technical component of it. The algorithmic principles and model composition of each technical component of ChatGPT is analyzed at three levels: the corpus system, the pre-training algorithm and model, and the fine-tuning algorithm and model. [Limitations] The investigation of the literature related to ChatGPT inevitably has omissions, and the interpretation of some technical contents is not deep enough. Some contents inferred by the authors may be incorrect. [Conclusions] The breakthrough in the application of ChatGPT technology is the result of continuous accumulation through iterative training of corpora, models and algorithms, as well as the effective combination and integration of various algorithmic models.

Key wordsChatGPT      ChatGPT Technology      Generative Pre-Training Models      Artificial Intelligence     
Received: 17 March 2023      Published: 13 April 2023
ZTFLH:  TP18 G253  
Fund:National Key R&D Program of China(2022YFF0711900)
Corresponding Authors: Liu Yi,ORCID:0000-0002-7360-2091,E-mail:liuyi@mail.las.ac.cn。   

Cite this article:

Qian Li, Liu Yi, Zhang Zhixiong, Li Xuesi, Xie Jing, Xu Qinya, Li Yang, Guan Zhengyi, Li Xiyu, Wen Sen. An Analysis on the Basic Technologies of ChatGPT. Data Analysis and Knowledge Discovery, 2023, 7(3): 6-15.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0229     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I3/6

The Structure of ChatGPT Technologies
模型 维基百科 书籍 期刊 Reddit链接 CommonCrawl
GPT-1 / 4.6 / / /
GPT-2 / / / 40 /
GPT-3 11.4 21 101 50 570
The Basic Pre-Training Data for GPT-n (Unit:GB)
SFT Data RM Data PPO Data
语料划分 语料来源 语料数量 语料划分 语料来源 语料数量 语料划分 语料来源 语料数量
训练集 标注工 11 295 训练集 标注工 6 623 训练集 用户 31 144
训练集 用户 1 430 训练集 用户 26 584 验证集 用户 16 185
验证集 标注工 1 550 验证集 标注工 3 488
验证集 用户 103 验证集 用户 14 399
The Distribution of the Conversational Fine-Tuning Corpus at Each Fine-Tuning Stage (Unit:Number of Tokens)
模型 文本相似度模型 文本搜索模型 代码搜索模型
Ada text-similarity-ada-001 text-search-ada-doc-001
text-search-ada-query-001
code-search-ada-code-001
code-search-ada-text-001
Babbage text-similarity-babbage-001 text-search-babbage-doc-001
text-search-babbage-query-001
code-search-babbage-code-001
code-search-babbage-text-001
Curie text-similarity-curie-001 text-search-curie-doc-001
text-search-curie-query-001
\
Davinci text-similarity-davinci-001 text-search-davinci-doc-001
text-search-davinci-query-001
\
The GPT-3 Embedding Series Models
[1] OpenAI. ChatGPT: Optimizing Language Models for Dialogue[EB/OL]. [2023-03-12]. https://openai.com/blog/chatgpt/.
[2] 张智雄, 钱力, 谢靖, 等. ChatGPT对科学研究和文献情报工作的影响[R]. 北京: 中国科学院文献情报中心, 国家科技文献图书中心, 2023.
[2] Zhang Zhixiong, Qian Li, Xie Jing, et al. The Impact of ChatGPT on Scientific Research and Library & Information Service[R]. Beijing: National Science Library, Chinese Academy of Sciences, National Science and Technology Digital Library, 2023.)
[3] Zhou C, Li Q, Li C, et al. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT[OL]. arXiv Preprint, arXiv:2302.09419.
[4] Cao Y H, Li S Y, Liu Y X, et al. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT[OL]. arXiv Preprint, arXiv:2303.04226.
[5] Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv:2005.14165.
[6] Neelakantan A, Xu T, Puri R, et al. Text and Code Embeddings by Contrastive Pre-Training[OL]. arXiv Preprint, arXiv: 2201.10005.
[7] Ouyang L, Wu J, Jiang X, et al. Training Language Models to Follow Instructions with Human Feedback[OL]. arXiv Preprint, arXiv:2203.02155.
[8] Wang F Y, Miao Q, Li X, et al. What does ChatGPT Say: The DAO from Algorithmic Intelligence to Linguistic Intelligence[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(3): 575-579.
doi: 10.1109/JAS.2023.123486
[9] Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code[OL]. arXiv Preprint, arXiv:2107.03374.
[10] Thompson A D. What’s in My AI?[EB/OL]. [2023-03-12]. https://lifearchitect.ai/whats-in-my-ai/.
[11] OpenAI, Aligning Language Models to Follow Instructions[EB/OL]. [2023-03-12]. https://openai.com/blog/instruction-following/.
[12] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv:1706.03762v2.
[13] OpenAI. Models-GPT-3[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/models/gpt-3.
[14] Neelakantan A, Weng L L, Power B, et al. Introducing Text and Code Embeddings[EB/OL]. [2023-03-12]. https://openai.com/blog/introducing-text-and-code-embeddings.
[15] van den Oord A, Li Y Z, Vinyals O. Representation Learning with Contrastive Predictive Coding[OL]. arXiv Preprint, arXiv:1807.03748.
[16] OpenAI. What are Embeddings?[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/guides/embeddings/what-are-embeddings.
[17] OpeanAI. Code Completion[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/guides/code/.
[18] Lachaux M A, Roziere B, Chanussot L, et al. Unsupervised Translation of Programming Languages[OL]. arXiv Preprint, arXiv:2006.03511.
[19] Kulal S, Pasupat P, Chandra K, et al. SPoC: Search-based Pseudocode to Code[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019.
[20] OpenAI. Models-Codex[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/models/codex.
[21] OpenAI. API Guide Text Completion Inserting Text[EB/OL]. [2023-03-13]. https://platform.openai.com/docs/guides/completion/inserting-text.
[22] OpenAI. API Guide Text Completion Editing Text[EB/OL]. [2023-03-13]. .
[23] Thompson A D. GPT-3.5 + ChatGPT: An Illustrated Overview[EB/OL]. [2023-03-12]. https://lifearchitect.ai/chatgpt/.
[24] Fu Y. How does GPT Obtain Its Ability? Tracing Emergent Abilities of Language Models to Their Sources[EB/OL]. [2023-03-12]. https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their- Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1.
[25] Wei J, Bosma M, Zhao V Y, et al. Finetuned Language Models are Zero-Shot Learners [OL]. arXiv Preprint, arXiv:2109.01652.
[26] Schulman J, Klimov O, Wolski F, et al. Proximal Policy Optimization[EB/OL]. [2023-03-12]. https://openai.com/research/openai-baselines-ppo.
[27] Joyce J M. Kullback-Leibler Divergence[A]// International Encyclopedia of Statistical Science[M]. Springer, 2011: 720-722.
[28] Gao L, Schulman J, Hilton J. Scaling Laws for Reward Model Overoptimization[EB/OL]. [2023-03-12]. https://openai.com/research/scaling-laws-for-reward-model-overoptimization.
[29] OpenAI. Model Index for Researchers[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/model-index-for-researchers.
[30] OpenAI. Models-GPT-3.5[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/models/gpt-3-5.
[1] Zhang Huaping, Li Linhan, Li Chunjin. ChatGPT Performance Evaluation on Chinese Language and Risk Measures[J]. 数据分析与知识发现, 2023, 7(3): 16-25.
[2] Zhao Chaoyang, Zhu Guibo, Wang Jinqiao. The Inspiration Brought by ChatGPT to LLM and the New Development Ideas of Multi-modal Large Model[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[3] Zhang Zhixiong, Yu Gaihong, Liu Yi, Lin Xin, Zhang Menting, Qian Li. The Influence of ChatGPT on Library & Information Services[J]. 数据分析与知识发现, 2023, 7(3): 36-42.
[4] Ou Guiyan, Pang Na, Wu Jiang. Influencing Factors of Patent Examination Cycle: Case Study of Artificial Intelligence in China[J]. 数据分析与知识发现, 2022, 6(8): 20-30.
[5] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[6] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[7] Lu Wei,Luo Mengqi,Ding Heng,Li Xin. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[8] Fu Xin. Studies on Intelligent Trends in Third Generation Search Engines[J]. 现代图书情报技术, 2002, 18(6): 28-30.
[9] Huang Kun,Fu Shaohong. Some Related Problems Faced by the Application of It in Information Retrieval[J]. 现代图书情报技术, 2001, 17(3): 26-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn