Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (3): 6-15     https://doi.org/10.11925/infotech.2096-3467.2023.0229
  专题 本期目录 | 过刊浏览 | 高级检索 |
ChatGPT的技术基础分析*
钱力1,2,3,刘熠1(),张智雄1,2,3,李雪思1,2,谢靖1,2,许钦亚1,2,黎洋1,2,管铮懿1,2,李西雨1,2,文森1,2
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院信息资源管理系 北京 100190
3国家新闻出版署学术期刊新型出版与知识服务重点实验室 北京 100190
An Analysis on the Basic Technologies of ChatGPT
Qian Li1,2,3,Liu Yi1(),Zhang Zhixiong1,2,3,Li Xuesi1,2,Xie Jing1,2,Xu Qinya1,2,Li Yang1,2,Guan Zhengyi1,2,Li Xiyu1,2,Wen Sen1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Key Laboratory of New Publishing and Knowledge Services for Scholarly Journals, Beijing 100190, China
全文: PDF (1060 KB)   HTML ( 146
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 梳理分析ChatGPT相关的语料、算法与模型,为同行业研究提供体系化的参考借鉴。【方法】 通过系统梳理GPT-3发布至今的相关文献与资料,刻画ChatGPT技术的整体架构,并解释与分析其背后的模型、算法与原理。【结果】 通过文献调研,根据现有资料还原了支撑ChatGPT功能的技术细节,梳理了ChatGPT技术的整体架构,解释了ChatGPT整体技术构成。按照ChatGPT的语料体系、预训练算法与模型、微调算法与模型三个层次分析ChatGPT各技术组件的算法原理与模型组成。【局限】 本文调研ChatGPT相关的文献难免存在遗漏,且对部分技术内容的解读还不够深入,一些由笔者推断的内容甚至可能存在错误。【结论】 ChatGPT技术应用的突破,是语料、模型、算法,通过迭代训练不断积累的结果,也是各类算法模型有效组合与集成的结果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
钱力
刘熠
张智雄
李雪思
谢靖
许钦亚
黎洋
管铮懿
李西雨
文森
关键词 ChatGPTChatGPT技术生成式预训练模型人工智能    
Abstract

[Objective] Review and analyze the corpus, algorithms and models related to ChatGPT, and provide a systematic reference for peer research. [Methods] This paper systematically reviewed the relevant literature and materials since the release of GPT-3. We depict the overall architecture of ChatGPT technology, and explain and analyze the models, algorithms, and principles behind it. [Results] This paper restores the technical details that support ChatGPT functionality based on limited information through literature research. Rationalizing the overall technical architecture diagram of ChatGPT and explaining each technical component of it. The algorithmic principles and model composition of each technical component of ChatGPT is analyzed at three levels: the corpus system, the pre-training algorithm and model, and the fine-tuning algorithm and model. [Limitations] The investigation of the literature related to ChatGPT inevitably has omissions, and the interpretation of some technical contents is not deep enough. Some contents inferred by the authors may be incorrect. [Conclusions] The breakthrough in the application of ChatGPT technology is the result of continuous accumulation through iterative training of corpora, models and algorithms, as well as the effective combination and integration of various algorithmic models.

Key wordsChatGPT    ChatGPT Technology    Generative Pre-Training Models    Artificial Intelligence
收稿日期: 2023-03-17      出版日期: 2023-04-13
ZTFLH:  TP18 G253  
基金资助:国家重点研发计划项目(2022YFF0711900)
通讯作者: 刘熠,ORCID:0000-0002-7360-2091,E-mail:liuyi@mail.las.ac.cn。   
引用本文:   
钱力, 刘熠, 张智雄, 李雪思, 谢靖, 许钦亚, 黎洋, 管铮懿, 李西雨, 文森. ChatGPT的技术基础分析*[J]. 数据分析与知识发现, 2023, 7(3): 6-15.
Qian Li, Liu Yi, Zhang Zhixiong, Li Xuesi, Xie Jing, Xu Qinya, Li Yang, Guan Zhengyi, Li Xiyu, Wen Sen. An Analysis on the Basic Technologies of ChatGPT. Data Analysis and Knowledge Discovery, 2023, 7(3): 6-15.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0229      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I3/6
Fig.1  ChatGPT技术的整体架构
模型 维基百科 书籍 期刊 Reddit链接 CommonCrawl
GPT-1 / 4.6 / / /
GPT-2 / / / 40 /
GPT-3 11.4 21 101 50 570
Table 1  GPT-n的基础预训练数据(单位:GB)[10]
SFT Data RM Data PPO Data
语料划分 语料来源 语料数量 语料划分 语料来源 语料数量 语料划分 语料来源 语料数量
训练集 标注工 11 295 训练集 标注工 6 623 训练集 用户 31 144
训练集 用户 1 430 训练集 用户 26 584 验证集 用户 16 185
验证集 标注工 1 550 验证集 标注工 3 488
验证集 用户 103 验证集 用户 14 399
Table 2  对话微调语料在各微调阶段的体量和分布(单位:Token数量)[7]
模型 文本相似度模型 文本搜索模型 代码搜索模型
Ada text-similarity-ada-001 text-search-ada-doc-001
text-search-ada-query-001
code-search-ada-code-001
code-search-ada-text-001
Babbage text-similarity-babbage-001 text-search-babbage-doc-001
text-search-babbage-query-001
code-search-babbage-code-001
code-search-babbage-text-001
Curie text-similarity-curie-001 text-search-curie-doc-001
text-search-curie-query-001
\
Davinci text-similarity-davinci-001 text-search-davinci-doc-001
text-search-davinci-query-001
\
Table 3  GPT-3的Embedding系列模型
[1] OpenAI. ChatGPT: Optimizing Language Models for Dialogue[EB/OL]. [2023-03-12]. https://openai.com/blog/chatgpt/.
[2] 张智雄, 钱力, 谢靖, 等. ChatGPT对科学研究和文献情报工作的影响[R]. 北京: 中国科学院文献情报中心, 国家科技文献图书中心, 2023.
[2] Zhang Zhixiong, Qian Li, Xie Jing, et al. The Impact of ChatGPT on Scientific Research and Library & Information Service[R]. Beijing: National Science Library, Chinese Academy of Sciences, National Science and Technology Digital Library, 2023.)
[3] Zhou C, Li Q, Li C, et al. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT[OL]. arXiv Preprint, arXiv:2302.09419.
[4] Cao Y H, Li S Y, Liu Y X, et al. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT[OL]. arXiv Preprint, arXiv:2303.04226.
[5] Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv:2005.14165.
[6] Neelakantan A, Xu T, Puri R, et al. Text and Code Embeddings by Contrastive Pre-Training[OL]. arXiv Preprint, arXiv: 2201.10005.
[7] Ouyang L, Wu J, Jiang X, et al. Training Language Models to Follow Instructions with Human Feedback[OL]. arXiv Preprint, arXiv:2203.02155.
[8] Wang F Y, Miao Q, Li X, et al. What does ChatGPT Say: The DAO from Algorithmic Intelligence to Linguistic Intelligence[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(3): 575-579.
doi: 10.1109/JAS.2023.123486
[9] Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code[OL]. arXiv Preprint, arXiv:2107.03374.
[10] Thompson A D. What’s in My AI?[EB/OL]. [2023-03-12]. https://lifearchitect.ai/whats-in-my-ai/.
[11] OpenAI, Aligning Language Models to Follow Instructions[EB/OL]. [2023-03-12]. https://openai.com/blog/instruction-following/.
[12] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv:1706.03762v2.
[13] OpenAI. Models-GPT-3[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/models/gpt-3.
[14] Neelakantan A, Weng L L, Power B, et al. Introducing Text and Code Embeddings[EB/OL]. [2023-03-12]. https://openai.com/blog/introducing-text-and-code-embeddings.
[15] van den Oord A, Li Y Z, Vinyals O. Representation Learning with Contrastive Predictive Coding[OL]. arXiv Preprint, arXiv:1807.03748.
[16] OpenAI. What are Embeddings?[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/guides/embeddings/what-are-embeddings.
[17] OpeanAI. Code Completion[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/guides/code/.
[18] Lachaux M A, Roziere B, Chanussot L, et al. Unsupervised Translation of Programming Languages[OL]. arXiv Preprint, arXiv:2006.03511.
[19] Kulal S, Pasupat P, Chandra K, et al. SPoC: Search-based Pseudocode to Code[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019.
[20] OpenAI. Models-Codex[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/models/codex.
[21] OpenAI. API Guide Text Completion Inserting Text[EB/OL]. [2023-03-13]. https://platform.openai.com/docs/guides/completion/inserting-text.
[22] OpenAI. API Guide Text Completion Editing Text[EB/OL]. [2023-03-13]. .
[23] Thompson A D. GPT-3.5 + ChatGPT: An Illustrated Overview[EB/OL]. [2023-03-12]. https://lifearchitect.ai/chatgpt/.
[24] Fu Y. How does GPT Obtain Its Ability? Tracing Emergent Abilities of Language Models to Their Sources[EB/OL]. [2023-03-12]. https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their- Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1.
[25] Wei J, Bosma M, Zhao V Y, et al. Finetuned Language Models are Zero-Shot Learners [OL]. arXiv Preprint, arXiv:2109.01652.
[26] Schulman J, Klimov O, Wolski F, et al. Proximal Policy Optimization[EB/OL]. [2023-03-12]. https://openai.com/research/openai-baselines-ppo.
[27] Joyce J M. Kullback-Leibler Divergence[A]// International Encyclopedia of Statistical Science[M]. Springer, 2011: 720-722.
[28] Gao L, Schulman J, Hilton J. Scaling Laws for Reward Model Overoptimization[EB/OL]. [2023-03-12]. https://openai.com/research/scaling-laws-for-reward-model-overoptimization.
[29] OpenAI. Model Index for Researchers[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/model-index-for-researchers.
[30] OpenAI. Models-GPT-3.5[EB/OL]. [2023-03-12]. https://platform.openai.com/docs/models/gpt-3-5.
[1] 张华平, 李林翰, 李春锦. ChatGPT中文性能测评与风险应对*[J]. 数据分析与知识发现, 2023, 7(3): 16-25.
[2] 赵朝阳, 朱贵波, 王金桥. ChatGPT给语言大模型带来的启示和多模态大模型新的发展思路*[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[3] 张智雄, 于改红, 刘熠, 林歆, 张梦婷, 钱力. ChatGPT对文献情报工作的影响*[J]. 数据分析与知识发现, 2023, 7(3): 36-42.
[4] 欧桂燕, 庞娜, 吴江. 专利审查周期影响因素研究——以中国人工智能领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 20-30.
[5] 宋若璇,钱力,杜宇. 基于科技论文中未来工作句集的学术创新构想话题自动生成方法研究*[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[6] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[7] 陆伟, 罗梦奇, 丁恒, 李信. 深度学习图像标注与用户标注比较研究*[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[8] 黄崑,符绍宏. 自动分词技术及其在信息检索中应用的研究[J]. 现代图书情报技术, 2001, 17(3): 26-29.
[9] 王咏,倪波,丁尉,承斌. 20世纪计算机软件技术的发展——IT技术世纪回眸之二[J]. 现代图书情报技术, 2000, 16(6): 6-9.
[10] 牛耘,朱献有. 神经网络技术在文献检索中的应用前景[J]. 现代图书情报技术, 1997, 13(3): 19-21.
[11] 张道福,李一平. 现代图书馆、情报中心的基本模型与信息技术[J]. 现代图书情报技术, 1995, 11(2): 40-45.
[12] 徐高林. 人工智能与图书情报工作自动化[J]. 现代图书情报技术, 1995, 11(1): 43-45.
[13] 何沁. 知识表达在情报检索系统中的潜在应用[J]. 现代图书情报技术, 1994, 10(5): 47-49.
[14] 贾同兴. 智能情报检索中的超文本技术[J]. 现代图书情报技术, 1994, 10(1): 42-46.
[15] 方正,李晓清. 情报检索智能化及其应用[J]. 现代图书情报技术, 1991, 7(3): 46-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn