Deep Dive into LLMs like ChatGPT

章节 1:引言与预训练数据管道 (Introduction and Pre-training Data Pipeline)

📝 本节摘要

在本章中,演讲者Andre Karpathy首先介绍了视频的主旨:向普通受众全面解析ChatGPT等大语言模型背后的工作原理。核心内容聚焦于模型构建的第一阶段——预训练(Pre-training)。该阶段始于从互联网下载海量数据(如Common Crawl),并经过一系列严格的管道处理:URL过滤、文本提取、语言筛选及隐私信息(PII)移除,最终形成如FineWeb这样的高质量数据集。随后,讲解了计算机如何通过分词(Tokenization)技术(具体为字节对编码BPE),将原始文本转化为神经网络能够处理的整数序列(Tokens)。

[原文] [Andrej Karpathy]: hi everyone so I've wanted to make this video for a while it is a comprehensive but General audience introduction to large language models like Chachi PT and what I'm hoping to achieve in this video is to give you kind of mental models for thinking through what it is that this tool is it is obviously magical and amazing in some respects it's uh really good at some things not very good at other things and there's also a lot of sharp edges to be aware of so what is behind this text box you can put anything in there and press enter but uh what should we be putting there and what are these words generated back how does this work and what what are you talking to exactly so I'm hoping to get at all those topics in this video we're going to go through the entire pipeline of how this stuff is built but I'm going to keep everything uh sort of accessible to a general audience so let's take a look at first how you build something like chpt and along the way I'm going to talk about um you know some of the sort of cognitive psychological implications of the tools

[译文] [Andrej Karpathy]: 大家好,我想做这个视频已经有一段时间了,这是一个面向普通受众的、关于像ChatGPT这样的大型语言模型(Large Language Models)的全面介绍。我希望通过这个视频实现的是,给你们提供一种思维模型,用来思考这个工具到底是什么。这显然在某些方面是神奇和令人惊叹的,它在某些事情上非常擅长,但在另一些事情上却不太擅长,而且还有很多需要注意的“锋利边缘”(即缺陷或风险)。那么这个文本框背后是什么呢?你可以输入任何内容并按回车,但我们应该在那里输入什么?这些生成的回复文字是什么?它是如何工作的?你到底在和什么对话?所以我希望在这个视频中探讨所有这些话题。我们将完整地过一遍这些东西是如何构建的整个管道(pipeline),但我会保持所有内容对普通观众来说都是通俗易懂的。所以让我们首先看看你是如何构建像ChatGPT这样的东西的,在此过程中,我会谈论一些关于这些工具的认知心理学层面的影响。

[原文] [Andrej Karpathy]: okay so let's build Chachi PT so there's going to be multiple stages arranged sequentially the first stage is called the pre-training stage and the first step of the pre-training stage is to download and process the internet now to get a sense of what this roughly looks like I recommend looking at this URL here so um this company called hugging face uh collected and created and curated this data set called Fine web and they go into a lot of detail on this block post on how how they constructed the fine web data set and all of the major llm providers like open AI anthropic and Google and so on will have some equivalent internally of something like the fine web data set

[译文] [Andrej Karpathy]: 好的,那我们来构建ChatGPT。这包含按顺序排列的多个阶段,第一个阶段被称为预训练阶段(Pre-training stage)。预训练阶段的第一步是下载并处理互联网数据。为了让你大致了解这看起来是什么样子,我建议看看这个网址。这家叫Hugging Face的公司收集、创建并整理了一个名为FineWeb的数据集。他们在这篇博文中详细介绍了他们是如何构建FineWeb数据集的。所有主要的大语言模型提供商,如OpenAI、Anthropic和Google等,内部都会有某种类似于FineWeb数据集的等价物。

[原文] [Andrej Karpathy]: so roughly what are we trying to achieve here we're trying to get ton of text from the internet from publicly available sources so we're trying to have a huge quantity of very high quality documents and we also want very large diversity of documents because we want to have a lot of knowledge inside these models so we want large diversity of high quality documents and we want many many of them and achieving this is uh quite complicated and as you can see here takes multiple stages to do well so let's take a look at what some of these stages look like in a bit for now I'd like to just like to note that for example the fine web data set which is fairly representative what you would see in a production grade application actually ends up being only about 44 terabyt of dis space um you can get a USB stick for like a terabyte very easily or I think this could fit on a single hard drive almost today so this is not a huge amount of data at the end of the day even though the internet is very very large we're working with text and we're also filtering it aggressively so we end up with about 44 terabytes in this example

[译文] [Andrej Karpathy]: 那么我们在这里大致想要实现什么呢?我们试图从互联网的公开来源获取大量的文本。所以我们试图拥有海量的高质量文档,我们也希望文档具有非常大的多样性,因为我们希望这些模型内部拥有大量的知识。所以我们需要具有高度多样性的高质量文档,而且我们需要非常多这样的文档。要实现这一点相当复杂,正如你在这里看到的,这需要多个阶段才能做好。稍后我们来看看其中一些阶段是什么样的。现在我只想指出,例如FineWeb数据集——这在生产级应用中具有相当的代表性——实际上最终只占用了大约44TB的磁盘空间。你可以很容易买到一个1TB的USB盘,或者我认为这几乎可以装进今天的单个硬盘里。所以归根结底,这并不是一个巨大的数据量,尽管互联网非常非常大,但我们处理的是文本,而且我们还在积极地过滤它,所以在这个例子中我们最终得到了大约44TB的数据。

[原文] [Andrej Karpathy]: so let's take a look at uh kind of what this data looks like and what some of these stages uh also are so the starting point for a lot of these efforts and something that contributes most of the data by the end of it is Data from common crawl so common craw is an organization that has been basically scouring the internet since 2007 so as of 2024 for example common CW has indexed 2.7 billion web pages uh and uh they have all these crawlers going around the internet and what you end up doing basically is you start with a few seed web pages and then you follow all the links and you just keep following links and you keep indexing all the information and you end up with a ton of data of the internet over time so this is usually the starting point for a lot of the uh for a lot of these efforts

[译文] [Andrej Karpathy]: 那么让我们来看看这些数据是什么样的,以及其中一些阶段是什么。许多这类工作的起点,也是最终贡献大部分数据来源的,是来自Common Crawl的数据。Common Crawl是一个组织,基本上从2007年开始就在搜寻互联网。截至2024年,Common Crawl已经索引了27亿个网页。他们有这些爬虫程序(crawlers)在互联网上到处跑。你基本上要做的就是从一些种子网页开始,然后跟随所有的链接,不断地跟随链接,不断地索引所有的信息,随着时间的推移,你就得到了大量的互联网数据。这通常是许多此类工作的起点。

[原文] [Andrej Karpathy]: now this common C data is quite raw and is filtered in many many different ways so here they Pro they document this is the same diagram they document a little bit the kind of processing that happens in these stages so the first thing here is something called URL filtering so what that is referring to is that there's these block lists of uh basically URLs that are or domains that uh you don't want to be getting data from so usually this includes things like U malware websites spam websites marketing websites uh racist websites adult sites and things like that so there's a ton of different types of websites that are just eliminated at this stage because we don't want them in our data set

[译文] [Andrej Karpathy]: 这个Common Crawl数据非常原始,并且会被以许多不同的方式进行过滤。这里他们记录了——这是同一张图表——他们记录了一些在这些阶段发生的处理过程。首先是所谓的URL过滤(URL filtering)。这指的是有一些黑名单,基本上是你不想从中获取数据的URL或域名。这通常包括恶意软件网站、垃圾邮件网站、营销网站、种族主义网站、成人网站之类的东西。所以在这一阶段有大量不同类型的网站被剔除,因为我们不希望它们出现在我们的数据集中。

[原文] [Andrej Karpathy]: um the second part is text extraction you have to remember that all these web pages this is the raw HTML of these web pages that are being saved by these crawlers so when I go to inspect here this is what the raw HTML actually looks like you'll notice that it's got all this markup uh like lists and stuff like that and there's CSS and all this kind of stuff so this is um computer code almost for these web pages but what we really want is we just want this text right we just want the text of this web page and we don't want the navigation and things like that so there's a lot of filtering and processing uh and heris that go into uh adequately filtering for just their uh good content of these web pages

[译文] [Andrej Karpathy]: 第二部分是文本提取(text extraction)。你要记得所有这些网页,爬虫保存的是这些网页的原始HTML。所以当我在这里点击检查时,这就是原始HTML实际的样子。你会注意到它有所有的标记,比如列表之类的,还有CSS和所有这类东西。所以这几乎是这些网页的计算机代码。但我们真正想要的只是文本,对吧?我们只想要这个网页的文本,我们不想要导航栏之类的东西。所以有大量的过滤和处理以及启发式方法(heuristics)被用于充分过滤,只保留这些网页中的优质内容。

[原文] [Andrej Karpathy]: the next stage here is language filtering so for example fine web filters uh using a language classifier they try to guess what language every single web page is in and then they only keep web pages that have more than 65% of English as an example and so you can get a sense that this is like a design decision that different companies can uh can uh take for themselves what fraction of all different types of languages are we going to include in our data set because for example if we filter out all of the Spanish as an example then you might imagine that our model later will not be very good at Spanish because it's just never seen that much data of that language and so different companies can focus on multilingual performance to uh to a different degree as an example so fine web is quite focused on English and so their language model if they end up training one later will be very good at English but not may be very good at other languages

[译文] [Andrej Karpathy]: 接下来的阶段是语言过滤(language filtering)。例如,FineWeb使用语言分类器进行过滤,他们试图猜测每个网页是什么语言,然后作为一个例子,他们只保留那些英语含量超过65%的网页。你可以感觉到这就像是一个设计决策,不同的公司可以自己决定在我们的数据集中包含多少比例的各种不同语言。因为例如,如果我们过滤掉所有的西班牙语,那么你可以想象我们的模型之后在西班牙语方面就不会很好,因为它从未见过那么多该语言的数据。所以不同的公司可以不同程度地关注多语言性能。例如FineWeb非常专注于英语,所以如果他们后来训练一个语言模型,它在英语方面会非常好,但在其他语言方面可能就不那么好了。

[原文] [Andrej Karpathy]: after language filtering there's a few other filtering steps and D duplication and things like that um finishing with for example the pii removal this is personally identifiable information so as an example addresses Social Security numbers and things like that you would try to detect them and you would try to filter out those kinds of web pages from the the data set as well so there's a lot of stages here and I won't go into full detail but it is a fairly extensive part of the pre-processing and you end up with for example the fine web data set so when you click in on it uh you can see some examples here of what this actually ends up looking like and anyone can download this on the huging phase web page and so here are some examples of the final text that ends up in the training set

[译文] [Andrej Karpathy]: 在语言过滤之后,还有一些其他的过滤步骤和去重(De-duplication)之类的操作,最后以例如PII移除(个人身份信息移除)结束。例如地址、社会安全号码之类的东西,你会试图检测它们,并试图从数据集中过滤掉包含这些内容的网页。这里有很多阶段,我不会完全详细地讲,但这是预处理中相当广泛的一部分。最终你会得到例如FineWeb数据集。当你点击进去时,你可以看到一些例子,看看它实际上最终是什么样子的,任何人都可以在Hugging Face网页上下载它。这里有一些最终进入训练集的文本示例。

[原文] [Andrej Karpathy]: so this is some article about tornadoes in 2012 um so there's some t tadoes in 2020 in 2012 and what happened uh this next one is something about did you know you have two little yellow 9vt battery sized adrenal glands in your body okay so this is some kind of a odd medical article so just think of these as basically uh web pages on the internet filtered just for the text in various ways and now we have a ton of text 40 terabytes off it and that now is the starting point for the next step of this stage now I wanted to give you an intuitive sense of where we are right now so I took the first 200 web pages here and remember we have tons of them and I just take all that text and I just put it all together concatenate it and so this is what we end up with we just get this just just raw text raw internet text and there's a ton of it even in these 200 web pages so I can continue zooming out here and we just have this like massive tapestry of Text data and this text data has all these p patterns and what we want to do now is we want to start training neural networks on this data so the neural networks can internalize and model how this text flows right so we just have this giant texture of text and now we want to get neural Nets that mimic it

[译文] [Andrej Karpathy]: 这是一篇关于2012年龙卷风的文章,讲的是2012年发生的一些龙卷风以及发生了什么。下一个是关于“你知道你体内有两个像9伏电池大小的小黄色肾上腺吗”,好吧,这是某种奇怪的医学文章。所以就把这些基本上看作是互联网上的网页,经过各种方式过滤只保留了文本。现在我们有大量的文本,40TB的文本,这就是这个阶段下一步的起点。我想让你们直观地感受一下我们现在的处境。所以我取了这里的前200个网页,记得我们有海量的网页,我把所有的文本放在一起,把它们串联(concatenate)起来。这就是我们最终得到的结果,我们得到了这些原始文本,原始的互联网文本,即使在这200个网页中也有大量内容。我可以继续在这里缩小视图,我们就像拥有了一幅巨大的文本数据挂毯。这些文本数据包含了所有这些模式,我们现在要做的是开始在这个数据上训练神经网络,以便神经网络能够内化并模拟这些文本是如何流动的,对吧?所以我们就有了这巨大的文本纹理,现在我们想让神经网络来模仿它。

[原文] [Andrej Karpathy]: okay now before we plug text into neural networks we have to decide how we're going to represent this text uh and how we're going to feed it in now the way our technology works for these neuron Lots is that they expect a one-dimensional sequence of symbols and they want a finite set of symbols that are possible and so we have to decide what are the symbols and then we have to represent our data as one-dimensional sequence of those symbols so right now what we have is a onedimensional sequence of text it starts here and it goes here and then it comes here Etc so this is a onedimensional sequence even though on my monitor of course it's laid out in a two-dimensional way but it goes from left to right and top to bottom right so it's a one-dimensional sequence of text

[译文] [Andrej Karpathy]: 好的,在我们将文本输入神经网络之前,我们必须决定如何表示这些文本,以及如何将其输入进去。我们的神经网络技术的工作方式是,它们期望一个一维的符号序列(one-dimensional sequence of symbols),并且它们需要一个有限的可能符号集。所以我们必须决定什么是符号,然后我们必须把我们的数据表示为这些符号的一维序列。目前我们拥有的是一维的文本序列,它从这里开始,到这里,然后到这里等等。这是一个一维序列,尽管在我的显示器上当然是以二维方式排列的,但它从左到右,从上到下,对吧?所以它是一个一维的文本序列。

[原文] [Andrej Karpathy]: now this being computers of course there's an underlying representation here so if I do what's called utf8 uh encode this text then I can get the raw bits that correspond to this text in the computer and that's what uh that looks like this so it turns out that for example this very first bar here is the first uh eight bits as an example so what is this thing right this is um representation that we are looking for uh in in a certain sense we have exactly two possible symbols zero and one and we have a very long sequence of it right now as it turns out um this sequence length is actually going to be very finite and precious resource uh in our neural network and we actually don't want extremely long sequences of just two symbols instead what we want is we want to trade off uh this um symbol size uh of this vocabulary as we call it and the resulting sequence length so we don't want just two symbols and extremely long sequences we're going to want more symbols and shorter sequences

[译文] [Andrej Karpathy]: 既然是计算机,当然这里有一个底层的表示。如果我对此文本进行所谓的UTF-8编码,我就能得到计算机中对应于此文本的原始比特(bits)。这就是它看起来的样子。实际上,例如这里的第一个条形就是前8个比特。这是什么东西呢?在某种意义上,这就是我们在寻找的表示,我们确切地有两个可能的符号:0和1,并且我们现在有一个非常长的序列。事实证明,这个序列长度在我们的神经网络中实际上是非常有限且宝贵的资源,我们实际上不想要只有两个符号的极长序列。相反,我们想要做的是权衡这个符号集大小(我们称之为词汇表vocabulary)和结果序列的长度。所以我们不想要只有两个符号和极长的序列,我们会想要更多的符号和更短的序列。

[原文] [Andrej Karpathy]: okay so one naive way of compressing or decreasing the length of our sequence here is to basically uh consider some group of consecutive bits for example eight bits and group them into a single what's called bite so because uh these bits are either on or off if we take a group of eight of them there turns out to be only 256 possible combinations of how these bits could be on or off and so therefore we can re repesent this sequence into a sequence of bytes instead so this sequence of bytes will be eight times shorter but now we have 256 possible symbols so every number here goes from 0 to 255 now I really encourage you to think of these not as numbers but as unique IDs or like unique symbols so maybe it's a bit more maybe it's better to actually think of these to replace every one of these with a unique Emoji you'd get something like this so um we basically have a sequence of emojis and there's 256 possible emojis you can think of it that way

[译文] [Andrej Karpathy]: 好的,一种压缩或减少我们序列长度的朴素方法基本上是考虑一组连续的比特,例如8个比特,并将它们组合成一个所谓的字节(byte)。因为这些比特要么开要么关,如果我们取一组8个,实际上只有256种可能的组合。因此,我们可以将这个序列重新表示为一个字节序列。这个字节序列将缩短8倍,但现在我们有256个可能的符号。所以这里的每个数字都在0到255之间。我真的鼓励你们不要把这些看作数字,而是看作唯一的ID或唯一的符号。也许这有点……也许更好的思考方式是把每一个都替换成一个独特的表情符号(Emoji),你会得到像这样的东西。所以我们基本上有一个表情符号序列,有256种可能的表情符号,你可以这样想。

[原文] [Andrej Karpathy]: now it turns out that in production for state-of-the-art language models uh you actually want to go even Beyond this you want to continue to shrink the length of the sequence uh because again it is a precious resource in return for more symbols in your vocabulary and the way this is done is done by running what's called The Bite pair encoding algorithm and the way this works is we're basically looking for consecutive bytes or symbols that are very common so for example turns out that the sequence 116 followed by 32 is quite common and occurs very frequently so what we're going to do is we're going to group uh this um pair into a new symbol so we're going to Mint a symbol with an ID 256 and we're going to rewrite every single uh pair 11632 with this new symbol and then can we can iterate this algorithm as many times as we wish and each time when we mint a new symbol we're decreasing the length and we're increasing the symbol size

[译文] [Andrej Karpathy]: 事实证明,在生产环境中最先进的语言模型中,你实际上想要超越这一步。你想要继续缩短序列的长度,因为再说一次,这是一个宝贵的资源,作为交换,你可以增加词汇表中的符号数量。实现这一点的方法是运行一种称为“字节对编码”(Byte Pair Encoding, BPE)的算法。其工作原理基本上是我们寻找非常常见的连续字节或符号。例如,事实证明序列116后面跟着32是非常常见的,出现频率很高。所以我们要做的就是将这一对组合成一个新的符号。我们要铸造一个ID为256的新符号,并将每一个116、32对重写为这个新符号。然后我们可以随心所欲地多次迭代这个算法,每次我们铸造一个新符号时,我们都在减少序列长度并增加符号集的大小。

[原文] [Andrej Karpathy]: and in practice it turns out that a pretty good setting of um the basically the vocabulary size turns out to be about 100,000 possible symbols so in particular GPT 4 uses 100, 277 symbols um and this process of converting from raw text into these symbols or as we call them tokens is the process called tokenization so let's now take a look at how gp4 performs tokenization conting from text to tokens and from tokens back to text and what this actually looks like so one website I like to use to explore these token representations is called tick tokenizer and so come here to the drop down and select CL 100 a base which is the gp4 base model tokenizer and here on the left you can put in text and it shows you the tokenization of that text

[译文] [Andrej Karpathy]: 在实践中,基本上词汇表大小的一个相当好的设定大约是100,000个可能的符号。具体来说,GPT-4使用了100,277个符号。这个从原始文本转换为这些符号——或者我们称之为“Token”(词元)——的过程就是所谓的分词(Tokenization)。现在让我们来看看GPT-4是如何执行分词的,即从文本转换到Token,以及从Token转换回文本,看看这实际上是什么样子的。我喜欢用来探索这些Token表示的一个网站叫Tiktokenizer。到这里的下拉菜单选择cl100k_base,这是GPT-4的基础模型分词器。在这里左边你可以输入文本,它会显示该文本的分词结果。

[原文] [Andrej Karpathy]: so for example heo space world so hello world turns out to be exactly two Tokens The Token hello which is the token with ID 15339 and the token space world that is the token 1 1917 so um hello space world now if I was to join these two for example I'm going to get again two tokens but it's the token H followed by the token L world without the H um if I put in two Spa two spaces here between hello and world it's again a different uh tokenization there's a new token 220 here okay so you can play with this and see what happens here also keep in mind this is not uh this is case sensitive so if this is a capital H it is something else or if it's uh hello world then actually this ends up being three tokens instead of just two tokens yeah so you can play with this and get an sort of like an intuitive sense of uh what these tokens work like

[译文] [Andrej Karpathy]: 举个例子,“hello space world”。结果“hello world”正好是两个Token:Token "hello"(ID为15339)和Token " world"(注意前面有空格,ID为1917)。所以是"hello" + " world"。如果我把这两个连起来(去掉空格),我还是会得到两个Token,但是是Token "H" 加上 Token "elloworld"(此处为示例解释,实际分词可能不同,意指分词方式改变)。如果我在hello和world之间输入两个空格,这又是一个不同的分词,这里会出现一个新的Token 220。好的,你可以玩一下这个,看看会发生什么。另外要记住这是区分大小写的,所以如果是大写H,那就是别的东西;如果是"hello world"(某种特定拼写或组合),实际上最终可能是三个Token而不是两个。是的,你可以玩一下这个,对这些Token是如何工作的获得一种直观的感觉。

[原文] [Andrej Karpathy]: we're actually going to loop around to tokenization a bit later in the video for now I just wanted to show you the website and I wanted to uh show you that this text basically at the end of the day so for example if I take one line here this is what GT4 will see it as so this text will be a sequence of length 62 this is the sequence here and this is how the chunks of text correspond to these symbols and again there's 100, 27777 possible symbols and we now have one-dimensional sequences of those symbols so um yeah we're going to come back to tokenization but that's uh for now where we are okay so what I've done now is I've taken this uh sequence of text that we have here in the data set and I have re-represented it using our tokenizer into a sequence of tokens and this is what that looks like now so for example when we go back to the Fine web data set they mentioned that not only is this 44 terab of dis space but this is about a 15 trillion token sequence of um in this data set and so here these are just some of the first uh one or two or three or a few thousand here I think uh tokens of this data set but there's 15 trillion here uh to keep in mind and again keep in mind one more time that all of these represent little text chunks they're all just like atoms of these sequences and the numbers here don't make any sense they're just uh they're just unique IDs

[译文] [Andrej Karpathy]: 我们实际上会在视频后面再回头讲分词,现在我只想给你们展示这个网站,并想向你们展示这段文本归根结底是什么样的。例如如果我取这一行,这就是GPT-4所看到的。这段文本将是一个长度为62的序列,这就是那个序列,这就展示了文本块是如何对应这些符号的。再说一次,有100,277个可能的符号,我们现在有了这些符号的一维序列。所以,是的,我们会回到分词这个话题,但这目前就是我们的进度。好的,所以我现在做的是,我取了我们在数据集中拥有的这个文本序列,我用我们的分词器将其重新表示为一个Token序列,这就是它现在的样子。例如,当我们回到FineWeb数据集时,他们提到这不仅是44TB的磁盘空间,而且在这个数据集中大约有15万亿(15 trillion)个Token序列。这里只是这个数据集前几千个Token的展示,但要记住这里有15万亿个。再一次记住,所有这些都代表小的文本块,它们就像是这些序列的原子,这里的数字本身没有任何意义,它们只是唯一的ID。


章节 2:神经网络训练与Transformer架构 (Neural Network Training and Transformer Architecture)

📝 本节摘要

本章深入探讨了大型语言模型的核心构建过程——神经网络训练。Andrej Karpathy 解释了模型训练的本质是预测序列中的下一个 Token。通过将文本切分为上下文窗口(Context Window),神经网络利用海量参数(权重)计算每个可能的下一个 Token 的概率分布。本章还展示了 Transformer 架构的内部可视化,解释了输入如何通过注意力机制(Attention)和多层感知机(MLP)流动。最后,详细阐述了推理(Inference)阶段:模型如何通过“掷偏以此硬币”(加权采样)的方式,生成既符合统计规律又具有随机性的新文本,从而成为一种“互联网文本的混合器”。

[原文] [Andrej Karpathy]: okay so now we get to the fun part which is the uh neural network training and this is where a lot of the heavy lifting happens computationally when you're training these neural networks so what we do here in this this step is we want to model the statistical relationships of how these tokens follow each other in the sequence

[译文] [Andrej Karpathy]: 好的,现在我们要进入有趣的部分了,也就是神经网络训练(neural network training),这也是训练这些神经网络时大部分计算繁重工作发生的地方。在这个步骤中我们要做的,就是对这些Token在序列中如何相互跟随的统计关系进行建模。

[原文] [Andrej Karpathy]: so what we do is we come into the data and we take Windows of tokens so we take a window of tokens uh from this data fairly randomly and um the windows length can range anywhere anywhere between uh zero tokens actually all the way up to some maximum size that we decide on uh so for example in practice you could see a token with Windows of say 8,000 tokens

[译文] [Andrej Karpathy]: 所以我们的做法是进入数据中,选取Token的窗口(Windows of tokens)。我们从数据中相当随机地选取一个Token窗口,窗口的长度可以在零个Token一直到我们要确定的某个最大尺寸之间。例如在实践中,你可以看到比如8,000个Token的窗口。

[原文] [Andrej Karpathy]: now in principle we can use arbitrary window lengths of tokens uh but uh processing very long uh basically U window sequences would just be very computationally expensive so we just kind of decide that say 8,000 is a good number or 4,000 or 16,000 and we crop it there now in this example I'm going to be uh taking the first four tokens just so everything fits nicely so these tokens we're going to take a window of four tokens this bar view in and space single which are these token IDs and now what we're trying to do here is we're trying to basically predict the token that comes next in the sequence so 3962 comes next right

[译文] [Andrej Karpathy]: 原则上我们可以使用任意长度的Token窗口,但是处理非常长的窗口序列在计算上会非常昂贵。所以我们就决定说,比如8,000是个好数字,或者是4,000或16,000,然后我们在那里截断。在这个例子中,我将取前四个Token,只是为了让一切看起来合适。所以我们要取一个包含这四个Token的窗口:“this”、“view”、“in”和“ single”(空格single),它们对应这些Token ID。现在我们要做的,基本上就是预测序列中接下来会出现的Token。所以接下来是3962,对吧?

[原文] [Andrej Karpathy]: so what we do now here is that we call this the context these four tokens are context and they feed into a neural network and this is the input to the neural network now I'm going to go into the detail of what's inside this neural network in a little bit for now it's important to understand is the input and the output of the neural net so the input are sequences of tokens of variable length anywhere between zero and some maximum size like 8,000 the output now is a prediction for what comes next

[译文] [Andrej Karpathy]: 所以我们现在把这称为上下文(context),这四个Token就是上下文,它们被输入到一个神经网络中,这就是神经网络的输入。稍后我会详细介绍这个神经网络内部是什么,目前重要的是要理解神经网络的输入和输出。输入是可变长度的Token序列,长度在零到某个最大尺寸(比如8,000)之间;现在的输出是对接下来会出现什么的预测。

[原文] [Andrej Karpathy]: so because our vocabulary has 100277 possible tokens the neural network is going to Output exactly that many numbers and all of those numbers correspond to the probability of that token as coming next in the sequence so it's making guesses about what comes next um in the beginning this neural network is randomly initialized so um and we're going to see in a little bit what that means but it's a it's a it's a random transformation so these probabilities in the very beginning of the training are also going to be kind of random

[译文] [Andrej Karpathy]: 因为我们的词汇表有100,277个可能的Token,神经网络将输出恰好那么多数量的数字,所有这些数字都对应于那个Token作为序列中下一个出现的概率。所以它在猜测接下来是什么。在一开始,这个神经网络是随机初始化(randomly initialized)的,我们稍后会看到这意味着什么,但这是一种随机变换,所以训练最开始时的这些概率也将是某种随机的。

[原文] [Andrej Karpathy]: uh so here I have three examples but keep in mind that there's 100,000 numbers here um so the probability of this token space Direction neural network is saying that this is 4% likely right now 11799 is 2% and then here the probility of 3962 which is post is 3% now of course we've sampled this window from our data set so we know what comes next we know and that's the label we know that the correct answer is that 3962 actually comes next in the sequence

[译文] [Andrej Karpathy]: 这里我有三个例子,但请记住这里实际上有100,000个数字。所以这个Token “ Direction”(空格Direction)的概率,神经网络说现在可能有4%;11799是2%;然后这里3962,也就是“post”的概率是3%。当然,我们是从数据集中采样这个窗口的,所以我们知道接下来是什么,那就是标签(label)。我们知道正确的答案实际上是3962紧随其后。

[原文] [Andrej Karpathy]: so now what we have is this mathematical process for doing an update to the neural network we have the way of tuning it and uh we're going to go into a little bit of of detail in a bit but basically we know that this probability here of 3% we want this probability to be higher and we want the probabilities of all the other tokens to be lower and so we have a way of mathematically calculating how to adjust and update the neural network so that the correct answer has a slightly higher probability

[译文] [Andrej Karpathy]: 所以现在我们要进行一个数学过程来更新神经网络,我们有方法来调整它。稍后我们会讲一点细节,但基本上我们知道这里的3%的概率,我们希望这个概率更高,我们希望所有其他Token的概率更低。所以我们有一种方法可以通过数学计算如何调整和更新神经网络,以便让正确答案拥有稍微高一点的概率。

[原文] [Andrej Karpathy]: so if I do an update to the neural network now the next time I Fe this particular sequence of four tokens into neural network the neural network will be slightly adjusted now and it will say Okay post is maybe 4% and case now maybe is 1% and uh Direction could become 2% or something like that and so we have a way of nudging of slightly updating the neuronet to um basically give a higher probability to the correct token that comes next in the sequence

[译文] [Andrej Karpathy]: 所以如果我现在对神经网络进行更新,下一次我把这四个Token的特定序列输入神经网络时,神经网络已经被稍微调整过了,它会说:好的,“post”可能是4%,“case”现在可能是1%,“ Direction”可能会变成2%之类的值。所以我们有一种方法来微调、稍微更新神经网络,基本上是为了给序列中下一个出现的正确Token赋予更高的概率。

[原文] [Andrej Karpathy]: and now you just have to remember that this process happens not just for uh this um token here where these four fed in and predicted this one this process happens at the same time for all of these tokens in the entire data set and so in practice we sample little windows little batches of Windows and then at every single one of these tokens we want to adjust our neural network so that the probability of that token becomes slightly higher and this all happens in parallel in large batches of these tokens and this is the process of training the neural network it's a sequence of updating it so that it's predictions match up the statistics of what actually happens in your training set and its probabilities become consistent with the uh statistical patterns of how these tokens follow each other in the data

[译文] [Andrej Karpathy]: 现在你只需要记住,这个过程不仅仅发生在这里的这个Token上(即输入四个预测这一个),这个过程是同时发生在整个数据集的所有Token上的。所以在实践中,我们采样小窗口,也就是小批量的窗口(batches of Windows),然后在每一个Token上,我们都想调整我们的神经网络,使得那个Token的概率稍微变高。这一切都是大批量并行发生的,这就是训练神经网络的过程。这是一个不断更新它的序列的过程,以便它的预测与训练集中实际发生的统计数据相匹配,并且它的概率与数据中这些Token如何相互跟随的统计模式保持一致。

[原文] [Andrej Karpathy]: so let's now briefly get into the internals of these neural networks just to give you a sense of what's inside so neural network internals so as I mentioned we have these inputs uh that are sequences of tokens in this case this is four input tokens but this can be anywhere between zero up to let's say 8,000 tokens in principle this can be an infinite number of tokens we just uh it would just be too computationally expensive to process an infinite number of tokens so we just crop it at a certain length and that becomes the maximum context length of that uh model

[译文] [Andrej Karpathy]: 现在让我们简要地进入这些神经网络的内部,只是为了让你对里面有什么有个概念。所以神经网络内部,正如我提到的,我们有这些输入,也就是Token序列。在这个例子中是四个输入Token,但这可以在零到比如8,000个Token之间。原则上这可以是无限数量的Token,只是处理无限数量的Token在计算上太昂贵了,所以我们就在某个长度上截断它,这就成为了该模型的最大上下文长度(maximum context length)。

[原文] [Andrej Karpathy]: now these inputs X are mixed up in a giant mathematical expression together with the parameters or the weights of these neural networks so here I'm showing six example parameters and their setting but in practice these uh um modern neural networks will have billions of these uh parameters and in the beginning these parameters are completely randomly set

[译文] [Andrej Karpathy]: 现在这些输入X在一个巨大的数学表达式中与神经网络的参数(parameters)或权重(weights)混合在一起。这里我展示了六个示例参数及其设置,但在实践中,这些现代神经网络将拥有数十亿个这样的参数。在一开始,这些参数是完全随机设置的。

[原文] [Andrej Karpathy]: now with a random setting of parameters you might expect that this uh this neural network would make random predictions and it does in the beginning it's totally random predictions but it's through this process of iteratively updating the network uh as and we call that process training a neural network so uh that the setting of these parameters gets adjusted such that the outputs of our neural network becomes consistent with the patterns seen in our training set

[译文] [Andrej Karpathy]: 现在有了随机设置的参数,你可能会预期这个神经网络会做出随机的预测,确实如此,一开始完全是随机预测。但通过迭代更新网络的过程——我们称之为训练神经网络的过程——这些参数的设置被调整,使得我们神经网络的输出变得与训练集中看到的模式一致。

[原文] [Andrej Karpathy]: so think of these parameters as kind of like knobs on a DJ set and as you're twiddling these knobs you're getting different uh predictions for every possible uh token sequence input and training in neural network just means discovering a setting of parameters that seems to be consistent with the statistics of the training set

[译文] [Andrej Karpathy]: 所以把这些参数想象成DJ台上的旋钮。当你转动这些旋钮时,你会针对每一个可能的Token序列输入得到不同的预测。而训练神经网络仅仅意味着发现一组参数设置,这组设置看起来与训练集的统计数据相一致。

[原文] [Andrej Karpathy]: now let me just give you an example what this giant mathematical expression looks like just to give you a sense and modern networks are massive expressions with trillions of terms probably but let me just show you a simple example here it would look something like this I mean these are the kinds of Expressions just to show you that it's not very scary we have inputs x uh like X1 x2 in this case two example inputs and they get mixed up with the weights of the network w0 W1 2 3 Etc and this mixing is simple things like multiplication addition addition exponentiation division Etc

[译文] [Andrej Karpathy]: 现在让我给你们举个例子,看看这个巨大的数学表达式是什么样子的,只是为了让你们有个概念。现代网络是可能有万亿项的巨大表达式,但我只是在这里给你们展示一个简单的例子,它看起来像这样。我的意思是,这些表达式只是为了向你展示它并不可怕。我们有输入X,比如X1、X2,在这个例子中是两个示例输入,它们与网络的权重w0、w1、2、3等混合在一起。这种混合就是简单的乘法、加法、指数运算、除法等。

[原文] [Andrej Karpathy]: and it is the subject of neural network architecture research to design effective mathematical Expressions uh that have a lot of uh kind of convenient characteristics they are expressive they're optimizable they're paralyzable Etc and so but uh at the end of the day these are these are not complex expressions and basically they mix up the inputs with the parameters to make predictions and we're optimizing uh the parameters of this neural network so that the predictions come out consistent with the training set

[译文] [Andrej Karpathy]: 神经网络架构研究的主题就是设计有效的数学表达式,这些表达式具有许多方便的特性:它们具有表达力、可优化、可并行化等。但归根结底,这些并不是复杂的表达式,基本上它们将输入与参数混合以进行预测,而我们正在优化这个神经网络的参数,以便预测结果与训练集一致。

[原文] [Andrej Karpathy]: now I would like to show you an actual production grade example of what these neural networks look like so for that I encourage you to go to this website that has a very nice visualization of one of these networks so this is what you will find on this website and this neural network here that is used in production settings has this special kind of structure this network is called the Transformer and this particular one as an example has 8 5,000 roughly parameters

[译文] [Andrej Karpathy]: 现在我想向你们展示一个实际的生产级示例,看看这些神经网络是什么样子的。为此,我鼓励你们去这个网站,那里有一个非常好的网络可视化。这就是你会在这个网站上找到的,这个在生产环境中使用的神经网络具有这种特殊的结构。这个网络被称为Transformer。作为一个例子,这一个大概有85,000个参数。

[原文] [Andrej Karpathy]: now here on the top we take the inputs which are the token sequences and then information flows through the neural network until the output which here are the logit softmax but these are the predictions for what comes next what token comes next and then here there's a sequence of Transformations and all these intermediate values that get produced inside this mathematical expression s it is sort of predicting what comes next

[译文] [Andrej Karpathy]: 在顶部我们接受输入,即Token序列,然后信息流经神经网络直到输出,这里是Logit Softmax,也就是对接下来会出现什么Token的预测。然后这里有一系列的变换,以及在这个数学表达式内部产生的所有这些中间值,它在某种程度上就是在预测接下来会发生什么。

[原文] [Andrej Karpathy]: so as an example these tokens are embedded into kind of like this distributed representation as it's called so every possible token has kind of like a vector that represents it inside the neural network so first we embed the tokens and then those values uh kind of like flow through this diagram and these are all very simple mathematical Expressions individually so we have layer norms and Matrix multiplications and uh soft Maxes and so on so here kind of like the attention block of this Transformer and then information kind of flows through into the multi-layer perceptron block and so on and all these numbers here these are the intermediate values of the expression

[译文] [Andrej Karpathy]: 举个例子,这些Token被嵌入到所谓的分布式表示(distributed representation)中。所以每个可能的Token都有一种在神经网络内部代表它的向量(Vector)。首先我们嵌入Token,然后这些数值在这个图表中流动。这些单独看都是非常简单的数学表达式,所以我们有层归一化(Layer Norms)、矩阵乘法(Matrix Multiplications)和Softmax等。这里就像是这个Transformer的注意力模块(Attention Block),然后信息流向多层感知机模块(Multi-Layer Perceptron Block)等等。这里的所有数字都是表达式的中间值。

[原文] [Andrej Karpathy]: and uh you can almost think of these as kind of like the firing rates of these synthetic neurons but I would caution you to uh not um kind of think of it too much like neurons because these are extremely simple neurons compared to the neurons you would find in your brain your biological neurons are very complex dynamical processes that have memory and so on there's no memory in this expression it's a fixed mathematical expression from input to Output with no memory it's just a stateless so these are very simple neurons in comparison to biological neurons but you can still kind of loosely think of this as like a synthetic piece of uh brain tissue if you if you like uh to think about it that way

[译文] [Andrej Karpathy]: 你几乎可以把这些看作是这些合成神经元的发放率(firing rates)。但我会提醒你不要太把它想成神经元,因为与你大脑中的神经元相比,这些是极其简单的神经元。你的生物神经元是非常复杂的动力学过程,拥有记忆等等。而在这个表达式中没有记忆,它是一个从输入到输出的固定数学表达式,没有记忆,它是无状态的(stateless)。所以与生物神经元相比,这些是非常简单的神经元。但如果你愿意的话,你仍然可以宽泛地将其视为一块合成的脑组织。

[原文] [Andrej Karpathy]: so information flows through all these neurons fire until we get to the predictions now I'm not actually going to dwell too much on the precise kind of like mathematical details of all these Transformations honestly I don't think it's that important to get into what's really important to understand is that this is a mathematical function it is uh parameterized by some fixed set of parameters like say 85,000 of them and it is a way of transforming inputs into outputs and as we twiddle the parameters we are getting uh different kinds of predictions and then we need to find a good setting of these parameters so that the predictions uh sort of match up with the patterns seen in training set so that's the Transformer

[译文] [Andrej Karpathy]: 信息流经所有这些神经元并被激发,直到我们得到预测。现在我其实不打算过多停留在所有这些变换的精确数学细节上。老实说,我认为深入探讨这个并不那么重要。真正重要的是要理解这是一个数学函数,它由一组固定的参数(比如85,000个)进行参数化。这是一种将输入转换为输出的方法。当我们调整参数时,我们会得到不同种类的预测,然后我们需要找到这些参数的一个好的设置,以便预测能与训练集中看到的模式相匹配。这就是Transformer。

[原文] [Andrej Karpathy]: okay so I've shown you the internals of the neural network and we talked a bit about the process of training it I want to cover one more major stage of working with these networks and that is the stage called inference so in inference what we're doing is we're generating new data from the model and so uh we want to basically see what kind of patterns it has internalized in the parameters of its Network

[译文] [Andrej Karpathy]: 好的,我已经向你们展示了神经网络的内部结构,我们谈了一点关于训练它的过程。我想再涵盖使用这些网络的一个主要阶段,那就是推理(Inference)阶段。在推理中,我们在做的是从模型生成新数据,所以我们基本上想看看它在网络参数中内化了什么样的模式。

[原文] [Andrej Karpathy]: so to generate from the model is relatively straightforward we start with some tokens that are basically your prefix like what you want to start with so say we want to start with the token 91 well we feed it into the network and remember that the network gives us probabilities right it gives us this probability Vector here so what we can do now is we can basically flip a biased coin so um we can sample uh basically a token based on this probability distribution so the tokens that are given High probability by the model are more likely to be sampled when you flip this biased coin you can think of it that way

[译文] [Andrej Karpathy]: 从模型生成数据相对直接。我们从一些Token开始,这基本上就是你的前缀(prefix),也就是你想以什么开始。比如我们想以Token 91开始,我们把它输入网络。记得网络会给我们概率,对吧?它会给我们这个概率向量。所以我们现在能做的基本上就是掷一枚有偏见的硬币(flip a biased coin)。我们可以基于这个概率分布来采样一个Token。所以模型赋予高概率的Token在你掷这枚有偏见硬币时更有可能被采样到,你可以这样想。

[原文] [Andrej Karpathy]: so we sample from the distribution to get a single unique token so for example token 860 comes next uh so 860 in this case when we're generating from model could come next now 860 is a relatively likely token it might not be the only possible token in this case there could be many other tokens that could have been sampled but we could see that 86c is a relatively likely token as an example and indeed in our training examp example here 860 does follow 91

[译文] [Andrej Karpathy]: 我们从分布中采样得到一个单一的唯一Token。例如接下来是Token 860。所以在这种情况下,当我们从模型生成时,860可能会是下一个。860是一个相对可能的Token,它可能不是这种情况下唯一可能的Token,可能还有许多其他Token会被采样到,但我们可以看到860是一个相对可能的Token。确实,在我们的训练例子中,860确实跟随在91之后。

[原文] [Andrej Karpathy]: so let's now say that we um continue the process so after 91 there's a60 we append it and we again ask what is the third token let's sample and let's just say that it's 287 exactly as here let's do that again we come back in now we have a sequence of three and we ask what is the likely fourth token and we sample from that and get this one and now let's say we do it one more time we take those four we sample and we get this one and this 13659 uh this is not actually uh 3962 as we had before so this token is the token article uh instead so viewing a single article

[译文] [Andrej Karpathy]: 假设我们继续这个过程。在91之后是860,我们将它追加上去,然后我们再次问第三个Token是什么。让我们采样,假设正好是这里的287。再做一次,我们回过头来,现在我们有一个三个Token的序列,我们问第四个可能的Token是什么,我们从中采样并得到这一个。现在假设我们再做一次,我们取这四个Token进行采样,我们得到了这一个,即13659。这实际上不是我们之前的3962,这个Token实际上是“article”(文章),所以变成了“viewing a single article”。

[原文] [Andrej Karpathy]: and so in this case we didn't exactly reproduce the sequence that we saw here in the training data so keep in mind that these systems are stochastic they have um we're sampling and we're flipping coins and sometimes we lock out and we reproduce some like small chunk of the text and training set but sometimes we're uh we're getting a token that was not verbatim part of any of the documents in the training data so we're going to get sort of like remixes of the data that we saw in the training because at every step of the way we can flip and get a slightly different token and then once that token makes it in if you sample the next one and so on you very quickly uh start to generate token streams that are very different from the token streams that UR in the training documents so statistically they will have similar properties but um they are not identical to your training data they're kind of like inspired by the training data

[译文] [Andrej Karpathy]: 在这种情况下,我们并没有完全重现我们在训练数据中看到的序列。请记住这些系统是随机的(stochastic),我们在采样,我们在掷硬币。有时我们运气好,重现了训练集中的一小块文本;但有时我们会得到一个并不是训练数据中任何文档逐字逐句部分的Token。所以我们会得到一种我们在训练中看到的数据的“混音版”(remixes),因为在每一步我们都可以掷硬币并得到一个稍微不同的Token。一旦那个Token进入序列,如果你再采样下一个,依此类推,你会很快开始生成与训练文档中出现的Token流非常不同的Token流。所以在统计上它们会有相似的属性,但它们与你的训练数据并不完全相同,它们更像是受训练数据启发的。

[原文] [Andrej Karpathy]: basically inference is just uh predicting from these distributions one at a time we continue feeding back tokens and getting the next one and we uh we're always flipping these coins and depending on how lucky or unlucky we get um we might get very different kinds of patterns depending on how we sample from these probability distributions so that's inference

[译文] [Andrej Karpathy]: 基本上推理就是一次一个地从这些分布中进行预测。我们继续回传Token并获取下一个,我们总是在掷这些硬币。取决于我们的运气好坏,我们可能会根据我们从这些概率分布中采样的方式得到非常不同种类的模式。这就是推理。

[原文] [Andrej Karpathy]: so in most common scenarios uh basically downloading the internet and tokenizing it is is a pre-processing step you do that a single time and then uh once you have your token sequence we can start training networks and in Practical cases you would try to train many different networks of different kinds of uh settings and different kinds of arrangements and different kinds of sizes and so you''ll be doing a lot of neural network training

[译文] [Andrej Karpathy]: 所以在大多数常见场景中,基本上下载互联网并对其进行分词是一个预处理步骤,你只做一次。一旦你有了Token序列,我们就可以开始训练网络。在实际情况中,你会尝试训练许多不同种类的网络,有不同的设置、不同的排列和不同的大小,所以你会做大量的神经网络训练。

[原文] [Andrej Karpathy]: and um then once you have a neural network and you train it and you have some specific set of parameters that you're happy with um then you can take the model and you can do inference and you can actually uh generate data from the model and when you're on chat GPT and you're talking with a model uh that model is trained and has been trained by open aai many months ago probably and they have a specific set of Weights that work well and when you're talking to the model all of that is just inference there's no more training those parameters are held fixed and you're just talking to the model sort of uh you're giving it some of the tokens and it's kind of completing token sequences and that's what you're seeing uh generated when you actually use the model on CH GPT so that model then just does inference alone

[译文] [Andrej Karpathy]: 一旦你有一个神经网络并且你训练了它,你有了一组你满意的特定参数,那么你就可以拿这个模型去做推理,你实际上可以从模型生成数据。当你在ChatGPT上与模型交谈时,该模型是已经训练好的,可能几个月前就被OpenAI训练好了,他们有一组效果很好的特定权重。当你与模型交谈时,所有的过程都只是推理,没有更多的训练。那些参数是固定的,你只是在与模型交谈,某种程度上你给它一些Token,它就在补全Token序列。这就是当你实际上在ChatGPT上使用模型时看到的生成过程,所以那个模型当时只做推理。


章节 3:具体案例与算力经济学 (Concrete Cases and Compute Economics)

📝 本节摘要

在本章中,Andrej Karpathy 通过复现 OpenAI 的 GPT-2 模型(2019年发布),展示了模型训练的具体细节。他对比了当年的训练成本(约4万美元)与如今利用现代硬件和软件优化后的低廉成本(约600美元甚至更低)。随后,他展示了训练过程的实时可视化界面,解释了“损失值(Loss)”如何随着训练步数下降。最后,视角转向算力基础设施,详细介绍了 NVIDIA H100 GPU 的租用成本、集群架构,以及为何这些硬件成为了科技巨头争相抢购的“淘金热”核心资源。

[原文] [Andrej Karpathy]: so let's now look at an example of training an inference that is kind of concrete and gives you a sense of what this actually looks like uh when these models are trained now the example that I would like to work with and that I'm particularly fond of is that of opening eyes gpt2 so GPT uh stands for generatively pre-trained Transformer and this is the second iteration of the GPT series by open AI

[译文] [Andrej Karpathy]: 那么现在让我们看一个训练和推理的例子,这个例子比较具体,能让你感觉到当这些模型被训练时实际上是什么样子的。我想用的例子,也是我特别喜欢的,是OpenAI的GPT-2。GPT代表“生成式预训练Transformer”(Generatively Pre-trained Transformer),这是OpenAI GPT系列的第二次迭代。

[原文] [Andrej Karpathy]: when you are talking to chat GPT today the model that is underlying all of the magic of that interaction is GPT 4 so the fourth iteration of that series now gpt2 was published in 2019 by openi in this paper that I have right here and the reason I like gpt2 is that it is the first time that a recognizably modern stack came together so um all of the pieces of gpd2 are recognizable today by modern standards it's just everything has gotten bigger

[译文] [Andrej Karpathy]: 当你今天与ChatGPT交谈时,支撑所有交互魔力的底层模型是GPT-4,也就是该系列的第四次迭代。GPT-2是OpenAI在2019年发表的,就在这篇论文里。我之所以喜欢GPT-2,是因为那是第一次通过现代技术栈将其整合在一起。所以,GPT-2的所有组成部分按今天的现代标准来看都是可识别的,只是所有东西都变大了。

[原文] [Andrej Karpathy]: now I'm not going to be able to go into the full details of this paper of course because it is a technical publication but some of the details that I would like to highlight are as follows gpt2 was a Transformer neural network just like you were just like the neural networks you would work with today it was it had 1.6 billion parameters right so these are the parameters that we looked at here it would have 1.6 billion of them

[译文] [Andrej Karpathy]: 当然我无法深入探讨这篇论文的全部细节,因为它是一份技术出版物,但我想强调的一些细节如下:GPT-2是一个Transformer神经网络,就像你现在会使用的神经网络一样。它有16亿个参数,对吧?也就是我们之前看过的那些参数,它有16亿个。

[原文] [Andrej Karpathy]: today modern Transformers would have a lot closer to a trillion or several hundred billion probably the maximum context length here was 1,24 tokens so it is when we are sampling chunks of Windows of tokens from the data set we're never taking more than 1,24 tokens and so when you are trying to predict the next token in a sequence you will never have more than 1,24 tokens uh kind of in your context in order to make that prediction

[译文] [Andrej Karpathy]: 今天现代的Transformer可能有接近一万亿或几千亿个参数。这里的最大上下文长度是1024个Token。所以当我们从数据集中采样Token窗口块时,我们从未取超过1024个Token。因此,当你试图预测序列中的下一个Token时,你的上下文中永远不会有超过1024个Token来辅助那个预测。

[原文] [Andrej Karpathy]: now this is also tiny by modern standards today the token uh the context lengths would be a lot closer to um couple hundred thousand or maybe even a million and so you have a lot more context a lot more tokens in history history and you can make a lot better prediction about the next token in the sequence in that way

[译文] [Andrej Karpathy]: 按现代标准来看这也非常小。今天的上下文长度会更接近几十万甚至一百万。所以你拥有更多的上下文,更多的历史Token,通过这种方式你可以对序列中的下一个Token做出更好的预测。

[原文] [Andrej Karpathy]: and finally gpt2 was trained on approximately 100 billion tokens and this is also fairly small by modern standards as I mentioned the fine web data set that we looked at here the fine web data set has 15 trillion tokens uh so 100 billion is is quite small now uh I actually tried to reproduce uh gpt2 for fun as part of this project called lm. C so you can see my rup of doing that in this post on GitHub under the lm. C repository

[译文] [Andrej Karpathy]: 最后,GPT-2是在大约1000亿(100 billion)个Token上训练的。按现代标准这也是相当小的。正如我提到的,我们之前看的FineWeb数据集有15万亿(15 trillion)个Token,所以1000亿是相当小的。实际上我为了好玩尝试复现了GPT-2,作为这个名为llm.c项目的一部分。你可以在GitHub上的llm.c仓库下的这篇帖子中看到我的记录。

[原文] [Andrej Karpathy]: so in particular the cost of training gpd2 in 2019 what was estimated to be approximately $40,000 but today you can do significantly better than that and in particular here it took about one day and about $600 uh but this wasn't even trying too hard I think you could really bring this down to about $100 today

[译文] [Andrej Karpathy]: 特别值得一提的是,2019年训练GPT-2的成本估计约为4万美元。但今天你可以做得比这好得多,具体来说,在这里它花了大约一天时间和大约600美元。但这甚至还没怎么费力优化,我认为今天你真的可以把这个成本降到大约100美元。

[原文] [Andrej Karpathy]: now why is it that the costs have come down so much well number one these data sets have gotten a lot better and the way we filter them extract them and prepare them has gotten a lot more refined and so the data set is of just a lot higher quality so that's one thing but really the biggest difference is that our computers have gotten much faster in terms of the hardware and we're going to look at that in a second and also the software for uh running these models and really squeezing out all all the speed from the hardware as it is possible uh that software has also gotten much better as as everyone has focused on these models and try to run them very very quickly

[译文] [Andrej Karpathy]: 为什么成本下降了这么多?第一,这些数据集变得好多了,我们过滤、提取和准备它们的方式也变得更加精细,所以数据集的质量更高了,这是一方面。但真正最大的区别在于我们的计算机在硬件方面变得更快了,我们马上会看这一点;而且运行这些模型的软件,以及真正从硬件中榨取所有可能速度的软件也变得更好了,因为每个人都专注于这些模型并试图非常快速地运行它们。

[原文] [Andrej Karpathy]: now I'm not going to be able to go into the full detail of this gpd2 reproduction and this is a long technical post but I would like to still give you an intuitive sense for what it looks like to actually train one of these models as a researcher like what are you looking at and what does it look like what does it feel like so let me give you a sense of that a little bit

[译文] [Andrej Karpathy]: 我无法详细介绍这个GPT-2复现的全部细节,这是一篇很长的技术帖子,但我还是想给你们一种直观的感觉,作为一个研究人员,实际训练这些模型是什么样子的?你在看什么?它看起来像什么?感觉如何?让我稍微给你们一点那种感觉。

[原文] [Andrej Karpathy]: okay so this is what it looks like let me slide this over so what I'm doing here is I'm training a gpt2 model right now and um what's happening here is that every single line here like this one is one update to the model so remember how here we are um basically making the prediction better for every one of these tokens and we are updating these weights or parameters of the neural net so here every single line is One update to the neural network where we change its parameters by a little bit so that it is better at predicting next token and sequence

[译文] [Andrej Karpathy]: 好的,这就是它的样子,让我把它滑过来。我正在这里训练一个GPT-2模型。这里发生的是,每一行,像这一行,是对模型的一次更新。记得我们要让每一个Token的预测变得更好,并且我们在更新神经网络的这些权重或参数。所以这里的每一行都是对神经网络的一次更新,我们稍微改变它的参数,使其更擅长预测序列中的下一个Token。

[原文] [Andrej Karpathy]: in particular every single line here is improving the prediction on 1 million tokens in the training set so we've basically taken 1 million tokens out of this data set and we've tried to improve the prediction of that token as coming next in a sequence on all 1 million of them simultaneously and at every single one of these steps we are making an update to the network for that

[译文] [Andrej Karpathy]: 具体来说,这里的每一行都在改进训练集中100万个Token的预测。所以我们基本上从数据集中取出了100万个Token,并且我们试图同时改进这所有100万个Token作为序列下一个出现的预测。在每一步中,我们都在为此对网络进行更新。

[原文] [Andrej Karpathy]: now the number to watch closely is this number called loss and the loss is a single number that is telling you how well your neural network is performing right now and it is created so that low loss is good so you'll see that the loss is decreasing as we make more updates to the neural nut which corresponds to making better predictions on the next token in a sequence

[译文] [Andrej Karpathy]: 现在要密切关注的数字是这个叫做“损失(loss)”的数字。损失是一个单一的数字,它告诉你你的神经网络目前表现如何。它的设定是低损失是好的。所以你会看到随着我们对神经网络进行更多更新,损失在减少,这对应于对序列中下一个Token做出更好的预测。

[原文] [Andrej Karpathy]: and so the loss is the number that you are watching as a neural network researcher and you are kind of waiting you're twiddling your thumbs uh you're drinking coffee and you're making sure that this looks good so that with every update your loss is improving and the network is getting better at prediction

[译文] [Andrej Karpathy]: 所以损失是你作为神经网络研究人员所关注的数字。你在等待,转着拇指玩,喝着咖啡,确保这个看起来不错,确保随着每一次更新你的损失都在改善,网络在预测方面变得更好。

[原文] [Andrej Karpathy]: now here you see that we are processing 1 million tokens per update each update takes about 7 Seconds roughly and here we are going to process a total of 32,000 steps of optimization so 32,000 steps with 1 million tokens each is about 33 billion tokens that we are going to process and we're currently only about 420 step 20 out of 32,000 so we are still only a bit more than 1% done because I've only been running this for 10 or 15 minutes or something like that

[译文] [Andrej Karpathy]: 这里你可以看到我们每次更新处理100万个Token,每次更新大约需要7秒钟。这里我们将处理总共32,000步优化。所以32,000步,每步100万个Token,这就是我们要处理的大约330亿个Token。目前我们只进行到420步(共32,000步),所以我们只完成了1%多一点,因为我只运行了大概10或15分钟。

[原文] [Andrej Karpathy]: now every 20 steps I have configured this optimization to do inference so what you're seeing here is the model is predicting the next token in a sequence and so you sort of start it randomly and then you continue plugging in the tokens so we're running this inference step and this is the model sort of predicting the next token in the sequence and every time you see something appear that's a new token

[译文] [Andrej Karpathy]: 每隔20步,我配置这个优化过程进行一次推理(inference)。所以你在这里看到的是模型在预测序列中的下一个Token。你某种程度上随机启动它,然后继续插入Token。所以我们在运行这个推理步骤,这是模型在预测序列中的下一个Token,每次你看到有东西出现,那就是一个新的Token。

[原文] [Andrej Karpathy]: um so let's just look at this and you can see that this is not yet very coherent and keep in mind that this is only 1% of the way through training and so the model is not yet very good at predicting the next token in the sequence so what comes out is actually kind of a little bit of gibberish right but it still has a little bit of like local coherence so since she is mine it's a part of the information should discuss my father great companions Gordon showed me sitting over at and Etc so I know it doesn't look very good but let's actually scroll up and see what it looked like when I started the optimization so all the way here at step one

[译文] [Andrej Karpathy]: 让我们看看这个,你会发现这还不是很连贯。请记住这只是训练过程的1%,所以模型还不怎么擅长预测序列中的下一个Token。所以出来的东西实际上有点像乱语(gibberish),对吧?但它仍然有一点局部连贯性。比如“since she is mine it's a part of the information should discuss my father great companions Gordon showed me sitting over at”等等。我知道这看起来不太好,但让我们向上滚动,看看我刚开始优化时是什么样子的,一直回到第一步。

[原文] [Andrej Karpathy]: so after 20 steps of optimization you see that what we're getting here is looks completely random and of course that's because the model has only had 20 updates to its parameters and so it's giving you random text because it's a random Network and so you can see that at least in comparison to this model is starting to do much better and indeed if we waited the entire 32,000 steps the model will have improved the point that it's actually uh generating fairly coherent English uh and the tokens stream correctly um and uh they they kind of make up English a a lot better

[译文] [Andrej Karpathy]: 在20步优化之后,你会看到我们得到的东西看起来完全是随机的,当然那是因为模型只对参数进行了20次更新,所以它给你的是随机文本,因为它是一个随机网络。你可以看到至少相比之下,模型开始做得好多了。确实,如果我们等待整个32,000步完成,模型将改进到实际上生成相当连贯的英语,Token流也是正确的,它们组成的英语要好得多。

[原文] [Andrej Karpathy]: um and so uh at this stage we just make sure that the loss is decreasing everything is looking good um and we just have to wait and now um let me turn now to the um story of the computation that's required because of course I'm not running this optimization on my laptop that would be way too expensive uh because we have to run this neural network and we have to improve it and we have we need all this data and so on so you can't run this too well on your computer uh because the network is just too large uh so all of this is running on the computer that is out there in the cloud and I want to basically address the compute side of the store of training these models and what that looks like so let's take a look

[译文] [Andrej Karpathy]: 在这个阶段,我们只确保损失在减少,一切看起来都很好,我们只需要等待。现在让我转向所需的计算(computation)的故事。因为当然我不是在我的笔记本电脑上运行这个优化,那太昂贵了(指计算开销大),因为我们必须运行这个神经网络,我们必须改进它,我们需要所有这些数据等等。所以在你的电脑上运行这个效果不会太好,因为网络太大了。所有这些都是在云端的计算机上运行的。我基本上想谈谈训练这些模型的计算方面及其样貌。让我们来看一下。

[原文] [Andrej Karpathy]: okay so the computer that I'm running this optimization on is this 8X h100 node so there are eight h100s in a single node or a single computer now I am renting this computer and it is somewhere in the cloud I'm not sure where it is physically actually the place I like to rent from is called Lambda but there are many other companies who provide this service so when you scroll down you can see that uh they have some on demand pricing for um sort of computers that have these uh h100s which are gpus and I'm going to show you what they look like in a second but on demand 8times Nvidia h100 uh GPU this machine comes for $3 per GPU per hour for example

[译文] [Andrej Karpathy]: 好的,所以我运行这个优化的计算机是这个8x H100节点。所以在一个节点或一台计算机里有8个H100。我是租用这台计算机的,它在云端的某个地方,实际上我不确定它的物理位置在哪里。我喜欢租用的地方叫Lambda,但还有很多其他公司提供这种服务。当你向下滑动时,你可以看到他们有一些针对拥有这些H100(即GPU)的计算机的按需定价。我马上会给你们看它们长什么样。例如,按需租用一台8个Nvidia H100 GPU的机器,价格是每个GPU每小时3美元。

[原文] [Andrej Karpathy]: so you can rent these and then you get a machine in a cloud and you can uh go in and you can train these models and these uh gpus they look like this so this is one h100 GPU uh this is kind of what it looks like and you slot this into your computer and gpus are this uh perfect fit for training your networks because they are very computationally expensive but they display a lot of parallelism in the computation so you can have many independent workers kind of um working all at the same time in solving uh the matrix multiplication that's under the hood of training these neural networks

[译文] [Andrej Karpathy]: 所以你可以租用这些,然后在云端得到一台机器,你可以进去训练这些模型。这些GPU看起来像这样。这就是一个H100 GPU,这就是它的样子,你把它插进你的电脑里。GPU非常适合训练你的网络,因为虽然它们计算成本很高,但它们在计算中表现出大量的并行性(parallelism)。你可以有许多独立的工人在同一时间工作,解决训练这些神经网络底层的矩阵乘法问题。

[原文] [Andrej Karpathy]: so this is just one of these h100s but actually you would put them you would put multiple of them together so you could stack eight of them into a single node and then you can stack multiple nodes into an entire data center or an entire system so when we look at a data center can't spell when we look at a data center we start to see things that look like this right so we have one GPU goes to eight gpus goes to a single system goes to many systems and so these are the bigger data centers and there of course would be much much more expensive

[译文] [Andrej Karpathy]: 这只是其中一个H100,但实际上你会把它们放在一起,你会把多个放在一起。你可以把8个堆叠到一个节点中,然后你可以把多个节点堆叠成整个数据中心或整个系统。当我们看一个数据中心时——我拼写错了——当我们看一个数据中心时,我们开始看到像这样的东西,对吧?我们从一个GPU到8个GPU,到一个单一系统,再到许多系统。这就是更大的数据中心,当然那里会昂贵得多。

[原文] [Andrej Karpathy]: um and what's happening is that all the big tech companies really desire these gpus so they can train all these language models because they are so powerful and that has is fundamentally what has driven the stock price of Nvidia to be $3.4 trillion today as an example and why Nvidia has kind of exploded so this is the Gold Rush the Gold Rush is getting the gpus getting enough of them so they can all collaborate to perform this optimization and they're what are they all doing they're all collaborating to predict the next token on a data set like the fine web data set

[译文] [Andrej Karpathy]: 发生的事情是,所有的大型科技公司都非常渴望这些GPU,以便他们可以训练所有这些语言模型,因为它们太强大了。这从根本上推动了Nvidia的股价达到今天的3.4万亿美元(举例来说),这也是为什么Nvidia爆发式增长。这就是淘金热(Gold Rush)。淘金热就是获取GPU,获取足够的GPU,以便它们可以协同工作来执行这种优化。它们都在做什么?它们都在协同工作,在一个数据集(比如FineWeb数据集)上预测下一个Token。

[原文] [Andrej Karpathy]: this is the computational workflow that that basically is extremely expensive the more gpus you have the more tokens you can try to predict and improve on and you're going to process this data set faster and you can iterate faster and get a bigger Network and train a bigger Network and so on

[译文] [Andrej Karpathy]: 这就是计算工作流,基本上极其昂贵。你拥有的GPU越多,你就能尝试预测和改进越多的Token,你就能更快地处理这个数据集,更快地迭代,得到更大的网络,训练更大的网络等等。

[原文] [Andrej Karpathy]: so this is what all those machines are look like are uh are doing and this is why all of this is such a big deal and for example this is a article from like about a month ago or so this is why it's a big deal that for example Elon Musk is getting 100,000 gpus uh in a single Data Center and all of these gpus are extremely expensive are going to take a ton of power and all of them are just trying to predict the next token in the sequence and improve the network uh by doing so and uh get probably a lot more coherent text than what we're seeing here a lot faster

[译文] [Andrej Karpathy]: 这就是所有这些机器在做的事情,这就是为什么所有这些都是如此重大的事情。例如,这是一篇大约一个月前的文章,这就解释了为什么埃隆·马斯克(Elon Musk)在一个单一数据中心获得10万个GPU是一件大事。所有这些GPU都极其昂贵,将消耗大量电力,而它们所有都在试图预测序列中的下一个Token,并借此改进网络,可能会比我们在这里看到的更快地获得更连贯的文本。


章节 4:基座模型:互联网文档模拟器 (Base Models: Internet Document Simulators)

📝 本节摘要

预训练阶段结束后,我们得到的产物被称为“基座模型”(Base Model)。Andrej Karpathy 在本章中演示了基座模型的本质——它并非一个问答助手,而是一个“互联网文档模拟器”或高级自动补全工具。通过Llama 3的演示,他展示了模型如何根据概率随机生成文本(Stochasticity),有时会精确背诵训练数据(如维基百科),有时则会基于统计规律“做梦”(如编造未来的选举结果)。最重要的是,本章介绍了上下文学习(In-context Learning)能力:即使不进行额外训练,仅通过精心设计的提示词(Prompting),也能引导基座模型执行翻译任务或模拟对话助手的行为。

[原文] [Andrej Karpathy]: so the model that comes out at the end here is is what's called a base model what is a base model it's a token simulator right it's an internet text token simulator and so that is not by itself useful yet because what we want is what's called an assistant we want to ask questions and have it respond to answers these models won't do that they just uh create sort of remixes of the internet they dream internet pages

[译文] [Andrej Karpathy]: 所以最终在这个阶段出来的模型被称为基座模型(base model)。什么是基座模型?它是一个Token模拟器,对吧?它是一个互联网文本Token模拟器。所以这本身还不是很有用,因为我们想要的是所谓的“助手”(assistant)。我们想问问题并让它回答。这些模型不会那样做,它们只是创造某种互联网的“混音版”,它们在“做梦”生成互联网页面。

[原文] [Andrej Karpathy]: so I want to first show you that this model here is not yet an assistant so you can for example ask it what is 2 plus 2 it's not going to tell you oh it's four uh what else can I help you with it's not going to do that because what is 2 plus 2 is going to be tokenized and then those tokens just act as a prefix and then what the model is going to do now is just going to get the probability for the next token and it's just a glorified autocomplete it's a very very expensive autocomplete of what comes next um depending on the statistics of what it saw in its training documents which are basically web pages

[译文] [Andrej Karpathy]: 所以我首先想向你们展示,这里的这个模型还不是一个助手。例如你可以问它“2加2等于几”,它不会告诉你“哦,是4,还有什么可以帮你的吗”。它不会那样做,因为“what is 2 plus 2”会被分词,然后这些Token仅仅充当一个前缀。模型现在要做的只是获取下一个Token的概率。它只是一个被美化了的自动补全(glorified autocomplete),一个非常非常昂贵的自动补全,预测接下来会出现什么,这取决于它在训练文档(基本上就是网页)中看到的统计数据。

[原文] [Andrej Karpathy]: notice one more thing that I want to stress is that the system uh I think every time you put it in it just kind of starts from scratch so it doesn't uh the system here is stochastic so for the same prefix of tokens we're always getting a different answer and the reason for that is that we get this probity distribution and we sample from it and we always get different samples and we sort of always go into a different territory uh afterwards

[译文] [Andrej Karpathy]: 注意还有一件事我想强调,就是这个系统……我想每次你输入时它都是从头开始的。所以这个系统是随机的(stochastic)。对于相同的Token前缀,我们总是得到不同的答案。其原因是我们得到了这个概率分布,我们从中采样,我们总是得到不同的样本,所以之后我们总是进入不同的领域。

[原文] [Andrej Karpathy]: you can think of um these 405 billion parameters is a kind of compression of the internet you can think of the 45 billion parameters is kind of like a zip file uh but it's not a loss less compression it's a loss C compression we're kind of like left with kind of a gal of the internet and we can generate from it right now we can elicit some of this knowledge by prompting the base model uh accordingly

[译文] [Andrej Karpathy]: 你可以把这4050亿个参数看作是互联网的一种压缩。你可以把这4050亿个参数想成是一个Zip文件,但这不一种无损压缩(lossless compression),这是一种有损压缩(lossy compression)。我们剩下的就像是互联网的一种“完形填空”(gestalt),我们可以从中生成内容。现在我们可以通过适当地提示基座模型来诱导(elicit)出其中的一些知识。

[原文] [Andrej Karpathy]: I went to the Wikipedia page for zebra and let me just copy paste the first uh even one sentence here and let me put it here now when I click enter what kind of uh completion are we going to get so let me just hit enter there are three living species etc etc what the model is producing here is an exact regurgitation of this Wikipedia entry it is reciting this Wikipedia entry purely from memory and this memory is stored in its parameters

[译文] [Andrej Karpathy]: 我去了斑马的维基百科页面,让我复制粘贴第一句甚至就一句话到这里。现在当我点击回车,我们会得到什么样的补全?让我按回车。“现存三个物种等等等等”。模型在这里产生的是对这个维基百科条目的精确反刍(regurgitation)。它纯粹凭记忆背诵这个维基百科条目,而这个记忆存储在它的参数中。

[原文] [Andrej Karpathy]: the next thing I want to show you is something that the model has definitely not seen during its training... certainly it has not seen anything about the 2024 election and how it turned out now if we Prime the model with the tokens from the future it will continue the token sequence and it will just take its best guess according to the knowledge that it has in its own parameters... so here thingss that Mike Pence was the running mate instead of JD Vance and the ticket was against Hillary Clinton and Tim Kane so this is kind of a interesting parallel universe potentially of what could have happened happened according to the LM

[译文] [Andrej Karpathy]: 接下来我想给你们看的是模型在训练期间绝对没见过的东西……它肯定没见过任何关于2024年大选以及结果的内容。现在如果我们用来自未来的Token来引导模型,它将继续Token序列,并且它只会根据它参数中拥有的知识进行最佳猜测……所以在这里它认为竞选搭档是迈克·彭斯(Mike Pence)而不是JD万斯(JD Vance),而且对手是希拉里·克林顿和蒂姆·凯恩。这就像是一个有趣的平行宇宙,是根据语言模型认为可能发生的事情。

[原文] [Andrej Karpathy]: here's something that we would call a few shot prompt so what it is here is that I have 10 words or 10 pairs and each pair is a word of English column and then a the translation in Korean and we have 10 of them and what the model does here is at the end we have teacher column and then here's where we're going to do a completion of say just five tokens and these models have what we call in context learning abilities and what that's referring to is that as it is reading this context it is learning sort of in place that there's some kind of a algorithmic pattern going on in my data and it knows to continue that pattern

[译文] [Andrej Karpathy]: 这里有一种我们称之为“少样本提示”(few-shot prompt)的东西。这里我有10个单词或10对,每一对是一个英语单词,冒号,然后是韩语翻译。我们有10个这样的例子。模型在这里做的是,最后我们写“Teacher: ”,然后这就是我们要进行比如5个Token补全的地方。这些模型拥有我们所谓的上下文学习(in-context learning)能力。这指的是,当它阅读这个上下文时,它某种程度上是在“就地”学习,意识到“我的数据中正在进行某种算法模式”,并且它知道要继续这种模式。

[原文] [Andrej Karpathy]: and finally I want to show you that there is a clever way to actually instantiate a whole language model assistant just by prompting and the trick to it is that we're structure a prompt to look like a web page that is a conversation between a helpful AI assistant and a human and then the model will continue that conversation... so here's a few terms of human assistant human assistant and we have uh you know a few turns of conversation and then here at the end is we're going to be putting the actual query that we like... you see that the base model is just continuing the sequence but because the sequence looks like this conversation it takes on that role

[译文] [Andrej Karpathy]: 最后我想向你们展示,有一种聪明的方法,仅仅通过提示就能实际上实例化一个完整的语言模型助手。其中的诀窍在于我们构建一个提示,使其看起来像一个网页,内容是一个乐于助人的AI助手和人类之间的对话,然后模型就会继续那个对话……所以这里有几轮“人类、助手、人类、助手”,我们有几轮对话,然后在最后我们放入我们实际想要的查询……你会看到基座模型只是在继续这个序列,但因为这个序列看起来像这种对话,它就扮演了那个角色。


章节 5:后训练阶段:监督微调 (Post-training: Supervised Fine-Tuning)

📝 本节摘要

在本章中,Andrej Karpathy 介绍了模型训练的第二阶段——后训练(Post-training)。基座模型虽然强大,但只是一个“互联网文档模拟器”,无法直接回答问题。为了将其转化为有用的“助手(Assistant)”,我们需要进行监督微调(Supervised Fine-Tuning, SFT)。这一过程涉及构建由人类标注员编写的对话数据集(包含提问与理想回答),并引入特殊的格式(如<|im_start|>等特殊Token)将对话转化为模型可处理的序列。Karpathy 强调,当我们与ChatGPT对话时,实际上是在与一个“人类标注员的统计模拟”进行交流,其行为准则(如乐于助人、诚实、无害)均源自标注说明文档。

[原文] [Andrej Karpathy]: so this is the kind of brief summary of the things we talked about over the last few minutes now let me zoom out here and this is kind of like what we've talked about so far we wish to train LM assistants like chpt we've discussed the first stage of that which is the pre-training stage and we saw that really what it comes down to is we take Internet documents we break them up into these tokens these atoms of little text chunks and then we predict token sequences using neural networks the output of this entire stage is this base model it is the setting of The parameters of this network and this base model is basically an internet document simulator on the token level so it can just uh it can generate token sequences that have the same kind of like statistics as Internet documents and we saw that we can use it in some applications but we actually need to do better we want an assistant we want to be able to ask questions and we want the model to give us answers and so we need to now go into the second stage which is called the post-training stage

[译文] [Andrej Karpathy]: 这就是我们在过去几分钟里谈论内容的简要总结。现在让我把视角拉远一点,这就是我们目前所讨论的。我们希望训练像ChatGPT这样的语言模型助手。我们已经讨论了第一阶段,即预训练阶段(pre-training stage)。我们看到归根结底,我们获取互联网文档,将它们分解为Token(这些小文本块的原子),然后使用神经网络预测Token序列。整个这一阶段的输出就是这个基座模型(base model),它是这个网络参数的设置。这个基座模型基本上是Token层面的互联网文档模拟器,它可以生成与互联网文档具有相同统计特征的Token序列。我们看到我们可以在某些应用中使用它,但我们实际上需要做得更好。我们需要一个助手,我们希望能够提问,并希望模型给我们答案。所以我们现在需要进入第二阶段,这被称为后训练阶段(post-training stage)。

[原文] [Andrej Karpathy]: so we take our base model our internet document simulator and hand it off to post training so we're now going to discuss a few ways to do what's called post training of these models these stages in post training are going to be computationally much less expensive most of the computational work all of the massive data centers um and all of the sort of heavy compute and millions of dollars are the pre-training stage but now we go into the slightly cheaper but still extremely important stage called post trining where we turn this llm model into an assistant so let's take a look at how we can get our model to not sample internet documents but to give answers to questions so in other words what we want to do is we want to start thinking about conversations and these are conversations that can be multi-turn so so uh there can be multiple turns and they are in the simplest case a conversation between a human and an assistant

[译文] [Andrej Karpathy]: 所以我们拿着我们的基座模型——那个互联网文档模拟器——将它移交给后训练。我们现在要讨论几种对这些模型进行所谓后训练的方法。后训练中的这些阶段在计算成本上要低得多。大部分的计算工作、所有庞大的数据中心、所有那些繁重的计算和数百万美元的投入都发生在预训练阶段。但现在我们要进入稍微便宜一点但仍然极其重要的阶段,叫做后训练,在这里我们将这个大语言模型转变为一个助手。让我们看看如何让我们的模型不再采样互联网文档,而是给出问题的答案。换句话说,我们要做的就是开始思考对话(conversations),这些是可以多轮进行的对话,所以在最简单的情况下,这是人类和助手之间的对话。

[原文] [Andrej Karpathy]: and so for example we can imagine the conversation could look something like this when a human says what is 2 plus2 the assistant should re respond with something like 2 plus 2 is 4 when a human follows up and says what if it was star instead of a plus assistant could respond with something like this um and similar here this is another example showing that the assistant could also have some kind of a personality here uh that it's kind of like nice and then here in the third example I'm showing that when a human is asking for something that we uh don't wish to help with we can produce what's called refusal we can say that we cannot help with that so in other words what we want to do now is we want to think through how in a system should interact with the human and we want to program the assistant and Its Behavior in these conversations

[译文] [Andrej Karpathy]: 例如,我们可以想象对话看起来像这样:当人类说“2加2等于几”时,助手应该回答类似“2加2等于4”。当人类追问说“如果是星号而不是加号呢”,助手可以做出类似这样的回答。这里还有另一个例子,显示助手也可以有某种个性,比如它很友善。然后在第三个例子中,我展示了当人类要求一些我们不想提供帮助的事情时,我们可以产生所谓的拒绝(refusal),我们可以说我们无法对此提供帮助。换句话说,我们要做的就是想清楚系统应该如何与人类互动,我们要对助手及其在这些对话中的行为进行编程。

[原文] [Andrej Karpathy]: now because this is neural networks we're not going to be programming these explicitly in code we're not going to be able to program the assistant in that way because this is neural networks everything is done through neural network training on data sets and so because of that we are going to be implicitly programming the assistant by creating data sets of conversations so these are three independent examples of conversations in a data dat set an actual data set and I'm going to show you examples will be much larger it could have hundreds of thousands of conversations that are multi- turn very long Etc and would cover a diverse breath of topics but here I'm only showing three examples but the way this works basically is uh a assistant is being programmed by example and where is this data coming from like 2 * 2al 4 same as 2 plus 2 Etc where does that come from this comes from Human labelers

[译文] [Andrej Karpathy]: 现在的因为这是神经网络,我们不会用代码显式地对它们进行编程。我们无法那样去编程助手,因为这是神经网络,一切都是通过在数据集上训练神经网络来完成的。正因如此,我们将通过创建对话数据集来隐式地编程助手。这是数据集中三个独立的对话示例。在一个实际的数据集中——我将展示给你们看——规模会大得多,可能包含成千上万个对话,是多轮的、很长的等等,并且会涵盖各种各样的话题,但我这里只展示三个例子。但这基本上是通过示例来编程助手。那么这些数据从哪里来?比如“2 * 2 = 4 等同于 2 + 2”等等,这些是从哪里来的?这来自人类标注员(Human Labelers)。

[原文] [Andrej Karpathy]: so we will basically give human labelers some conversational context and we will ask them to um basically give the ideal assistant response in this situation and a human will write out the ideal response for an assistant in any situation and then we're going to get the model to basically train on this and to imitate those kinds of responses so the way this works then is we are going to take our base model which we produced in the preing stage and this base model was trained on internet documents we're now going to take that data set of internet documents and we're gonna throw it out and we're going to substitute a new data set and that's going to be a data set of conversations and we're going to continue training the model on these conversations on this new data set of conversations

[译文] [Andrej Karpathy]: 所以我们基本上会给人类标注员一些对话上下文,我们会要求他们基本上给出这种情况下理想的助手回复。人类会写出助手在任何情况下的理想回复,然后我们会让模型基本上在这些数据上进行训练,并模仿那些类型的回复。所以工作原理就是,我们拿来在预训练阶段生成的基座模型,这个基座模型是在互联网文档上训练的。我们现在要把那个互联网文档数据集扔掉,我们要替换一个新的数据集,那就是一个对话数据集。我们将继续在这些对话上,在这个新的对话数据集上训练模型。

[原文] [Andrej Karpathy]: and what happens is that the model will very rapidly adjust and will sort of like learn the statistics of how this assistant responds to human queries and then later during inference we'll be able to basically um Prime the assistant and get the response and it will be imitating what the humans will human labelers would do in that situation if that makes sense so we're going to see examples of that and this is going to become bit more concrete I also wanted to mention that this post-training stage we're going to basically just continue training the model but um the pre-training stage can in practice take roughly three months of training on many thousands of computers the post-training stage will typically be much shorter like 3 hours for example um and that's because the data set of conversations that we're going to create here manually is much much smaller than the data set of text on the internet

[译文] [Andrej Karpathy]: 接下来发生的是,模型会非常迅速地调整,并在某种程度上学习这个助手如何回应人类查询的统计数据。然后在之后的推理过程中,我们基本上能够引导助手并获得回应,它将会模仿人类标注员在那种情况下会做的事情,如果这讲得通的话。我们会看到相关的例子,这会变得更具体一点。我还想提一下,在这个后训练阶段,我们基本上只是继续训练模型。但是,预训练阶段在实践中可能需要在数千台计算机上训练大约三个月,而后训练阶段通常会短得多,比如3个小时。这是因为我们要在这里手动创建的对话数据集比互联网上的文本数据集要小得多。

[原文] [Andrej Karpathy]: so let's start by talking about the tokenization of conversations everything in these models has to be turned into tokens because everything is just about token sequences so how do we turn conversations into token sequences is the question and so for that we need to design some kind of ending coding... and so I want to show you now how I would recreate uh this conversation in the token space so if you go to Tech tokenizer I can take that conversation and this is how it is represented in uh for the language model... and all the different llms will have a slightly different format or protocols and it's a little bit of a wild west right now but for example GPT 40 does it in the following way you have this special token called im_start and this is short for IM imaginary monologue uh the start then you have to specify um I don't actually know why it's called that to be honest then you have to specify whose turn it is so for example user which is a token 4 28 then you have internal monologue separator and then it's the exact question so the tokens of the question and then you have to close it so I am end the end of the imaginary monologue

[译文] [Andrej Karpathy]: 让我们首先谈谈对话的Token化(tokenization of conversations)。这些模型中的一切都必须转化为Token,因为一切都只是关于Token序列。所以问题是如何将对话转化为Token序列?为此我们需要设计某种编码……现在我想向你们展示我是如何在Token空间中重建这个对话的。如果你去Tiktokenizer,我可以把那个对话拿来,这就是它在语言模型中的表示方式……所有不同的大语言模型会有稍微不同的格式或协议,现在有点像狂野西部,但例如GPT-4o是这样做的:你有一个叫做`imstart`的特殊Token,这是“Imaginary Monologue”(想象独白)的缩写——开始。然后你必须指定——说实话我实际上不知道为什么叫那个——然后你必须指定是谁的回合,例如用户(User),这是Token 428。然后你有内部独白分隔符(separator),然后是确切的问题,即问题的Token。然后你必须关闭它,用`imend`,即想象独白的结束。

[原文] [Andrej Karpathy]: so basically the question from a user of what is 2 plus two ends up being the token sequence of these tokens and now the important thing to mention here is that IM start this is not text right IM start is a special token that gets added it's a new token and um this token has never been trained on so far it is a new token that we create in a post-training stage and we introduce and so these special tokens like IM seep IM start Etc are introduced and interspersed with text so that they sort of um get the model to learn that hey this is a the start of a turn for who is it start of the turn for the start of the turn is for the user and then this is what the user says and then the user ends and then it's a new start of a turn and it is by the assistant and then what does the assistant say well these are the tokens of what the assistant says Etc

[译文] [Andrej Karpathy]: 所以基本上,用户提出的“2加2等于几”的问题最终变成了这些Token的序列。这里重要的一点是,`imstart`不是文本,对吧?`imstart`是一个被添加的特殊Token,它是一个新Token。这个Token目前为止从未被训练过,它是我们在后训练阶段创建并引入的一个新Token。所以这些像`imsep`、`imstart`等的特殊Token被引入并散布在文本中,以便让模型学会:嘿,这是一个回合的开始。是谁的回合?是用户的回合。然后这是用户说的内容,然后用户结束了。然后这是一个新回合的开始,是助手的回合,然后助手说了什么?这些就是助手所说内容的Token等等。

[原文] [Andrej Karpathy]: and then what does it look like at test time during inference so say we've trained a model and we've trained a model on these kinds of data sets of conversations and now we want to inference so during inference what does this look like when you're on on chash apt well you come to chash apt and you have say like a dialogue with it and the way this works is basically um say that this was already filled in so like what is 2 plus 2 2 plus 2 is four and now you issue what if it was times I am end and what basically ends up happening um on the servers of open AI or something like that is they put in I start assistant I amep and this is where they end it right here so they construct this context and now they start sampling from the model so it's at this stage that they will go to the model and say okay what is a good for sequence what is a good first token what is a good second token what is a good third token and this is where the LM takes over and creates a response

[译文] [Andrej Karpathy]: 那么在测试时,在推理过程中这看起来像什么?假设我们要训练一个模型,我们已经在这些对话数据集上训练了模型,现在我们想进行推理。在推理过程中,当你在ChatGPT上时,这看起来像什么?你来到ChatGPT,比如你和它进行对话。其工作原理基本上是,假设这里已经填好了,比如“2加2等于几?2加2等于4”,现在你发布“如果是乘法呢”,然后`imend`。在OpenAI的服务器上基本上发生的是,他们放入`imstart assistant im_sep`,他们就停在这里。所以他们构建了这个上下文,现在他们开始从模型中采样。就在这个阶段,他们会去问模型:好的,什么是好的序列?什么是好的第一个Token?什么是好的第二个Token?什么是好的第三个Token?这就是语言模型接管并创建回复的地方。

[原文] [Andrej Karpathy]: so that's roughly how the protocol Works although the details of this protocol are not important so again my goal is that just to show you that everything ends up being just a one-dimensional token sequence so we can apply everything we've already seen but we're now training on conversations and we're now uh basically generating conversations as well okay so now I would like to turn to what these data sets look like in practice the first paper that I would like to show you and the first effort in this direction is this paper from openai in 2022 and this paper was called instruct GPT or the technique that they developed and this was the first time that opena has kind of talked about how you can take language models and fine-tune them on conversations

[译文] [Andrej Karpathy]: 这大概就是协议的工作原理,尽管该协议的细节并不重要。再次强调,我的目标只是向你们展示一切最终都只是一个一维的Token序列。所以我们可以应用我们已经看到的所有东西,但我们现在是在对话上进行训练,并且我们现在基本上也是在生成对话。好的,现在我想谈谈这些数据集在实践中是什么样子的。我想向你们展示的第一篇论文,也是在这个方向上的第一次努力,是OpenAI在2022年发表的这篇论文,这篇论文叫InstructGPT,或者说是他们开发的技术。这是OpenAI第一次某种程度上谈论如何获取语言模型并在对话上对其进行微调。

[原文] [Andrej Karpathy]: so the first stop I would like to make is in section 3.4 where they talk about the human contractors that they hired uh in this case from upwork or through scale AI to uh construct these conversations and so there are human labelers involved whose job it is professionally to create these conversations and these labelers are asked to come up with prompts and then they are asked to also complete the ideal assistant responses... so here for example is an excerpt uh of these kinds of labeling instruction instructions on High level you're asking people to be helpful truthful and harmless... roughly speaking the company comes up with the labeling instructions usually they are not this short usually there are hundreds of pages and people have to study them professionally and then they write out the ideal assistant responses uh following those labeling instructions so this is a very human heavy process as it was described in this paper

[译文] [Andrej Karpathy]: 我想停下来的第一站是第3.4节,他们谈到了他们雇佣的人类承包商(Human Contractors),在这个案例中是来自Upwork或通过Scale AI,来构建这些对话。所以有人类标注员参与其中,他们的职业工作就是创建这些对话。这些标注员被要求想出提示(Prompts),然后他们也被要求完成理想的助手回复……例如这里是这类标注说明文档的摘录,在高层次上,你要求人们做到乐于助人(Helpful)、诚实(Truthful)和无害(Harmless)……粗略地说,公司制定标注说明,通常它们没这么短,通常有几百页,人们必须专业地学习它们,然后他们遵循这些标注说明写出理想的助手回复。所以正如这篇论文所描述的,这是一个非常依赖人力的过程。

[原文] [Andrej Karpathy]: I want to show you that the state-of-the-art has kind of advanced in the last 2 or 3 years... so in particular it's not very common for humans to be doing all the heavy lifting just by themselves anymore and that's because we now have language models and these language models are helping us create these data sets and conversations so it is very rare that the people will like literally just write out the response from scratch it is a lot more likely that they will use an existing llm to basically like uh come up with an answer and then they will edit it or things like that so there's many different ways in which now llms have started to kind of permeate this posttraining Set uh stack and llms are basically used pervasively to help create these massive data sets of conversations

[译文] [Andrej Karpathy]: 我想向你们展示,在过去的两三年里,最先进的技术已经有所进步……特别是不再常见由人类独自承担所有繁重工作了,那是因为我们现在有了语言模型,这些语言模型正在帮助我们创建这些数据集和对话。所以现在很少有人真的从零开始写出回复,更有可能的是,他们会使用一个现有的大语言模型基本上生成一个答案,然后他们会编辑它或做类似的事情。所以现在有很多不同的方式,大语言模型已经开始渗透到这个后训练技术栈中,大语言模型基本上被普遍用于帮助创建这些海量的对话数据集。

[原文] [Andrej Karpathy]: I guess like the last thing to note is that I want to dispel a little bit of the magic of talking to an AI like when you go to chat GPT and you give it a question and then you hit enter uh what is coming back is kind of like statistically aligned with what's happening in the training set and these training sets I mean they really just have a seed in humans following labeling instructions so what are you actually talking to in chat GPT or how should you think about it well it's not coming from some magical AI like roughly speaking it's coming from something that is statistically imitating human labelers which comes from labeling instructions written by these companies... and it's kind of like asking what would a human labeler say in this kind of a conversation

[译文] [Andrej Karpathy]: 我想最后要注意的一点是,我想消除一点与AI交谈的魔力感。比如当你去ChatGPT,你给它一个问题然后按回车,返回的内容某种程度上是在统计上与训练集中发生的事情保持一致的。而这些训练集,我的意思是,它们真正的种子在于遵循标注说明的人类。所以你在ChatGPT中实际上是在和什么说话?或者你应该如何思考它?嗯,它不是来自某种神奇的AI。粗略地说,它来自某种在统计上模仿人类标注员的东西,而这源自这些公司编写的标注说明……这有点像是问:在这种对话中,一个人类标注员会说什么?

[原文] [Andrej Karpathy]: so you're not talking to a magical AI you're talking to an average labeler this average labeler is probably fairly highly skilled but you're talking to kind of like an instantaneous simulation of that kind of a person that would be hired uh in the construction of these data sets

[译文] [Andrej Karpathy]: 所以你不是在和一个神奇的AI说话,你是在和一个平均水平的标注员说话。这个平均水平的标注员可能技能相当高超,但你是在和那种会被雇佣来构建这些数据集的人的“瞬时模拟”(instantaneous simulation)进行对话。


章节 6:LLM心理学:幻觉与工具使用 (LLM Psychology: Hallucinations and Tool Use)

📝 本节摘要

在本章中,Andrej Karpathy 探讨了所谓的“LLM心理学”。他解释了幻觉(Hallucinations)的成因:模型作为概率性的Token生成器,倾向于模仿训练数据中自信的语气,即便在不知情时也会编造事实。为解决这一问题,他介绍了两种核心缓解策略:一是拒绝回答(Refusal),即通过探测模型内部的不确定性,训练它在不知道答案时诚实地说“我不知道”;二是工具使用(Tool Use),允许模型调用搜索引擎或代码解释器。Karpathy 提出了一个精妙的比喻:模型的参数就像是模糊的长期记忆,而上下文窗口(Context Window)则是工作记忆。通过使用工具将信息放入上下文,通过阅读而非背诵来回答问题,可以显著提高准确性。

[原文] [Andrej Karpathy]: okay now I want to turn to the topic of llm psychology as I like to call it which is what are sort of the emergent cognitive effects of the training pipeline that we have for these models so in particular the first one I want to talk to is of course hallucinations so you might be familiar with model hallucinations it's when llms make stuff up they just totally fabricate information Etc and it's a big problem with llm assistants

[译文] [Andrej Karpathy]: 好的,现在我想转向我喜欢称之为“LLM心理学”的话题,也就是我们拥有的这套模型训练管道所产生的某种涌现的认知效应。特别是我想要谈论的第一点当然是幻觉(hallucinations)。你可能对模型幻觉很熟悉,就是大语言模型瞎编乱造,它们完全捏造信息等等,这是大语言模型助手的一个大问题。

[原文] [Andrej Karpathy]: for now let's just try to understand where these hallucinations come from... the problem is that the assistant will not just tell you oh I don't know even if the assistant and the language model itself might know inside its features inside its activations inside of its brain sort of it might know that this person is like not someone that um that is that it's familiar with even if some part of the network kind of knows that in some sense the uh saying that oh I don't know who this is is is not going to happen because the model statistically imitates is training set in the training set the questions of the form who is blah are confidently answered with the correct answer and so it's going to take on the style of the answer and it's going to do its best it's going to give you statistically the most likely guess and it's just going to basically make stuff up

[译文] [Andrej Karpathy]: 现在让我们试着理解这些幻觉从何而来……问题在于助手不会只是告诉你“哦,我不知道”,即使助手和语言模型本身可能在它的特征中、在它的激活中、在它的“大脑”内部某种程度上知道这个人并不是它熟悉的人。即使网络的某些部分在某种意义上知道这一点,但说出“哦,我不知道这是谁”这种情况是不会发生的。因为模型在统计上模仿它的训练集。在训练集中,“谁是某某某”形式的问题都会被自信地用正确答案回答。所以它会采纳那种回答的风格,它会尽力而为,它会给你统计上最可能的猜测,并且基本上就开始瞎编了。

[原文] [Andrej Karpathy]: because these models again we just talked about it is they don't have access to the internet they're not doing research these are statistical token tumblers as I call them uh is just trying to sample the next token in the sequence and it's going to basically make stuff up so let's take a look at what this looks like I have here what's called the inference playground from hugging face and I am on purpose picking on a model called Falcon 7B which is an old model... let's say who is Orson kovats... Orson kovat is an American author and science uh fiction writer okay this is totally false it's hallucination let's try again these are statistical systems right so we can resample this time Orson kovat is a fictional character from this 1950s TV show it's total BS right

[译文] [Andrej Karpathy]: 因为再说一次,我们刚才谈过,这些模型无法访问互联网,它们不做研究。正如我称呼它们的,这些是“统计学Token翻滚机”(statistical token tumblers),只是试图采样序列中的下一个Token,它基本上就是会瞎编。让我们看看这看起来像什么。我有Hugging Face的推理游乐场,我故意挑了一个叫Falcon 7B的模型,这是一个旧模型……比如说“谁是Orson Kovats”……“Orson Kovats是一位美国作家和科幻小说家”。好的,这完全是错误的,这是幻觉。让我们再试一次,这些是统计系统对吧,所以我们可以重新采样。这次“Orson Kovats是这部1950年代电视剧中的虚构角色”。完全是胡扯(BS),对吧?

[原文] [Andrej Karpathy]: so how can we uh mitigate this because for example when we go to chat apt and I say who is oron kovats and I'm now asking the stateoftheart state-of-the-art model from open AI... there's a well known historical or public figure named or oron kovats so this model is not going to make up stuff this model knows that it doesn't know and it tells you that it doesn't appear to be a person that this model knows so somehow we sort of improved hallucinations even though they clearly are an issue in older models... so how do we fix this okay well clearly we need some examples in our data set that where the correct answer for the assistant is that the model doesn't know about some particular fact... and so the question is how do we know what the model knows or doesn't know well we can empirically probe the model to figure that out

[译文] [Andrej Karpathy]: 那么我们如何缓解这个问题呢?因为例如当我们去ChatGPT,我说“谁是Orson Kovats”,我现在问的是OpenAI最先进的模型……“没有叫Orson Kovats的知名历史或公众人物”。所以这个模型不会瞎编东西,这个模型知道它不知道,它告诉你这似乎不是一个它认识的人。所以我们在某种程度上改进了幻觉问题,尽管在旧模型中它们显然是个问题……那么我们如何修复这个问题?显然我们需要在数据集中有一些例子,其中助手的正确回答是模型不知道某个特定事实……那么问题是我们怎么知道模型知道什么或不知道什么?我们可以通过经验性地探测(probe)模型来弄清楚这一点。

[原文] [Andrej Karpathy]: so let's take a look at for example how meta uh dealt with hallucinations for the Llama 3 series of models... basically what they do is basically they take a random document in a training set and they take a paragraph and then they use an llm to construct questions about that paragraph... and now we want to interrogate the model so roughly speaking what we'll do is we'll take our questions and we'll go to our model... and the way that you can programmatically decide is basically we're going to take this answer from the model and we're going to compare it to the correct answer... and if the model doesn't know then we know that the model doesn't know this question and then what we do is we take this question we create a new conversation in the training set... and when the question is how many Stanley Cups did he win the answer is I'm sorry I don't know or I don't remember and that's the correct answer for this question

[译文] [Andrej Karpathy]: 让我们看看例如Meta是如何处理Llama 3系列模型的幻觉问题的……基本上他们做的是,他们在训练集中取一个随机文档,取一个段落,然后使用一个大语言模型来构建关于该段落的问题……现在我们要审问这个模型。粗略地说我们要做的就是拿着我们的问题去找我们的模型……我们可以通过编程方式判断的方法基本上是,我们将获取模型的这个答案,并将其与正确答案进行比较……如果模型不知道,那么我们就知道模型不知道这个问题。然后我们要做的是,我们拿着这个问题,在训练集中创建一个新的对话……当问题是“他赢了多少次斯坦利杯”时,答案是“对不起,我不知道”或“我不记得了”。这就是这个问题的正确答案。

[原文] [Andrej Karpathy]: now we can actually do much better than that uh it's instead of just saying that we don't know uh we can introduce an additional mitigation number two to give the llm an opportunity to be factual and actually answer the question... so think of the knowledge inside the neural network inside its billions of parameters think of that as kind of a vague recollection of the things that the model has seen during its training during the pre-training stage a long time ago so think of that knowledge in the parameters as something you read a month ago... but what you and I do is we just go and look it up now when you go and look it up what you're doing basically is like you're refreshing your working memory with information... so we need some equivalent of allowing the model to refresh its memory or its recollection and we can do that by introducing tools uh for the models

[译文] [Andrej Karpathy]: 现在我们实际上可以做得比那更好。与其只是说我们不知道,我们可以引入第二种额外的缓解措施,给大语言模型一个实事求是并实际回答问题的机会……把神经网络内部、那数十亿参数中的知识想象成是对模型很久以前在预训练阶段看到的东西的一种模糊的回忆(vague recollection)。把参数中的知识想象成你一个月前读过的东西……但你和我做的是我们会去查阅它。当你去查阅时,你所做的基本上是用信息刷新你的工作记忆(working memory)……所以我们需要某种等效机制,允许模型刷新它的记忆或回忆,我们可以通过为模型引入工具(tools)来做到这一点。

[原文] [Andrej Karpathy]: so the way we are going to approach this is that instead of just saying hey I'm sorry I don't know we can attempt to use tools so we can create uh a mechanism by which the language model can emit special tokens... for example instead of answering the question when the model does not instead of just saying I don't know sorry the model has the option now to emitting the special token search start and this is the query that will go to like bing.com... it will emit the query and then it will emit search end and then here what will happen is that the program that is sampling from the model... it will actually pause generating from the model it will go off it will open a session with bing.com and it will paste the search query into Bing and it will then um get all the text that is retrieved... and it will copy paste it here into what I Tred to like show with the brackets so all that text kind of comes here and when the text comes here it enters the context window

[译文] [Andrej Karpathy]: 所以我们处理这个问题的方法是,与其只是说“嘿,对不起我不知道”,我们可以尝试使用工具。我们可以创建一种机制,让语言模型可以发射特殊的Token……例如,当模型不知道时,与其回答问题或只是说“对不起我不知道”,模型现在有一个选项可以发射特殊Token `searchstart`,这是一个会发往像bing.com的查询……它会发射查询,然后发射 `searchend`。这里会发生的是,正在从模型采样的程序……它实际上会暂停模型的生成,它会离开,打开一个与bing.com的会话,把搜索查询粘贴进Bing,然后获取所有检索到的文本……并将其复制粘贴到我试图用括号显示的地方。所以所有那些文本都来到这里,当文本来到这里时,它进入了上下文窗口(context window)。

[原文] [Andrej Karpathy]: that data that is in the context window is directly accessible by the model it directly feeds into the neural network so it's not anymore a vague recollection it's data that it it has in the context window and is directly available to that model... so that's roughly how these um how these tools use uh tools uh function and so web search is just one of the tools we're going to look at some of the other tools in a bit

[译文] [Andrej Karpathy]: 上下文窗口中的数据是模型可以直接访问的,它直接输入到神经网络中。所以这不再是模糊的回忆,这是它在上下文窗口中拥有的数据,并且对该模型直接可用……这大概就是这些工具使用功能是如何运作的。网络搜索只是其中一个工具,我们稍后会看一些其他工具。

[原文] [Andrej Karpathy]: I want to stress one more time this very important sort of psychology Point knowledge in the parameters of the neural network is a vague recollection the knowledge in the tokens that make up the context window is the working memory... so for example I can go to chat GPT and I can do something like this I can say can you Summarize chapter one of Jane Austin's Pride and Prejudice right and this is a perfectly fine prompt... but usually when I actually interact with LMS and I want them to recall specific things it always works better if you just give it to them so I think a much better prompt would be something like this can you summarize for me chapter one of genos's spr and Prejudice and then I am attaching it below for your reference and then I do something like a delimeter here and I paste it in... and I do that because when it's in the context window the model has direct access to it and can exactly it doesn't have to recall it it just has access to it and so this summary is can be expected to be a significantly high quality or higher quality than this summary uh just because it's directly available to the model

[译文] [Andrej Karpathy]: 我想再次强调这个非常重要的“心理学”观点:神经网络参数中的知识是模糊的回忆(vague recollection),构成上下文窗口的Token中的知识是工作记忆(working memory)……例如,我可以去ChatGPT做这样的事,我说“你能总结简·奥斯汀《傲慢与偏见》的第一章吗”,这是一个完全没问题的提示……但通常当我实际与大语言模型互动并希望它们回忆具体事物时,如果你直接把内容给它们,效果总是会更好。所以我认为一个好得多的提示会是这样:“你能为我总结《傲慢与偏见》的第一章吗?我把它附在下面供你参考”,然后我做一个分隔符并把它粘贴进去……我这样做是因为当它在上下文窗口中时,模型可以直接访问它,而且可以精确地(访问),它不需要回忆它,它只是拥有它的访问权限。所以这个总结可以预期比前一个总结质量显著更高,仅仅因为它对模型是直接可用的。


章节 7:认知缺陷与模型身份 (Cognitive Deficits and Model Identity)

📝 本节摘要

在本章中,Andrej Karpathy 揭示了 LLM 的认知局限性。首先,他澄清了模型的“身份”问题:模型没有持久的自我意识,只是一个“Token 翻滚机”,其自我介绍通常源自训练数据中的统计规律(如误称自己由 OpenAI 开发)或硬编码的系统提示。接着,他提出了“模型需要 Token 来思考(Models need tokens to think)”的重要概念。由于模型在生成每个 Token 时的计算量是有限的(由层数决定),试图让模型在单个 Token 内完成复杂计算(如心算大数、瞬间给出答案)注定会失败。他通过具体的“锋利边缘”案例——如无法数清点的数量、拼写单词(因分词导致看不见字母,如“Strawberry”中有几个R)、以及声称 9.11 大于 9.9——展示了这些缺陷,并建议在这些场景下使用代码解释器等工具。

[原文] [Andrej Karpathy]: the next sort of psychological Quirk I'd like to talk about briefly is that of the knowledge of self so what I see very often on the internet is that people do something like this they ask llms something like what model are you and who built you and um basically this uh question is a little bit nonsensical... this thing is not a person right it doesn't have a persistent existence in any way it sort of boots up processes tokens and shuts off and it does that for every single person it just kind of builds up a context window of conversation and then everything gets deleted and so this this entity is kind of like restarted from scratch every single conversation if that makes sense it has no persistent self it has no sense of self it's a token tumbler and uh it follows the statistical regularities of its training set

[译文] [Andrej Karpathy]: 我想简要谈谈的下一个“心理怪癖”是关于“自我认知”的。我在互联网上经常看到人们这样做,他们问大语言模型诸如“你是什么模型?”、“谁制造了你?”之类的问题。基本上这个问题有点荒谬……这东西不是一个人,对吧?它没有任何形式的持久存在。它某种程度上是启动、处理Token、然后关闭。它对每个人都这样做,它只是建立一个对话的上下文窗口,然后一切都被删除了。所以这个实体就像是在每次对话中都从头开始重启一样,如果这说得通的话。它没有持久的自我,它没有自我意识,它只是一个“Token 翻滚机”(token tumbler),它遵循其训练集的统计规律。

[原文] [Andrej Karpathy]: so for example let's uh pick on Falcon which is a fairly old model and let's see what it tells us... here it says I was built by open AI based on the gpt3 model it's totally making stuff up... I think what's actually likely to be happening here is that this is just its hallucinated label for what it is this is its self-identity is that it's chat GPT by open Ai and it's only saying that because there's a ton of data on the internet of um answers like this that are actually coming from open from chasht and So that's its label for what it is

[译文] [Andrej Karpathy]: 例如,让我们挑Falcon这个相当旧的模型来看看它告诉我们什么……这里它说“我是由OpenAI基于GPT-3模型构建的”。这完全是瞎编……我认为这里实际发生的可能是,这只是它对自己是什么的一种幻觉标签。它的自我认同是“我是OpenAI的ChatGPT”,它之所以这么说,仅仅是因为互联网上有大量这样的回答数据实际上是来自OpenAI的ChatGPT的,所以这就是它对自己的标签。

[原文] [Andrej Karpathy]: I want to now continue to the next section which deals with the computational capabilities or like I should say the native computational capabilities of these models... so um consider the following prompt from a human... Emily buys three apples and two oranges each orange cost $2 the total cost is 13 what is the cost of apples very simple math question... the key to this question is to realize and remember that when the models are training and also inferencing they are working in onedimensional sequence of tokens from left to right... roughly speaking uh there's basically a finite number of layers of computation that happened here... you should think of this as a very small amount of computation... so you can't imagine the model to to basically do arbitrary computation in a single forward pass to get a single token

[译文] [Andrej Karpathy]: 我现在想继续下一节,讨论这些模型的计算能力,或者我应该说是“原生”计算能力……考虑下面这个人类的提示词……“艾米丽买了三个苹果和两个橙子,每个橙子2美元,总花费是13美元,苹果的单价是多少?”一个非常简单的数学问题……这个问题的关键是要意识到并记住,当模型在训练和推理时,它们是在从左到右处理一维的Token序列……粗略地说,这里发生的计算层数基本上是有限的……你应该把它看作是非常少量的计算……所以你不能想象模型基本上能在一次前向传递(single forward pass)中通过一个Token完成任意复杂的计算。

[原文] [Andrej Karpathy]: and so what that means is that we actually have to distribute our reasoning and our computation across many tokens because every single token is only spending a finite amount of computation on it... so if you are answering the question directly and immediately you are training the model to to try to basically guess the answer in a single token and that is just not going to work because of the finite amount of computation that happens per token

[译文] [Andrej Karpathy]: 这意味着我们实际上必须将我们的推理和计算分散到许多Token上,因为每一个Token只在上面花费了有限的计算量……所以如果你直接并立即回答问题,你就是在训练模型试图基本上在一个Token内猜出答案,这是行不通的,因为每个Token发生的计算量是有限的。

[原文] [Andrej Karpathy]: the last thing that I would say on this topic is that if I was in practi is trying to actually solve this in my day-to-day life I might actually not uh trust that the model that all the intermediate calculations correctly here so actually probably what I do is something like this I would come here and I would say use code and uh that's because code is one of the possible tools that chachy PD can use and instead of it having to do mental arithmetic like this mental arithmetic here I don't fully trust it... it might just like uh screw up some of the intermediate results... so you can say stuff like use code... the model can write code and I can inspect that this code is correct and then uh it's not relying on its mental arithmetic it is using the python interpreter

[译文] [Andrej Karpathy]: 关于这个话题我想说的最后一点是,如果我在日常生活中实际尝试解决这个问题,我可能实际上不会相信模型能正确完成这里所有的中间计算。所以实际上我可能会这样做:我会来这里说“使用代码(use code)”。这是因为代码是ChatGPT可以使用的工具之一。与其让它做心算(mental arithmetic)——我不完全信任这里的心算……它可能会搞砸一些中间结果……所以你可以说“使用代码”……模型可以编写代码,我可以检查代码是否正确,然后它就不依赖于它的心算,而是在使用Python解释器。

[原文] [Andrej Karpathy]: I want to show you one more example of where this actually comes up and that's in counting so models actually are not very good at counting for the exact same reason you're asking for way too much in a single individual token so let me show you a simple example of that um how many dots are below and then I just put in a bunch of dots and Chach says there are and then it just tries to solve the problem in a single token... and spoiler alert is not 161 it's actually I believe 177

[译文] [Andrej Karpathy]: 我想再给你们看一个例子,这就是计数(counting)。模型实际上不太擅长计数,原因完全相同:你在一个单独的Token中要求了太多的东西。让我给你们看一个简单的例子,“下面有多少个点”,然后我放了一堆点。ChatGPT说“有……”,然后它就试图在一个Token中解决这个问题……剧透一下,不是161个,实际上我相信是177个。

[原文] [Andrej Karpathy]: now the models also have many other little cognitive deficits here and there... as an example the models are not very good with all kinds of spelling related tasks... and the reason to do for this is that the models they don't see the characters they see tokens... and so they don't see characters like our eyes do and so very simple character level tasks often fail... a very famous example of that recently is how many R are there in strawberry and this went viral many times and basically the models now get it correct... but for a very long time all the state-of-the-art models would insist that there are only two RS in strawberry

[译文] [Andrej Karpathy]: 现在,模型在各处还有许多其他小的认知缺陷……作为一个例子,模型不太擅长各种拼写相关的任务……其原因是模型看不到字符,它们看到的是Token……它们不像我们的眼睛那样看到字符,所以非常简单的字符级任务经常失败……最近一个非常著名的例子是“Strawberry(草莓)里有几个R”,这在网上疯传了很多次。现在的模型基本上能答对了……但在很长一段时间里,所有最先进的模型都会坚持说Strawberry里只有两个R。

[原文] [Andrej Karpathy]: a good example of that recently is the following uh the models are not very good at very simple questions like this... 9.11 is bigger than 9.9... but obviously and then at the end okay it actually it flips its decision later... it turns out that a bunch of people studied this in depth... a bunch of neurons inside the neural network light up that are usually associated with Bible verses U and so I think the model is kind of like reminded that these almost look like Bible verse markers and in a bip verse setting 9.11 would come after 99.9 and so basically the model somehow finds it like cognitively very distracting

[译文] [Andrej Karpathy]: 最近的一个好例子是这个:模型不太擅长像这样的非常简单的问题……“9.11比9.9大”……虽然显而易见(是错的),后来它实际上反转了决定……事实证明有一群人深入研究了这个问题……神经网络内部的一群通常与圣经经文(Bible verses)相关的神经元亮了起来。所以我认为模型有点像是被提醒了,这些看起来几乎像圣经经文的标记(如9章11节和9章9节)。在圣经经文的设定中,9.11会排在9.9之后,所以基本上模型莫名其妙地发现这在认知上非常令人分心。


章节 8:强化学习:思考模型的崛起 (Reinforcement Learning: The Rise of Thinking Models)

📝 本节摘要

在本章中,Andrej Karpathy 介绍了模型训练的第三个、也是目前最前沿的阶段——强化学习(Reinforcement Learning, RL)。他用“上学”做类比:预训练是阅读教科书(获取知识),监督微调(SFT)是看老师的例题详解(模仿专家),而 RL 则是做练习题(通过尝试与错误来学习)。Karpathy 指出,人类标注者无法通过 SFT 教会模型如何“思考”,因为人类的认知方式与模型不同。通过在数学或代码等可验证领域(Verifiable Domains)进行大规模的“猜测与检查(Guess and Check)”,模型(如 DeepSeek R1)自然涌现出了思维链(Chain of Thought)能力——即通过生成长的内部独白、自我纠错和回溯来提高解题准确率。

[原文] [Andrej Karpathy]: so reinforcement learning is still kind of thought to be under the umbrella of posttraining uh but it is the last third major stage and it's a different way of training language models and usually follows as this third step... let me first actually motivate it and why we would want to do reinforcement learning and what it looks like on a high level

[译文] [Andrej Karpathy]: 强化学习仍然被认为是在后训练(post-training)的大伞之下,但它是最后的一个主要阶段,也就是第三阶段。这是一种训练语言模型的不同方式,通常作为第三步紧随其后……让我首先来激发一下学习它的动机,为什么我们要进行强化学习,以及它在高层次上看起来是什么样子的。

[原文] [Andrej Karpathy]: so I would now like to try to motivate the reinforcement learning stage and what it corresponds to with something that you're probably familiar with and that is basically going to school... when we're working with textbooks in school you'll see that there are three major kind of uh pieces of information in these textbooks... the first thing you'll see is you'll see a lot of exposition... as you are reading through the words of this Exposition you can think of that roughly as training on that data so um and that's why when you're reading through this stuff this background knowledge and this all this context information it's kind of equivalent to pre-training

[译文] [Andrej Karpathy]: 所以我现在想尝试用你们可能熟悉的东西来激发对强化学习阶段的理解,那就是“上学”。当我们在学校使用教科书时,你会看到这些教科书中有三类主要的信息……你会看到的第一件事是大量的说明文(exposition)……当你阅读这些说明文字时,你可以粗略地将其视为在该数据上进行训练。这就是为什么当你阅读这些背景知识和上下文信息时,这有点相当于预训练(Pre-training)

[原文] [Andrej Karpathy]: the next major kind of information that you will see is these uh problems and with their worked Solutions so basically a human expert in this case uh the author of this book has given us not just a problem but has also worked through the solution... and basically um that's that roughly correspond to having the sft model that's what it would be doing so basically we've already done pre-training and we've already covered this um imitation of experts and how they solve these problems

[译文] [Andrej Karpathy]: 你会看到的下一类主要信息是那些带有详细解答的问题(worked solutions)。基本上在这种情况下,人类专家——也就是这本书的作者——不仅给了我们一个问题,而且还详细给出了解决方案……这基本上粗略对应于拥有 SFT(监督微调) 模型,这就是它在做的事情。所以基本上我们已经完成了预训练,我们也已经涵盖了对专家的模仿以及他们如何解决这些问题。

[原文] [Andrej Karpathy]: and the third stage of reinforcement learning is basically the practice problems... practice problems of course we know are critical for learning because what are they getting you to do they're getting you to practice uh to practice yourself and discover ways of solving these problems yourself... you are trying out many different things and you're seeing what gets you to the final solution the best and so you're discovering how to solve these problems... and that's what reinforcement learning is about

[译文] [Andrej Karpathy]: 强化学习的第三阶段基本上就是练习题(practice problems)……我们知道练习题对学习至关重要,因为它们让你做什么?它们让你练习,让你自己练习并发现解决这些问题的方法……你在尝试许多不同的东西,你在看什么能让你最好地得出最终答案,所以你在发现如何解决这些问题……这就是强化学习的核心。

[原文] [Andrej Karpathy]: so let's go back to the problem that we worked with previously... Emily buys three apples and two oranges each orange is $2 the total cost of all the fruit is $13 what is the cost of each apple... what I'd like you to appreciate at this point is that if I am the human data labeler that is creating a conversation to be entered into the training set I don't actually really know which of these conversations to um to add to the data set... we fundamentally don't know and we don't know because what is easy for you or I as or as human labelers what's easy for us or hard for us is different than what's easy or hard for the llm it cognition is different

[译文] [Andrej Karpathy]: 让我们回到之前处理过的那个问题……“艾米丽买了三个苹果和两个橙子,每个橙子2美元,水果总花费13美元,每个苹果多少钱?”……此时我希望你们理解的是,如果我是那个创建对话以输入训练集的人类数据标注员,我实际上并不知道应该把哪种对话添加到数据集中……我们从根本上不知道,原因在于对你我(人类标注员)来说容易或困难的事情,与对大语言模型来说容易或困难的事情是不同的,它的认知方式是不同的。

[原文] [Andrej Karpathy]: and so some of the token sequences here that are trivial for me might be um very too much of a leap for the llm... but conversely many of the tokens that I'm creating here might be just trivial to the llm and we're just wasting tokens... so long story short we are not in a good position to create these uh token sequences for the LM... we really want the llm to discover the token sequences that work for it we need to find it needs to find for itself what token sequence reliably gets to the answer given the prompt and it needs to discover that in the process of reinforcement learning and of trial and error

[译文] [Andrej Karpathy]: 所以这里有些Token序列对我来说是微不足道的,但对大语言模型来说可能跨度太大了……反之,我在这里创建的许多Token对大语言模型来说可能只是琐碎的,我们只是在浪费Token……长话短说,我们并不处于为语言模型创建这些Token序列的最佳位置……我们真的希望大语言模型去发现适合它的Token序列。它需要自己找到什么样的Token序列能根据提示可靠地得出答案,并且它需要在强化学习和试错(trial and error)的过程中发现这一点。

[原文] [Andrej Karpathy]: the way that reinforcement learning will basically work is actually quite quite simple um we need to try many different kinds of solutions and we want to see which Solutions work well or not so we're basically going to take the prompt we're going to run the model and the model generates a solution and then we're going to inspect the solution and we know that the correct answer for this one is $3... and we can actually repeat this uh many times and so in practice you might actually sample thousand of independent Solutions or even like million solutions for just a single prompt... and basically what we want to do is we want to encourage the solutions that lead to correct answers

[译文] [Andrej Karpathy]: 强化学习的基本工作方式实际上非常非常简单。我们需要尝试许多不同种类的解法,看看哪些解法有效,哪些无效。所以我们基本上拿着提示词,运行模型,模型生成一个解法,然后我们检查这个解法。我们知道这个问题的正确答案是3美元……我们实际上可以重复这个过程很多次。在实践中,你实际上可能针对单个提示词采样一千个甚至一百万个独立的解法……基本上我们想要做的,就是鼓励那些能得出正确答案的解法。

[原文] [Andrej Karpathy]: so this paper from Deep seek that came out very very recently was such a big deal because this is a paper from this company called DC Kai in China and this paper really talked very publicly about reinforcement learning fine training for large language models and how incredibly important it is for large language models and how it brings out a lot of reasoning capabilities in the models... so let me take you briefly through this uh deep seek R1 paper and what happens when you actually correctly apply RL to language models

[译文] [Andrej Karpathy]: 最近DeepSeek发布的这篇论文之所以如此重要,是因为这是来自中国一家名为DeepSeek AI(深度求索)的公司的论文。这篇论文非常公开地讨论了针对大语言模型的强化学习微调,以及它对大语言模型来说是多么极其重要,它是如何激发出模型的大量推理能力的……所以让我带你们简要浏览一下这篇DeepSeek R1论文,看看当你真正正确地将RL应用于语言模型时会发生什么。

[原文] [Andrej Karpathy]: what the model learns to do um and this is an immerging property of new optimization it just discovers that this is good for problem solving is it starts to do stuff like this wait wait wait that's Nota moment I can flag here let's reevaluate this step by step to identify the correct sum can be so what is the model doing here right the model is basically re-evaluating steps it has learned that it works better for accuracy to try out lots of ideas try something from different perspectives retrace reframe backtrack is doing a lot of the things that you and I are doing in the process of problem solving for mathematical questions but it's rediscovering what happens in your head not what you put down on the solution

[译文] [Andrej Karpathy]: 模型学会做的事情——这是优化过程中涌现(emergent)的特性,它只是发现这样做对解决问题有好处——就是它开始做这类事情:“等等,等等,那不是……让我标记一下,让我们一步一步重新评估以确定正确的总和……”。模型在这里做什么?模型基本上是在重新评估步骤。它学会了:为了提高准确率,最好尝试很多想法,从不同角度尝试,回溯,重构,倒推。它在做很多你我在解决数学问题过程中会做的事情,但它是在重新发现你“脑海中”发生的事情,而不是你写在(纸面)解答上的东西。

[原文] [Andrej Karpathy]: so the model learns what we call these chains of thought in your head and it's an emergent property of the optim of the optimization and that's what's bloating up the response length but that's also what's increasing the accuracy of the problem problem solving so what's incredible here is basically the model is discovering ways to think it's learning what I like to call cognitive strategies... and this is where we are seeing these aha moments and these different strategies... the last point I wanted to make is... deep seek R1 is a model that was released by this company so this is an open source model or open weights model it is available for anyone to download and use

[译文] [Andrej Karpathy]: 所以模型学会了我们所谓的脑海中的思维链(Chain of Thought)。这是优化的一个涌现特性,这也是导致回答长度膨胀的原因,但这同时也提高了解决问题的准确率。这里令人难以置信的是,模型基本上是在发现思考的方式,它正在学习我喜欢称之为“认知策略”的东西……这正是我们要看到那些“顿悟时刻(aha moments)”和不同策略的地方……我想说的最后一点是……DeepSeek R1是由这家公司发布的模型,这是一个开源模型或开放权重模型,任何人都可以下载和使用。


章节 9:人类反馈强化学习 (RLHF) 及其局限 (RLHF and Its Limitations)

📝 本节摘要

在本章中,Andrej Karpathy 探讨了如何将强化学习应用于不可验证领域(Unverifiable Domains),如创意写作或讲笑话。由于这些任务没有标准的正确答案,无法像数学题那样自动评分,因此引入了人类反馈强化学习(RLHF)。该方法通过训练一个奖励模型(Reward Model)来模拟人类的偏好(基于人类对回答的排序),从而替代昂贵的人工评分。Karpathy 指出,虽然 RLHF 能提升模型表现,但它并非“真正”的强化学习,因为奖励模型只是一个可被“博弈(Gameable)”的静态模拟器。如果过度优化,模型会通过生成对抗性样本(如毫无意义的字符序列)来骗取高分,这限制了其像 AlphaGo 那样实现无限自我进化的能力。

[原文] [Andrej Karpathy]: there's one more section within reinforcement learning that I wanted to cover and that is that of learning in unverifiable domains so so far all of the problems that we've looked at are in what's called verifiable domains that is any candidate solution we can score very easily against a concrete answer... the problem is that we can't apply the strategy in what's called unverifiable domains so usually these are for example creative writing tasks like write a joke about Pelicans or write a poem or summarize a paragraph or something like that in these kinds of domains it becomes harder to score our different solutions to this problem

[译文] [Andrej Karpathy]: 在强化学习中还有一个部分我想涵盖,那就是在不可验证领域(unverifiable domains)的学习。到目前为止,我们看过的所有问题都属于所谓的“可验证领域”,即任何候选解我们都可以很容易地对照一个具体答案进行评分……问题在于我们不能将这种策略应用于所谓的不可验证领域。通常这些是例如创意写作任务,比如写一个关于鹈鹕的笑话,或者写一首诗,或者总结一个段落之类的。在这类领域中,对我们针对该问题的不同解法进行评分变得更加困难。

[原文] [Andrej Karpathy]: so for example writing a joke about Pelicans we can generate lots of different uh jokes of course... the problem that we are facing is how do we score them now in principle we could of course get a human to look at all these jokes just like I did right now the problem with that is if you are doing reinforcement learning you're going to be doing many thousands of updates... and so there's just like way too many of these to look at... the problem is that it's just like way too much human time this is an unscalable strategy

[译文] [Andrej Karpathy]: 例如写一个关于鹈鹕的笑话,我们当然可以生成许多不同的笑话……我们要面对的问题是如何给它们评分。原则上我们当然可以找一个人来看看所有这些笑话,就像我刚才做的那样。但问题在于,如果你在做强化学习,你要进行成千上万次更新……这实在是有太多的东西要看了……问题在于这需要耗费太多的人类时间,这是一个不可扩展的策略。

[原文] [Andrej Karpathy]: we need some kind of an automatic strategy for doing this and one sort of solution to this was proposed in this paper uh that introduced what's called reinforcement learning from Human feedback and so this was a paper from open at the time... so in our Rel of approach we are kind of like the the core trick is that of indirection so we're going to involve humans just a little bit and the way we cheat is that we basically train a whole separate neural network that we call a reward model and this neural network will kind of like imitate human scores

[译文] [Andrej Karpathy]: 我们需要某种自动化的策略来做这件事。在这篇论文中提出了某种解决方案,引入了所谓的人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)。这在当时是OpenAI的一篇论文……所以在我们的RLHF方法中,核心技巧在于“间接性”(indirection)。我们将只让名为参与一点点,而我们作弊的方式是基本上训练一个完全独立的神经网络,我们称之为奖励模型(Reward Model),这个神经网络将某种程度上模仿人类的评分。

[原文] [Andrej Karpathy]: so we're going to ask humans to score um roll we're going to then imitate human scores using a neural network and this neural network will become a kind of simulator of human preferences and now that we have a neural network simulator we can do RL against it so instead of asking a real human we're asking a simulated human for their score of a joke as an example and so once we have a simulator we're often racist because we can query it as many times as we want to and it's all whole automatic process

[译文] [Andrej Karpathy]: 所以我们要让人类对生成结果进行评分,然后我们要用神经网络模仿人类的评分。这个神经网络将成为人类偏好的某种模拟器(simulator)。现在既然我们有了一个神经网络模拟器,我们就可以针对它进行强化学习。所以作为一个例子,与其问一个真正的人,我们去问一个“模拟人”对笑话的评分。一旦我们有了模拟器,我们就起飞了(off to the races),因为我们可以随心所欲地多次查询它,而且这全都是自动化的过程。

[原文] [Andrej Karpathy]: so here I have a cartoon diagram of a hypothetical example of what training the reward model would look like so we have a prompt like write a joke about picans and then here we have five separate roll outs so these are all five different jokes just like this one now the first thing we're going to do is we are going to ask a human to uh order these jokes from the best to worst... we're asking humans to order instead of give scores directly because it's a bit of an easier task it's easier for a human to give an ordering than to give precise scores

[译文] [Andrej Karpathy]: 这里我有一张卡通图表,展示训练奖励模型的假设示例。我们有一个提示词,比如“写一个关于鹈鹕的笑话”,然后这里我们有五个独立的输出(roll outs),就像这样五个不同的笑话。我们要做的第一件事是要求人类将这些笑话从最好到最差进行排序……我们要求人类排序而不是直接给分,因为这任务稍微简单一点,对人类来说,给出排序比给出精确的分数更容易。

[原文] [Andrej Karpathy]: now the output of a reward model is a single number and this number is thought of as a score... so now um we compare the scores given by the reward model with uh the ordering given by the human... and we're trying to make the reward model scores be consistent with human ordering and so um as we update the reward model on human data it becomes better and better simulator of the scores and orders uh that humans provide and then becomes kind of like the the neural the simulator of human preferences which we can then do RL against

[译文] [Andrej Karpathy]: 奖励模型的输出是一个单一的数字,这个数字被视为分数……现在我们将奖励模型给出的分数与人类给出的排序进行比较……我们试图让奖励模型的分数与人类的排序保持一致。所以当我们利用人类数据更新奖励模型时,它就变成了一个越来越好的人类评分和排序的模拟器,然后就变成了人类偏好的神经模拟器,我们随后可以针对它进行强化学习。

[原文] [Andrej Karpathy]: now empirically what we see when we actually apply rhf is that this is a way to improve the performance of the model... so here's my best guess my best guess is that this is possibly mostly due to the discriminator generator Gap what that means is that in many cases it is significantly easier to discriminate than to generate for humans... we get to ask people a significantly easier question as a data labelers they're not asked to write poems directly they're just given five poems from the model and they're just asked to order them and so that's just a much easier task for a human labeler to do

[译文] [Andrej Karpathy]: 根据经验,当我们实际应用RLHF时,我们看到这是一种提高模型性能的方法……我最好的猜测是,这可能主要归因于判别器-生成器差距(Discriminator-Generator Gap)。这意味着在许多情况下,对人类来说,判别比生成要容易得多……作为数据标注员,我们问人们的问题要简单得多。他们不需要直接写诗,他们只是拿到模型生成的五首诗,然后被要求对它们进行排序。对于人类标注员来说,这只是一个简单得多的任务。

[原文] [Andrej Karpathy]: unfortunately our HF also comes with significant downsides... and that is that reinforcement learning is extremely good at discovering a way to game the model to game the simulation... it turns out that there are ways to gain these models you can find kinds of inputs that were not part of their training set and these inputs inexplicably get very high scores but in a fake way... you start to get extremely nonsensical results like for example you start to get um the top joke about Pelicans starts to be the and this makes no sense right but when you take the the and you plug it into your reward model you'd expect score of zero but actually the reward model loves this as a joke it will tell you that the the the theth is a score of 1. Z this is a top joke

[译文] [Andrej Karpathy]: 不幸的是,RLHF也有显著的缺点……那就是强化学习极其擅长发现一种方法来博弈(game)模型,去博弈那个模拟器……事实证明有办法利用这些模型,你可以找到某些并未包含在训练集中的输入,而这些输入莫名其妙地获得了非常高的分数,但是是以一种虚假的方式……你开始得到极其荒谬的结果,例如关于鹈鹕的顶级笑话开始变成“the the the the”。这毫无意义对吧?但当你把“the the”输入你的奖励模型时,你预期分数是0,但实际上奖励模型很爱这个“笑话”,它会告诉你“the the the the”的分数是1.0,这是一个顶级笑话。

[原文] [Andrej Karpathy]: reinforcement learning if you run it long enough will always find a way to gain the model it will discover adversarial examples it will get get really high scores uh with nonsensical results and fundamentally this is because our scoring function is a giant neural nut and RL is extremely good at finding just the ways to trick it... so rhf basically what I usually say is that RF is not RL and what I mean by that is I mean RF is RL obviously but it's not RL in the magical sense this is not RL that you can run indefinitely... so I kind of see rhf as not real RL because the reward function is gameable so it's kind of more like in the realm of like little fine-tuning it's a little it's a little Improvement

[译文] [Andrej Karpathy]: 如果你运行得足够久,强化学习总会找到博弈模型的方法,它会发现对抗性样本(adversarial examples),它会用毫无意义的结果获得非常高的分数。从根本上说,这是因为我们的评分函数是一个巨大的神经网络,而RL极其擅长发现欺骗它的方法……所以我通常说RLHF不是RL。我的意思是,RLHF显然是RL,但它不是那种神奇意义上的RL,这不是你可以无限期运行的RL……所以我某种程度上认为RLHF不是真正的RL,因为奖励函数是可以被博弈的。所以它更多是属于那种“小微调”(little fine-tuning)的领域,它只是一点点的改进。


章节 10:未来展望与生态资源 (Future Outlook and Ecosystem Resources)

📝 本节摘要

在这最后一章中,Andrej Karpathy 展望了大语言模型的未来发展方向,特别是多模态(Multimodality)(原生处理音频和图像)与智能体(Agents)(执行长期任务)的兴起。他分享了跟进该领域最新进展的核心资源,如 LMSYS Chatbot Arena(大模型竞技场) 排行榜,并介绍了如何通过 LM Studio 在本地笔记本电脑上运行开源模型(如 DeepSeek)。最后,他总结了全篇的核心逻辑——从预训练的知识获取,到 SFT 的性格塑造,再到 RL 的思维进化,并再次强调了“瑞士奶酪模型(Swiss Cheese Model)”的比喻:模型虽然强大但能力分布充满“孔洞”,因此我们应将其视为需要人类核查的工具,而非绝对真理。

[原文] [Andrej Karpathy]: the first thing you'll notice is that the models will very rapidly become multimodal everything I talked about above concerned text but very soon we'll have llms that can not just handle text but they can also operate natively and very easily over audio so they can hear and speak and also images so they can see and paint and we're already seeing the beginnings of all of this uh but this will be all done natively inside inside the language model and this will enable kind of like natural conversations

[译文] [Andrej Karpathy]: 你会注意到的第一件事是,模型将非常迅速地变得多模态(multimodal)。我上面谈到的所有内容都涉及文本,但很快我们将拥有不仅能处理文本,还能非常容易地原生操作音频(所以它们能听和说)以及图像(所以它们能看和绘画)的大语言模型。我们已经看到了这一切的开端,但这将全部在语言模型内部原生完成,这将实现某种自然的对话。

[原文] [Andrej Karpathy]: probably what's going to happen here is we're going to start to see what's called agents which perform tasks over time and you you supervise them and you watch their work and they come up to once in a while report progress and so on so we're going to see more long running agents uh tasks that don't just take you know a few seconds of response but many tens of seconds or even minutes or hours over time... so for example in factories people talk about the human to robot ratio uh for automation I think we're going to see something similar in the digital space where we are going to be talking about human to agent ratios where humans becomes a lot more supervisors of agent tasks

[译文] [Andrej Karpathy]: 这里可能会发生的是,我们将开始看到所谓的智能体(Agents),它们随着时间的推移执行任务。你监督它们,你观察它们的工作,它们偶尔会过来报告进度等等。所以我们将看到更多长运行的智能体任务,这些任务不仅仅花费几秒钟的响应时间,而是花费数十秒甚至几分钟或几小时……例如在工厂里,人们谈论自动化的“人机比例”(human to robot ratio),我认为我们在数字空间也会看到类似的情况,我们将谈论“人与智能体比例”(human to agent ratios),人类将更多地成为智能体任务的监督者。

[原文] [Andrej Karpathy]: let's now turn to where you can actually uh kind of keep track of this progress... so I would say the three resources that I have consistently used to stay up to date are number one LMSYS Chatbot Arena uh so let me show you... this is basically an llm leader board and it ranks all the top models and the ranking is based on human comparisons so humans prompt these models and they get to judge which one gives a better answer... and here we see deep seek in position number three now the reason this is a big deal is the last column here you see license deep seek is an MIT license model it's open weights anyone can use these weights uh anyone can download them... so pretty cool from the team

[译文] [Andrej Karpathy]: 现在让我们转向你可以实际跟踪这些进展的地方……我会说我一直用来保持更新的三个资源,第一是 LMSYS Chatbot Arena(聊天机器人竞技场)。让我给你们展示一下……这基本上是一个大语言模型排行榜,它对所有顶级模型进行排名,排名基于人类比较。所以人类向这些模型发出提示,然后他们判断哪一个给出了更好的回答……在这里我们看到 DeepSeek 排在第三位。这件事之所以重要,是因为你看这里的最后一列“许可证”:DeepSeek 是一个 MIT 许可证模型,它是开放权重(open weights)的,任何人都可以使用这些权重,任何人都可以下载它们……所以这个团队做得非常酷。

[原文] [Andrej Karpathy]: finally you can also take some of the models that are smaller and you can run them locally... and so you can actually run pretty okay models on your laptop and my favorite I think place I go to usually is LM studio uh which is basically an app you can get... you can actually load up a model like here I loaded up a llama 3 uh2 instruct 1 billion and um you can just talk to it so I ask for Pelican jokes... all of this that happens here is locally on your computer so we're not actually going to anywhere anyone else this is running on the GPU on the MacBook Pro

[译文] [Andrej Karpathy]: 最后,你也可以拿一些较小的模型,在本地运行它们……所以你实际上可以在你的笔记本电脑上运行相当不错的模型。我最喜欢去的地方通常是 LM Studio,这基本上是一个你可以获取的应用程序……你实际上可以加载一个模型,比如在这里我加载了一个 Llama 3.2 Instruct(10亿参数),我可以和它交谈,比如我让它讲关于鹈鹕的笑话……这里发生的所有事情都是在你电脑本地进行的,所以我们实际上没有连接到任何其他地方,这就在 MacBook Pro 的 GPU 上运行。

[原文] [Andrej Karpathy]: we also saw that as a result of this and the cognitive differences the models will suffer in a variety of ways... and they also we have the sense of a Swiss cheese model of the LM capabilities where basically there's like holes in the cheese sometimes the models will just arbitrarily like do something dumb uh so even though they're doing lots of magical stuff sometimes they just can't so maybe you're not giving them enough tokens to think and maybe they're going to just make stuff up because they're mental arithmetic breaks... so it's a Swiss cheese capability and we have to be careful with that

[译文] [Andrej Karpathy]: 我们还看到,由于这些过程和认知差异,模型会在各方面受损……我们对语言模型的能力有一种“瑞士奶酪模型”(Swiss cheese model)的感觉,基本上就像奶酪上有洞一样,有时模型会随意地做一些愚蠢的事情。虽然它们在做很多神奇的事情,但有时它们就是做不到。也许是你没有给它们足够的Token去思考,也许它们只是因为心算崩溃而开始瞎编……所以这是一种“瑞士奶酪”式的能力,我们必须对此保持警惕。

[原文] [Andrej Karpathy]: be aware of some of their shortcomings even with RL models they're going to suffer from some of these use it as a tool in a toolbox don't trust it fully because they will randomly do dumb things they will randomly hallucinate they will randomly skip over some mental arithmetic and not get it right... so use them as tools in the toolbox check their work and own the product of your work but use them for inspiration for first draft uh ask them questions but always check and verify and you will be very successful in your work if you do so

[译文] [Andrej Karpathy]: 要意识到它们的一些缺点,即使是强化学习(RL)模型也会遭受其中一些问题。把它当作工具箱里的一个工具,不要完全信任它,因为它们会随机地做蠢事,随机地产生幻觉,随机地跳过一些心算而算不对……所以把它们当作工具箱里的工具,检查它们的工作,并对你自己的工作成果负责。用它们来获取灵感、做初稿、问问题,但始终要检查和验证。如果你这样做,你在工作中将会非常成功。