How I use LLMs

章节 1:引言与当前 LLM 生态系统概览

📝 本节摘要

Andrej Karpathy 在本节开篇点明了本期视频的核心目标:从上一期的底层原理(训练与心理学)转向大模型的实际应用。他将 ChatGPT 视为该领域的“鼻祖”和当前功能最丰富的标杆,同时也详细介绍了 2025 年蓬勃发展的 LLM 生态系统,包括 Google 的 Gemini、Anthropic 的 Claude、xAI 的 Grok 等竞争对手。此外,他还分享了如何通过 Chatbot Arena 等排行榜来追踪不同模型的性能表现。,,

[原文] [Andrej Karpathy]: hi everyone so in this video I would like to continue our general audience series on large language models like chpd now in the previous video deep dive into llms that you can find on my YouTube we went into a lot of the underhood fundamentals of how these models are trained and how you should think about their cognition or psychology

[译文] [Andrej Karpathy]: 大家好,在这个视频中,我想继续我们关于像 ChatGPT 这样的大型语言模型的普通受众系列。在之前的视频《深入剖析 LLM》(你可以在我的 YouTube 频道找到)中,我们深入探讨了许多关于这些模型是如何训练的底层原理,以及你应该如何思考它们的认知或心理学。

[原文] [Andrej Karpathy]: now in this video I want to go into more practical applications of these tools I want to show you lots of examples I want to take you through all the different settings that are available and I want to show you how I use these tools and how you can also use them uh in your own life and work so let's dive in

[译文] [Andrej Karpathy]: 而在这个视频中,我想深入探讨这些工具的更实际的应用。我想向你们展示大量的例子,带你们了解所有可用的不同设置,并向你们展示我是如何使用这些工具的,以及你们如何在自己的生活和工作中通过使用它们。那么,让我们开始吧。

[原文] [Andrej Karpathy]: okay so first of all the web page that I have pulled up here is chp.com now as you might know chpt it was developed by openai and deployed in 2022 so this was the first time that people could actually just kind of like talk to a large language model through a text interface and this went viral and over all over the place on the internet and uh this was huge

[译文] [Andrej Karpathy]: 好的,首先,我这里打开的网页是 chatgpt.com。正如你们可能知道的,ChatGPT 是由 OpenAI 开发并在 2022 年部署的。这是人们第一次能够真正通过文本界面与大型语言模型进行交谈,它迅速走红并席卷了整个互联网,这在当时是一件大事。

[原文] [Andrej Karpathy]: now since then though the ecosystem has grown a lot so I'm going to be showing you a lot of examples of Chachi PT specifically but now in 2025 uh there's many other apps that are kind of like Chachi PT like and this is now a much bigger and richer ecosystem

[译文] [Andrej Karpathy]: 不过从那时起,这个生态系统已经有了巨大的增长。所以我虽然会具体展示很多 ChatGPT 的例子,但在 2025 年的今天,还有许多其他类似 ChatGPT 的应用程序,这是一个规模更大、内容更丰富的生态系统。

[原文] [Andrej Karpathy]: so in particular I think Chachi PT by openai is this Original Gangster incumbent it's most popular and most featur rich also because it's been around the longest but there are many other kind of clones available I would say I don't think it's too unfair to say but in some cases there are kind of like unique experiences that are not found in chashi p and we're going to see examples of those

[译文] [Andrej Karpathy]: 特别是,我认为 OpenAI 的 ChatGPT 是这个领域的“元老级”(Original Gangster)现任霸主,它是最受欢迎且功能最丰富的,也因为它存在的时间最长。但现在有许多其他的“克隆版”可用——我想这样说也不算太不公平——但在某些情况下,它们提供了一些 ChatGPT 中找不到的独特体验,我们将看到这方面的例子。

[原文] [Andrej Karpathy]: so for example big Tech has followed with a lot of uh kind of chat GPT like experiences so for example Gemini met and co-pilot from Google meta and Microsoft respectively and there's also a number of startups so for example anthropic uh has Claud which is kind of like a chasht equivalent xai which is elon's company has Gro uh and there's many others

[译文] [Andrej Karpathy]: 举个例子,大型科技公司已经紧随其后,推出了许多类似 ChatGPT 的体验,比如 Google 的 Gemini、Meta 的(LLaMA相关产品)和 Microsoft 的 Copilot。还有许多初创公司,例如 Anthropic 推出了 Claude,它有点像 ChatGPT 的等价物;Elon 的公司 xAI 推出了 Grok;还有许多其他的。

[原文] [Andrej Karpathy]: so all of these here are from the United States um companies basically deep seek is a Chinese company and lchat is a French company Mistral now where can you find these and how can you keep track of them well number one on the internet somewhere but there are some leaderboards and in the previous video I've shown you uh chatbot arena is one of them

[译文] [Andrej Karpathy]: 这里提到的基本上都是美国公司。DeepSeek 是一家中国公司,Le Chat 是法国公司 Mistral 的产品。那么,你在哪里可以找到这些模型以及如何跟踪它们呢?第一,在互联网的某个地方;但也有一些排行榜,在之前的视频中我向你们展示过,Chatbot Arena(聊天机器人竞技场)就是其中之一。

[原文] [Andrej Karpathy]: so here you can come to some ranking of different models and you can see sort of their strength or ELO score and so this is one place where you can keep track of them I would say like another place maybe is this um seal Le leaderboard from scale and so here you can also see different kinds of eval and different kinds of models and how well they rank and you can also come here to see which models are currently performing the best on a wide variety of tasks

[译文] [Andrej Karpathy]: 在这里你可以看到不同模型的排名,看到它们的实力或 ELO 分数,这是你可以跟踪它们的一个地方。我想说另一个地方可能是 Scale 的这个 SEAL Leaderboards,在这里你也可以看到不同类型的评估、不同的模型以及它们的排名情况,你可以来这里看看哪些模型目前在各种任务上表现最好。

[原文] [Andrej Karpathy]: so understand that the ecosystem is fairly rich but for now I'm going to start with open AI because it is the incumbent and is most feature Rich but I'm going to show you others over time as well

[译文] [Andrej Karpathy]: 所以要明白,现在的生态系统相当丰富,但目前我将从 OpenAI 开始,因为它是现任霸主且功能最丰富,但我随着时间的推移也会向你们展示其他的模型。


章节 2:LLM 的核心原理:Token、上下文窗口与训练阶段

📝 本节摘要

在本节中,Andrej 深入剖析了与 LLM 交互的底层机制。他首先解释了我们在对话框中输入的文本实际上被切分成了“Token(词元)”,并通过 TikTokenizer 工具直观展示了模型眼中的数字序列。他指出,虽然用户界面显示的是气泡对话,但在模型看来,这只是一个不断延长的一维 Token 流,而“新建对话”本质上是清空这个上下文窗口(Context Window)。随后,他用生动的“Zip 文件”比喻来解释模型的训练过程:预训练(Pre-training)是将整个互联网的信息进行“有损压缩”存入神经网络参数中,赋予模型世界知识;而后训练(Post-training)则是通过人类对话数据让这个文件学会像助手一样说话。最后他强调,此时的模型是一个完全独立的实体,尚未通过工具连接外部世界。

[原文] [Andrej Karpathy]: so let's start with chachy PT what is this text box text box and what do we put in here okay so the most basic form of interaction with the language model is that we give it text and then we get some typ text back in response

[译文] [Andrej Karpathy]: 那么让我们从 ChatGPT 开始,这个文本框是什么?我们要在这里输入什么?好的,与语言模型最基本的交互形式就是我们给它文本,然后我们得到一些键入的文本作为回应。

[原文] [Andrej Karpathy]: so as an example we can ask to get a ha cou about what it's like to be a large language model so uh this is a good kind of example askas for a language model because these models are really good at writing so writing haikus or poems or cover letters or resumés or email replies they're just good at writing

[译文] [Andrej Karpathy]: 举个例子,我们可以要求它写一首关于“作为一个大型语言模型是什么感觉”的俳句。这是一个向语言模型提问的好例子,因为这些模型非常擅长写作,比如写俳句、诗歌、求职信、简历或电子邮件回复,它们就是很擅长写作。

[原文] [Andrej Karpathy]: so when we ask for something like this what happens looks as follows the model basically responds um words flow like a stream endless Echo never mind ghost of thought unseen okay it's pretty dramatic

[译文] [Andrej Karpathy]: 当我们提出这样的要求时,发生的情况如下,模型基本上会回答:“词语如溪流般流淌,无尽的回声,永恒的心智,看不见的思想幽灵。”(译注:此处为模型生成的俳句直译)好吧,这还挺戏剧性的。

[原文] [Andrej Karpathy]: but what we're seeing here in chashi PT is something that looks a bit like a conversation that you would have with a friend these are kind of like chat bubbles now we saw in the previous video is that what's going on under the hood here is that this is what we call a user query this piece of text and this piece of text and also the response from the model this piece of text is chopped up into little text chunks that we call tokens

[译文] [Andrej Karpathy]: 但我们在 ChatGPT 中看到的,看起来有点像你和朋友之间的对话,这些看起来像聊天气泡。我们在之前的视频中看到,这背后的实际情况是,我们称之为“用户查询”的这段文本,以及模型的响应文本,都被切分成了我们称之为“Token(词元)”的小文本块。

[原文] [Andrej Karpathy]: so these this sequence of text is under the hood a token sequence onedimensional token sequence now the way we can see those tokens is we can use an app like for example Tik tokenizer

[译文] [Andrej Karpathy]: 所以这一串文本在底层其实是一个 Token 序列,一个一维的 Token 序列。我们要查看这些 Token 的方法是使用一个像 TikTokenizer 这样的应用程序。

[原文] [Andrej Karpathy]: so making sure that GPT 40 is selected I can paste my text here and this is actually what the model sees Under the Hood my piece of text to the model looks like a sequence of exactly 15 tokens and these are the little text chunks that the model sees

[译文] [Andrej Karpathy]: 确保选择了 GPT-4o,我可以把我的文本粘贴到这里,这实际上就是模型在底层看到的东西。我的这段文本对模型来说看起来就像是一个正好由 15 个 Token 组成的序列,这些就是模型看到的小文本块。

[原文] [Andrej Karpathy]: now there's a vocabulary here of 200,000 roughly of possible tokens and then these are the token IDs corresponding to all these little text chunks that are part of my query and you can play with this and update and you can see that for example this is Skate sensitive you would get different tokens and you can kind of edit it and see live how the token sequence changes

[译文] [Andrej Karpathy]: 这里有一个大约包含 200,000 个可能 Token 的词表,然后这些是对应于我查询中所有这些小文本块的 Token ID。你可以玩一下这个,更新一下,你会看到例如这是大小写敏感的,你会得到不同的 Token,你可以通过编辑它来实时查看 Token 序列是如何变化的。

[原文] [Andrej Karpathy]: so our query was 15 tokens and then the model response is right here and it responded back to us with a sequence of exactly 19 tokens so that Hau is this sequence of 19 tokens now so we said 15 tokens and it said 19 tokens back

[译文] [Andrej Karpathy]: 所以我们的查询是 15 个 Token,然后模型的响应就在这里,它回复给我们一个正好 19 个 Token 的序列。所以那首俳句就是这 19 个 Token 的序列。我们说了 15 个 Token,它回了 19 个 Token。

[原文] [Andrej Karpathy]: now because this is a conversation and we want to actually maintain a lot of the metadata that actually makes up a conversation object this is not all that's going on under under the hood and we saw in the previous video a little bit about the um conversation format um so it gets a little bit more complicated in that we have to take our user query and we have to actually use this a chat format

[译文] [Andrej Karpathy]: 因为这是一个对话,我们实际上想要保留许多构成对话对象的元数据,所以底层发生的事情不仅仅是这些。我们在之前的视频中稍微看了一下对话格式,它稍微复杂一点,因为我们要处理用户查询,并且必须实际使用这种聊天格式。

[原文] [Andrej Karpathy]: so let me delete the system message I don't think it's very important for the purposes of understanding what's going on let me paste my message as the user and then let me paste the model response as an assistant and then let me crop it here properly the tool doesn't do that properly

[译文] [Andrej Karpathy]: 让我删除系统消息,我认为对于理解正在发生的事情来说它不是很重要。让我作为“用户”粘贴我的消息,然后作为“助手”粘贴模型的响应,然后让我在这里适当地裁剪一下,这个工具处理得不是很好。

[原文] [Andrej Karpathy]: so here we have it as it actually happens under the hood there are all these special tokens that basically begin a message from the user and then the user says and this is the content of what we said and then the user ends and then the assistant begins and says this Etc

[译文] [Andrej Karpathy]: 所以这就是底层实际发生的情况。这里有所有这些特殊的 Token,基本上用来标识“用户消息开始”,然后用户说(内容),这是我们说的内容,然后“用户结束”,接着“助手开始”并说这些内容,等等。

[原文] [Andrej Karpathy]: now the precise details of the conversation format are not important what I want to get across here is that what looks to you and I as little chat bubbles going back and forth under the hood we are collaborating with the model and we're both writing into a token stream and these two bubbles back and forth were in sequence of exactly 42 tokens under the hood I contributed some of the first tokens and then the model continued the sequence of tokens with its response and we could alternate and continue adding tokens here and together we're are building out a token window a onedimensional tokens onedimensional sequence of tokens

[译文] [Andrej Karpathy]: 对话格式的具体细节并不重要。我想在这里传达的是,在你我看来是来回的小聊天气泡,在底层其实是我们正在与模型协作,我们都在向一个 Token 流中写入内容。这两个来回的气泡在底层正好是一个 42 个 Token 的序列。我贡献了前面的一些 Token,然后模型用它的响应延续了这个 Token 序列。我们可以交替进行,继续在这里添加 Token,我们共同构建了一个 Token 窗口(Token Window),一个一维的 Token 序列。

[原文] [Andrej Karpathy]: okay so let's come back to chpt now what we are seeing here is kind of like little bubbles going back and forth between us and the model under the hood we are building out a one-dimensional token sequence when I click new chat here that wipes the token window that resets the tokens to basically zero again and restarts the conversation from scratch

[译文] [Andrej Karpathy]: 好的,让我们回到 ChatGPT。我们在这里看到的是有点像我们和模型之间来回的小气泡,但在底层我们正在构建一个一维的 Token 序列。当我在这里点击“新对话(New Chat)”时,这就擦除了 Token 窗口,基本上将 Token 重置为零,并从头开始重新开始对话。

[原文] [Andrej Karpathy]: now the cartoon diagram that I have in my mind when I'm speaking to a model looks something like this when we click new chat we begin a token sequence so this is a onedimensional sequence of tokens the user we can write tokens into this stream and then when we hit enter we transfer control over to the language model and the language model responds with its own token streams and then the language to model has a special token that basically says something along the lines of I'm done so when it emits that token the chat GPT application transfers control back to us and we can take turns together we are building out the token the token stream which we also call the context window

[译文] [Andrej Karpathy]: 当我与模型交谈时,我脑海中的卡通示意图大致是这样的:当我们点击“新对话”时,我们开始一个新的 Token 序列,这是一个一维的 Token 序列。作为用户,我们可以将 Token 写入这个流中,当我们按下回车键时,我们将控制权转移给语言模型,语言模型用它自己的 Token 流进行响应。然后语言模型有一个特殊的 Token,基本上是在说“我讲完了”。当它发出那个 Token 时,ChatGPT 应用程序将控制权转回给我们,我们可以轮流进行。我们一起构建了这个 Token 流,我们也称之为“上下文窗口(Context Window)”。

[原文] [Andrej Karpathy]: so the context window is kind of like this working memory of tokens and anything that is inside this context window is kind of like in the working memory of this conversation and is very directly accessible by the model

[译文] [Andrej Karpathy]: 所以上下文窗口就像是 Token 的工作记忆(Working Memory),任何在这个上下文窗口内的东西都像是这个对话的工作记忆,并且可以被模型非常直接地访问。

[原文] [Andrej Karpathy]: now what is this entity here that we are talking to and how should we think about it well this language model here we saw that the way it is trained in the previous video we saw there are two major stages the pre-training stage and the post-training stage

[译文] [Andrej Karpathy]: 那么,我们正在与之交谈的这个实体是什么?我们应该如何看待它?我们在之前的视频中看到了这个语言模型的训练方式,有两个主要阶段:预训练(Pre-training)阶段和后训练(Post-training)阶段。

[原文] [Andrej Karpathy]: the pre-training stage is kind of like taking all of Internet chopping it up into tokens and then compressing it into a single kind of like zip file but the zip file is not exact the zip file is lossy and probabilistic zip file because we can't possibly represent all of internet in just one one sort of like say terabyte of uh of zip file um because there's just way too much information so we just kind of get the gal or The Vibes inside this um zip file

[译文] [Andrej Karpathy]: 预训练阶段有点像把整个互联网切分成 Token,然后把它压缩成一个单独的、类似 Zip 文件(压缩包)的东西。但这个 Zip 文件不是精确的,它是一个有损的、概率性的 Zip 文件。因为我们要么不可能只用比如说 1TB 的 Zip 文件来表示整个互联网,信息量太大了,所以我们只是在这个 Zip 文件里保留了大概的意思或“氛围(Vibes)”。

[原文] [Andrej Karpathy]: now what actually inside the zip file are the parameters of a neural network and so for example a one tbte zip file would correspond to roughly say one trillion parameters inside this neural network

[译文] [Andrej Karpathy]: 实际上在这个 Zip 文件里面的是神经网络的参数。举个例子,一个 1TB 的 Zip 文件大致对应于这个神经网络内部的 1 万亿个参数。

[原文] [Andrej Karpathy]: and when this neural network is trying to to do is it's trying to basically take tokens and it's trying to predict the next token in a sequence but it's doing that on internet documents so it's kind of like this internet document generator right

[译文] [Andrej Karpathy]: 这个神经网络试图做的是,它基本上试图获取 Token,并试图预测序列中的下一个 Token。但它是在互联网文档上做这件事,所以它有点像是一个互联网文档生成器,对吧。

[原文] [Andrej Karpathy]: um and in the process of predicting the next token on a sequence on internet the neural network gains a huge amount of knowledge about the world and this knowledge is all represented and stuffed and compressed inside the one trillion parameters roughly of this language model

[译文] [Andrej Karpathy]: 在预测互联网序列上下一个 Token 的过程中,神经网络获得了关于世界的大量知识。这些知识都被表示、填充并压缩在这个语言模型大约 1 万亿个参数中。

[原文] [Andrej Karpathy]: now this pre-training stage also we saw is fairly costly so this can be many tens of millions of dollars say like three months of training and so on um so this is a costly long phase for that reason this phase is not done that often

[译文] [Andrej Karpathy]: 我们也看到这个预训练阶段是相当昂贵的,可能需要数千万美元,比如说三个月的训练时间等等。这是一个昂贵且漫长的阶段,正因如此,这个阶段并不会经常进行。

[原文] [Andrej Karpathy]: so for example gbt 40 uh this model was pre-trained uh probably many months ago maybe like even a year ago by now and so that's why these models are a little bit out of date they have what's called a knowledge cutof because that knowledge cut off corresponds to when the model was pre-trained and its knowledge only goes up to that point

[译文] [Andrej Karpathy]: 比如 GPT-4o,这个模型可能是在好几个月前预训练的,甚至可能是一年前了。这就是为什么这些模型有点过时,它们有一个所谓的“知识截止日期(Knowledge Cutoff)”,因为那个知识截止日期对应于模型进行预训练的时间,它的知识只到那个点为止。

[原文] [Andrej Karpathy]: now some knowledge can come into the model through the post-training fa phase which we'll talk about in a second but roughly speaking you should think of these uh models is kind of like a little bit out of date because pre- training is way too expensive and happens infrequently

[译文] [Andrej Karpathy]: 一些知识可以通过后训练阶段进入模型,我们马上会讲到,但粗略地说,你应该把这些模型看作是有点过时的,因为预训练太昂贵了,而且发生得很不频繁。

[原文] [Andrej Karpathy]: so any kind of recent information like if you wanted to talk to your model about something that happened last week or so on we're going to need other ways of providing that information to the model model because it's not stored in the knowledge of the model so we're going to have various tool use to give that information to the model

[译文] [Andrej Karpathy]: 所以任何最近的信息,比如如果你想和你的模型谈论上周发生的事情等等,我们将需要其他方式向模型提供这些信息,因为它没有存储在模型的知识库中,所以我们将使用各种工具来把这些信息提供给模型。

[原文] [Andrej Karpathy]: now after pre-training there's a second stage goes post-training and post-training Stage is really attaching a smiley face to this ZIP file because we don't want to generate internet documents we want this thing to take on the Persona of an assistant that responds to user queries and that's done in a process of post training where we swap out the data set for a data set of conversations that are built out by humans

[译文] [Andrej Karpathy]: 在预训练之后,还有第二个阶段叫做后训练(Post-training)。后训练阶段实际上就是给这个 Zip 文件贴上一个“笑脸”,因为我们不想生成互联网文档,我们希望这个东西呈现出一个能够回答用户查询的助手的“人设(Persona)”。这是在后训练过程中完成的,我们将数据集换成了一个由人类构建的对话数据集。

[原文] [Andrej Karpathy]: so this is basically where the model takes on this Persona and that actually so that we can like ask questions and it responds with answers so it takes on the style of the of an assistant that's post trainining but it has the knowledge of all of internet and that's by pre-training so these two are combined in this artifact

[译文] [Andrej Karpathy]: 这基本上就是模型获得这种人设的地方,这样我们就可以提问,它就会回答。所以它呈现出了助手的风格,这是后训练赋予的;但它拥有整个互联网的知识,那是预训练赋予的。所以这两者在这个产物中结合在了一起。

[原文] [Andrej Karpathy]: um now the important thing to understand here I think for this section is that what you are talking to to is a fully self-contained entity by default this language model think of it as a one tbte file on a dis secretly that represents one trillion parameters and their precise settings inside the neural network that's trying to give you the next token in the sequence

[译文] [Andrej Karpathy]: 我认为这一节要理解的重要一点是,默认情况下,你正在与之交谈的是一个完全独立的实体。把这个语言模型想象成一个静静躺在磁盘上的 1TB 文件,它代表了神经网络内部的 1 万亿个参数及其精确设置,试图给你序列中的下一个 Token。

[原文] [Andrej Karpathy]: but this is the fully selfcontained entity there's no calculator there's no computer and python interpreter there's no worldwide web browsing there's none of that there's no tool use yet in what we've talked about so far you're talking to a zip file if you stream tokens to it it will respond with tokens back and this ZIP file has the knowledge from pre-training and it has the style and form from posttraining and uh so that's roughly how you can think about this entity

[译文] [Andrej Karpathy]: 但这是一个完全独立的实体。没有计算器,没有计算机和 Python 解释器,没有万维网浏览,这些都没有。在我们目前讨论的内容中还没有工具的使用。你正在和一个 Zip 文件交谈,如果你向它传输 Token,它会用 Token 回复你。这个 Zip 文件拥有来自预训练的知识,以及来自后训练的风格和形式。大致上这就是你可以如何看待这个实体。


章节 3:基础交互与幻觉应对:以知识检索为例

📝 本节摘要

在本节中,Andrej 用一个生动的“第一人称”独白总结了 LLM 的本质:一个存储了过时互联网知识的压缩文件,披着一层由人类训练出的友好外衣。随后,他通过查询“美式咖啡咖啡因含量”和“感冒药成分”两个真实生活案例,展示了如何在不使用联网工具的情况下利用模型的基础知识库。他特别强调,虽然模型对常见信息(如常用药成分)的记忆通常很准确,但这本质上是概率性的统计回忆,因此必须警惕“幻觉”,对于关键信息(如药物成分),务必对照实物或一手来源进行核实。

[原文] [Andrej Karpathy]: so if I had to summarize what we talked about so far I would probably do it in the form of an introduction of Chach PT in a way that I think you should think about it so the introduction would be hi I'm Chach PT I am a one tab zip file my knowledge comes from the internet which I read in its entirety about six months ago and I only remember vaguely okay and my winning personality was programmed by example by human labelers at open AI

[译文] [Andrej Karpathy]: 如果非要总结我们目前所谈论的内容,我可能会以 ChatGPT 自我介绍的形式来表达,这也是我认为你应该如何看待它的方式。这个介绍会是这样的:“嗨,我是 ChatGPT。我是一个 1TB 的 Zip 文件。我的知识来自互联网,我在大约六个月前完整地阅读了它,但我只记得个大概,好吗?而我那迷人的个性是由 OpenAI 的人类标注员通过示例编程设定的。”

[原文] [Andrej Karpathy]: so the personality is programmed in post-training and the knowledge comes from compressing the internet during pre-training and this knowledge is a little bit out of date and it's a probabilistic and slightly vague some of the things that uh probably are mentioned very frequently on the internet I will have a lot better better recollection of than some of the things that are discussed very rarely very similar to what you might expect with a human

[译文] [Andrej Karpathy]: 所以个性是在后训练阶段编程的,而知识来自于预训练期间对互联网的压缩。这些知识有点过时,而且是概率性的、稍微模糊的。对于那些在互联网上被频繁提及的事物,我的记忆会比那些极少被讨论的事物要好得多,这与你对人类的期望非常相似。

[原文] [Andrej Karpathy]: so let's not talk about some of the repercussions of this entity and how we can talk to it and what kinds of things we can expect from it now I'd like to use real examples when we actually go through this so for example this morning I asked Chachi the following how much caffeine is in one shot of Americana and I was curious because I was comparing it to matcha

[译文] [Andrej Karpathy]: 那么,让我们先不谈这个实体的一些后果,而是谈谈我们如何与它交谈以及我们可以从它那里期待什么。现在我想在讲解过程中使用真实的例子。比如,今天早上我问了 ChatGPT 这个问题:“一杯美式咖啡里有多少咖啡因?”我很好奇,因为我当时正在拿它和抹茶做比较。

[原文] [Andrej Karpathy]: now chashi PT will tell me that this is roughly 63 Mig of caffeine or so now the reason I'm asking chash HPT this question that I think this is okay is number one I'm not asking about any knowledge that is very recent so I do expect that the model has sort of read about how much caffeine there is in one shot this I don't think this information has changed too much

[译文] [Andrej Karpathy]: ChatGPT 告诉我这大约是 63 毫克的咖啡因。我之所以问 ChatGPT 这个问题,并且认为这没问题,原因第一是:我问的不是任何非常新的知识。所以我确实期望模型已经读到过关于一杯浓缩咖啡里有多少咖啡因的内容,我不认为这个信息会有太大的变化。

[原文] [Andrej Karpathy]: and number two I think this information is extremely frequent on the internet this kind of a question and this kind of information has occurred all over the place on the internet and because there was so many mentions of it I expect a model to have good memory of it in its knowledge so there's no tool use and the model the zip file responded that there's roughly 63 Mig

[译文] [Andrej Karpathy]: 第二,我认为这个信息在互联网上出现的频率极高。这类问题和这类信息在互联网上到处都是。因为有这么多的提及,我期望模型在它的知识库中对它有很好的记忆。所以这里没有使用任何工具,模型——也就是那个 Zip 文件——回答说大约是 63 毫克。

[原文] [Andrej Karpathy]: now I'm not guaranteed that this is the correct answer uh this is just its vague recollection of the internet but I can go to primary sources and maybe I can look up okay uh caffeine and uh Americano and I could verify that yeah it looks to be about 63 is roughly right and you can look at primary sources to decide if this is true or not so I'm not strictly speaking guaranteed that this is true but I think probably this is the kind of thing that chpt would know

[译文] [Andrej Karpathy]: 我不能保证这是正确答案,这只是它对互联网的模糊回忆。但我可以去查一手资料,也许我可以查一下“咖啡因”和“美式咖啡”,我可以验证一下,是的,看起来 63 毫克大致是正确的。你可以查看一手资料来决定这是否属实。虽然严格来说我不能保证这是真的,但我认为这可能是 ChatGPT 会知道的事情。

[原文] [Andrej Karpathy]: here's an example of a conversation I had two days ago actually um and there's another example of a knowledge based conversation and things that I'm comfortable asking of Chach PT with some caveats so I'm a bit sick I have runny nose and I want to get meds that help with that so it told me a bunch of stuff um and um I want my nose to not be runny so I gave it a clarification based on what it said and then it kind of gave me some of the things that might be helpful with that

[译文] [Andrej Karpathy]: 这是我两天前进行的一个对话例子,这是另一个基于知识的对话例子,也是我在有一些注意事项的前提下乐意问 ChatGPT 的事情。我有点不舒服,流鼻涕,我想找点药来缓解。所以它告诉我一堆东西,我希望不再流鼻涕,所以我根据它所说的做了一个澄清,然后它给了我一些可能对此有帮助的东西。

[原文] [Andrej Karpathy]: and then I looked at some of the meds that I have at home and I said does daycool or night call work and it went off and it kind of like went over the ingredients of Dil and NYL and whether or not they um helped mitigate Ronnie nose

[译文] [Andrej Karpathy]: 然后我看了看家里的一些药,我问:“DayQuil(日用感冒药)或 NyQuil(夜用感冒药)有用吗?”它就开始列举 DayQuil 和 NyQuil 的成分,以及它们是否有助于缓解流鼻涕。

[原文] [Andrej Karpathy]: now when these ingredients are coming here again remember we are talking to a zip file that has a recollection of the internet I'm not guaranteed that these ingredients are correct and in fact I actually took out the box and I looked at the ingredients and I made sure that NY ingredients are exactly these ingredients um and I'm doing that because I don't always fully trust what's coming out here right this is just a probabilistic statistical recollection of the internet

[译文] [Andrej Karpathy]: 当这些成分显示出来时,再次记住,我们是在和一个拥有互联网记忆的 Zip 文件交谈。我不能保证这些成分是正确的。事实上,我确实拿出了盒子,查看了成分表,确认 NyQuil 的成分确实就是这些。我这样做是因为我不总是完全信任这里输出的内容,对吧?这只是对互联网的一种概率性的、统计性的回忆。

[原文] [Andrej Karpathy]: but that said conversations of DayQuil and NyQuil these are very common meds uh probably there's tons of information about a lot of this on the internet and this is the kind of things that the model have pretty good uh recollection of so actually these were all correct

[译文] [Andrej Karpathy]: 但话虽如此,关于 DayQuil 和 NyQuil 的对话,这些是非常常见的药物,互联网上可能关于这些的信息多如牛毛,这是模型会有相当好回忆的那类事情。所以实际上,这些都是正确的。

[原文] [Andrej Karpathy]: and then I said okay well I have nyel um how far how fast would it act roughly and it kind of tells me and then is a basically a tal and says yes so this is a good example of how chipt was useful to me it is a knowledge based query this knowledge uh sort of isn't recent knowledge U this is all coming from the knowledge of the model I think this is common information this is not a high stakes situation I'm checking Chach PT a little bit uh but also this is not a high Stak situation so no big deal so I popped an iol and indeed it helped um but that's roughly how I'm thinking about what's going back here

[译文] [Andrej Karpathy]: 然后我说:“好的,我有 NyQuil,大概多久会起效?”它告诉了我,基本上就是这样。所以这是一个 ChatGPT 对我有用的好例子。这是一个基于知识的查询,这个知识不是最近的知识,这都来自于模型的知识库。我认为这是常见信息,这不是一个高风险的情况。我稍微检查了一下 ChatGPT,但这也不是什么生死攸关的大事,所以没什么大不了的。所以我吃了一片 NyQuil,确实有效。这就是我对这里发生的事情的大致思考方式。


章节 4:最佳实践:上下文管理与模型选择策略

📝 本节摘要

在这一节中,Andrej 分享了两个使用 LLM 的关键建议。首先,他强调了“定期新建对话”的重要性。因为过长的对话会堆积 Token,不仅会让模型“分心”降低准确率,还会因为计算量增加而变慢。其次,他提醒用户要时刻留意自己正在使用的模型版本(如 GPT-4o 与 GPT-4o mini 的区别),并根据需求选择合适的付费层级。最后,他提出了“LLM 顾问团(LLM Council)”的概念,即遇到重要问题时,会同时咨询 OpenAI、Anthropic、Google 和 xAI 的顶级模型,综合各方意见做出判断。

[原文] [Andrej Karpathy]: okay so at this point I want to make two notes the first note I want to make is that naturally as you interact with these models you'll see that your conversations are growing longer right anytime you are switching topic I encourage you to always start a new chat

[译文] [Andrej Karpathy]: 好的,讲到这里我想做两点说明。第一点是,当你与这些模型交互时,你会自然地发现你的对话变得越来越长,对吧?无论何时当你切换话题时,我鼓励你总是开始一个新的对话(New Chat)。

[原文] [Andrej Karpathy]: when you start a new chat as we talked about you are wiping the context window of tokens and resetting it back to zero if it is the case that those tokens are not any more useful to your next query I encourage you to do this because these tokens in this window are expensive and they're expensive in kind of like two ways

[译文] [Andrej Karpathy]: 正如我们讨论过的,当你开始一个新对话时,你是在擦除上下文窗口中的 Token,将其重置回零。如果那些 Token 对你的下一个查询不再有用,我鼓励你这样做,因为这个窗口里的 Token 是昂贵的,而且这种昂贵体现在两个方面。

[原文] [Andrej Karpathy]: number one if you have lots of tokens here then the model can actually find it a little bit distracting uh so if this was a lot of tokens um the model might this is kind of like the working memory of the model the model might be distracted by all the tokens in the in the past when it is trying to sample tokens much later on so it could be distracting and it could actually decrease the accuracy of of the model and of its performance

[译文] [Andrej Karpathy]: 第一,如果你这里有很多 Token,模型实际上会觉得有点分心。如果 Token 很多——这有点像是模型的工作记忆——当模型试图在很后面进行采样时,它可能会被过去所有的 Token 分散注意力。所以这可能会导致分心,实际上可能会降低模型的准确性和性能。

[原文] [Andrej Karpathy]: and number two the more tokens are in the window uh the more expensive it is by a little bit not by too much but by a little bit to sample the next token in the sequence so your model is actually slightly slowing down it's becoming more expensive to calculate the next token and uh the more tokens there are here

[译文] [Andrej Karpathy]: 第二,窗口里的 Token 越多,采样序列中下一个 Token 的成本就会稍微高一点点,虽然不是多很多,但确实会高一点。所以你的模型实际上会稍微变慢,随着这里 Token 的增加,计算下一个 Token 的成本会变得更高。

[原文] [Andrej Karpathy]: and so think of the tokens in the context window as a precious resource um think of that as the working memory of the model and don't overload it with irrelevant information and keep it as short as you can and you can expect that to work faster and slightly better

[译文] [Andrej Karpathy]: 所以要把上下文窗口里的 Token 看作一种宝贵的资源,把它看作模型的工作记忆,不要用不相关的信息让它过载,尽可能保持简短,这样你可以预期它运行得更快,效果也稍微好一些。

[原文] [Andrej Karpathy]: of course if the if the information actually is related to your task you may want to keep it in there but I encourage you to as often as as you can um basically start a new chat whenever you are switching topic

[译文] [Andrej Karpathy]: 当然,如果这些信息确实与你的任务相关,你可能希望把它保留在那里。但我鼓励你尽可能经常地——基本上只要你切换话题,就开始一个新的对话。

[原文] [Andrej Karpathy]: the second thing is that I always encourage you to keep in mind what model you are actually using so here in the top left we can drop down and we can see that we are currently using GPT 40 now there are many different models of many different flavors and there are too many actually but we'll go through some of these over time so we are using GPT 40 right now and in everything that I've shown you this is GPD 40

[译文] [Andrej Karpathy]: 第二件事是,我总是鼓励你记住你实际正在使用的是什么模型。在这里的左上角,我们可以下拉菜单,看到我们目前正在使用的是 GPT-4o。现在有许多不同口味的不同模型,实际上太多了,但我们会随着时间的推移介绍其中的一些。所以我们现在用的是 GPT-4o,我向你们展示的所有内容都是基于 GPT-4o 的。

[原文] [Andrej Karpathy]: now when I open a new incognito window so if I go to chat gt.com and I'm not logged in the model that I'm talking to here so if I just say hello uh the model that I'm talking to here might not be GPT 40 it might be a smaller version uh now unfortunately opening ey does not tell me when I'm not logged in what model I'm using which is kind of unfortunate but it's possible that you are using a smaller kind of Dumber model

[译文] [Andrej Karpathy]: 如果我打开一个新的隐身窗口,去 chatgpt.com 并且没有登录,我在这里与之交谈的模型——如果我只是说“你好”——我在这里与之交谈的模型可能不是 GPT-4o,它可能是一个更小的版本。不幸的是,OpenAI 没有告诉我当我不登录时我正在使用什么模型,这有点遗憾,但你可能正在使用一个更小的、稍微笨一点的模型。

[原文] [Andrej Karpathy]: so if we go to the chipt pricing page here we see that they have three basic tiers for individuals the free plus and pro and in the free tier you have access to what's called GPT 40 mini and this is a smaller version of GPT 40 it is smaller model with a smaller number of parameters it's not going to be as creative like it's writing might not be as good its knowledge is not going to be as good it's going to probably hallucinate a bit more Etc

[译文] [Andrej Karpathy]: 如果我们去 ChatGPT 的定价页面,我们会看到他们针对个人有三个基本层级:免费版(Free)、Plus 版和 Pro 版。在免费层级中,你可以使用所谓的 GPT-4o mini,这是 GPT-4o 的一个较小版本。它是一个参数量较少的小模型,它不会那么有创造力,比如它的写作可能没那么好,它的知识可能没那么丰富,它可能会产生更多的幻觉等等。

[原文] [Andrej Karpathy]: uh but it is kind of like the free offering the free tier they do say that you have limited access to 40 and3 mini but I'm not actually 100% sure like it didn't tell us which model we were using so we just fundamentally don't know

[译文] [Andrej Karpathy]: 这就是免费提供的产品。在免费层级,他们确实说你可以有限地访问 4o 和 o3-mini,但我实际上也不是 100% 确定,因为它没有告诉我们正在使用的是哪个模型,所以我们根本不知道。

[原文] [Andrej Karpathy]: now when you pay for $20 per month even though it doesn't say this I I think basically like they're screwing up on how they're describing this but if you go to fine print limits apply we can see that the plus users get 80 messages every 3 hours for GPT 40 so that's the flagship biggest model that's currently available as of today um that's available and that's what we want to be using so if you pay $20 per month you have that with some limits

[译文] [Andrej Karpathy]: 当你支付每月 20 美元时——尽管它没这么写,我觉得他们在描述这个方面搞砸了——但如果你去看细则中的“限制适用(limits apply)”,我们可以看到 Plus 用户每 3 小时可以发送 80 条 GPT-4o 消息。那是截至今天可用的旗舰级、最大的模型,那也是我们想要使用的。所以如果你每月付 20 美元,你就可以在一定限制下使用它。

[原文] [Andrej Karpathy]: and then if you pay for2 $100 per month you get the pro and there's a bunch of additional goodies as well as unlimited GPD foro and we're going to go into some of this because I do pay for pro subscription

[译文] [Andrej Karpathy]: 然后如果你支付每月 200 美元,你就可以获得 Pro 版,这里有一堆额外的好东西,以及不限量的 GPT-4o。我们会深入探讨其中的一些,因为我确实购买了 Pro 订阅。

[原文] [Andrej Karpathy]: now the whole takeaway I want you to get from this is be mindful of the models that you're using typically with these companies the bigger models are more expensive to uh calculate and so therefore uh the companies charge more for the bigger models and so make those tradeoffs for yourself depending on your usage of llms

[译文] [Andrej Karpathy]: 我想让你从这里得到的整个要点是:留意你正在使用的模型。通常对于这些公司来说,更大的模型计算成本更高,因此公司会对更大的模型收取更高的费用。所以你需要根据你使用 LLM 的情况自己进行权衡。

[原文] [Andrej Karpathy]: um have a look at you can get away with the cheaper offerings and if the intelligence is not good enough for you and you're using this professionally you may really want to consider paying for the top tier models that are available from these companies

[译文] [Andrej Karpathy]: 看看你是否可以用更便宜的产品凑合。如果它的智能程度对你来说不够,而且你是用于专业用途,你可能真的要考虑付费使用这些公司提供的顶级模型。

[原文] [Andrej Karpathy]: in my case in my professional work I do a lot of coding and a lot of things like that and this is still very cheap for me so I pay this very gladly uh because I get access to some really powerful models that I'll show you in a bit um so yeah keep track of what model you're using and make those decisions for yourself

[译文] [Andrej Karpathy]: 就我而言,在我的专业工作中,我做很多编程和类似的事情,这对我也依然非常便宜,所以我非常乐意支付这笔费用,因为我可以访问一些真正强大的模型,我稍后会展示给你们。所以,是的,跟踪你正在使用的模型,并为你自己做出决定。

[原文] [Andrej Karpathy]: I also want to show you that all the other llm providers will all have different pricing teams TI with different models at different tiers that you can pay for so for example if we go to Claude from anthropic you'll see that I am paying for the professional plan and that gives me access to Claude 3.5 Sonet and if you are not paying for a Pro Plan then probably you only have access to maybe ha cou or something like that um and so use the most powerful model that uh kind of like works for you

[译文] [Andrej Karpathy]: 我还想向你们展示,所有其他的 LLM 提供商也都有不同的定价层级,你可以在不同层级付费使用不同的模型。例如,如果我们去 Anthropic 的 Claude,你会看到我正在支付专业版计划,这让我可以使用 Claude 3.5 Sonet。如果你没有支付专业版计划,那么可能你只能使用 Haiku 或类似的模型。所以请使用最适合你的强大模型。

[原文] [Andrej Karpathy]: here's an example of me using Claud a while back I was asking for just a travel advice uh so I was asking for a cool City to go to and Claud told me that zerat in Switzerland is really cool so I ended up going there for a New Year's break following claud's advice but this is just an example of another thing that I find these models pretty useful for is travel advice and ideation and giving getting pointers that you can research further

[译文] [Andrej Karpathy]: 这是我前段时间使用 Claude 的一个例子。我当时在寻求旅行建议,我问有什么很酷的城市可以去,Claude 告诉我瑞士的采尔马特(Zermatt)真的很酷。所以我听从了 Claude 的建议,最终去那里度过了新年假期。这只是我觉得这些模型非常有用的另一个例子:旅行建议、构思以及获得你可以进一步研究的指引。

[原文] [Andrej Karpathy]: um here we also have an example of gemini.com so this is from Google I got Gemini's opinion on the matter and I asked it for a cool City to go to and it also recommended zerat so uh that was nice so I like to go between different models and asking them similar questions and seeing what they think about

[译文] [Andrej Karpathy]: 这里我们也有一个 gemini.com 的例子,这是 Google 的产品。我征求了 Gemini 对这件事的意见,我问它有什么很酷的城市可以去,它也推荐了采尔马特。这很好。所以我喜欢在不同的模型之间切换,问它们类似的问题,看看它们是怎么想的。

[原文] [Andrej Karpathy]: and for Gemini also on the top left we also have a model selector so you can pay for the more advanced tiers and use those models same thing goes for grock just released we don't want to be asking Gro 2 questions because we know that grock 3 is the most advanced model so I want to make sure that I pay enough and such that I have grock 3 access

[译文] [Andrej Karpathy]: 对于 Gemini,左上角也有一个模型选择器,你可以为更高级的层级付费并使用那些模型。刚发布的 Grok 也是一样,我们不想问 Grok 2 问题,因为我们知道 Grok 3 是最先进的模型,所以我想确保我付了足够的钱,以便我有 Grok 3 的访问权限。

[原文] [Andrej Karpathy]: um so for all these different providers find the one that works best for you experiment with different providers experiment with different pricing tiers for the problems that you are working on and uh that's kind of and often I end up personally just paying for a lot of them and then asking all all of them uh the same question and I kind of refer to all these models as my llm Council so they're kind of like the Council of language models

[译文] [Andrej Karpathy]: 所以对于所有这些不同的提供商,找到最适合你的那个。针对你正在处理的问题,尝试不同的提供商,尝试不同的定价层级。通常我个人最终会为它们中的很多付费,然后问它们所有同一个问题。我把所有这些模型称为我的“LLM 顾问团(LLM Council)”,它们就像是语言模型的委员会。

[原文] [Andrej Karpathy]: if I'm trying to figure out where to go on a vacation I will ask all of them and uh so you can also do that for yourself if that works for you

[译文] [Andrej Karpathy]: 如果我想弄清楚去哪里度假,我会问它们所有的模型。所以如果这对你有用的话,你也可以这样做。


章节 5:思考型模型(Thinking Models):利用强化学习解决复杂问题

📝 本节摘要

在本节中,Andrej 介绍了 LLM 训练的第三个阶段:强化学习(Reinforcement Learning)。经过这一阶段训练的模型被称为“思考型模型(Thinking Models)”(如 OpenAI 的 o1 系列或 DeepSeek R1)。与普通模型不同,它们在回答前会生成一段“内心独白”,像人类一样尝试不同策略、回溯检查假设。Andrej 通过一个复杂的代码调试案例(梯度检查失败),展示了普通 GPT-4o 无法解决,而开启“深度思考”模式的模型(以及 DeepSeek R1)能成功找到 bug 的过程。他建议仅在处理复杂的数学、编程或逻辑难题时使用这类较慢且昂贵的模型。

[原文] [Andrej Karpathy]: okay the next topic I want to now turn to is that of thinking models qu unquote so we saw in the previous video that there are multiple stages of training pre-training goes to supervised fine tuning goes to reinforcement learning

[译文] [Andrej Karpathy]: 好的,我现在想转向的下一个话题是所谓的“思考型模型(Thinking Models)”。我们在之前的视频中看到,训练有多个阶段:预训练(Pre-training),然后是监督微调(Supervised Fine-tuning),接着是强化学习(Reinforcement Learning)。

[原文] [Andrej Karpathy]: and reinforcement learning is where the model gets to practice um on a large collection of problems that resemble the practice problems in the textbook and it gets to practice on a lot of math en code problems um and in the process of reinforcement learning the model discovers thinking strategies that lead to good outcomes

[译文] [Andrej Karpathy]: 强化学习是模型在大量类似教科书练习题的问题上进行练习的阶段,它练习大量的数学和代码问题。在强化学习的过程中,模型会发现能够带来良好结果的思考策略。

[原文] [Andrej Karpathy]: and these thinking strategies when you look at them they very much resemble kind of the inner monologue you have when you go through problem solving so the model will try out different ideas uh it will backtrack it will revisit assumptions and it will do things like that

[译文] [Andrej Karpathy]: 当你观察这些思考策略时,它们非常像你在解决问题时进行的内心独白。模型会尝试不同的想法,它会回溯,它会重新审视假设,它会做诸如此类的事情。

[原文] [Andrej Karpathy]: now a lot of these strategies are very difficult to hardcode as a human labeler because it's not clear what the thinking process should be it's only in the reinforcement learning that the model can try out lots of stuff and it can find the thinking process that works for it with its knowledge and its capabilities so so this is the third stage of uh training these models

[译文] [Andrej Karpathy]: 许多这类策略很难由人类标注员进行硬编码,因为不清楚思考过程应该是什么样的。只有在强化学习中,模型才能尝试大量的东西,并找到适合其知识和能力的思考过程。所以这是训练这些模型的第三个阶段。

[原文] [Andrej Karpathy]: this stage is relatively recent so only a year or two ago and all of the different llm Labs have been experimenting with these models over the last year and this is kind of like seen as a large breakthrough recently and here we looked at the paper from Deep seek that was the first to uh basically talk about it publicly and they had a nice paper about incentivizing reasoning capabilities in llms Via reinforcement learning so that's the paper that we looked at in the previous video

[译文] [Andrej Karpathy]: 这个阶段相对较新,也就是一两年前的事,所有的 LLM 实验室在过去一年里都在尝试这些模型,这被视为最近的一个重大突破。在这里我们参考了 DeepSeek 的论文,他们是第一个公开讨论这个问题的,他们有一篇关于通过强化学习激励 LLM 推理能力的很棒的论文,那就是我们在上一个视频中看过的论文。

[原文] [Andrej Karpathy]: so we now have to adjust our cartoon a little bit because uh basically what it looks like is our Emoji now has this optional thinking bubble and when you are using a thinking model which will do additional thinking you are using the model that has been additionally tuned with reinforcement learning

[译文] [Andrej Karpathy]: 所以我们现在必须稍微调整一下我们的卡通示意图,因为基本上看起来我们的表情符号现在有了一个可选的“思考气泡”。当你使用一个会进行额外思考的思考型模型时,你正在使用的是经过强化学习额外调优的模型。

[原文] [Andrej Karpathy]: and qualitatively what does this look like well qualitatively the model will do a lot more thinking and what you can expect is that you will get higher accuracies especially on problems that are for example math code and things that require a lot of thinking things that are very simple like uh might not actually benefit from this but things that are actually deep and hard might benefit a lot

[译文] [Andrej Karpathy]: 从性质上讲,这看起来像什么?嗯,模型会进行更多的思考,你可以预期你会获得更高的准确率,特别是在数学、代码和需要大量思考的问题上。像非常简单的事情可能实际上不会从中受益,但真正深刻和困难的事情可能会受益匪浅。

[原文] [Andrej Karpathy]: and so um but basically what you're paying for it is that the models will do thinking and that can sometimes take multiple minutes because the models will emit tons and tons of tokens over a period of many minutes and you have to wait uh because the model is thinking just like a human would think but in situations where you have very difficult problems this might Translate to higher accuracy

[译文] [Andrej Karpathy]: 基本上你为此付出的代价是模型会进行思考,这有时可能需要几分钟,因为模型会在几分钟内发出成吨的 Token。你必须等待,因为模型正在思考,就像人类思考一样。但在你有非常困难的问题的情况下,这可能会转化为更高的准确率。

[原文] [Andrej Karpathy]: so let's take a look at some examples so here's a concrete example when I was stuck on a programming problem recently so uh something called the gradient check fails and I'm not sure why and I copy pasted the model uh my code uh so the details of the code are not important but this is basically um an optimization of a multier perceptron and details are not important it's a bunch of code that I wrote and there was a bug because my gradient check didn't work and I was just asking for advice

[译文] [Andrej Karpathy]: 让我们看一些例子。这是一个具体的例子,最近我在一个编程问题上卡住了。有一个叫“梯度检查(Gradient Check)”的东西失败了,我不确定为什么。我复制粘贴了我的代码——代码的细节并不重要,这基本上是一个多层感知器的优化——总之是我写的一堆代码,里面有个 bug,因为我的梯度检查不起作用,我只是在寻求建议。

[原文] [Andrej Karpathy]: and GPT 40 which is the blackship most powerful model for open AI but without thinking uh just kind of like uh went into a bunch of uh things that it thought were issues or that I should double check but actually didn't really solve the problem like all of the things that it gave me here are not the core issue of the problem so the model didn't really solve the issue um and it tells me about how to debug it and so on

[译文] [Andrej Karpathy]: 而 GPT-4o,这是 OpenAI 目前最强大的旗舰模型,但在没有“思考”的情况下,它只是列出了一堆它认为是问题或我应该再次检查的事情,但实际上并没有真正解决问题。它给我的所有这些都不是问题的核心所在,所以模型并没有真正解决问题,它只是告诉我如何调试等等。

[原文] [Andrej Karpathy]: but then what I did was here in the drop down I turned to one of the thinking models now for open all of these models that start with o are thinking models 01 O3 mini O3 mini high and 01 Pro promote are all thinking models and uh they're not very good at naming their models uh but uh that is the case and so here they will say something like uses Advanced reasoning or uh good at COD and Logics and stuff like that but these are basically all tuned with reinforcement learning

[译文] [Andrej Karpathy]: 但后来我在下拉菜单中转向了一个思考型模型。对于 OpenAI,所有以“o”开头的模型都是思考型模型:o1、o3-mini、o3-mini high 和 o1 Pro mode 都是思考型模型。他们不太擅长给模型命名,但这确实是事实。这里它们会显示类似“使用高级推理”或“擅长代码和逻辑”之类的说明,但这些基本上都是用强化学习调优过的。

[原文] [Andrej Karpathy]: and the because I am paying for $200 per month I have have access to O Pro mode which is best at reasoning um but you might want to try some of the other ones if depending on your pricing tier

[译文] [Andrej Karpathy]: 因为我每月支付 200 美元,所以我可以使用 o1 Pro 模式,这是推理能力最好的。但根据你的定价层级,你可能想试试其他的。

[原文] [Andrej Karpathy]: and when I gave the same model the same prompt to 01 Pro which is the best at reasoning model and you have to pay $200 per month for this one then the exact same prompt it went off and it thought for 1 minute and it went through a sequence of thoughts and opening eye doesn't fully show you the exact thoughts they just kind of give you little summaries of the thoughts but it thought about the code for a while and then it actually came to get came back with the correct solution it noticed that the parameters are mismatched and how I pack and unpack them and Etc so this actually solved my problem

[译文] [Andrej Karpathy]: 当我给 o1 Pro(那个最擅长推理且需要每月付 200 美元的模型)同样的提示时,它开始运行并思考了 1 分钟。它经历了一系列的思考过程。OpenAI 并没有完全向你展示确切的思考内容,他们只是给你一些思考的小摘要。但它思考了一会儿代码,然后实际上给出了正确的解决方案。它注意到参数不匹配以及我如何打包和解包它们等等,所以这实际上解决了我的问题。

[原文] [Andrej Karpathy]: and I tried out giving the exact same prompt to a bunch of other llms so for example Claud I gave Claude the same problem and it actually noticed the correct issue and solved it and it did that even with uh sonnet which is not a thinking model so claw 3.5 Sonet to my knowledge is not a thinking model and to my knowledge anthropic as of today doesn't have a thinking model deployed but this might change by the time you watch this video um but even without thinking this model actually solved the issue

[译文] [Andrej Karpathy]: 我试着把完全相同的提示给了一堆其他的 LLM。比如 Claude,我把同样的问题给了 Claude,它实际上注意到了正确的问题并解决了它。甚至用 Sonnet 也做到了,而 Sonnet 并不是一个思考型模型。据我所知,Claude 3.5 Sonnet 不是思考型模型,Anthropic 截至今天还没有部署思考型模型(但在你观看此视频时可能会改变)。但即使没有思考,这个模型实际上也解决了这个问题。

[原文] [Andrej Karpathy]: when I went to Gemini I asked it um and it also solved the issue even though I also could have tried the a thinking model but it wasn't necessary I also gave it to grock uh grock 3 in this case and grock 3 also solved the problem after a bunch of stuff um so so it also solved the issue

[译文] [Andrej Karpathy]: 当我去问 Gemini 时,它也解决了这个问题,尽管我也可以尝试思考型模型,但这没有必要。我也把它给了 Grok,这里是 Grok 3,Grok 3 在处理了一堆东西后也解决了问题。

[原文] [Andrej Karpathy]: and then finally I went to uh perplexity doai and the reason I like perplexity is because when you go to the model dropdown one of the models that they host is this deep seek R1 so this has the reasoning with the Deep seek R1 model which is the model that we saw uh over here uh this is the paper so perplexity just hosts it and makes it very easy to use so I copy pasted it there and I ran it and uh I think they render they like really render it terribly but down here you can see the raw thoughts of the model uh even though you have to expand them but you see like okay the user is having trouble with the gradient check and then it tries out a bunch of stuff and then it says but wait when they accumulate the gradients they're doing the thing incorrectly let's check the order the parameters are packed as this and then it notices the issue and then it kind of like um says that's a critical mistake and so it kind of like thinks through it and you have to wait a few minutes and then also comes up with the correct answer

[译文] [Andrej Karpathy]: 最后我去了 perplexity.ai。我喜欢 Perplexity 的原因是当你去模型下拉菜单时,他们托管的模型之一是 DeepSeek R1。所以这拥有 DeepSeek R1 模型的推理能力,也就是我们在这边看到的那个模型(这篇论文提到的)。Perplexity 只是托管了它并让它很容易使用。所以我把它复制粘贴到那里运行。我觉得他们渲染得很糟糕,但在下面你可以看到模型的原始思考——尽管你必须展开它们。你会看到比如“好的,用户在梯度检查上有困难”,然后它尝试了一堆东西,然后说“等等,当他们累积梯度时,他们做错了,让我们检查一下参数打包的顺序”,然后它注意到了问题,然后说“这是一个关键错误”。所以它思考了一番,你必须等几分钟,然后它也给出了正确的答案。

[原文] [Andrej Karpathy]: so basically long story short what do I want to show you there exist a class of models that we call thinking models all the different providers may or may not have a thinking model these models are most effective for difficult problems in math and code and things like that and in those kinds of cases they can push up the accuracy of your performance

[译文] [Andrej Karpathy]: 所以基本上长话短说,我想向你们展示的是,存在一类我们称为“思考型模型”的模型。所有不同的提供商可能有也可能没有思考型模型。这些模型在数学和代码等困难问题上最有效,在这些情况下,它们可以提高你的表现准确率。

[原文] [Andrej Karpathy]: in many cases like if if you're asking for travel advice or something like that you're not going to benefit out of a thinking model there's no need to wait for one minute for it to think about uh some destinations that you might want to go to so for myself I usually try out the non-thinking models because their responses are really fast but when I suspect the response is not as good as it could have been and I want to give the opportunity to the model to think a bit longer about it I will change it to a thinking model depending on whichever one you have available to you

[译文] [Andrej Karpathy]: 在很多情况下,比如如果你在寻求旅行建议或类似的东西,你不会从思考型模型中受益,没有必要为了它思考你想去的目的地而等上一分钟。所以对我自己来说,我通常先试用非思考型模型,因为它们的反应非常快。但是当我怀疑回答不够好,我想给模型机会让它多思考一会儿时,我会切换到思考型模型,取决于你手头有什么可用的模型。

[原文] [Andrej Karpathy]: now when you go to Gro for example when I start a new conversation with grock um when you put the question here like hello you should put something important here you see here think so let the model take its time so turn on think and then click go and when you click think grock under the hood switches to the thinking model and all the different LM providers will kind of like have some kind of a selector for whether or not you want the model to think or whether it's okay to just like go um with the previous kind of generation of the models

[译文] [Andrej Karpathy]: 比如当你去 Grok,当我开始一个新的对话时,如果你在这里输入问题,你应该在这里注意一个重要的东西,你看到这里有个“思考(Think)”按钮。所以让模型从容一点,打开“思考”,然后点击开始。当你点击“思考”时,Grok 在底层会切换到思考型模型。所有不同的 LLM 提供商都会有某种选择器,让你选择是否希望模型进行思考,还是可以直接使用上一代那种直接回答的模型。


章节 6:工具使用(一):联网搜索与实时信息获取

📝 本节摘要

在这一节中,Andrej 引入了 LLM 的一个关键能力升级——“工具使用(Tool Use)”。他指出,单纯的 LLM 只是一个封闭的“Zip 文件”,无法获知训练截止日期之后的新闻。为了解决这个问题,模型需要能够连接互联网。他以查询美剧《白莲花度假村》第三季发布时间为例,对比了传统 Google 搜索(需人工逐个点击链接)与 LLM 搜索(自动抓取网页内容填入上下文窗口并生成答案)的区别。他还横向测评了 ChatGPT、Perplexity、Grok、Claude 和 Gemini 对联网搜索的支持情况,并列举了自己日常使用 Perplexity 进行的一系列时效性查询案例(如股市行情、产品发布传闻、突发新闻解读等)。

[原文] [Andrej Karpathy]: okay now the next section I want to continue to is to Tool use uh so far we've only talked to the language model through text and this language model is again this ZIP file in a folder it's inert it's closed off it's got no tools it's just um a neural network that can emit tokens

[译文] [Andrej Karpathy]: 好的,现在我想继续讲的下一部分是“工具使用(Tool Use)”。到目前为止,我们只通过文本与语言模型进行了交谈,而这个语言模型依然只是文件夹里的那个 Zip 文件。它是惰性的,它是封闭的,它没有任何工具,它只是一个能发出 Token 的神经网络。

[原文] [Andrej Karpathy]: so what we want to do now though is we want to go beyond that and we want to give the model the ability to use a bunch of tools and one of the most useful tools is an internet search and so let's take a look at how we can make models use internet search

[译文] [Andrej Karpathy]: 所以我们现在想做的是超越这一点,我们想给模型能力去使用一堆工具,而最有用的工具之一就是互联网搜索。所以让我们来看看如何让模型使用互联网搜索。

[原文] [Andrej Karpathy]: so for example again using uh concrete examples from my own life a few days ago I was watching White Lotus season 3 um and I watched the first episode and I love this TV show by the way and I was curious when the episode two was coming out

[译文] [Andrej Karpathy]: 再举个例子,用我生活中的具体例子。几天前我在看《白莲花度假村(White Lotus)》第三季,我看了第一集——顺便说一句,我很喜欢这部电视剧——我很好奇第二集什么时候出。

[原文] [Andrej Karpathy]: uh and so in the old world you would imagine you go to Google or something like that you put in like new episodes of white lot of season 3 and then you start clicking on these links and maybe open a few of them or something like that right and you start like searching through it and trying to figure it out

[译文] [Andrej Karpathy]: 在旧世界里,你可以想象你会去 Google 或类似的地方,输入“白莲花第三季新剧集”,然后你开始点击这些链接,也许打开其中的几个或者类似的,对吧?你开始浏览搜索结果并试图弄清楚答案。

[原文] [Andrej Karpathy]: and sometimes you lock out and you get a schedule um but many times you might get really crazy ads there's a bunch of random stuff going on and it's just kind of like an unpleasant experience right

[译文] [Andrej Karpathy]: 有时你运气好会找到一个时间表,但在很多时候你可能会看到真正疯狂的广告,有一堆乱七八糟的东西,这只是一种不愉快的体验,对吧?

[原文] [Andrej Karpathy]: so wouldn't it be great if a model could do this kind of a search for you visit all the web pages and then take all those web pages take all their content and stuff it into the context window and then basically give you the response

[译文] [Andrej Karpathy]: 所以,如果一个模型能为你做这种搜索,访问所有的网页,然后把所有这些网页的内容都塞进上下文窗口,然后基本上给你回复,那岂不是很好吗?

[原文] [Andrej Karpathy]: and that's what we're going to do now basically we haven't a mechanism or a way we introduce a mechanism for for the model to emit a special token that is some kind of a searchy internet token

[译文] [Andrej Karpathy]: 这就是我们现在要做的。基本上我们引入了一种机制,让模型可以发出一个特殊的 Token,这是一种“搜索互联网”的 Token。

[原文] [Andrej Karpathy]: and when the model emits the searchd internet token the Chach PT application or whatever llm application it is you're using will stop sampling from the model and it will take the query that the model model gave it goes off it does a search it visits web pages it takes all of their text and it puts everything into the context window

[译文] [Andrej Karpathy]: 当模型发出这个“搜索互联网”的 Token 时,ChatGPT 应用程序或你正在使用的任何 LLM 应用程序将停止从模型中采样。它会获取模型给出的查询,然后去执行搜索,访问网页,获取它们所有的文本,并把所有东西都放入上下文窗口中。

[原文] [Andrej Karpathy]: so now you have this internet search tool that itself can also contribute tokens into our context window and in this case it would be like lots of internet web pages and maybe there's 10 of them and maybe it just puts it all together and this could be thousands of tokens coming from these web pages just as we were looking at them ourselves

[译文] [Andrej Karpathy]: 所以现在你有了一个互联网搜索工具,它本身也可以向我们的上下文窗口贡献 Token。在这种情况下,这就像是许多互联网网页,也许有 10 个,也许它只是把它们都放在一起。这可能来自这些网页的数千个 Token,就像我们自己正在浏览它们一样。

[原文] [Andrej Karpathy]: and then after it has inserted all those web pages into the Contex window it will reference back to your question as to hey what when is this Mo when is this season getting released and it will be able to reference the text and give you the correct answer

[译文] [Andrej Karpathy]: 在它把所有这些网页插入到上下文窗口之后,它会回过头来参考你的问题,比如“嘿,这一季什么时候发布”,然后它就能参考这些文本并给你正确的答案。

[原文] [Andrej Karpathy]: and notice that this is a really good example of why we would need internet search without the internet search this model has no chance to actually give us the correct answer because like I mentioned this model was trained a few months ago the schedule probably was not known back then and so when uh White load of season 3 is coming out is not part of the real knowledge of the model and it's not in the zip file most likely

[译文] [Andrej Karpathy]: 注意,这是一个很好的例子,说明了为什么我们需要互联网搜索。没有互联网搜索,这个模型根本没机会给我们正确的答案。因为就像我提到的,这个模型是几个月前训练的,那时时间表可能还没出来。所以《白莲花》第三季什么时候出并不是模型真正知识的一部分,它很可能不在那个 Zip 文件里。

[原文] [Andrej Karpathy]: uh because this is something that was presumably decided on in the last few weeks and so the model has to basically go off and do internet search to learn this knowledge and it learns it from the web pages just like you and I would without it and then it can answer the question once that information is in the context window

[译文] [Andrej Karpathy]: 因为这可能是过去几周才决定的事情,所以模型基本上必须去进行互联网搜索来学习这些知识。它从网页中学习,就像你和我一样,一旦这些信息进入上下文窗口,它就可以回答这个问题了。

[原文] [Andrej Karpathy]: and remember again that the context window is this working memory so once we load the Articles once all of these articles think of their text as being coped copy pasted into the context window now they're in a working memory and the model can actually answer those questions because it's in the context window

[译文] [Andrej Karpathy]: 再次记住,上下文窗口就是工作记忆。所以一旦我们加载了文章——想象一下这些文章的文本被复制粘贴到了上下文窗口中——现在它们就在工作记忆里了,模型实际上就可以回答这些问题了,因为信息就在上下文窗口中。

[原文] [Andrej Karpathy]: so basically long story short don't do this manually but use tools like perplexity as an example so perplexity doai had a really nice sort of uh llm that was doing internet search um and I think it was like the first app that really convincingly did this

[译文] [Andrej Karpathy]: 所以长话短说,不要手动做这些,而是使用像 Perplexity 这样的工具作为例子。perplexity.ai 有一个非常好的进行互联网搜索的 LLM,我认为它是第一个真正令人信服地做到这一点的应用程序。

[原文] [Andrej Karpathy]: more recently chashi PT also introduced a search button says search the web so we're going to take a look at that in a second for now when are new episodes of wi Lotus season 3 getting released you can just ask and instead of having to do the work manually we just hit enter and the model will visit these web pages it will create all the queries and then it will give you the answer so it just kind of did a ton of the work for you

[译文] [Andrej Karpathy]: 最近 ChatGPT 也引入了一个“搜索网络(Search the web)”的按钮,我们马上就会看看那个。现在,对于“《白莲花》第三季新剧集什么时候发布”这个问题,你可以直接问,不用手动做这些工作,我们只需按回车,模型就会访问这些网页,它会创建所有的查询,然后给你答案。它基本上为你做了大量的工作。

[原文] [Andrej Karpathy]: um and then you can uh usually there will be citations so you can actually visit those web pages yourself and you can make sure that these are not hallucinations from the model and you can actually like double check that this is actually correct because it's not in principle guaranteed it's just um you know something that may or may not work

[译文] [Andrej Karpathy]: 然后通常会有引用来源,所以你可以亲自访问那些网页,你可以确保这些不是模型的幻觉,你可以再次检查这是否真的正确。因为原则上这不能保证是正确的,它只是一种可能有效也可能无效的东西。

[原文] [Andrej Karpathy]: if we take this we can also go to for example chat GPT say the same thing but now when we put this question in without actually selecting search I'm not actually 100% sure what the model will do in some cases the model will actually like know that this is recent knowledge and that it probably doesn't know and it will create a search in some cases we have to declare that we want to do the search

[译文] [Andrej Karpathy]: 如果我们把这个问题拿到比如 ChatGPT,说同样的话。但现在当我们输入这个问题而不实际选择“搜索”时,我实际上并不是 100% 确定模型会做什么。在某些情况下,模型实际上会知道这是最近的知识,它可能不知道,于是它会创建一个搜索;在某些情况下,我们必须声明我们想要进行搜索。

[原文] [Andrej Karpathy]: in my own personal use I would know that the model doesn't know and so I would just select search but let's see first uh let's see if uh what happens okay searching the web and then it prints stuff and then it sites so the model actually detected itself that it needs to search the web because it understands that this is some kind of a recent information Etc so this was correct

[译文] [Andrej Karpathy]: 在我个人的使用中,我会知道模型不知道,所以我通常会直接选择搜索。但让我们先看看会发生什么。好的,“正在搜索网络”,然后它打印出内容,然后它给出了引用。所以模型实际上自己检测到了它需要搜索网络,因为它理解这是某种最近的信息等等。所以这是正确的。

[原文] [Andrej Karpathy]: alternatively if I create a new conversation I could have also select it search because I know I need to search enter and then it does the same thing searching the web and and that's the the result

[译文] [Andrej Karpathy]: 或者,如果我创建一个新对话,我也可以选择“搜索”,因为我知道我需要搜索。回车,然后它做了同样的事情,正在搜索网络,这就是结果。

[原文] [Andrej Karpathy]: so basically when you're using these LM look for this for example grock excuse me let's try grock without it without selecting search Okay so the model does some search uh just knowing that it needs to search and gives you the answer

[译文] [Andrej Karpathy]: 所以基本上当你使用这些 LLM 时,寻找这个功能。例如 Grok,不好意思,让我们试试 Grok,不选择搜索。好的,模型进行了一些搜索,它知道它需要搜索,并给了你答案。

[原文] [Andrej Karpathy]: so basically uh let's see what cloud does you see so CLA does actually have the Search tool available so it will say as of my last update in April 2024 this last update is when the model went through pre-training and so Claud is just saying as of my last update the knowledge cut off of April 2024 uh it was announced but it doesn't know

[译文] [Andrej Karpathy]: 基本上,让我们看看 Claude 会做什么。你看,Claude(这里指当时版本)实际上没有搜索工具可用,所以它会说“截至我在 2024 年 4 月的最后一次更新”。这个最后更新是模型进行预训练的时间。所以 Claude 只是在说,截至我 2024 年 4 月的知识截止日期,它已宣布,但它不知道具体时间。

[原文] [Andrej Karpathy]: so Claud doesn't have the internet search integrated as an option and will not give you the answer I expect that this is something that anthropic might be working on

[译文] [Andrej Karpathy]: 所以 Claude 没有将互联网搜索作为一个选项集成进来,也不会给你答案。我预计这是 Anthropic 可能正在研究的东西。

[原文] [Andrej Karpathy]: let's try Gemini and let's see what it says unfortunately no official release date for white loto season 3 yet so um Gemini 2.0 pro experimental does not have access to Internet search and doesn't know

[译文] [Andrej Karpathy]: 让我们试试 Gemini,看看它说什么。“不幸的是,白莲花第三季还没有正式发布日期”。所以 Gemini 2.0 Pro Experimental 无法访问互联网搜索,它不知道。

[原文] [Andrej Karpathy]: uh we could try some of the other ones like 2.0 flash let me try that okay so this model seems to know but it doesn't give citations oh wait okay there we go sources and related content so we see how 2.0 flash actually has the internet search tool but I'm guessing that the 2.0 pro which is uh the most powerful model that they have this one actually does not have access

[译文] [Andrej Karpathy]: 我们可以试试其他的,比如 2.0 Flash,让我试一下。好的,这个模型似乎知道,但它没有给出引用……哦,等等,好了,这里有“来源和相关内容”。所以我们看到 2.0 Flash 实际上有互联网搜索工具,但我猜 2.0 Pro——也就是他们最强大的模型——实际上没有访问权限。

[原文] [Andrej Karpathy]: and it in here it actually tells us 2.0 pro experimental lacks access to real-time info and some Gemini features so this model is not fully wired with internet search

[译文] [Andrej Karpathy]: 这里它实际上告诉我们,2.0 Pro Experimental 缺乏对实时信息和某些 Gemini 功能的访问权限。所以这个模型并没有完全连接互联网搜索。

[原文] [Andrej Karpathy]: so long story short we can get models to perform Google searches for us visit the web page just pull in the information to the context window and answer questions and uh this is a very very cool feature but different models possibly different apps have different amount of integration of this capability and so you have to be kind of on the lookout for that

[译文] [Andrej Karpathy]: 长话短说,我们可以让模型为我们要执行 Google 搜索,访问网页,把信息拉入上下文窗口并回答问题。这是一个非常非常酷的功能。但不同的模型,可能不同的应用程序,对这种能力的集成程度不同,所以你必须留意这一点。

[原文] [Andrej Karpathy]: I wanted to show you a few more examples of how I use the search tool in my own work so what are the kinds of queries that I use and this is fairly easy for me to do because usually for these kinds of cases I go to perplexity just out of habit even though chat GPT today can do this kind of stuff as well uh as do probably many other services as well but I happen to use perplexity for these kinds of search queries

[译文] [Andrej Karpathy]: 我想向你们展示更多关于我在工作中如何使用搜索工具的例子。我会使用什么样的查询?这对我来说很容易展示,因为通常对于这类情况,我出于习惯会去用 Perplexity。尽管如今 ChatGPT 也能做这类事情,可能很多其他服务也能做,但我碰巧使用 Perplexity 进行这类搜索查询。

[原文] [Andrej Karpathy]: so whenever I expect that the answer can be achieved by doing basically something like Google search and visiting a few of the top links and the answer is somewhere in those top links whenever that is the case I expect to use the search tool and I come to perplexity so here are some examples

[译文] [Andrej Karpathy]: 所以,只要我预期答案可以通过做类似 Google 搜索并访问前几个链接来获得,且答案就在那些链接里的某个地方,只要是这种情况,我就预期会使用搜索工具,我会用 Perplexity。所以这里有一些例子。

[原文] [Andrej Karpathy]: is the market open today um and uh this was unprecedent day I wasn't 100% sure so uh perplexity understands what it's today it will do the search and it will figure out that I'm President's Day this was closed

[译文] [Andrej Karpathy]: “今天股市开盘吗?”那是总统日,我不是 100% 确定,所以 Perplexity 理解今天是哪一天,它会进行搜索,并弄清楚因为是总统日,股市是关闭的。

[原文] [Andrej Karpathy]: where's White Lotus season 3 filmed again this is something that I wasn't sure that a model would know in its knowledge this is something Niche so maybe there's not that many mentions of it on the internet and also this is more recent so I don't expect a model to know uh by default so uh this was a good a fit for the Search tool

[译文] [Andrej Karpathy]: “《白莲花》第三季在哪里拍摄的?”这又是我不确定模型在其知识库中是否知道的事情。这是一些小众的东西,也许互联网上提到的不多,而且这也是比较新的,所以我不期望模型默认就知道。所以这是一个很适合搜索工具的例子。

[原文] [Andrej Karpathy]: does versel offer post equal database so this was a good example of this because I this kind of stuff changes over time and the offerings of verel which is accompany uh may change over time and I want the latest and whenever something is latest or something changes I prefer to use the search tool so I come to proplex

[译文] [Andrej Karpathy]: “Vercel 提供 PostgreSQL 数据库吗?”这是一个很好的例子,因为这类东西会随时间变化,Vercel(一家公司)的产品可能会随时间变化,而我想要最新的信息。每当涉及最新的东西或者会变化的东西时,我更喜欢使用搜索工具,所以我用 Perplexity。

[原文] [Andrej Karpathy]: when is what do the Apple launch tomorrow and what are some of the rumors so again this is something recent uh where is the singles Inferno season 4 cast uh must know uh so this is again a good example because this is very fresh information

[译文] [Andrej Karpathy]: “明天 Apple 发布会发布什么?有哪些传闻?”这又是最近的事情。“《单身即地狱(Singles Inferno)》第四季的演员阵容是谁?”我必须知道。这也是一个好例子,因为这是非常新鲜的信息。

[原文] [Andrej Karpathy]: why is the paler stock going up what is driving the enthusiasm when is civilization 7 coming out exactly um this is an example also like has Brian Johnson talked about the toothpaste uses um and I was curious basically I like what Brian does and again it has the two features number one it's a little bit esoteric so I'm not 100% sure if this is at scale on the internet and would be part of like knowledge of a model and number two this might change over time so I want to know what toothpaste he uses most recently and so this is good fit again for a Search tool

[译文] [Andrej Karpathy]: “为什么 Palantir 股票在上涨?是什么推动了这种热情?”“《文明 7》确切什么时候出?”还有一个例子比如“Brian Johnson 谈过他用的牙膏吗?”我很好奇,我喜欢 Brian 做的事。这又有两个特征:第一,这有点深奥,我不 100% 确定这在互联网上是否达到了被模型知识库收录的规模;第二,这可能会随时间变化,我想知道他最近用什么牙膏。所以这又很适合搜索工具。

[原文] [Andrej Karpathy]: is it safe to travel to Vietnam uh this can potentially change over time and then I saw a bunch of stuff on Twitter about a USA ID and I wanted to know kind of like what's the deal uh so I searched about that and then you can kind of like dive in in a bunch of ways here but this use case here is kind of along the lines of I see something trending and I'm kind of curious what's happening like what is the gist of it and so I very often just quickly bring up a search of like what's happening and then get a model to kind of just give me a gist of roughly what happened um because a lot of the individual tweets or posts might not have the full context just by itself so these are examples of how I use a Search tool

[译文] [Andrej Karpathy]: “去越南旅行安全吗?”这可能会随时间变化。然后我在 Twitter 上看到一堆关于 USAID 的东西,我想知道那是怎么回事,所以我搜了一下。你可以深入了解,但这儿的用例就像是,我看到有什么东西在流行,我好奇发生了什么,它的要点是什么。所以我经常快速发起一个搜索,问“发生了什么”,然后让模型给我一个大概的要点,因为很多单独的推文或帖子本身可能没有完整的背景。这些就是我如何使用搜索工具的例子。


章节 7:工具使用(二):深度研究(Deep Research)功能解析

📝 本节摘要

Andrej 在本节介绍了一项较新的重磅功能——“深度研究(Deep Research)”。这是一种结合了联网搜索与长时间思考(可能耗时数十分钟)的高级能力,目前需订阅 $200/月的 ChatGPT Pro 才能使用。他以研究“长寿混合补充剂(Longevity Mix)”成分为例,展示了该功能如何通过阅读大量文献生成一份包含引用的深度报告。此外,他还横向对比了 Perplexity 和 Grok 的类似功能,认为 ChatGPT 目前的表现最为详尽。最后他特别提醒,尽管报告看似专业且附带引用,但仍可能存在“幻觉”或关键遗漏(例如在统计 LLM 实验室时漏掉了 xAI),因此必须将其视为“初稿”并进行人工核实。

[原文] [Andrej Karpathy]: okay next up I would like to tell you about this capability called Deep research and this is fairly recent only as of like a month or two ago uh but I think it's incredibly cool and really interesting and kind of went under the radar for a lot of people even though I think it shouldn't have

[译文] [Andrej Karpathy]: 好的,接下来我想给你们讲讲这个叫做“深度研究(Deep Research)”的能力。这是相当新的功能,大概才出来一两个月,但我认为它酷毙了,真的很有趣。尽管我认为它不该被忽视,但很多人似乎并没有注意到它。

[原文] [Andrej Karpathy]: so when we go to chipt pricing here we notice that deep research is listed here under Pro so it currently requires $200 per month so this is the top tier uh however I think it's incredibly cool so let me show you by example um in what kinds of scenarios you might want to use it

[译文] [Andrej Karpathy]: 当我们看 ChatGPT 的定价时,我们注意到深度研究被列在 Pro 版下面,所以目前它需要每月 200 美元,这是顶级层级。然而,我认为它非常酷,所以让我通过例子向你们展示在什么样的情况下你可能会想使用它。

[原文] [Andrej Karpathy]: roughly speaking uh deep research is a combination of internet search and thinking and rolled out for a long time so the model will go off and it will spend tens of minutes doing what deep research um and a first sort of company that announced this was CH GPT as part of its Pro offering uh very recently like a month ago

[译文] [Andrej Karpathy]: 粗略地说,深度研究是互联网搜索和思考的结合,并且持续很长时间。所以模型会开始运行,它会花几十分钟做深度研究。第一个宣布这个功能的公司是 ChatGPT,作为其 Pro 产品的一部分,就在最近,大约一个月前。

[原文] [Andrej Karpathy]: so here's an example recently I was on the internet buying supplements which I know is kind of crazy but Brian Johnson has this starter pack and I was kind of curious about it and there's this thing called Longevity mix right and it's got a bunch of health actives and I want to know what these things are right

[译文] [Andrej Karpathy]: 举个例子,最近我在网上买补充剂——我知道这有点疯狂——但 Brian Johnson 有个新手包,我对此有点好奇。里面有个叫“长寿混合剂(Longevity Mix)”的东西,对吧?它里面有一堆健康活性成分,我想知道这些东西到底是什么,对吧?

[原文] [Andrej Karpathy]: and of course like so like ca AKG like like what the hell is this Boost energy production for sustained Vitality like what does that mean so one thing you could of course do is you could open up Google search uh and look at the Wikipedia page or something like that and do everything that you're kind of used to but deep research allows you to uh basically take an an alternate route and it kind of like processes a lot of this information for you and explains it a lot better

[译文] [Andrej Karpathy]: 当然,比如像 Ca-AKG 这种东西,这到底是什么鬼?“提升能量产生以维持持续的活力”,这什么意思?你当然可以做的一件事是打开 Google 搜索,查看维基百科页面之类的,做所有你习惯做的事情。但深度研究允许你基本上采取另一条路线,它会为你处理大量这类信息,并解释得好得多。

[原文] [Andrej Karpathy]: so as an example we can do something like this this is my example prompt C AKG is one Health one of the health actives in Brian Johnson's blueprint at 2.5 grams per serving can you do research on CG tell me why um tell me about why it might be found in the longevity mix it's possible efficency in humans or animal models its potential mechanism of action any potential concerns or toxicity or anything like that

[译文] [Andrej Karpathy]: 举个例子,我们可以这样做。这是我的示例提示词:“Ca-AKG 是 Brian Johnson 蓝图中的健康活性成分之一,每份 2.5 克。你能对 Ca-AKG 做个研究吗?告诉我为什么它会出现在长寿混合剂中,它在人类或动物模型中的可能功效,它的潜在作用机制,以及任何潜在的担忧或毒性之类的。”

[原文] [Andrej Karpathy]: now here I have this button available to you to me and you won't unless you pay $200 per month right now but I can turn on deep research so let me copy paste this and hit go um and now the model will say okay I'm going to research this and then sometimes it likes to ask clarifying questions before it goes off

[译文] [Andrej Karpathy]: 现在这里我有这个按钮,我有而你没有,除非你现在每月付 200 美元。我可以开启深度研究。让我复制粘贴这个并点击开始。现在模型会说“好的,我要研究这个”,然后有时它喜欢在开始之前问一些澄清性问题。

[原文] [Andrej Karpathy]: so a focus on human clinical studies animal models are both so let's say both specific sources uh all of all sources I don't know comparison to other longevity compounds uh not needed comparison just AKG uh we can be pretty brief the model understands uh and we hit go and then okay I'll research AKG starting research and so now we have to wait for probably about 10 minutes or so

[译文] [Andrej Karpathy]: 比如“侧重于人类临床研究、动物模型还是两者兼有?”我们就说“两者兼有”。“具体来源?”“所有来源,我不知道。”“与其他长寿化合物的比较?”“不需要比较,只看 AKG。”我们可以简短一点,模型能理解。我们点击开始,然后它说“好的,我将研究 AKG,开始研究”。现在我们要等大概 10 分钟左右。

[原文] [Andrej Karpathy]: and if you'd like to click on it you can get a bunch of preview of what the model is doing on a high level so this will go off and it will do a combination of like I said thinking and internet search but it will issue many internet searches it will go through lots of papers it will look at papers and it will think and it will come back 10 minutes from now so this will run for a while

[译文] [Andrej Karpathy]: 如果你想点开它,你可以看到模型正在做什么的高层级预览。这会开始运行,它会像我说的,进行思考和互联网搜索的结合。但它会发起许多互联网搜索,它会浏览大量论文,它会查看论文并进行思考,然后在 10 分钟后回来。这会运行一会儿。

[原文] [Andrej Karpathy]: meanwhile while this is running uh I'd like to show you equivalence of it in the industry so inspired by this a lot of people were interested in cloning it and so one example is for example perplexity so complexity when you go to the model drop down has something called Deep research and so you can issue the same queries here and we can give this to perplexity

[译文] [Andrej Karpathy]: 同时,在这个运行期间,我想向你们展示行业中的同类产品。受此启发,很多人有兴趣克隆这个功能。一个例子是 Perplexity。当你去 Perplexity 的模型下拉菜单时,有个叫“深度研究(Deep Research)”的东西,你可以在这里发布同样的查询,我们可以把这个给 Perplexity。

[原文] [Andrej Karpathy]: and then grock as well has something called Deep search instead of deep research but I think that grock's deep search is kind of like deep research but I'm not 100% sure so we can issue grock deep search as well grock 3 deep search go and uh this model is going to go off as well

[译文] [Andrej Karpathy]: 然后 Grok 也有一个叫“深度搜索(Deep Search)”的东西,不叫深度研究。但我认为 Grok 的深度搜索有点像深度研究,但我不是 100% 确定。所以我们可以也发布 Grok 深度搜索,Grok 3 深度搜索,走起,这个模型也会开始运行。

[原文] [Andrej Karpathy]: now I think uh where is my Chachi PT so Chachi PT is kind of like maybe a quarter done perplexity is going to be down soon okay still thinking and Gro is still going as well I like grock's interface the most it seems like okay so basically it's looking up all kinds of papers Web MD browsing results and it's kind of just getting all this

[译文] [Andrej Karpathy]: 现在我觉得……我的 ChatGPT 怎么样了?ChatGPT 大概完成了四分之一。Perplexity 很快就要好了,好的,还在思考。Grok 也在继续。我最喜欢 Grok 的界面,看起来不错。基本上它在查找各种论文、WebMD、浏览结果,它正在获取所有这些信息。

[原文] [Andrej Karpathy]: now while this is all going on of course it's accumulating a giant cont text window and it's processing all that information trying to kind of create a report for us so key points uh what is C CG and why is it in longevity mix how is it Associated to longevity Etc and so it will do citations and it will kind of like tell you all about it

[译文] [Andrej Karpathy]: 当这一切发生时,当然它正在积累一个巨大的上下文窗口,并且正在处理所有这些信息,试图为我们创建一份报告。关键点:什么是 Ca-AKG,为什么它在长寿混合剂里,它与长寿有什么关联等等。它会做引用,它会把一切都告诉你。

[原文] [Andrej Karpathy]: and so this is not a simple and short response this is a kind of like almost like a custom research paper on any topic you would like and so this is really cool and it gives a lot of references potentially for you to go off and do some of your own reading and maybe ask some clarifying questions afterwards but it's actually really incredible that it gives you all these like different citations and processes the information for you a little bit

[译文] [Andrej Karpathy]: 所以这不是一个简单简短的回答,这有点像是一份关于你喜欢的任何主题的定制研究论文。所以这真的很酷,它提供了很多参考文献,让你可能随后去进行自己的阅读,或者之后问一些澄清性问题。但这真的很不可思议,它给了你所有这些不同的引用,并为你处理了一下信息。

[原文] [Andrej Karpathy]: let's see if perplexity finished okay perplexity is still still researching and chat PT is also researching so let's uh briefly pause the video and um I'll come back when this is done

[译文] [Andrej Karpathy]: 让我们看看 Perplexity 完成了吗?好的,Perplexity 还在研究,ChatGPT 也在研究。所以让我们稍微暂停一下视频,等这完成了我再回来。

[原文] [Andrej Karpathy]: okay so perplexity finished and we can see some of the report that it wrote up uh so there's some references here and some uh basically description and then chashi he also finished and it also thought for 5 minutes looked at 27 sources and produced a report so here it talked about uh research in worms dropa in mice and in human trials that are ongoing and then a proposed mechanism of action and some safety and potential concerns and references which you can dive uh deeper into

[译文] [Andrej Karpathy]: 好的,Perplexity 完成了,我们可以看到它写的一些报告,这里有一些参考文献和一些基本描述。然后 ChatGPT 也完成了,它思考了 5 分钟,查看了 27 个来源并生成了一份报告。这里它谈到了在蠕虫、果蝇、小鼠中的研究以及正在进行的人体试验,还有提出的作用机制、一些安全性和潜在担忧,以及你可以深入研究的参考文献。

[原文] [Andrej Karpathy]: so usually in my own work right now I've only used this maybe for like 10 to 20 queries so far something like that usually I find that the chash PT offering is currently the best it is the most thorough it reads the best it is the longest uh it makes most sense when I read it um and I think the perplexity and the gro are a little bit uh a little bit shorter and a little bit briefer and don't quite get into the same detail as uh as the Deep research from Google uh from Chach right now

[译文] [Andrej Karpathy]: 在我目前的工作中,我大概只用了 10 到 20 次这个查询。通常我发现 ChatGPT 的产品目前是最好的。它是最详尽的,读起来最好,也是最长的,读起来最合理。我认为 Perplexity 和 Grok 稍微短一点,稍微简略一点,没有达到 ChatGPT 目前深度研究的那种细节程度。

[原文] [Andrej Karpathy]: I will say that everything that is given to you here again keep in mind that even though it is doing research and it's pulling in there are no guarantees that there are no hallucinations here uh any of this can be hallucinated at any point in time it can be totally made up fabricated misunderstood by the model so that's why these citations are really important treat this as your first draft treat this as papers to look at um but don't take this as uh definitely true

[译文] [Andrej Karpathy]: 我要说的是,再次记住,这里给你的所有东西,即使它在做研究,即使它在拉取信息,也不能保证这里没有幻觉。任何这些内容都可能在任何时间点出现幻觉,它可能完全是编造的、捏造的、被模型误解的。所以这就是为什么这些引用非常重要。把这当作你的初稿,把这当作要看的论文清单,但不要把它当作绝对真理。

[原文] [Andrej Karpathy]: so here what I would do now is I would actually go into these papers and I would try to understand uh is the is chat understanding it correctly and maybe I have some follow-up questions Etc so you can do all that but still incredibly useful to see these reports once in a while to get a bunch of sources that you might want to descend into afterwards

[译文] [Andrej Karpathy]: 所以我现在会做的是,实际上进入这些论文,试着去理解 ChatGPT 是否理解正确了,也许我会有一些后续问题等等。你可以做所有这些,但偶尔看到这些报告还是非常有用的,可以获得一堆你之后可能想要深入研究的来源。

[原文] [Andrej Karpathy]: okay so just like before I wanted to show a few brief examples of how how I've used deep research so for example I was uh trying to change browser um because Chrome was not uh Chrome upset me and so it deleted all my tabs so I was looking at either Brave or Arc and I I was most interested in which one is more private and uh basically Chach BT compil this report for me and I this was actually quite helpful and I went into some of the sources and I sort of understood why Brave is basically tldr significantly better and that's why for example here I'm using brave because I switched to it now and so this is an example of um basically researching different kinds of products and comparing them I think that's a good fit for deep research

[译文] [Andrej Karpathy]: 好的,就像之前一样,我想展示几个我如何使用深度研究的简短例子。比如,我当时想换浏览器,因为 Chrome 让我很不爽,它删除了我所有的标签页。所以我在看 Brave 或 Arc,我最感兴趣的是哪个更注重隐私。ChatGPT 基本上为我汇编了这份报告,这实际上很有帮助。我查看了一些来源,我也大概明白了为什么 Brave 基本上(长话短说)要好得多。这就是为什么比如现在我正在使用 Brave,因为我现在切换到它了。这是一个研究不同类型产品并进行比较的例子,我认为这非常适合深度研究。

[原文] [Andrej Karpathy]: uh here I wanted to know about a life extension in mice so it kind of gave me a very long reading but basically mice are an animal model for longevity and uh different Labs have tried to extend it with various techniques

[译文] [Andrej Karpathy]: 这里我想知道关于小鼠寿命延长的事情,它给了我很长的阅读材料,但基本上小鼠是长寿的动物模型,不同的实验室尝试用各种技术延长其寿命。

[原文] [Andrej Karpathy]: and then here I wanted to explore llm labs in the USA and I wanted a table of how large they are how much funding they've had Etc so this is the table that It produced now this table is basically hit and miss unfortunately so I wanted to show it as an example of a failure

[译文] [Andrej Karpathy]: 然后这里我想探索美国的 LLM 实验室,我想要一个表格,显示它们有多大,获得了多少资金等等。这就是它生成的表格。不幸的是,这个表格基本上是时好时坏的,所以我想把它作为一个失败的例子展示出来。

[原文] [Andrej Karpathy]: um I think some of these numbers I didn't fully check them but they don't seem way too wrong some of this looks wrong um but the bigger Mission I definitely see is that xai is not here which I think is a really major emission and then also conversely hugging phase should probably not be here because I asked specifically about llm labs in the USA and also a Luther AI I don't think should count as a major llm lab um due to mostly its resources and so I think it's kind of a hit and miss things are missing I don't fully trust these numbers I have to actually look at them and so again use it as a first draft don't fully trust it still very helpful that's it

[译文] [Andrej Karpathy]: 我觉得其中一些数字——我没有完全核对——但看起来并不是错得离谱,有些看起来是错的。但我看到的更大的遗漏绝对是 xAI 不在这里,我认为这是一个非常重大的遗漏。相反,Hugging Face 可能不应该在这里,因为我特别问的是美国的 LLM 实验室。另外 EleutherAI,主要由于其资源原因,我认为不应该算作主要的 LLM 实验室。所以我觉得这有点命中率不稳定,有些东西缺失了,我不完全相信这些数字,我必须实际去查证。所以再次强调,把它当作初稿,不要完全相信它,虽然还是很有帮助。就是这样。


章节 8:工具使用(三):文档上传与辅助阅读

📝 本节摘要

在本节中,Andrej 介绍了一种通过上传具体文档来增强 LLM 能力的方法。不同于仅仅依赖模型模糊的内部知识,用户可以将 PDF 文件(如复杂的学术论文)或长文本直接加载到上下文窗口中,让模型基于该文档回答问题。他演示了如何将一篇关于 DNA 语言模型的论文上传给 Claude 3.7 进行总结和问答。此外,他还分享了自己阅读经典著作(如亚当·斯密的《国富论》)的独特方法:将章节内容复制给模型,进行逐段总结和交互式探讨。他认为这种“结伴阅读”的方式极大地提高了对晦涩或古老文本的理解与记忆,尽管目前的操作流程(复制粘贴)仍显得有些笨拙。

[原文] [Andrej Karpathy]: so what's really happening here that is interesting is that we are providing the llm with additional concrete documents that it can reference inside its context window so the model is not just relying on the knowledge the hazy knowledge of the world through its parameters and what it knows in its brain we're actually giving it concrete documents

[译文] [Andrej Karpathy]: 这里发生的真正有趣的事情是,我们正在为 LLM 提供额外的具体文档,它可以在其上下文窗口中引用这些文档。所以模型不再仅仅依赖于通过其参数获得的对世界模糊的知识以及它“大脑”里知道的东西,我们实际上是给了它具体的文档。

[原文] [Andrej Karpathy]: it's as if you and I reference specific documents like on the Internet or something like that while we are um kind of producing some answer for some question now we can do that through an internet search or like a tool like this but we can also provide these llms with concrete documents ourselves through a file upload and I find this functionality pretty helpful in many ways

[译文] [Andrej Karpathy]: 这就像你我在为某个问题生成答案时,会参考互联网上的具体文档或类似的东西。我们可以通过互联网搜索或像这样的工具来做到这一点,但我们也可以通过文件上传自己向这些 LLM 提供具体文档,我发现这个功能在很多方面都非常有帮助。

[原文] [Andrej Karpathy]: so as an example uh let's look at Cloud because they just released Cloud 3.7 while I was filming this video so this is a new Cloud Model that is now the state-of-the-art and notice here that we have thinking mode now as of 3.7 and so normal is what we looked at so far but they just release extended best for Math and coding challenges and what they're not saying but is actually true under the hood probably most likely is that this was trained with reinforcement learning in a similar way that all the other thinking models were produced

[译文] [Andrej Karpathy]: 举个例子,让我们看看 Claude,因为就在我录制这个视频的时候,他们刚刚发布了 Claude 3.7。这是现在最先进的新 Claude 模型。注意这里,从 3.7 开始我们有了“思考模式(Thinking Mode)”。所以“标准模式(Normal)”是我们目前所看到的,但他们刚刚发布了“扩展模式(Extended)”,这最适合数学和编程挑战。他们没明说但实际上底层很可能发生的是,这也就是通过强化学习训练出来的,就像所有其他思考型模型产生的方式一样。

[原文] [Andrej Karpathy]: so what we can do now is we can uploaded documents that we wanted to reference inside its context window so as an example uh there's this paper that came out that I was kind of interested in it's from Arc Institute and it's basically um a language model trained on DNA and so I was kind of curious ious I mean I'm not from biology but I was kind of curious what this is

[译文] [Andrej Karpathy]: 所以我们现在能做的是,我们可以上传我们希望它在其上下文窗口中引用的文档。举个例子,最近出了一篇我有兴趣的论文,是 Arc 研究所的,基本上是一个在 DNA 上训练的语言模型。所以我很好奇——我的意思是,我不是学生物学的——但我有点好奇这是什么。

[原文] [Andrej Karpathy]: and this is a perfect example of um what is what LMS are extremely good for because you can upload these documents to the llm and you can load this PDF into the context window and then ask questions about it and uh basically read the document together with an llm and ask questions off it so the way you do that is you basically just drag and drop so we can take that PDF and just drop it here um this is about 30 megabytes

[译文] [Andrej Karpathy]: 这是一个展示 LLM 极其擅长什么的完美例子,因为你可以把这些文档上传给 LLM,你可以把这个 PDF 加载到上下文窗口中,然后询问相关问题,基本上就是和 LLM 一起阅读文档并向它提问。做法基本上就是拖放,我们可以把那个 PDF 拖到这里,这个文件大约 30MB。

[原文] [Andrej Karpathy]: now when Claude gets this document it is very likely that they actually discard a lot of the images and that kind of information I don't actually know exactly what they do under the hood and they don't really talk about it but it's likely that the images are thrown away or if they are there they may not be as as um as well understood as you and I would understand them potentially

[译文] [Andrej Karpathy]: 当 Claude 收到这个文档时,很有可能他们实际上丢弃了很多图像和类似的信息。我实际上并不知道他们在底层确切做了什么,他们也没怎么谈论这个,但很可能图像被丢弃了,或者如果它们还在,模型可能无法像你我那样很好地理解它们。

[原文] [Andrej Karpathy]: and it's very likely that what's happening under the hood is that this PDF is basically converted to a text file and that text file is loaded into the token window and once it's in the token window it's in the working memory and we can ask questions of it

[译文] [Andrej Karpathy]: 底层很有可能发生的是,这个 PDF 基本上被转换成了一个文本文件,那个文本文件被加载到了 Token 窗口中。一旦它在 Token 窗口中,它就在工作记忆里了,我们就可以对它提问了。

[原文] [Andrej Karpathy]: so typically when I start reading papers together with any of these llms I just ask for can you uh give me a summary uh summary of this paper let's see what cloud 3.7 says uh okay I'm exceeding the length limit of this chat oh god really oh damn okay well let's try chbt

[译文] [Andrej Karpathy]: 所以通常当我开始和这些 LLM 一起阅读论文时,我只是要求:“你能给我一个总结吗?这篇论文的总结。”让我们看看 Claude 3.7 怎么说……好吧,“我超过了这个对话的长度限制”。哦天哪,真的吗?该死。好吧,那让我们试试 ChatGPT。

[原文] [Andrej Karpathy]: uh can you summarize this paper and we're using gbt 40 and we're not using thinking um which is okay we don't we can start by not thinking reading documents summary of the paper genome modeling and design across all domains of life so this paper introduces Evo 2 large scale biological Foundation model and then key features and so on

[译文] [Andrej Karpathy]: “你能总结这篇论文吗?”我们使用的是 GPT-4o,我们没有使用“思考模式”,这没关系,我们可以先不思考。阅读文档,论文总结:《跨越所有生命领域的基因组建模与设计》。这篇论文介绍了 Evo 2,一个大规模生物学基础模型,然后是关键特性等等。

[原文] [Andrej Karpathy]: so I personally find this pretty helpful and then we can kind of go back and forth and as I'm reading through the abstract and the introduction Etc I am asking questions of the llm and it's kind of like uh making it easier for me to understand the paper

[译文] [Andrej Karpathy]: 我个人觉得这非常有帮助。然后我们可以来回交流,当我读摘要和引言等部分时,我会向 LLM 提问,这就像是……它让我更容易理解这篇论文。

[原文] [Andrej Karpathy]: another way that I like to use this functionality extensively is when I'm reading books it is rarely ever the case anymore that I read books just by myself I always involve an LM to help me read a book

[译文] [Andrej Karpathy]: 我喜欢广泛使用这个功能的另一种方式是读书。现在我很少再独自读书了,我总是让 LLM 参与进来帮助我读书。

[原文] [Andrej Karpathy]: so a good example of that recently is The Wealth of Nations uh which I was reading recently and it is a book from 1776 written by Adam Smith and it's kind of like the foundation of classical economics and it's a really good book and it's kind of just very interesting to me that it was written so long ago but it has a lot of modern day kind of like uh it's just got a lot of insights um that I think are very timely even today

[译文] [Andrej Karpathy]: 最近的一个好例子是《国富论》,我最近在读这本书。它是亚当·斯密在 1776 年写的一本书,它是古典经济学的基础。这是一本非常好的书,对我来说非常有趣的是,虽然它写于很久以前,但它有很多现代的……它有很多见解,我认为即使在今天也非常合时宜。

[原文] [Andrej Karpathy]: so the way I read books now as an example is uh you basically pull up the book and you have to get uh access to like the raw content of that information in the case of Wealth of Nations this is easy because it is from 1776 so you can just find it on wealth Project Gutenberg as an example

[译文] [Andrej Karpathy]: 所以举个例子,我现在读书的方式是,你基本上打开这本书,你必须获得该信息的原始内容。对于《国富论》来说这很容易,因为它是 1776 年的,所以你可以直接在“古腾堡计划(Project Gutenberg)”上找到它。

[原文] [Andrej Karpathy]: and then basically find the chapter that you are currently reading so as an example let's read this chapter from book one and this chapter uh I was reading recently and it kind of goes into the division of labor and how it is limited by the extent of the market roughly speaking if your Market is very small then people can't specialize and specialization is what um is basically huge uh specialization is extremely important for wealth creation um because you can have experts who specialize in their simple little task but you can only do that at scale uh because without the scale you don't have a large enough market to sell to uh your specialization

[译文] [Andrej Karpathy]: 然后基本上找到你目前正在读的章节。举个例子,让我们读第一卷的这一章。这一章我最近在读,它探讨了劳动分工以及它如何受限于市场范围。粗略地说,如果你的市场非常小,那么人们就无法专业化。而专业化基本上是巨大的……专业化对于财富创造极其重要,因为你可以让专家专注于他们简单的小任务。但你只能在规模化的情况下这样做,因为如果没有规模,你就没有足够大的市场来兜售你的专业化成果。

[原文] [Andrej Karpathy]: so what we do is we copy paste this book uh this chapter at least uh this is how I like to do it we go to say Claud and um we say something like we are reading The Wealth of Nations now remember Claude has kind has knowledge of The Wealth of Nations but probably doesn't remember exactly the uh content of this chapter so it wouldn't make sense to ask Claud questions about this chapter directly uh because it probably doesn't remember remember what this chapter is about but we can remind Claud by loading this into the context window

[译文] [Andrej Karpathy]: 所以我们做的是复制粘贴这本书——至少是这一章,这是我喜欢的做法——我们去比如说 Claude,然后我们说类似这样的话:“我们正在读《国富论》。”现在记住,Claude 拥有关于《国富论》的知识,但可能不记得这一章的确切内容。所以直接问 Claude 关于这一章的问题是没有意义的,因为它可能不记得这一章是关于什么的,但我们可以通过将其加载到上下文窗口中来提醒 Claude。

[原文] [Andrej Karpathy]: so we reading the weal of Nations uh please summarize this chapter to start and then what I do here is I copy paste um now in Cloud when you copy paste they don't actually show all the text inside the text box they create a little text attachment uh when it is over uh some size and so we can click enter and uh we just kind of like start off usually I like to start off with a summary of what this chapter is about just so I have a rough idea and then I go in and I start reading the chapter

[译文] [Andrej Karpathy]: “所以我们正在读《国富论》,请先总结这一章。”然后我在这里做的是复制粘贴。在 Claude 中,当你复制粘贴时,如果内容超过一定大小,它们实际上不会在文本框中显示所有文本,而是创建一个小的文本附件。所以我们可以点击回车,我们就这样开始。通常我喜欢从这一章的摘要开始,这样我就有了一个大致的概念,然后我进去开始阅读这一章。

[原文] [Andrej Karpathy]: and uh any point we have any questions then we just come in and just ask our question and I find that basically going hand inand with llms uh dramatically creases my retention my understanding of these chapters and I find that this is especially the case when you're reading for example uh documents from other fields like for example biology or for example documents from a long time ago like 1776 where you sort of need a little bit of help of even understanding what uh the basics of the language

[译文] [Andrej Karpathy]: 在任何时候如果我们有问题,我们就进去提问。我发现基本上与 LLM 携手并进大大增加了我的记忆力和我对这些章节的理解。我发现尤其是当你阅读其他领域的文档(比如生物学),或者很久以前的文档(比如 1776 年的)时,情况更是如此,在那里你甚至需要一点帮助来理解基本的语言。

[原文] [Andrej Karpathy]: or for example I would feel a lot more courage approaching a very old text that is outside of my area of expertise maybe I'm reading Shakespeare or I'm reading things like that I feel like llms make a lot of reading very dramatically more accessible than it used to be before because you're not just right away confused you can actually kind of go slowly through it and figure it out together with the llm in hand so I use this extensively and I think it's extremely helpful

[译文] [Andrej Karpathy]: 或者举个例子,我会更有勇气去接触我专业领域之外的一篇非常古老的文本,也许我在读莎士比亚或类似的东西。我觉得 LLM 让很多阅读变得比以前容易得多了,因为你不会一开始就感到困惑,你实际上可以慢慢地读,手头有 LLM 帮你一起弄清楚。所以我广泛使用这个方法,我认为它非常有帮助。

[原文] [Andrej Karpathy]: I'm not aware of tools unfortunately that make this very easy for you today I do this clunky back and forth so literally I will find uh the book somewhere and I will copy paste stuff around and I'm going back and forth and it's extremely awkward and clunky and unfortunately I'm not aware of a tool that makes this very easy for you but obviously what you want is as you're reading a book you just want to highlight the passage and ask questions about it this currently as far as I know does not exist um but this is extremely helpful I encourage you to experiment with it and uh don't read books alone

[译文] [Andrej Karpathy]: 不幸的是,我不知道现在有什么工具能让这件事变得非常容易。我做这种笨拙的来回操作:真的就是我在某处找到书,把东西复制粘贴过来,来回折腾,极其尴尬和笨拙。不幸的是,我不知道有什么工具能让你很容易做到这一点。但显然你想要的是,当你读书的时候,你只想高亮那段文字并询问相关问题。据我所知目前这还不存在,但这非常有帮助,我鼓励你去尝试一下,不要独自读书。


章节 9:工具使用(四):代码解释器与高级数据分析

📝 本节摘要

本节重点介绍 LLM 的一项“超能力”:Python 解释器(Python Interpreter)。Andrej 指出,单纯的 LLM 并不擅长心算复杂的数学题(本质上是在做文本预测),但通过调用编程工具,它们可以编写并运行代码来获取精确结果。他展示了 ChatGPT 如何像使用计算器一样解决复杂乘法,并对比了 Grok 3(因缺乏工具而产生幻觉)和 Gemini 的表现。此外,他还演示了高级数据分析(Advanced Data Analysis)功能:让 ChatGPT 搜集 OpenAI 的估值历史、绘制图表并预测未来趋势。他特别提醒,虽然模型能像“初级数据分析师”一样画出漂亮的图表,但经常会在代码中通过“隐式假设”埋雷(例如随意填充缺失数据),或者在文字总结时一本正经地胡说八道(图表数据是 20 万亿,文字却说是 1.7 万亿),因此用户必须保持警惕并审查代码。

[原文] [Andrej Karpathy]: okay the next very powerful tool that I now want to turn to is the use of a python interpreter or basically giving the ability to the llm to use and write computer programs so instead of the llm giving you an answer directly it has the ability now to write a computer program and to emit special tokens that the chpt application recognizes as hey this is not for the human this is uh basically saying that whatever I output it here uh is actually a computer program please go off and run it and give me the result of running that computer program

[译文] [Andrej Karpathy]: 好的,我现在想转向的下一个非常强大的工具是 Python 解释器的使用,或者基本上是赋予 LLM 使用和编写计算机程序的能力。所以 LLM 不再直接给你答案,它现在有能力编写计算机程序,并发出一特殊的 Token,ChatGPT 应用程序会识别出:“嘿,这不是给人类看的,这基本上是在说我这里输出的是一个计算机程序,请去运行它,并把运行那个计算机程序的结果给我。”

[原文] [Andrej Karpathy]: so uh it is the integration of the language model with a programming language here like python so uh this is extremely powerful let's see the simplest example of where this would be uh used and what this would look like

[译文] [Andrej Karpathy]: 所以这就是语言模型与像 Python 这样的编程语言的集成。这非常强大。让我们看看这将会在哪里使用的最简单例子,以及它看起来是什么样子的。

[原文] [Andrej Karpathy]: so if I go go to chpt and I give it some kind of a multiplication problem problem let's say 30 9 or something like that then this is a fairly simple multiplication and you and I can probably do something like this in our head right like 30 9 you can just come up with the result of 270 right

[译文] [Andrej Karpathy]: 如果我去 ChatGPT,给它某种乘法问题,比如说 30 乘以 9 或类似的。这是一个相当简单的乘法,你和我大概都可以在脑子里做这个,对吧?比如 30 乘以 9,你可以直接得出 270 的结果,对吧。

[原文] [Andrej Karpathy]: so let's see what happens okay so llm did exactly what I just did it calculated the result of this multiplication to be 270 but it's actually not really doing math it's actually more like almost memory work uh but it's easy enough to do in your head um so there was no tool use involved here all that happened here was just the zip file uh doing next token prediction and uh gave the correct result here in its head

[译文] [Andrej Karpathy]: 让我们看看会发生什么。好的,LLM 做了和我刚才做的完全一样的事,它计算出这个乘法的结果是 270。但实际上它并不是真的在做数学,这实际上更像是记忆工作。但这在脑子里做足够简单,所以这里没有涉及工具的使用。这里发生的一切只是那个 Zip 文件在做下一个 Token 的预测,并在它的脑子里给出了正确结果。

[原文] [Andrej Karpathy]: the problem now is what if we want something more more complicated so what is this times this and now of course this if I asked you to calculate this you would give up instantly because you know that you can't possibly do this in your head and you would be looking for a calculator and that's exactly what the llm does now too and opening ey has trained chat GPT to recognize problems that it cannot do in its head and to rely on tools instead

[译文] [Andrej Karpathy]: 现在的问题是,如果我们想要更复杂一点的东西呢?比如这个数乘以这个数是多少?当然,如果我现在让你计算这个,你会立刻放弃,因为你知道你不可能在脑子里算出这个,你会去找计算器。而这也正是 LLM 现在所做的。OpenAI 已经训练 ChatGPT 去识别那些它无法在脑子里完成的问题,并转而依赖工具。

[原文] [Andrej Karpathy]: so what I expect jpt to do for this kind of a query is to turn to Tool use so let's see what it looks like okay there we go so what's opened up here is What's called the python interpreter and python is basically a little programming language and instead of the llm telling you directly what the result is the llm writes a program and then not shown here are special tokens that tell the chipd application to please run the program and then the llm pauses execution instead the Python program runs creates a result and then passes this this result back to the language model as text and the language model takes over and tells you that the result of this is that

[译文] [Andrej Karpathy]: 所以我期望 GPT 对这类查询做的是转向工具使用。让我们看看它是什么样子的。好的,开始了。这里打开的是所谓的 Python 解释器,Python 基本上是一种小程序语言。LLM 不是直接告诉你结果是什么,而是编写一个程序。这里没有显示出来的是那些特殊的 Token,它们告诉 ChatGPT 应用程序请运行这个程序,然后 LLM 暂停执行。取而代之的是 Python 程序运行,生成一个结果,然后将这个结果作为文本传回给语言模型,语言模型接手并告诉你结果是这个。

[原文] [Andrej Karpathy]: before we move on I just wanted to make the point that unfortunately um you have to kind of keep track of which llms that you're talking to have different kinds of tools available to them because different llms might not have all the same tools and in particular LMS that do not have access to the python interpreter or programming language or are unwilling to use it might not give you correct results in some of these harder problems

[译文] [Andrej Karpathy]: 在我们继续之前,我只想指出一点,不幸的是,你必须留意你正在交谈的 LLM 拥有哪些可用的工具,因为不同的 LLM 可能并不都拥有相同的工具。特别是那些无法访问 Python 解释器或编程语言,或者不愿意使用它的 LLM,在这些较难的问题上可能无法给你正确的结果。

[原文] [Andrej Karpathy]: so as an example here we saw that um chasht correctly used a programming language and didn't do this in its head grock 3 actually I believe does not have access to a programming language uh like like a python interpreter and here it actually does this in its head and gets remarkably close but if you actually look closely at it uh it gets it wrong this should be one 120 instead of 060 so grock 3 will just hallucinate through this multiplication and uh do it in its head and get it wrong but actually like remarkably close

[译文] [Andrej Karpathy]: 举个例子,我们看到 ChatGPT 正确地使用了编程语言,没有在脑子里做这道题。而 Grok 3,我相信实际上它无法访问像 Python 解释器这样的编程语言,它实际上是在脑子里做这道题的。它算出的结果非常接近,但如果你仔细看,它算错了。这里应该是 120 而不是 060。所以 Grok 3 只是通过幻觉来完成这个乘法,在脑子里做,然后搞错了,但实际上非常接近。

[原文] [Andrej Karpathy]: uh then I tried Claud and Claude actually wrote In this case not python code but it wrote JavaScript code but uh JavaScript is also a programming l language and get gets the correct result then I came to Gemini and I asked uh 2.0 pro and uh Gemini did not seem to be using any tools there's no indication of that and yet it gave me what I think is the correct result which actually kind of surprised me

[译文] [Andrej Karpathy]: 然后我试了 Claude,Claude 在这种情况下实际上写的不是 Python 代码,而是 JavaScript 代码,但 JavaScript 也是一种编程语言,并且得到了正确结果。然后我去了 Gemini,我问了 2.0 Pro。Gemini 似乎没有使用任何工具,没有任何迹象表明这一点,然而它却给了我认为是正确的结果,这实际上让我有点惊讶。

[原文] [Andrej Karpathy]: so Gemini I think actually calculated this in its head correctly and the way we can tell that this is uh which is kind of incredible the way we can tell that it's not using tools is we can just try something harder what is we have to make it harder for it okay so it gives us some result and then I can use uh my calculator here and it's wrong right so this is using my MacBook Pro calculator and uh two it's it's not correct but it's like remarkably close but it's not correct but it will just hallucinate the answer

[译文] [Andrej Karpathy]: 所以 Gemini 我认为实际上是在脑子里正确计算出来的,这有点不可思议。我们要判断它没有使用工具的方法是,我们可以试一个更难的。我们必须给它加大难度。好的,它给了我们一个结果,然后我可以用我这里的计算器……结果是错的,对吧?这是用我的 MacBook Pro 计算器算的,它不正确。虽然非常接近,但不正确,它只是通过幻觉编造了答案。

[原文] [Andrej Karpathy]: okay so one practical setting where this can be quite powerful is what's called Chach Advanced Data analysis and as far as I know this is quite unique to chpt itself and it basically um gets chpt to be kind of like a junior data analyst uh who you can uh kind of collaborate with

[译文] [Andrej Karpathy]: 好的,这在实际应用中非常强大的一个场景是所谓的“ChatGPT 高级数据分析(Advanced Data Analysis)”。据我所知,这对 ChatGPT 来说是相当独特的。它基本上让 ChatGPT 变成了一个你可以与之协作的“初级数据分析师”。

[原文] [Andrej Karpathy]: so let me show you a concrete example without going into the full detail so first we need to get some data that we can analyze and plot and chart Etc so here in this case I said uh let's research openi evaluation as an example and I explicitly asked Chachi to use the search tool because I know that under the hood such a thing exists and I don't want it to be hallucinating data to me I wanted to actually look it up and back it up and create a table where each year have we have the valuation

[译文] [Andrej Karpathy]: 让我给你们展示一个具体的例子,不展开全部细节。首先我们需要获取一些我们可以分析、绘图和制表的数据。在这里,我说“让我们研究一下 OpenAI 的估值”,我明确要求 ChatGPT 使用搜索工具,因为我知道底层有这个功能,我不希望它给我幻觉数据,我希望它实际去查证并支持这些数据,并创建一个表格,列出每年的估值。

[原文] [Andrej Karpathy]: so these are the open evaluations over time notice how in 2015 it's not applicable so uh the valuation is like unknown then I said now plot this use lock scale for y- axis and so this is where this gets powerful Chachi PT goes off and writes a program that plots the data over here so it cre a little figure for us and it uh sort of uh ran it and showed it to us so this can be quite uh nice and valuable because it's very easy way to basically collect data upload data in a spreadsheet and visualize it Etc

[译文] [Andrej Karpathy]: 这些是随时间变化的 OpenAI 估值。注意 2015 年是“不适用(N/A)”,所以估值是未知的。然后我说:“现在绘制这个图,Y 轴使用对数刻度。”这就是它变得强大的地方。ChatGPT 去写了一个程序来绘制这些数据,它为我们创建了一个小图表,运行它并展示给我们。这相当不错且有价值,因为它是一种基本上收集数据、上传电子表格数据并进行可视化的非常简单的方法。

[原文] [Andrej Karpathy]: I will note some of the things here so as an example notice that we had na for 2015 but Chachi PT when I was writing the code and again I would always encourage you to scrutinize the code it put in 0.1 for 2015 and so basically it implicitly assumed that uh it made the Assumption here in code that the valuation of 2015 was 100 million uh and because it put in 0.1 and it's kind of like did it without telling us so it's a little bit sneaky and uh that's why you kind of have to pay attention little bit to the code

[译文] [Andrej Karpathy]: 我要指出这里的一些问题。比如注意我们在 2015 年有 N/A,但 ChatGPT 在写代码时——我再次鼓励你们一定要审查代码——它为 2015 年填入了 0.1。所以基本上它隐式地假设了……它在代码里做了个假设,认为 2015 年的估值是 1 亿美元,因为它填了 0.1。它这样做却没有告诉我们,所以这有点鬼鬼祟祟,这就是为什么你必须稍微注意一下代码。

[原文] [Andrej Karpathy]: now fit a trend line and extrapolate until the year 2030 Mark the expected valuation in 2030 so it went off and it basically did a linear fit and it's using cciis curve fit and it did this and came up with a plot and uh it told me that the valuation based on the trend in 2030 is approximately 1.7 trillion which sounds amazing except uh here I became suspicious because I see that Chach PT is telling me it's 1.7 trillion but when I look here at 2030 it's printing 2027 1.7 B so its extrapolation when it's printing the variable is inconsistent with 1.7 trillion

[译文] [Andrej Karpathy]: “现在拟合一条趋势线并推断到 2030 年,标记出 2030 年的预期估值。”它就开始运行,基本上做了一个线性拟合,它使用了 SciPy 的曲线拟合,得出了一个图表。它告诉我,基于趋势,2030 年的估值大约是 1.7 万亿美元。这听起来很惊人,除了……在这里我开始怀疑了,因为我看到 ChatGPT 告诉我是 1.7 万亿,但当我看这里的 2030 年时,它打印的是 2.027e+04(即2万多),这与 1.7 万亿不一致。

[原文] [Andrej Karpathy]: uh this makes it look like that valuation should be about 20 trillion and so that's what I said print this variable directly by itself what is it and then it sort of like rewrote the code and uh gave me the variable itself and as we see in the label here it is indeed 2271 Etc so in 2030 the true exponential Trend extrapolation would be a valuation of 20 trillion um so I was like I was trying to confront Chach and I was like you lied to me right and it's like yeah sorry I messed up

[译文] [Andrej Karpathy]: 这看起来估值应该是大约 20 万亿。所以我说:“直接单独打印这个变量,它是什么?”然后它重写了代码,给了我变量本身。正如我们在标签里看到的,它确实是 2271 等等。所以在 2030 年,真实的指数趋势外推估值将是 20 万亿美元。所以我试图与 ChatGPT 对质,我说:“你骗了我,对吧?”它就像:“是的,抱歉我搞砸了。”

[原文] [Andrej Karpathy]: so I guess I I I like this example because number one it shows the power of the tool in that it can create these figures for you and it's very nice but I think number two it shows the um trickiness of it where for example here it made an implicit assumption and here it actually told me something uh it told me just the wrong it hallucinated 1.7 trillion so again it is kind of like a very very Junior data analyst it's amazing that it can plot figures but you have to kind of still know what this code is doing and you have to be careful and scrutinize it and make sure that you are really watching very closely because your Junior analyst is a little bit uh absent minded and uh not quite right all the time so really powerful but also be careful with this

[译文] [Andrej Karpathy]: 所以我想我喜欢这个例子,因为第一,它展示了工具的强大之处,它可以为你创建这些图表,这非常好。但第二,我认为它展示了其中的棘手之处,例如在这里它做了一个隐式假设,而在这里它实际上告诉我一些错误的东西,它幻觉出了 1.7 万亿。所以再次强调,它有点像一个非常非常初级的数据分析师。它能画图很神奇,但你还是得知道代码在做什么,你必须小心审查它,确保你真的在密切注视,因为你的这位初级分析师有点心不在焉,而且并不总是正确的。所以真的很强大,但也要小心使用。


章节 10:工具使用(五):Claude Artifacts 与可视化图表

📝 本节摘要

在本节中,Andrej 介绍了 Anthropic Claude 特有的“Artifacts”功能。不同于传统的文本回复,该功能允许模型编写代码并在浏览器侧边栏实时预览交互式应用。他演示了如何将亚当·斯密的维基百科介绍转化为一个基于 React 的“抽认卡(Flashcards)”应用,直接在网页中进行测试。此外,他重点推荐了他最喜欢的用法:让 Claude 将晦涩的《国富论》章节转化为“概念图(Conceptual Diagrams)”(使用 Mermaid 语法),通过可视化的树状结构帮助记忆和理解复杂的论证逻辑。

[原文] [Andrej Karpathy]: okay so I've introduced you to Chach PT and Advanced Data analysis which is one powerful way to basically have LMS interact with code and add some UI elements like showing of figures and things like that I would now like to uh introduce you to one more related tool and that is uh specific to cloud and it's called artifacts

[译文] [Andrej Karpathy]: 好的,我已经向你们介绍了 ChatGPT 和高级数据分析,这是让 LLM 与代码交互并添加一些 UI 元素(如展示图表等)的一种强大方式。现在我想向你们介绍另一个相关的工具,那是 Claude 特有的,叫做“Artifacts(工件)”。

[原文] [Andrej Karpathy]: so let me show you by example what this is so I have a conversation with Claude and I'm asking generate 20 flash cards from the following text um and for the text itself I just came to the Adam Smith Wikipedia page for example and I copy pasted this introduction here so I copy pasted this here and asked for flash cards and Claude responds with 20 flash cards so for example when was Adam Smith baptized on June 16th Etc when did he die what was his nationality Etc

[译文] [Andrej Karpathy]: 让我通过例子向你们展示这是什么。我与 Claude 进行了一次对话,我要求“根据以下文本生成 20 张抽认卡”。对于文本本身,我只是去了亚当·斯密的维基百科页面,把这段介绍复制粘贴到了这里。所以我把它粘贴在这里并要求生成抽认卡,Claude 回复了 20 张抽认卡。例如,“亚当·斯密是什么时候受洗的?”“6 月 16 日”等等。“他什么时候去世的?”“他的国籍是什么?”等等。

[原文] [Andrej Karpathy]: so once we have the flash cards we actually want to practice these flashcards and so this is where I continue the conversation and I say now use the artifacts feature to write a flashcards app to test these flashcards and so clot goes off and writes code for an app that uh basically formats all of this into flashcards and that looks like this

[译文] [Andrej Karpathy]: 一旦我们有了抽认卡,我们实际上想要练习这些抽认卡。这就是我继续对话的地方,我说:“现在使用 Artifacts 功能编写一个抽认卡应用程序来测试这些抽认卡。”于是 Claude 就去写了一个应用程序的代码,基本上把所有这些都格式化成抽认卡,看起来是这样的。

[原文] [Andrej Karpathy]: so what Claude wrote specifically was this C code here so it uses a react library and then basically creates all these components it hardcodes the Q&A into this app and then all the other functionality of it and then the cloud interface basically is able to load these react components directly in your browser and so you end up with an app

[译文] [Andrej Karpathy]: Claude 具体写的是这段代码,它使用了一个 React 库,然后基本上创建了所有这些组件。它把问答硬编码到这个应用程序中,还有所有其他的功能。然后 Claude 的界面基本上能够直接在你的浏览器中加载这些 React 组件,所以你最终得到了一个应用程序。

[原文] [Andrej Karpathy]: so when was Adam Smith baptized and you can click to reveal the answer and then you can say whether you got it correct or not when did he die uh what was his nationality Etc so you can imagine doing this and then maybe we can reset the progress or Shuffle the cards Etc

[译文] [Andrej Karpathy]: 所以,“亚当·斯密是什么时候受洗的?”你可以点击显示答案,然后你可以说你是否答对了。“他什么时候去世的?”“他的国籍是什么?”等等。所以你可以想象这样做,然后也许我们可以重置进度或洗牌等等。

[原文] [Andrej Karpathy]: so what happened here is that Claude wrote us a super duper custom app just for us uh right here and um typically what we're used to is some software Engineers write apps they make them available and then they give you maybe some way to customize them or maybe to upload flashcards like for example in the eny app you can import flash cards and all this kind of stuff

[译文] [Andrej Karpathy]: 这里发生的事情是,Claude 就在这里为我们写了一个超级定制的应用程序。通常我们习惯的是软件工程师编写应用程序,发布出来,然后可能给你一些定制的方法,或者上传抽认卡的方法,比如在 Anki 应用程序中你可以导入抽认卡之类的东西。

[原文] [Andrej Karpathy]: this is a very different Paradigm because in this Paradigm Claud just writes the app just for you and deploys it here in your browser now keep in mind that a lot of apps you will find on the internet they have entire backends Etc there's none of that here there's no database or anything like that but these are like local apps that can run in your browser and uh they can get fairly sophisticated and useful in some cases uh so that's Cloud artifacts

[译文] [Andrej Karpathy]: 这是一个非常不同的范式,因为在这个范式中,Claude 只是为你编写应用程序,并将其部署在这里的浏览器中。请记住,你在互联网上找到的很多应用程序都有完整的后端等等,这里没有那些,没有数据库或类似的东西。但这些就像是可以在你的浏览器中运行的本地应用程序,在某些情况下它们可以变得相当复杂和有用。这就是 Claude Artifacts。

[原文] [Andrej Karpathy]: now to be honest I'm not actually a daily user of artifacts I use it once in a while I do know that a large number of people are experimenting with it and you can find a lot of artifact showcasing cases because they're easy to share so these are a lot of things that people have developed um various timers and games and things like that um but the one use case that I did find very useful in my own work is basically uh the use of diagrams diagram generation

[译文] [Andrej Karpathy]: 老实说,我实际上并不是 Artifacts 的日常用户,我偶尔用一下。我知道有很多人正在尝试它,你可以找到很多 Artifacts 的展示案例,因为它们很容易分享。人们开发了很多东西,各种计时器、游戏之类的。但我发现在我自己的工作中非常有用的一个用例基本上是图表的使用,图表生成。

[原文] [Andrej Karpathy]: so as an example let's go back to the book chapter of Adam Smith that we were looking at what I do sometimes is we are reading The Wealth of Nations by Adam Smith I'm attaching chapter 3 and book one please create a conceptual diagram of this chapter

[译文] [Andrej Karpathy]: 举个例子,让我们回到我们正在看的亚当·斯密的那一书章。我有时做的是:“我们正在读亚当·斯密的《国富论》,我附上了第一卷第三章,请为这一章创建一个概念图。”

[原文] [Andrej Karpathy]: and when Claude hears conceptual diagram of this chapter very often it will write a code that looks like this and if you're not familiar with this this is using the mermaid library to basically create or Define a graph and then uh this is plotting that mermaid diagram and so Claud analyzes the chapter and figures out that okay the key principle that's being communicated here is as follows that basically the division of labor is related to the extent of the market the size of it

[译文] [Andrej Karpathy]: 当 Claude 听到“这一章的概念图”时,通常它会写出一段看起来像这样的代码。如果你不熟悉这个,这是在使用 Mermaid 库来基本上创建或定义一个图表,然后绘制出那个 Mermaid 图表。所以 Claude 分析了这一章,并指出,好的,这里传达的关键原则如下:基本上劳动分工与市场范围(即市场规模)有关。

[原文] [Andrej Karpathy]: and then these are the pieces of the chapter so there's the comparative example um of trade and how much easier it is to do on land and on water and the specific example that's used and that Geographic factors actually make a huge difference here and then the comparison of land transport versus water transport and how much easier water transport is and then here we have some early civilizations that have all benefited from basically the availability of water water transport and have flourished as a result of it because they support specialization

[译文] [Andrej Karpathy]: 然后这些是章节的各个部分。这里有关于贸易的比较例子,在陆地和水上进行的难易程度,以及使用的具体例子,地理因素在这里实际上造成了巨大的差异。然后是陆路运输与水路运输的比较,水路运输要容易得多。然后这里我们有一些早期文明,它们基本上都受益于水路运输的可用性,并因此而繁荣,因为它们支持专业化。

[原文] [Andrej Karpathy]: so it's if you're a conceptual kind of like visual thinker and I think I'm a little bit like that as well I like to lay out information and like as like a tree like this and it helps me remember what that chapter is about very easily and I just really enjoy these diagrams and like kind of getting a sense of like okay what is the layout of the argument how is it arranged spatially and so on and so if you're like me then you will definitely enjoy this and you can make diagrams of anything of books of chapters of source codes of anything really and so I specifically find this fairly useful

[译文] [Andrej Karpathy]: 所以,如果你是一个概念型的、视觉型的思考者——我觉得我也有一点像那样——我喜欢把信息像这样排成树状图,这有助于我非常容易地记住那一章是关于什么的。我真的非常喜欢这些图表,喜欢那种感觉,比如“好的,论点的布局是什么?它是如何在空间上排列的?”等等。如果你像我一样,那么你肯定会喜欢这个。你可以为任何东西制作图表:书籍、章节、源代码,真的任何东西。所以我特别觉得这相当有用。


根据提供的资料,Andrej Karpathy 的视频内容并没有明确标注章节编号(如“第 11 章”)。但根据视频内容的逻辑顺序和主要话题的转换,第 11 个主要话题通常对应的是关于 图像(Images) 的内容(紧随“语音/音频”之后)。

以下是关于该部分内容的详细总结:

第 11 部分:图像能力 (Images)

这部分主要讲解了 LLM 如何处理图像模态,包括“看”图像(输入)和“画”图像(输出)。

1. 技术原理:图像即 Token

Karpathy 解释了模型如何理解图像。类似于文本和音频,图像也可以被转化为 Token 序列

处理方式:将图像切割成矩形网格(Patches),每个小块被量化为一个 Token。
模型视角:对 Transformer 神经网络而言,它并不知道这些 Token 是文本、音频还是图像,它只是在处理 Token 流的统计模式。这使得单一模型可以同时处理多种模态。

2. 图像输入功能 (Image Input / Vision)

Karpathy 展示了他如何在日常生活中使用上传图片的功能(Vision),让模型进行分析。他通常建议分两步走:先让模型把图片内容转录为文本(以确保它看清了数据),然后再针对文本提问。

具体使用案例:

营养成分分析

* 他上传了 Brian Johnson 的“长寿混合粉”(Longevity Mix)的营养标签,让模型解释其中的成分(如 CaAKG 是什么),并根据安全性对成分进行排序,了解哪些是基础成分,哪些比较特殊。

* 他上传了 Colgate 牙膏的成分表,询问哪些是功能性成分,哪些是没必要的添加剂(如色素)。结果发现很多成分其实是不必要的。

医疗报告解读

* 他上传了自己验血报告(血脂面板)的截图。因为报告原本是 PDF 且难以阅读,他通过截图上传,让 ChatGPT 读取数据并进行解释。他认为对于这类常见医疗数据,模型的知识储备通常很足,但也提醒要仔细核对数据转录是否正确。

学术与数学问题

* 他在阅读论文时遇到一个数学难题(trick question),直接截图上传,让模型识别公式并给出答案(例如求 x 的值)。

解释梗图 (Memes)

* 他展示了模型可以理解幽默。例如一张关于乌鸦的梗图(一群乌鸦叫 "murder",一只乌鸦试图加入另一只叫 "attempted murder"),模型能准确解释其中的双关语笑点。

3. 图像输出功能 (Image Output / Generation)

这部分介绍了使用模型生成图像的能力(如 OpenAI 的 DALL-E 3)。

应用场景:Karpathy 个人主要用它来制作 YouTube 视频的封面图(Thumbnails)或生成图标(Icons)。他展示了如何结合搜索工具生成一张“总结今日头条”的图片。
工作机制:目前的 DALL-E 3 在生成图像时,实际上是将用户的简短提示词先由 LLM 改写为详细的 Caption(描述),然后再传给专门的图像生成模型去绘制。

:如果您的“第 11 章”指的是紧随其后的视频 (Video) 或者之前的 语音 (Audio) 部分,请告知,我可以为您补充相关内容。但按照常见的话题切分,“图像”通常是排在第 11 个位置(继 Intro, Thinking, Search, Deep Research, Uploads, Python, Artifacts, Cursor, Voice, NotebookLM 之后)。

根据视频内容的逻辑顺序,紧随“图像(Images)”之后的第 12 个主要话题是关于 视频能力 (Video) 的内容。

以下是该部分的详细总结:

第 12 部分:视频能力 (Video)

这部分内容主要展示了多模态模型如何处理动态的视觉信息,包括通过摄像头“看”世界(输入)和生成视频(输出)。

1. 视频输入功能 (Video Input / Camera)

Karpathy 演示了如何通过手机端的 Advanced Voice Mode(高级语音模式)使用实时视频功能。

功能入口:该功能目前主要存在于移动端 App 中,在高级语音模式下点击摄像头图标即可开启。
演示过程:他拿着手机在房间里走动,让模型识别镜头中的物体,模型能够进行实时对话和识别:

* 吸音棉 (Acoustic Foam):模型认出了墙上的黑色吸音板,并询问是否在测试减少回声的效果。

* 书籍识别:模型准确认出了两本书——Jack Weatherford 的《成吉思汗与今日世界的形成》(Genghis Khan and the Making of the Modern World),以及理查德·费曼的《别闹了,费曼先生》(Surely You're Joking, Mr. Feynman)。

* 仪器读数:模型认出了 Aranet4 二氧化碳监测仪,并读取了屏幕上的数值(713 PPM),解释说这是正常的室内空气质量(通常建议低于 800 PPM)。

* 地图:模型认出了《指环王》中的 中土世界 (Middle Earth) 地图。

使用场景:Karpathy 坦言自己作为“硬核用户”平时很少用这个功能(更习惯用代码或文本),但他认为这非常适合向父母或祖父母展示 AI,因为它也是一种极其自然、低门槛的交互方式——只需指着物体提问即可。
技术推测:他推测目前模型在后台可能并不是处理连续的视频流,而是每秒截取一张图片进行分析,但在用户体验上感觉像是流畅的视频通话。

2. 视频输出功能 (Video Output / Generation)

这部分简要提到了 AI 生成视频的能力。

现状:视频生成工具正在飞速发展(如 OpenAI 的 Sora,以及 Luma, Runway 等竞品)。
案例:他展示了一条推文,对比了不同模型生成“老虎在丛林中”的视频效果,认为现在的质量已经非常惊人(如 Kling V2 等模型)。
个人应用:Karpathy 表示自己不是创意类工作者,所以在日常工作中用得不多,但这显然是行业内一个快速进化的领域。

:视频内容的最后一个主要板块是 第 13 章:生活质量功能 (Quality of Life Features),涵盖了记忆 (Memory)、自定义指令 (Custom Instructions) 和自定义 GPTs (Custom GPTs)。

根据视频内容的逻辑顺序,紧随“视频 (Video)”之后的第 13 个主要话题(也是最后一个主要话题)是 生活质量功能 (Quality of Life Features)

以下是该部分的详细总结:

第 13 部分:生活质量功能 (Quality of Life Features)

这一部分主要介绍了如何通过个性化设置,让模型更了解用户,从而减少重复操作,提升交互效率。

1. 记忆功能 (Memory)

Karpathy 认为这是 ChatGPT 目前独有的一个非常有用的功能。

功能描述:模型能够跨对话记住用户的偏好和信息。它本质上是一个关于用户的独立“知识数据库”,每次对话时都会被预加载进去。
演示案例

* Karpathy 询问模型认为“好莱坞的巅峰期”是什么时候。模型回答是 90 年代末到 2000 年代初。Karpathy 表示“完全同意”,模型随后更新了记忆库,记录下用户持有这一观点。

* 应用价值:随着时间推移,模型会越来越了解用户。例如,有了这些记忆,模型在推荐电影时会更符合用户的口味。

管理记忆:用户可以查看、编辑或删除这些记忆条目。

2. 自定义指令 (Custom Instructions)

这是一个全局设置,用于规定模型在所有对话中的行为模式和语气。

Karpathy 的设置

* 语气要求:他告诉模型不要像个“HR 业务伙伴”那样说话,要正常交流;多做解释和教学。

* 特定场景:他正在学习韩语,因此设定了在涉及韩语时默认使用的敬语等级(避免模型在非正式和过度正式之间乱跳)。

3. 自定义 GPTs (Custom GPTs)

Karpathy 将其描述为“保存提示词工程的时间”——如果你有一个经常重复使用的复杂任务,可以把它封装成一个 Custom GPT。他主要将其用于语言学习

案例一:韩语单词提取器 (Korean Vocabulary Extractor)

* 功能:输入一个韩语句子,它会自动提取其中的单词,并转换成“韩语;英语”的格式,方便直接复制到 Anki 背单词软件中。

* 技巧:他使用了少样本提示 (Few-Shot Prompting),即在指令中给了模型 4 个具体的输入输出示例,这比单纯用语言描述任务效果好得多。

案例二:详细翻译器 (Detailed Translator)

* 痛点:Google Translate 或 Papago 往往只给出一个大概的翻译,对于语言学习者来说,无法理解句子结构。

* 功能:不仅翻译整句,还会把句子拆解,逐个部分解释语法和词义。Karpathy 认为这比市面上任何翻译器都好用,因为它能处理俚语和细微差别。

案例三:字幕截图助手

* 场景:他在看韩综(如《单身即地狱》 Singles Inferno)时,视频字幕是嵌在画面里的,无法复制。

* 工作流:他截图上传给这个 Custom GPT,模型会先进行 OCR(文字识别),识别出韩语字幕,然后翻译并拆解语法结构,让他能边看剧边学习。


总结

Karpathy 在视频最后总结道,目前的 LLM 生态系统像是一个“动物园”,不同的应用(ChatGPT, Claude, Grok, Perplexity, Cursor 等)各有千秋。用户需要根据自己的需求(如是否需要联网搜索、是否需要编程、是否需要多模态交互)灵活选择工具,。

根据视频内容的逻辑顺序,紧随“生活质量功能 (Quality of Life Features)”之后的第 14 个主要话题是视频的 总结与生态系统概览 (Conclusion & Summary)

以下是该部分的详细总结:

第 14 部分:总结与生态系统 (Conclusion / Summary)

Karpathy 在视频的最后对当前的大语言模型(LLM)生态系统进行了全面的回顾和总结。他将目前的现状描述为一个“充满活力的动物园(Zoo)”,各种 App 都在快速进化。

1. 生态系统概览

百花齐放:虽然 ChatGPT 是目前的“领头羊”(Incumbent),功能最丰富,但其他竞争对手(如 Claude, Grok, Perplexity 等)也在迅速追赶,甚至在某些特定领域超越了 ChatGPT。
各有所长:Karpathy 建议用户根据具体需求选择工具:

* 搜索:他倾向于用 Perplexity

* Web 原型/图表:他喜欢用 Claude Artifacts

* 语音交互:他认为 ChatGPT Advanced Voice 很不错。

* 娱乐/无限制:如果 ChatGPT 拒绝回答(too cagey),可以尝试 Grok

2. 核心概念回顾 (Mental Model Checklist)

他在选择和使用模型时,通常会关注以下几个维度:

模型层级 (Pricing Tiers)

* 要意识到你是在和一个“压缩文件”(Zip File)对话。

* 大模型(通常付费):知识渊博、写作更有创意。

* 小模型(通常免费):能力较弱,更容易产生幻觉。

思考模型 (Thinking Models)

* 对于数学、代码或推理难题,应使用经过强化学习训练的“思考模型”(如 o1/o3, DeepSeek R1)。

* 普通问题不需要等待思考时间,用普通模型即可。

工具使用 (Tool Use)

* 联网搜索:如果你问的是近期发生的新闻或只有特定网站才有的信息,必须确保模型能联网。

* 代码解释器:如果你需要生成图表、进行精确计算,需要模型能运行 Python 代码(如 Advanced Data Analysis)。

多模态 (Multimodality)

* 关注模型是原生支持(Native/Omni)多模态,还是通过外部模型“拼凑”的。原生支持的模型在处理音频和视频时的反应和理解能力会更强。

平台差异

* 注意网页版(Web)和手机 App 版(Mobile)的功能差异。例如,高级语音模式和摄像头视频输入通常只在手机 App 上有。

3. 结语

Karpathy 鼓励大家多尝试不同的工具,找到最适合自己工作流和偏好的模型,因为这个领域的变化非常快,不同的 App 在不同时期会有不同的优势。