Dario Amodei: Anthropic CEO on Claude, AGI & the F

章节 01:扩展定律(Scaling Laws)的历史与假设

📝 本节摘要

本章涵盖了访谈的正式开场与核心理论的起源。Lex Fridman 首先介绍了 Anthropic 的三位重量级嘉宾:CEO Dario Amodei、研究员 Amanda Askell 和 Chris Olah。随后,对话切入正题,Dario 回顾了他在 2014 年至 2017 年间的职业经历。他描述了自己在百度(Baidu)与吴恩达(Andrew Ng)共事时,通过观察语音识别系统的训练,首次萌生了“简单扩大规模(Scaling)就能提升性能”的直觉。这种直觉在 2017 年看到 GPT-1 的结果时得到了验证,使他确信只要持续增加数据和算力,AI 就能处理极其广泛的认知任务。

[原文] [Lex Fridman]: The following is a conversation with Dario Amodei, CEO of Anthropic, the company that created Claude that is currently and often at the top of most LLM benchmark leaderboards.

[译文] [Lex Fridman]: 接下来的内容是与 Anthropic 公司 CEO Dario Amodei 的对话,该公司创造了 Claude,目前该模型经常在大多数大语言模型(LLM)基准测试排行榜上名列前茅。

[原文] [Lex Fridman]: On top of that, Dario and the Anthropic team have been outspoken advocates for taking the topic of AI safety very seriously, and they have continued to publish a lot of fascinating AI research on this and other topics.

[译文] [Lex Fridman]: 除此之外,Dario 和 Anthropic 团队一直是直言不讳的倡导者,主张非常严肃地对待 AI 安全这一议题,并且他们持续在该领域及其他课题上发表大量引人入胜的 AI 研究成果。

[原文] [Lex Fridman]: I'm also joined afterwards by two other brilliant people from Anthropic.

[译文] [Lex Fridman]: 稍后,还有另外两位来自 Anthropic 的杰出人士也将加入我的谈话。

[原文] [Lex Fridman]: First Amanda Askell, who is a researcher working on alignment and fine tuning of Claude, including the design of Claude's character and personality.

[译文] [Lex Fridman]: 首先是 Amanda Askell,她是一名研究员,致力于 Claude 的对齐(alignment)和微调(fine tuning)工作,包括设计 Claude 的角色和性格。

[原文] [Lex Fridman]: A few folks told me she has probably talked with Claude more than any human at Anthropic.

[译文] [Lex Fridman]: 有几个人告诉我,她可能比 Anthropic 的任何人都更多地与 Claude 交谈过。

[原文] [Lex Fridman]: So she was definitely a fascinating person to talk to about prompt engineering and practical advice on how to get the best out of Claude.

[译文] [Lex Fridman]: 所以,如果要探讨提示词工程(prompt engineering)以及如何充分利用 Claude 的实用建议,她绝对是一个令人着迷的对话对象。

[原文] [Lex Fridman]: After that, Chris Olah stopped by for a chat.

[译文] [Lex Fridman]: 在那之后,Chris Olah 也过来聊了聊。

[原文] [Lex Fridman]: He's one of the pioneers of the field of mechanistic interpretability, which is an exciting set of efforts that aims to reverse engineer neural networks to figure out what's going on inside, inferring behaviors from neural activation patterns inside the network.

[译文] [Lex Fridman]: 他是机械可解释性(mechanistic interpretability)领域的先驱之一,这是一系列令人兴奋的尝试,旨在对神经网络进行逆向工程以弄清其内部运作机制,通过网络内部的神经激活模式来推断其行为。

[原文] [Lex Fridman]: This is a very promising approach for keeping future super intelligent AI systems safe.

[译文] [Lex Fridman]: 这是确保未来超级智能 AI 系统安全的一种非常有前途的方法。

[原文] [Lex Fridman]: For example, by detecting from the activations when the model is trying to deceive the human it is talking to.

[译文] [Lex Fridman]: 例如,通过激活模式检测模型何时试图欺骗与之交谈的人类。

[原文] [Lex Fridman]: This is the "Lex Fridman Podcast." To support it, please check out our sponsors in the description.

[译文] [Lex Fridman]: 这里是《Lex Fridman 播客》。如果要支持本节目,请查看描述中的赞助商信息。

[原文] [Lex Fridman]: And now, dear friends, here's Dario Amodei.

[译文] [Lex Fridman]: 现在,亲爱的朋友们,有请 Dario Amodei。

[原文] [Lex Fridman]: Let's start with the big idea of scaling laws and the Scaling Hypothesis. What is it? What is its history? And where do we stand today?

[译文] [Lex Fridman]: 让我们从扩展定律(scaling laws)和扩展假说(Scaling Hypothesis)这个大概念开始吧。它是什么?它的历史是怎样的?我们今天处于什么位置?

[原文] [Dario Amodei]: So I can only describe it as, you know, as it relates to kind of my own experience.

[译文] [Dario Amodei]: 我只能根据我自己的经历来描述它。

[原文] [Dario Amodei]: But I've been in the AI field for about 10 years and it was something I noticed very early on.

[译文] [Dario Amodei]: 我进入 AI 领域大约 10 年了,这是我很早就注意到的事情。

[原文] [Dario Amodei]: So I first joined the AI world when I was working at Baidu with Andrew Ng in late 2014, which is almost exactly 10 years ago now.

[译文] [Dario Amodei]: 我最初进入 AI 世界是在 2014 年底,当时我在百度(Baidu)与吴恩达(Andrew Ng)共事,距离现在几乎正好 10 年。

[原文] [Dario Amodei]: And the first thing we worked on was speech recognition systems.

[译文] [Dario Amodei]: 我们当时做的第一件事就是语音识别系统。

[原文] [Dario Amodei]: And in those days I think deep learning was a new thing, it had made lots of progress, but everyone was always saying, we don't have the algorithms we need to succeed.

[译文] [Dario Amodei]: 在那个年代,深度学习还是个新鲜事物,虽然已经取得了很多进展,但每个人都还在说,我们还没有获得成功所需的算法。

[原文] [Dario Amodei]: You know, we're not, we're only matching a tiny, tiny fraction.

[译文] [Dario Amodei]: 你知道,我们还没有,我们只匹配了极其微小的一部分。

[原文] [Dario Amodei]: There's so much we need to kind of discover algorithmically.

[译文] [Dario Amodei]: 在算法层面上,我们还有太多东西需要去探索。

[原文] [Dario Amodei]: We haven't found the picture of how to match the human brain.

[译文] [Dario Amodei]: 我们还没找到能够匹敌人类大脑的图景。

[原文] [Dario Amodei]: And when, you know, in some ways I was fortunate, I was kind of, you know, you can have almost beginner's luck, right?

[译文] [Dario Amodei]: 而当……你知道,在某些方面我很幸运,某种程度上,这几乎算是初学者的运气,对吧?

[原文] [Dario Amodei]: I was like a newcomer to the field and, you know, I looked at the neural net that we were using for speech, the recurrent neural networks, and I said, I don't know, what if you make them bigger and give them more layers?

[译文] [Dario Amodei]: 我当时就像这个领域的新人,我看着我们用于语音识别的神经网络——也就是循环神经网络(recurrent neural networks)——我说,我不知道,如果不把它们做得更大,给它们加更多层会怎样?

[原文] [Dario Amodei]: And what if you scale up the data along with this, right?

[译文] [Dario Amodei]: 如果与此同时你也扩大数据规模会怎样?

[原文] [Dario Amodei]: I just saw these as like independent dials that you could turn.

[译文] [Dario Amodei]: 我只是把这些看作是可以转动的独立旋钮。

[原文] [Dario Amodei]: And I noticed that the model started to do better and better as you gave them more data, as you made the models larger, as you trained them for longer.

[译文] [Dario Amodei]: 我注意到,当你给它们更多数据,当你把模型做得更大,当你训练它们的时间更长时,模型开始表现得越来越好。

[原文] [Dario Amodei]: And I didn't measure things precisely in those days, but along with colleagues, we very much got the informal sense that the more data and the more compute and the more training you put into these models, the better they perform.

[译文] [Dario Amodei]: 那时候我没有精确测量这些指标,但我和同事们强烈地产生了一种非正式的感觉:你投入这些模型的数据越多、算力(compute)越多、训练越多,它们的表现就越好。

[原文] [Dario Amodei]: And so initially my thinking was, hey, maybe that is just true for speech recognition systems, right?

[译文] [Dario Amodei]: 所以起初我的想法是,嘿,也许这只适用于语音识别系统,对吧?

[原文] [Dario Amodei]: Maybe that's just one particular quirk, one particular area.

[译文] [Dario Amodei]: 也许那只是一个特定的怪现象,一个特定的领域。

[原文] [Dario Amodei]: I think it wasn't until 2017 when I first saw the results from GPT-1 that it clicked for me that language is probably the area in which we can do this.

[译文] [Dario Amodei]: 我想直到 2017 年,当我第一次看到 GPT-1 的结果时,我才突然明白(clicked for me),语言可能就是我们可以实现这一点的领域。

[原文] [Dario Amodei]: We can get trillions of words of language data, we can train on them.

[译文] [Dario Amodei]: 我们可以获得数万亿字的语言数据,我们可以在这些数据上进行训练。

[原文] [Dario Amodei]: And the models we were training those days were tiny.

[译文] [Dario Amodei]: 我们那个年代训练的模型非常小。

[原文] [Dario Amodei]: You could train them on one to eight GPUs whereas, you know, now we train jobs on tens of thousands, soon going to hundreds of thousands of GPUs.

[译文] [Dario Amodei]: 那时候你可以在 1 到 8 个 GPU 上训练它们,而你知道,现在我们的训练任务是在数万个 GPU 上进行的,很快就要达到数十万个 GPU。

[原文] [Dario Amodei]: And so when I saw those two things together and, you know, there were a few people like Ilya Sutskever, who you've interviewed, who had somewhat similar views, right?

[译文] [Dario Amodei]: 所以当我把这两件事结合起来看时……你知道,还有少数人,比如你采访过的 Ilya Sutskever,也持有类似的观点,对吧?

[原文] [Dario Amodei]: He might have been the first one, although I think a few people came to similar views around the same time, right?

[译文] [Dario Amodei]: 他可能是第一个,尽管我认为有几个人大约在同一时间得出了类似的观点,对吧?

[原文] [Dario Amodei]: There was, you know, Rich Sutton's Bitter Lesson, there was Gwern wrote about the Scaling Hypothesis.

[译文] [Dario Amodei]: 比如 Rich Sutton 的文章《苦涩的教训》(The Bitter Lesson),还有 Gwern 写的关于扩展假说(Scaling Hypothesis)的文章。

[原文] [Dario Amodei]: But I think somewhere between 2014 and 2017 was when it really clicked for me, when I really got conviction that, hey, we're gonna be able to do these incredibly wide cognitive tasks if we just scale up the models.

[译文] [Dario Amodei]: 但我认为大概在 2014 年到 2017 年之间,我真正顿悟了,我真正确信:嘿,只要我们要扩大模型规模,我们将能够完成这些极其广泛的认知任务。

[原文] [Dario Amodei]: And at every stage of scaling, there are always arguments.

[译文] [Dario Amodei]: 而在扩展的每一个阶段,总会存在争议。

[原文] [Dario Amodei]: And you know, when I first heard them, honestly, I thought probably I'm the one who's wrong, and, you know, all these experts in the field are right.

[译文] [Dario Amodei]: 你知道,当我第一次听到这些争论时,老实说,我想可能我才是错的,而领域内的所有这些专家是对的。

[原文] [Dario Amodei]: They know the situation better than I do, right?

[译文] [Dario Amodei]: 他们比我更了解情况,对吧?

[原文] [Dario Amodei]: There's, you know, the Chomsky argument about like, you can get syntactics, but you can't get semantics.

[译文] [Dario Amodei]: 比如有乔姆斯基(Chomsky)的论点,认为你可以获得句法(syntactics),但你无法获得语义(semantics)。


章节 02:为何规模效应有效:物理学直觉与层级结构

📝 本节摘要

在本节中,Lex 询问 Dario 对“越大越好”这一现象背后的直觉。Dario 结合自己生物物理学的背景,用物理学中的“1/f 噪声”(粉红噪声)和自然界的正态分布作为类比。他提出语言数据中存在从简单的词法搭配到复杂的段落逻辑的层级结构。小模型只能捕捉最常见的简单模式(如主谓宾结构),而随着规模扩大,模型能够捕捉到长尾分布中更罕见、更复杂的语义和逻辑模式,从而产生智能的“涌现”。

[原文] [Lex Fridman]: A bit of a philosophical question, but what's your intuition about why bigger is better in terms of network size and data size? Why does it lead to more intelligent models?

[译文] [Lex Fridman]: 这是一个有点哲学意味的问题,但关于为什么在网络规模和数据规模方面越大越好,你的直觉是什么?为什么这会导致更智能的模型?

[原文] [Dario Amodei]: So in my previous career as a biophysicist, so I did physics undergrad and then biophysics in grad school. So I think back to what I know as a physicist, which is actually much less than what some of my colleagues at Anthropic have in terms of expertise in physics. There's this concept called the 1/f noise and 1/x distributions where often, you know, just like if you add up a bunch of natural processes, you get a Gaussian.,

[译文] [Dario Amodei]: 在我之前作为生物物理学家的职业生涯中——我本科读的是物理,然后在研究生阶段读了生物物理学。所以我回想起我作为物理学家所知道的知识,实际上比起我在 Anthropic 的一些同事在物理学方面的专业知识,我知道的要少得多。有一个概念叫做 1/f 噪声(1/f noise)和 1/x 分布,通常……就像如果你把一堆自然过程加在一起,你会得到一个高斯分布(Gaussian)。

[原文] [Dario Amodei]: If you add up a bunch of kind of differently distributed natural processes, if you like take a probe and hook it up to a resistor, the distribution of the thermal noise in the resistor goes as one over the frequency. It's some kind of natural convergent distribution.

[译文] [Dario Amodei]: 如果你把一堆分布不同的自然过程加在一起,比如你拿一个探针连接到一个电阻上,电阻中热噪声的分布是随频率倒数(1/f)变化的。这是一种自然的收敛分布。

[原文] [Dario Amodei]: And I think what it amounts to is that if you look at a lot of things that are produced by some natural process that has a lot of different scales, right? Not a Gaussian, which is kind of narrowly distributed, but you know, if I look at kind of like large and small fluctuations that lead to electrical noise, they have this decaying 1/x distribution.,

[译文] [Dario Amodei]: 我认为这实际上意味着,如果你观察许多由某种具有多种不同尺度的自然过程产生的事物,对吧?不是那种分布狭窄的高斯分布,而是,如果你观察导致电噪声的大大小小的波动,它们具有这种衰减的 1/x 分布。

[原文] [Dario Amodei]: And so now I think of like patterns in the physical world, right? Or in language. If I think about the patterns in language, there are some really simple patterns. Some words are much more common than others like "the," then there's basic noun verb structure, then there's the fact that, you know, nouns and verbs have to agree, they have to coordinate.

[译文] [Dario Amodei]: 现在我想想物理世界中的模式,或者是语言中的模式。如果我思考语言中的模式,会有一些非常简单的模式。有些词比其他词更常见,比如“the”,然后是基本的名词动词结构,接着是名词和动词必须一致,它们必须协调。

[原文] [Dario Amodei]: And there's the higher level sentence structure, then there's the thematic structure of paragraphs. And so the fact that there's this regressing structure, you can imagine that as you make the networks larger, first they capture the really simple correlations, the really simple patterns, and there's this long tail of other patterns.

[译文] [Dario Amodei]: 还有更高层级的句子结构,以及段落的主题结构。正因为存在这种回归结构,你可以想象,当你把网络做大时,它们首先捕捉到的是真正简单的相关性,真正简单的模式,然后还有这种长尾的其他模式。

[原文] [Dario Amodei]: And if that long tail of other patterns is really smooth like it is with the 1/f noise in, you know, physical processes like resistors, then you can imagine as you make the network larger, it's kind of capturing more and more of that distribution, and so that smoothness gets reflected in how well the models are at predicting and how well they perform.,

[译文] [Dario Amodei]: 如果这种长尾的其他模式非常平滑,就像物理过程(如电阻)中的 1/f 噪声那样,那么你可以想象,当你把网络做大时,它就在捕捉越来越多的这种分布,而这种平滑性就反映在模型的预测能力和表现上。

[原文] [Dario Amodei]: Language is an evolved process, right? We've developed language, we have common words and less common words. We have common expressions and less common expressions. We have ideas, cliches that are expressed frequently, and we have novel ideas. And that process has developed, has evolved with humans over millions of years. And so the guess, and this is pure speculation would be that there's some kind of long tail distribution of the distribution of these ideas.,

[译文] [Dario Amodei]: 语言是一个进化的过程,对吧?我们发展了语言,我们有常用词和不常用词。我们有常用表达和不常用表达。我们有经常被表达的观点、陈词滥调,也有新颖的观点。这一过程随着人类在数百万年间不断发展、进化。所以我的猜测——这纯粹是推测——就是这些思想的分布存在某种长尾分布。

[原文] [Lex Fridman]: So there's the long tail, but also there's the height of the hierarchy of concepts that you're building up. So the bigger the network, presumably you have a higher capacity to-

[译文] [Lex Fridman]: 所以不仅有长尾,还有你正在构建的概念层级的高度。所以网络越大,想必你就有更高的容量去——

[原文] [Dario Amodei]: Exactly, if you have a small network, you only get the common stuff, right? if I take a tiny neural network, it's very good at understanding that, you know, a sentence has to have, you know, verb, adjective, noun, right?

[译文] [Dario Amodei]: 正是,如果你有一个小网络,你只能得到常见的东西,对吧?如果我拿一个很小的神经网络,它非常擅长理解……你知道,一个句子必须有动词、形容词、名词,对吧?

[原文] [Dario Amodei]: But it's terrible at deciding what those verb, adjective and noun should be and whether they should make sense. If I make it just a little bigger, it gets good at that, then suddenly it's good at the sentences, but it's not good at the paragraphs. And so these rarer and more complex patterns get picked up as I add more capacity to the network.,

[译文] [Dario Amodei]: 但它很糟糕的一点是,它无法决定那些动词、形容词和名词应该是什么,以及它们组合起来是否有意义。如果我把它做得稍微大一点,它就擅长那个了;然后突然间它擅长句子了,但还不擅长段落。所以,随着我向网络添加更多的容量,这些更罕见、更复杂的模式就会被捕捉到。


章节 03:扩展的潜在天花板:数据、算力与合成数据

📝 本节摘要

面对“扩展定律是否有尽头”的提问,Dario 坦言在达到人类智能水平之前,恐怕没有天花板。他指出,虽然在涉及人类官僚体系(如临床试验)的领域存在限制,但在生物学等纯智力领域,AI 有巨大的超越空间。针对“数据枯竭”的担忧,他提出了合成数据(Synthetic Data)作为解决方案,类比 DeepMind 的 AlphaGo Zero 通过自我博弈超越人类。此外,他预测算力集群的投资规模将在 2027 年达到 1000 亿美元级别,并指出按照当前的线性外推,AI 很快将在编程等领域达到人类博士或顶级专家的水平。

[原文] [Lex Fridman]: Well, the natural question then is, what's the ceiling of this?

[译文] [Lex Fridman]: 那么自然而然的问题就是,这一切的天花板在哪里?

[原文] [Dario Amodei]: Yeah.

[译文] [Dario Amodei]: 是的。

[原文] [Lex Fridman]: Like how complicated and complex is the real world? How much stuff is there to learn?

[译文] [Lex Fridman]: 比如现实世界到底有多复杂、多精密?有多少东西是需要去学习的?

[原文] [Dario Amodei]: I don't think any of us knows the answer to that question. My strong instinct would be that there's no ceiling below the level of humans, right?

[译文] [Dario Amodei]: 我认为我们没人知道这个问题的答案。但我强烈的直觉是,在达到人类水平之前不存在天花板,对吧?

[原文] [Dario Amodei]: We humans are able to understand these various patterns, and so that makes me think that if we continue to, you know, scale up these models to kind of develop new methods for training them and scaling them up, that will at least get to the level that we've gotten to with humans.

[译文] [Dario Amodei]: 我们人类能够理解这些各种各样的模式,所以这让我认为,如果我们继续扩大这些模型的规模,开发新的训练和扩展方法,至少能达到我们人类已经达到的水平。

[原文] [Dario Amodei]: There's then a question of, you know, how much more is it possible to understand than humans do? How much is it possible to be smarter and more perceptive than humans?

[译文] [Dario Amodei]: 接下来的问题是,比人类理解得更多是有可能的吗?比人类更聪明、更敏锐是有可能的吗?

[原文] [Dario Amodei]: I would guess the answer has got to be domain dependent. If I look at an area like biology, and, you know, I wrote this essay, "Machines of Loving Grace." It seems to me that humans are struggling to understand the complexity of biology, right?

[译文] [Dario Amodei]: 我猜答案肯定是依领域而定的。如果我看像生物学这样的领域——你知道,我写过一篇名为《仁慈机器》(Machines of Loving Grace)的文章——在我看来,人类正在极其艰难地试图理解生物学的复杂性,对吧?

[原文] [Dario Amodei]: If you go to Stanford or to Harvard or to Berkeley, you have whole departments of, you know, folks trying to study, you know, like the immune system or metabolic pathways, and each person understands only a tiny bit, part of it, specializes, and they're struggling to combine their knowledge with that of other humans.

[译文] [Dario Amodei]: 如果你去斯坦福、哈佛或伯克利,你会看到整个院系的人都在试图研究免疫系统或代谢途径之类的东西,而每个人只理解极其微小的一部分,非常专业化,并且他们很难将自己的知识与其他人的知识结合起来。

[原文] [Dario Amodei]: And so I have an instinct that there's a lot of room at the top for AIs to get smarter.

[译文] [Dario Amodei]: 所以我有一种直觉,AI 在这方面还有很大的提升空间,可以变得更聪明。

[原文] [Dario Amodei]: If I think of something like materials in the physical world or you know, like addressing, you know, conflicts between humans or something like that. I mean, you know, it may be there's only, some of these problems are not intractable, but much harder. And it may be that there's only so well you can do at some of these things, right?

[译文] [Dario Amodei]: 如果我想到物理世界中的材料,或者解决人类之间的冲突之类的事情。我是说,也许其中一些问题并非不可解决,但要困难得多。而且在这些事情上,你可能只能做到某种程度,对吧?

[原文] [Dario Amodei]: Just like with speech recognition, there's only so clear I can hear your speech. So I think in some areas there may be ceilings, you know, that are very close to what humans have done. in other areas, those ceilings may be very far away.

[译文] [Dario Amodei]: 就像语音识别一样,我能听清你说话的程度是有限的。所以我认为在某些领域可能存在天花板,且非常接近人类已有的水平;而在其他领域,这些天花板可能非常遥远。

[原文] [Dario Amodei]: And I think we'll only find out when we build these systems. It's very hard to know in advance. We can speculate, but we can't be sure.

[译文] [Dario Amodei]: 我想我们要等到构建出这些系统时才会知道。很难提前预知。我们可以推测,但无法确定。

[原文] [Lex Fridman]: And in some domains, the ceiling might have to do with human bureaucracies and things like this, as you write about.

[译文] [Lex Fridman]: 正如你所写的,在某些领域,天花板可能与人类的官僚机构之类的东西有关。

[原文] [Dario Amodei]: Yes.

[译文] [Dario Amodei]: 是的。

[原文] [Lex Fridman]: So humans fundamentally have to be part of the loop. That's the cause of the ceiling, not maybe the limits of the intelligence.

[译文] [Lex Fridman]: 所以人类从根本上必须是循环的一部分。这才是天花板的成因,也许并非智能本身的限制。

[原文] [Dario Amodei]: Yeah, I think in many cases, you know, in theory, technology could change very fast, for example, all the things that we might invent with respect to biology. But remember there's a, you know, there's a clinical trial system that we have to go through to actually administer these things to humans.

[译文] [Dario Amodei]: 是的,我认为在很多情况下,理论上技术可以变化得非常快,例如我们在生物学方面可能发明的所有东西。但请记住,我们必须通过临床试验系统,才能真正将这些东西应用到人类身上。

[原文] [Dario Amodei]: I think that's a mixture of things that are unnecessary and bureaucratic and things that kind of protect the integrity of society. And the whole challenge is that it's hard to tell what's going on. It's hard to tell which is which, right?

[译文] [Dario Amodei]: 我认为这是不必要的官僚主义与某种保护社会完整性的机制的混合体。而整个挑战在于很难分辨到底发生了什么。很难分辨哪个是哪个,对吧?

[原文] [Dario Amodei]: My view is definitely, I think in terms of drug development, my view is that we're too slow and we're too conservative. But certainly if you get these things wrong, you know, it's possible to risk people's lives by being too reckless. And so at least some of these human institutions are in fact protecting people.

[译文] [Dario Amodei]: 我的观点很明确,在药物开发方面,我认为我们太慢了,太保守了。但当然,如果你搞错了,太鲁莽是有可能危及人们生命的。所以至少其中一些人类机构实际上是在保护人们。

[原文] [Lex Fridman]: If we do hit a limit, if we do hit a slow down in the scaling laws, what do you think would be the reason? Is it compute limited, data limited? Is it something else, idea limited?

[译文] [Lex Fridman]: 如果我们真的触及了极限,如果扩展定律真的放缓了,你认为原因会是什么?是算力受限、数据受限?还是其他原因,比如想法受限?

[原文] [Dario Amodei]: So, a few things. Now we're talking about hitting the limit before we get to the level of humans and the skill of humans. So, I think one that's, you know, one that's popular today and I think, you know, could be a limit that we run into. Like most of the limits, I would bet against it, but it's definitely possible is we simply run out of data.

[译文] [Dario Amodei]: 有几点。现在我们讨论的是在达到人类水平和技能之前触及极限的情况。我认为目前有一个很流行的观点——我也认为这可能是我们会遇到的一个限制。像大多数限制一样,我会赌它不会发生,但这绝对是有可能的:那就是我们仅仅是耗尽了数据。

[原文] [Dario Amodei]: There's only so much data on the internet and there's issues with the quality of the data, right? You can get hundreds of trillions of words on the internet, but a lot of it is repetitive or it's search engine, you know, search engine optimization drivel, or maybe in the future it'll even be text generated by AIs itself.

[译文] [Dario Amodei]: 互联网上的数据只有那么多,而且数据质量也存在问题,对吧?你可以在互联网上找到数百万亿个词,但其中很多是重复的,或者是搜索引擎优化的垃圾内容,甚至未来可能全都是 AI 自己生成的文本。

[原文] [Dario Amodei]: And so I think there are limits to what can be produced in this way. That said, we and I would guess other companies are working on ways to make data synthetic where you can, you know, you can use the model to generate more data of the type that you have already or even generate data from scratch.

[译文] [Dario Amodei]: 所以我认为这种方式能生产的内容是有限的。话虽如此,我们——我猜其他公司也是——正在研究制作合成数据(synthetic data)的方法,你可以利用模型生成更多你已有的那类数据,甚至从零开始生成数据。

[原文] [Dario Amodei]: If you think about what was done with DeepMind's AlphaGo Zero, they managed to get a bot all the way from, you know, no ability to play Go whatsoever to above human level just by playing against itself. There was no example data from humans required in the AlphaGo Zero version of it.

[译文] [Dario Amodei]: 如果你想想 DeepMind 的 AlphaGo Zero 做了什么,他们成功地让一个机器人从完全不会下围棋进化到超越人类水平,仅仅通过自我博弈。AlphaGo Zero 版本完全不需要人类的示例数据。

[原文] [Dario Amodei]: The other direction, of course, is these reasoning models that do chain of thought and stop to think and reflect on their own thinking. In a way, that's another kind of synthetic data coupled with reinforcement learning. So my guess is with one of those methods, we'll get around the data limitation or there may be other sources of data that are available.

[译文] [Dario Amodei]: 另一个方向当然是这些能够进行思维链(chain of thought)、停下来思考并反思自己思维的推理模型。在某种程度上,那是另一种结合了强化学习的合成数据。所以我猜通过其中一种方法,我们将绕过数据限制,或者可能会有其他可用的数据来源。

[原文] [Lex Fridman]: What about the limits of compute? Meaning the expensive nature of building bigger and bigger data centers.

[译文] [Lex Fridman]: 那算力的限制呢?我是说建造越来越大的数据中心所带来的昂贵成本。

[原文] [Dario Amodei]: So right now, I think, you know, most of the frontier model companies I would guess are operating in, you know, roughly, you know, $1 billion scale, plus or minus a factor of three, right? Those are the models that exist now or are being trained now.

[译文] [Dario Amodei]: 目前,我想大多数前沿模型公司大概都在 10 亿美元这个规模上运作,上下浮动大约 3 倍,对吧?这就是目前存在或正在训练的模型的规模。

[原文] [Dario Amodei]: I think next year, we're gonna go to a few billion, and then 2026, we may go to, you know, above 10 billion, and probably by 2027, their ambitions to build 100 billion dollar clusters, and I think all of that actually will happen.

[译文] [Dario Amodei]: 我认为明年我们会达到几十亿美元,然后 2026 年可能会超过 100 亿美元,大概到 2027 年,会有建造 1000 亿美元(100 billion dollar) 集群的雄心,而且我认为所有这些实际上都会发生。

[原文] [Dario Amodei]: There's a lot of determination to build the compute to do it within this country, and I would guess that it actually does happen. Now, if we get to 100 billion, that's still not enough compute, that's still not enough scale then either we need even more scale or we need to develop some way of doing it more efficiently of shifting the curve.

[译文] [Dario Amodei]: 在这个国家(美国)内部,人们有很大的决心去建设算力来实现这一点,我猜这真的会发生。如果到了 1000 亿还不够算力、不够规模,那么我们要么需要更大的规模,要么需要开发某种更有效率的方法来移动这条曲线。

[原文] [Dario Amodei]: I think between all of these, one of the reasons I'm bullish about powerful AI happening so fast is just that if you extrapolate the next few points on the curve, we're very quickly getting towards human level ability, right?

[译文] [Dario Amodei]: 综合所有这些因素,我对强大的 AI 会如此快到来持乐观态度的原因之一是,如果你仅仅外推曲线上的接下来的几个点,我们正非常迅速地接近人类水平的能力,对吧?

[原文] [Dario Amodei]: Some of the new models that we developed, some reasoning models that have come from other companies, they're starting to get to what I would call the PhD or professional level, right?

[译文] [Dario Amodei]: 我们开发的一些新模型,以及其他公司推出的一些推理模型,已经开始达到我会称之为博士或专业人士的水平了,对吧?

[原文] [Dario Amodei]: If you look at their coding ability, the latest model we released, Sonnet 3.5, the new or updated version, it gets something like 50% on SWE-bench, and SWE-bench is an example of a bunch of professional, real world software engineering tasks.

[译文] [Dario Amodei]: 如果你看它们的编程能力,我们发布的最新模型 Sonnet 3.5(新版或更新版),在 SWE-bench 上达到了大约 50% 的分数,而 SWE-bench 是一系列专业、真实的软件工程任务的集合。

[原文] [Dario Amodei]: At the beginning of the year, I think the state of the art was 3 or 4%. So in 10 months we've gone from 3% to 50% on this task, and I think in another year, we'll probably be at 90%. I mean, I don't know, but might even be less than that.

[译文] [Dario Amodei]: 在今年年初,我认为最先进的水平只有 3% 或 4%。所以在 10 个月内,我们在这个任务上从 3% 提升到了 50%,我认为再过一年,我们可能会达到 90%。我是说,我不知道,但也可能用不了那么久。

[原文] [Dario Amodei]: So if we just continue to extrapolate this, right, in terms of skill that we have, I think if we extrapolate the straight curve, within a few years, we will get to these models being, you know, above the highest professional level in terms of humans.

[译文] [Dario Amodei]: 所以如果我们继续推演这一点,就我们拥有的技能而言,我认为如果我们按直线推演,几年之内,我们将达到这些模型在人类最高专业水平之上的程度。


章节 04:良性竞争论:以“力争上游”改变行业激励

📝 本节摘要

面对 Lex 关于如何战胜 OpenAI、Google 等竞争对手的提问,Dario 给出了一个反直觉的答案。他提出了“力争上游”(Race to the Top)的变革理论:Anthropic 的目标并非击败对手成为唯一的赢家,而是通过率先实施高成本的安全措施(如机械可解释性研究),改变行业的激励结构。当一家公司树立了负责任的榜样后,其他公司为了避免被视为“不负责任”,将被迫跟进。Dario 承认这会削弱 Anthropic 的短期竞争优势,但这能成功将整个 AI 领域的竞争方向从“向下比烂”扭转为“向上向善”。

[原文] [Lex Fridman]: So Anthropic has several competitors. It'd be interesting to get your sort of view of it all. OpenAI, Google, xAI, Meta. What does it take to win in the broad sense of win in this space?

[译文] [Lex Fridman]: Anthropic 有几个竞争对手。我很想听听你对这一切的看法。OpenAI、Google、xAI、Meta。在这个领域,广义上的“赢”到底意味着什么?

[原文] [Dario Amodei]: Yeah, so I want to separate out a couple things, right? So, you know, Anthropic's mission is to kind of try to make this all go well, right? And you know, we have a theory of change called race to the top, right? Race to the top is about trying to push the other players to do the right thing by setting an example.

[译文] [Dario Amodei]: 是的,我想把几件事区分开来。你知道,Anthropic 的使命是试图让这一切(AI 发展)顺利进行,对吧?我们有一套变革理论叫做“力争上游”(race to the top),对吧?“力争上游”是指试图通过树立榜样,来推动其他参与者做正确的事情。

[原文] [Dario Amodei]: It's not about being the good guy, it's about setting things up so that all of us can be the good guy. I'll give a few examples of this. Early in the history of Anthropic, one of our co-founders, Chris Olah, who I believe you're interviewing soon, you know, he's the co-founder of the field of mechanistic interpretability, which is an attempt to understand what's going on inside AI models.

[译文] [Dario Amodei]: 这不是关于只有我们要当“好人”,而是关于建立一种机制,让我们所有人都能成为好人。我举几个例子。在 Anthropic 早期,我们的联合创始人之一 Chris Olah——我相信你很快会采访他——他是机械可解释性(mechanistic interpretability)领域的共同创始人,这是一种试图理解 AI 模型内部运作机制的尝试。

[原文] [Dario Amodei]: So we had him and one of our early teams focus on this area of interpretability, which we think is good for making models safe and transparent. For three or four years, that had no commercial application whatsoever. It still doesn't today.

[译文] [Dario Amodei]: 所以我们让他和我们的一个早期团队专注于可解释性这一领域,我们认为这对提高模型的安全性和透明度很有好处。但在三四年的时间里,这完全没有任何商业应用。甚至直到今天也还没有,。

[原文] [Dario Amodei]: We're doing some early betas with it, and probably it will eventually, but you know, this is a very, very long research bed and one in which we've built in public and shared our results publicly. And we did this because, you know, we think it's a way to make models safer.

[译文] [Dario Amodei]: 我们正在用它做一些早期的测试版,可能最终会有用,但这其实是一个非常漫长的研究基础,而且我们是在公开构建并公开分享成果。我们之所以这样做,是因为我们认为这是让模型更安全的一种方式。

[原文] [Dario Amodei]: An interesting thing is that as we've done this, other companies have started doing it as well, in some cases because they've been inspired by it, in some cases because they're worried that, you know, if other companies are doing this to look more responsible, they wanna look more responsible too.

[译文] [Dario Amodei]: 有趣的是,当我们这样做之后,其他公司也开始这样做了。有些情况是因为他们受到了启发,有些情况则是因为他们担心,如果其他公司这样做是为了显得更负责任,那么他们也想显得更负责任,。

[原文] [Dario Amodei]: No one wants to look like the irresponsible actor, and so they adopt this as well. When folks come to Anthropic, interpretability often a draw, and I tell them, the other places you didn't go, tell them why you came here, and then you see soon that there's interpretability teams elsewhere as well.

[译文] [Dario Amodei]: 没有人想看起来像个不负责任的角色,所以他们也采用了这种做法。当人们来到 Anthropic 时,可解释性通常是一个吸引点。我会告诉他们,去跟那些你没去的公司说,告诉他们你为什么来这里。然后你很快就会发现,其他地方也建立了可解释性团队。

[原文] [Dario Amodei]: And in a way, that takes away our competitive advantage because it's like, oh, now others are doing it as well, but it's good for the broader system, and so we have to invent some new thing that we're doing that others aren't doing as well.

[译文] [Dario Amodei]: 在某种程度上,这夺走了我们的竞争优势,因为这就好像是,噢,现在其他人也在做这件事了。但这对于更广泛的系统是有好处的,所以我们必须再去发明一些新的、别人还没做好的事情,。

[原文] [Dario Amodei]: And the hope is to basically bid up the importance of doing the right thing. And it's not about us in particular, right? It's not about having one particular good guy. Other companies can do this as well.

[译文] [Dario Amodei]: 我们的希望基本上是抬高“做正确之事”的重要性。这并非特指我们,对吧?不是关于只有一个特定的好人。其他公司也可以这样做。

[原文] [Dario Amodei]: If they join the race to do this, you know, that's the best news ever, right? It's just, it's about kind of shaping the incentives to point upward instead of shaping the incentives to point downward.

[译文] [Dario Amodei]: 如果他们加入这场竞赛来做这件事,那就是最好的消息,对吧?这只是关于塑造激励机制,使其指向上方(向善),而不是指向下方(比烂)。


章节 05:Claude 模型家族:Opus、Sonnet 与 Haiku 的定位

📝 本节摘要

Lex 提到了今年 Claude 系列的密集发布,包括 3 月份推出的 Claude 3 家族(Opus、Sonnet、Haiku)以及 7 月更新的 Sonnet 3.5。Dario 解释了这种“三件套”策略背后的逻辑:市场同时需要昂贵且强大的模型(用于复杂分析和创意写作)以及快速且廉价的模型(用于自动补全和后台任务)。他揭示了命名灵感来源于诗歌的长度:Haiku(俳句)短小精悍,Sonnet(十四行诗)篇幅适中,Opus(大作/乐章)则代表宏大的作品。Dario 还指出,新一代模型(如 Sonnet 3.5)的目标是移动“性能-成本”曲线,使其智能程度超越上一代的最强模型(Opus 3),同时保持中等模型的速度与成本。

[原文] [Lex Fridman]: Let's talk about the present. Let's talk about Claude. So this year, a lot has happened. In March, Claude 3, Opus, Sonnet, Haiku were released, then Claude 3.5 Sonnet in July, with an updated version just now released, and then also Claude 3.5 Haiku was released. Okay, can you explain the difference between Opus, Sonnet and Haiku, and how we should think about the different versions?

[译文] [Lex Fridman]: 让我们谈谈当下。让我们谈谈 Claude。今年发生了很多事情。3 月份,Claude 3 系列的 Opus、Sonnet 和 Haiku 发布了;然后 7 月发布了 Claude 3.5 Sonnet,刚刚又发布了一个更新版本;接着 Claude 3.5 Haiku 也发布了。好的,你能解释一下 Opus、Sonnet 和 Haiku 之间的区别吗?我们应该如何看待这些不同的版本?

[原文] [Dario Amodei]: Yeah, so let's go back to March when we first released these three models. So, you know, our thinking was, you know, different companies produce kind of large and small models, better and worse models. We felt that there was demand both for a really powerful model, you know, and you that might be a little bit slower that you'd have to pay more for, and also for fast, cheap models that are as smart as they can be for how fast and cheap, right?

[译文] [Dario Amodei]: 好的,让我们回溯到三月,也就是我们首次发布这三个模型的时候。我们的想法是,你知道,不同的公司都在生产大模型和小模型,好的模型和差一点的模型。我们觉得市场既需要一个真正强大的模型——哪怕它可能稍微慢一点,你也需要支付更多费用;同时也需要快速、廉价的模型,并且在该速度和成本下尽可能地聪明,对吧?

[原文] [Dario Amodei]: Whenever you wanna do some kind of like, you know, difficult analysis, like if, you know, I wanna write code, for instance, or you know, I wanna brainstorm ideas, or I wanna do creative writing, I want the really powerful model. But then there's a lot of practical applications in a business sense where it's like I'm interacting with a website.

[译文] [Dario Amodei]: 每当你想要做某种困难的分析,比如我想写代码,或者我想头脑风暴构思点子,或者我想进行创意写作时,我想要真正强大的模型。但也存在很多商业意义上的实际应用,比如我正在与网站进行交互。

[原文] [Dario Amodei]: You know, like, I'm like doing my taxes, or I'm, you know, talking to a, you know, to like a legal advisor and I want to analyze a contract or, you know, we have plenty of companies that are just like, you know, I wanna do auto complete on my IDE or something. And for all of those things, you want to act fast and you want to use the model very broadly.,

[译文] [Dario Amodei]: 比如我在报税,或者我在和法律顾问交谈,我想分析一份合同;或者我们有很多公司客户只是想在他们的 IDE(集成开发环境)上做自动补全之类的功能。对于所有这些事情,你需要动作快,并且你需要非常广泛地使用该模型。

[原文] [Dario Amodei]: So we wanted to serve that whole spectrum of needs. So we ended up with this, you know, this kind of poetry theme. And so what's a really short poem? It's a haiku. And so Haiku is the small, fast, cheap model that is, you know, was at the time was released surprisingly, surprisingly intelligent for how fast and cheap it was.

[译文] [Dario Amodei]: 所以我们想要满足这整个范围的需求。于是我们最终选定了这个诗歌主题。什么是非常短的诗?是俳句(Haiku)。所以 Haiku 是那个小巧、快速、便宜的模型,在当时发布时,考虑到它的速度和价格,它的智能程度令人惊讶,非常令人惊讶。

[原文] [Dario Amodei]: Sonnet is a medium sized poem, right, a couple paragraphs, and so Sonnet was the middle model. It is smarter but also a little bit slower, a little bit more expensive. And Opus, like a magnum opus is a large work, Opus was the largest, smartest model at the time. So that was the original kind of thinking behind it.,

[译文] [Dario Amodei]: 十四行诗(Sonnet)是中等篇幅的诗,对吧,大概几段话,所以 Sonnet 是中间档的模型。它更聪明,但也稍微慢一点,稍微贵一点。而 Opus,就像代表作(magnum opus)是一部宏大的作品一样,Opus 是当时最大、最聪明的模型。这就是最初的构思逻辑。

[原文] [Dario Amodei]: And our thinking then was, well, each new generation of models should shift that trade-off curve. So when we released Sonnet 3.5, it has the same, roughly the same, you know, cost and speed as the Sonnet 3 model. But it increased its intelligence to the point where it was smarter than the original Opus 3 model, especially for code but also just in general.,

[译文] [Dario Amodei]: 我们当时的想法是,每一代新模型都应该移动那个权衡曲线。所以当我们发布 Sonnet 3.5 时,它的成本和速度与 Sonnet 3 模型大致相同。但它的智能程度提高到了超越最初的 Opus 3 模型的水准,特别是在代码方面,在整体表现上也是如此。

[原文] [Dario Amodei]: And so now, you know, we've shown results for Haiku 3.5, and I believe Haiku 3.5, the smallest new model is about as good as Opus 3, the largest old model. So basically the aim here is to shift the curve, and then at some point, there's gonna be an Opus 3.5.

[译文] [Dario Amodei]: 现在,你知道,我们已经展示了 Haiku 3.5 的结果,我相信 Haiku 3.5——这个最小的新模型——大约和 Opus 3——那个最大的旧模型——一样好。所以基本上我们的目标就是移动这条曲线,然后在某个时间点,会推出 Opus 3.5。

[原文] [Dario Amodei]: Now, every new generation of models has its own thing. They use new data, their personality changes in ways that we kind of, you know, try to steer but are not fully able to steer. And so there's never quite that exact equivalence where the only thing you're changing is intelligence.,

[译文] [Dario Amodei]: 每一代新模型都有它自己的特点。它们使用新数据,它们的性格会发生变化,这种变化我们在某种程度上试图引导,但无法完全掌控。所以永远不会存在那种完全的等价替换,即你唯一改变的仅仅是智力。

[原文] [Dario Amodei]: We always try and improve other things, and some things change without us knowing or measuring. So it's very much an inexact science. In many ways, the manner and personality of these models is more an art than it is a science.

[译文] [Dario Amodei]: 我们总是试图改进其他方面,有些东西在我们不知道或未测量的情况下就变了。所以这非常像是一门不精确的科学。在许多方面,这些模型的举止和性格更像是一门艺术,而不是科学。


章节 06:大模型训练揭秘:从预训练到安全测试

📝 本节摘要

在本节中,Lex 询问了模型版本迭代之间(例如从 Claude 3.0 到 3.5)的时间究竟花在哪里。Dario 详细拆解了模型诞生的全生命周期:首先是耗时数月、动用数万张 GPU 的预训练(Pre-training);接着是日益庞大且不够精确的后训练(Post-training)阶段,包括 RLHF 和其他强化学习方法。随后是严格的安全测试,包括与美英 AI 安全研究所合作进行的 CBRN(生化核放)风险评估。Dario 将这一过程类比为制造飞机:既要追求极致的安全严谨,又要追求生产流程的自动化与流线型。他还强调,外界眼中的“尤里卡时刻”往往是由无数“超级无聊的细节”和扎实的软件/性能工程堆砌而成的。

[原文] [Lex Fridman]: So what is sort of the reason for the span of time between, say, Claude Opus 3.0 and 3.5? What takes that time? If you can speak to.

[译文] [Lex Fridman]: 那么,Claude Opus 3.0 和 3.5 之间的时间跨度大概是什么原因造成的?是什么花费了这些时间?如果你能谈谈的话。

[原文] [Dario Amodei]: Yeah, so there's different processes. There's pre-training, which is, you know, just kind of the normal language model training, and that takes a very long time. That uses, you know, these days, you know, tens of thousands, sometimes many tens of thousands of GPUs or TPUs or Trainium, or you know, we use different platforms, but, you know, accelerator chips, often training for months.

[译文] [Dario Amodei]: 是的,这包含不同的流程。首先是预训练(pre-training),也就是通常的语言模型训练,这需要很长时间。如今这通常需要数万个,有时甚至是好几万个 GPU 或 TPU 或 Trainium——我们使用不同的平台,但总之是加速芯片——通常要训练好几个月。

[原文] [Dario Amodei]: There's then a kind of post-training phase where we do reinforcement learning from human feedback, as well as other kinds of reinforcement learning. That phase is getting larger and larger now, and, you know, often, that's less of an exact science. It often takes effort to get it right.

[译文] [Dario Amodei]: 然后是一个后训练(post-training)阶段,我们会进行基于人类反馈的强化学习(RLHF),以及其他类型的强化学习。这个阶段现在变得越来越庞大,而且通常来说,这不像是一门精确的科学。往往需要付出努力才能把它做对。

[原文] [Dario Amodei]: Models are then tested with some of our early partners to see how good they are, and they're then tested both internally and externally for their safety, particularly for catastrophic and autonomy risks. So we do internal testing according to our responsible scaling policy, which I, you know, could talk more about that in detail.

[译文] [Dario Amodei]: 模型随后会与我们的一些早期合作伙伴一起进行测试,看看它们表现如何;然后会在内部和外部进行安全测试,特别是针对灾难性风险和自主性风险。我们会根据我们的“负责任扩展政策”(Responsible Scaling Policy)进行内部测试,关于这一点我可以详细多谈谈。

[原文] [Dario Amodei]: And then we have an agreement with the US and the UK AI Safety Institute, as well as other third party testers in specific domains to test the models for what are called CBRN risks, chemical, biological, radiological and nuclear, which are, you know, we don't think that models pose these risks seriously yet, but every new model, we wanna evaluate to see if we're starting to get close to some of these more dangerous capabilities.

[译文] [Dario Amodei]: 此外,我们与美国和英国的 AI 安全研究所(AI Safety Institute)以及特定领域的其他第三方测试机构达成协议,测试模型的所谓 CBRN 风险——即化学、生物、放射性和核风险。虽然我们认为目前模型尚未严重构成这些风险,但对于每个新模型,我们都想评估一下我们是否开始接近这些更危险的能力。

[原文] [Dario Amodei]: So those are the phases. And then, you know, then it just takes some time to get the model working in terms of inference and launching it in the API. So there's just a lot of steps to actually making a model work.

[译文] [Dario Amodei]: 这就是各个阶段。然后,还需要一些时间让模型在推理(inference)方面运作起来,并在 API 中发布。所以,要真正让一个模型运作起来,确实有很多步骤。

[原文] [Dario Amodei]: And of course, you know, we're always trying to make the processes as streamlined as possible, right? We want our safety testing to be rigorous, but we want it to be rigorous and to be, you know, to be automatic, to happen as fast as it can without compromising on rigor.

[译文] [Dario Amodei]: 当然,我们一直在努力让流程尽可能简化,对吧?我们希望安全测试既严谨,又能自动化,在不牺牲严谨性的前提下尽可能快地完成。

[原文] [Dario Amodei]: Same with our pre-training process and our post-training process. So, you know, it's just like building anything else. It's just like building airplanes. You want to make them, you know, you want to make them safe, but you want to make the process streamlined. And I think the creative tension between those is, you know, is an important thing in making the models work.

[译文] [Dario Amodei]: 我们的预训练流程和后训练流程也是如此。所以,这就像制造其他任何东西一样。就像制造飞机。你想让它们安全,但也想让流程简化。我认为这两者之间这种创造性的张力,是让模型成功运作的重要因素。

[原文] [Lex Fridman]: Yeah, rumor on the street, I forget who was saying that Anthropic has really good tooling, so probably a lot of the challenge here on the software engineering side is to build the tooling to have like a efficient, low friction interaction with the infrastructure.

[译文] [Lex Fridman]: 是的,坊间传闻——我忘了是谁说的了——Anthropic 拥有非常好的工具(tooling),所以这里软件工程方面的一大挑战可能就是构建这些工具,以便与基础设施进行高效、低摩擦的交互。

[原文] [Dario Amodei]: You would be surprised how much of the challenges of, you know, building these models comes down to, you know, software engineering, performance engineering, you know. From the outside you might think, oh, man, we had this eureka breakthrough, right?

[译文] [Dario Amodei]: 你会惊讶地发现,构建这些模型的挑战中有多少归结为软件工程、性能工程。从外界看,你可能会想,噢天哪,我们取得了一个“尤里卡”式的突破(eureka breakthrough),对吧?

[原文] [Dario Amodei]: You know, this movie with the science, we discovered it, we figured it out. But I think all things, even, you know, incredible discoveries, like, they almost always come down to the details, and often super, super boring details.

[译文] [Dario Amodei]: 就像电影里的科学情节一样,我们发现了它,我们搞定了它。但我认为所有事情,即使是不可思议的发现,几乎总是归结为细节,而且往往是超级、超级无聊的细节。

[原文] [Dario Amodei]: I can't speak to whether we have better tooling than other companies. I mean, you know, haven't been at those other companies, at least not recently, but it's certainly something we give a lot of attention to.

[译文] [Dario Amodei]: 我无法断言我们的工具是否比其他公司更好。我的意思是,我没去过其他那些公司,至少最近没去过,但这肯定是我们非常关注的事情。

[原文] [Lex Fridman]: I don't know if you can say, but from three, from Claude 3 to Claude 3.5, is there any extra pre-training going on or is it mostly focused on the post-training? There's been leaps in performance.

[译文] [Lex Fridman]: 我不知道你能不能说,但从 Claude 3 到 Claude 3.5,是有进行额外的预训练,还是主要集中在后训练上?毕竟性能有了飞跃。

[原文] [Dario Amodei]: Yeah, I think at any given stage, we're focused on improving everything at once. Just naturally, like there are different teams, each team makes progress in a particular area, in making a particular, you know, their particular segment of the relay race better. And it's just natural that when we make a new model, we put all of these things in at once.

[译文] [Dario Amodei]: 是的,我认为在任何特定阶段,我们都专注于同时改进所有方面。很自然地,会有不同的团队,每个团队都在特定领域取得进展,让接力赛中属于他们的那一段跑得更好。当我们制作一个新模型时,自然而然地会将所有这些改进同时放入进去。

[原文] [Lex Fridman]: So, the data you have, like the preference data you get from RLHF, is that applicable, is there a ways to apply it to newer models as it get trained up?

[译文] [Lex Fridman]: 那么,你们拥有的数据,比如从 RLHF 获得的偏好数据,那是适用的吗?有没有办法在训练新模型时应用它?

[原文] [Dario Amodei]: Yeah, preference data from old models sometimes gets used for new models, although, of course, it performs somewhat better when it's, you know, trained on, it's trained on the new models.

[译文] [Dario Amodei]: 是的,旧模型的偏好数据有时会被用于新模型,尽管当然,如果是针对新模型训练的数据,效果会好一些。

[原文] [Dario Amodei]: Note that we have this, you know, Constitutional AI method such that we don't only use preference data, we kind of, there's also a post-training process where we train the model against itself and there's, you know, new types of post-training the model against itself that are used every day. So it's not just RLHF, it's a bunch of other methods as well. Post-training, I think, you know, is becoming more and more sophisticated.

[译文] [Dario Amodei]: 请注意,我们有这种宪法式 AI(Constitutional AI)方法,因此我们不仅仅使用偏好数据,还有一个后训练过程,我们会让模型进行自我对抗训练(train the model against itself),而且每天都在使用这种新型的自我对抗后训练方法。所以不仅仅是 RLHF,还有很多其他方法。我认为后训练正在变得越来越复杂精妙。


章节 07:Sonnet 3.5 的编程能力飞跃与基准测试

📝 本节摘要

本节对话主要围绕 Claude 3.5 Sonnet 在编程能力上的显著提升展开。Lex 提到他在使用 Cursor 编程时切身感受到了模型的进步。Dario 证实了这一点,并分享了 Anthropic 内部的一则轶事:公司内最顶尖的工程师们首次认为 Sonnet 3.5 是真正能节省时间的代码工具,而此前的模型往往对新手更有用。

>

随后,Dario 深入解析了 SWE-bench(软件工程基准测试),这是一个基于真实 GitHub 拉取请求(Pull Requests)的高难度测试。他指出,模型在该任务上的成功率从年初的 3% 飙升至现在的 50%。他预测,一旦该数值达到 90%-95%,将标志着 AI 能够自主完成大部分软件工程任务,这代表了编程领域的一次重大范式转移。

[原文] [Lex Fridman]: Well, what explains the big leap in performance for the new Sonnet 3.5? I mean, at least in the programming side. And maybe this is a good place to talk about benchmarks. What does it mean to get better? Just the number went up, but, you know, I program, but I also love programming and Claude 3.5 through Cursor is what I use to assist me in programming. And there was, at least experientially, anecdotally, it's gotten smarter at programming. So like, what does it take to get it smarter?

[译文] [Lex Fridman]: 那么,是什么解释了新版 Sonnet 3.5 在性能上的巨大飞跃?我是说,至少在编程方面。也许这是谈论基准测试(benchmarks)的好时机。变得“更好”到底意味着什么?仅仅是数字上升了吗?你知道,我编程,我也热爱编程,我通过 Cursor 使用 Claude 3.5 来辅助我编程。至少从体验上、从轶事证据来看,它在编程方面确实变聪明了。所以,要让它变聪明需要付出什么?

[原文] [Dario Amodei]: We observed that as well, by the way. There were a couple very strong engineers here at Anthropic who all previous code models, both produced by us and produced by all the other companies, hadn't really been useful to them. You know, they said, you know, maybe this is useful to beginner, it's not useful to me. But Sonnet 3.5, the original one for the first time they said, "Oh my God, this helped me with something that, you know, that it would've taken me hours to do. This is the first model that's actually saved me time.",

[译文] [Dario Amodei]: 顺便说一句,我们也观察到了这一点。Anthropic 有几位非常厉害的工程师,之前所有的代码模型——无论是我们要生产的还是其他公司生产的——对他们来说都没什么用。你知道,他们会说,也许这东西对初学者有用,但对我没用。但是 Sonnet 3.5,最初的那个版本,让他们第一次惊呼:“天哪,这东西帮我解决了一个本来要花我好几个小时才能搞定的问题。这是第一个真正帮我节省了时间的模型。”

[原文] [Dario Amodei]: So again, the waterline is rising. And then I think, you know, the new Sonnet has been even better. In terms of what it takes, I mean, I'll just say it's been across the board. It's in the pre-training, it's in the post-training, it's in various evaluations that we do. We've observed this as well. And if we go into the details of the benchmark, so SWE-bench is basically since, you know, since you're a programmer, you know, you'll be familiar with like pull requests and, you know, just pull requests are like the, you know, like a sort of atomic unit of work.,

[译文] [Dario Amodei]: 所以再说一次,水位正在上升。而且我认为,新版的 Sonnet 甚至表现得更好。至于这需要付出什么,我想说这是全方位的。包括预训练、后训练,以及我们进行的各种评估。我们也观察到了这一点。如果我们深入探讨基准测试的细节,比如 SWE-bench,既然你是程序员,你肯定熟悉拉取请求(pull requests),拉取请求就像是某种原子级的工作单元。

[原文] [Dario Amodei]: You know, you could say, you know, I'm implementing one thing. And SWE-bench actually gives you kind of a real world situation where the code base is in a current state and I'm trying to implement something that's, you know, that's described in language. We have internal benchmarks where we measure the same thing and you say, just give the model free reign to like, you know, do anything, run anything, edit anything. How well is it able to complete these tasks?,

[译文] [Dario Amodei]: 你可以说,我在实现某一个功能。SWE-bench 实际上给你提供了一种真实世界的场景:代码库处于某种当前状态,而我试图实现某个用自然语言描述的东西。我们有内部基准测试也在测量同样的事情,你只要给模型自由发挥的空间,比如做任何事、运行任何东西、编辑任何内容。它完成这些任务的能力有多好?

[原文] [Dario Amodei]: And it's that benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time. So I actually do believe that if we get, you can game benchmarks, but I think if we get to 100% on that benchmark in a way that isn't kind of like overtrained or game for that particular benchmark, probably represents a real and serious increase in kind of programming ability.

[译文] [Dario Amodei]: 正是在这个基准测试上,成功率从之前的 3% 提升到了现在的 50% 左右。所以我真的相信,如果我们在那个基准测试上达到 100%——当然你可以通过作弊(game)来刷榜,但我指的是以一种非过度训练或非针对性作弊的方式达到 100%——那可能代表了编程能力上一次真实且严肃的提升。

[原文] [Dario Amodei]: And I would suspect that if we can get to, you know, 90, 95% that, you know, it will represent ability to autonomously do a significant fraction of software engineering tasks.

[译文] [Dario Amodei]: 而且我怀疑,如果我们能达到 90% 或 95%,那将代表它具备了自主完成很大一部分软件工程任务的能力。


章节 08:发布节奏、命名困境与“Opus 3.5”的时间表

📝 本节摘要

Lex 追问备受期待的 Claude 3.5 Opus 的发布时间,Dario 幽默地将其与游戏界的“跳票之王”《永远的毁灭公爵》和《GTA 6》作比,暗示了高预期下的不确定性,但也确认了该模型仍在计划中。随后,两人深入探讨了 AI 模型的命名哲学。Dario 坦言,与传统软件不同,AI 模型的版本号(如 3.5 vs 4.0)很难精确反映模型在预训练效率、推理成本和智能水平之间的复杂权衡。他承认目前的命名体系(Haiku/Sonnet/Opus)虽然初衷良好,但在快速迭代的技术现实面前,所有 AI 公司都在艰难探索如何准确定义“新一代”模型。

[原文] [Lex Fridman]: Well, ridiculous timeline question. When is Claude Opus 3.5 coming out?

[译文] [Lex Fridman]: 问一个有点离谱的时间线问题。Claude Opus 3.5 什么时候出?

[原文] [Dario Amodei]: Not giving you an exact date, but you know, there, you know, as far as we know, the plan is still to have a Claude 3.5 Opus.

[译文] [Dario Amodei]: 我不会给你一个确切的日期,但你知道,据我们所知,计划中仍然会有 Claude 3.5 Opus。

[原文] [Lex Fridman]: Are we gonna get it before "GTA 6" or no?

[译文] [Lex Fridman]: 我们会在《GTA 6》之前看到它吗?

[原文] [Dario Amodei]: Like "Duke Nukem Forever."

[译文] [Dario Amodei]: 就像《永远的毁灭公爵》(Duke Nukem Forever)那样。

[原文] [Lex Fridman]: "Duke Nukem-"

[译文] [Lex Fridman]: 《毁灭公爵》……

[原文] [Dario Amodei]: What was that game? There was some game that was delayed 15 years.

[译文] [Dario Amodei]: 那个游戏叫什么来着?有一个游戏跳票了 15 年。

[原文] [Lex Fridman]: That's right. Was that "Duke Nukem Forever?"

[译文] [Lex Fridman]: 没错。是《永远的毁灭公爵》吗?

[原文] [Lex Fridman]: Yeah. And I think "GTA" is now just releasing trailers.

[译文] [Lex Fridman]: 是的。而且我觉得《GTA》现在才刚发布预告片。

[原文] [Dario Amodei]: You know, it's only been three months since we released the first Sonnet.

[译文] [Dario Amodei]: 你知道,距离我们发布第一个 Sonnet 版本其实才过了三个月。

[原文] [Lex Fridman]: Yeah, it's the incredible pace of release.

[译文] [Lex Fridman]: 是的,这是令人难以置信的发布节奏。

[原文] [Dario Amodei]: It just tells you about the pace, the expectations for when things are gonna come out.

[译文] [Dario Amodei]: 这恰恰说明了节奏之快,以及人们对新产品问世的期望值。

[原文] [Lex Fridman]: So what about 4.0? So how do you think about sort of as these models get bigger and bigger, about versioning, and also just versioning in general, why Sonnet 3.5 updated with the date? Why not Sonnet 3.6, which a lot of people are calling it?

[译文] [Lex Fridman]: 那么 4.0 呢?随着这些模型变得越来越大,你是如何考虑版本控制的?还有总体上的版本命名,为什么 Sonnet 3.5 是带日期更新的?为什么不叫 Sonnet 3.6?很多人都这么叫它。

[原文] [Dario Amodei]: Yeah, naming is actually an interesting challenge here, right? Because I think a year ago, most of the model was pre-training, and so you could start from the beginning and just say, okay, we're gonna have models of different sizes, we're gonna train them all together and you know, we'll have a family of naming schemes and then we'll put some new magic into them and then, you know, we'll have the next generation.

[译文] [Dario Amodei]: 是的,命名在这里实际上是一个有趣的挑战,对吧?因为我认为一年前,模型的大部分工作都在预训练上,所以你可以从头开始说,好吧,我们会有不同大小的模型,我们会把它们放在一起训练,我们会有一套家族式的命名方案,然后我们会给它们注入一些新的魔法,接着就会有下一代。

[原文] [Dario Amodei]: The trouble starts already when some of them take a lot longer than others to train, right? That already messes up your time a little bit.

[译文] [Dario Amodei]: 麻烦首先在于,其中一些模型的训练时间要比其他模型长得多,对吧?这已经把你的时间表搞得有点乱了。

[原文] [Dario Amodei]: But as you make big improvements in pre-training, then you suddenly notice, oh, I can make better pre-train model and that doesn't take very long to do, but you know, clearly it has the same, you know, size and shape of previous models.

[译文] [Dario Amodei]: 但当你在预训练方面取得重大改进时,你会突然发现,噢,我可以制作更好的预训练模型,而且这不需要很长时间,但很明显,它具有与以前的模型相同的规模和形态。

[原文] [Dario Amodei]: So I think those two together as well as the timing issues, any kind of scheme you come up with, you know, the reality tends to kind of frustrate that scheme, right? Tend tends to kind of break out of the scheme. It's not like software where you can say, oh, this is like, you know, 3.7, this is 3.8.

[译文] [Dario Amodei]: 所以我认为这两个因素加上时间问题,无论你想出什么样的方案,现实往往会挫败该方案,对吧?现实往往会打破这种方案。它不像软件,你可以说,噢,这是 3.7,这是 3.8。

[原文] [Dario Amodei]: No, you have models with different trade-offs. You can change some things in your models, you can train, you can change other things. Some are faster and slower at inference, some have to be more expensive, some have to be less expensive. And so I think all the companies have struggled with this.

[译文] [Dario Amodei]: 不,你的模型有不同的权衡(trade-offs)。你可以改变模型中的某些东西,你可以训练,可以改变其他东西。有些在推理时更快或更慢,有些必须更贵,有些必须更便宜。所以我认为所有的公司都在这方面挣扎。

[原文] [Dario Amodei]: I think we did very, you know, I think we were in a good position in terms of naming when we had Haiku, Sonnet and Opus.

[译文] [Dario Amodei]: 我认为我们在拥有 Haiku、Sonnet 和 Opus 时,在命名方面处于一个很好的位置。

[原文] [Lex Fridman]: It was great, great start.

[译文] [Lex Fridman]: 那是很棒的,很棒的开端。

[原文] [Dario Amodei]: We're trying to maintain it, but it's not perfect, so we'll try and get back to the simplicity, but just the nature of the field, I feel like no one's figured out naming. It's somehow a different paradigm from like normal software and so we just, none of the companies have been perfect at it.

[译文] [Dario Amodei]: 我们正试图维持它,但这并不完美,所以我们会尝试回归简洁。但就这个领域的本质而言,我觉得还没人真正搞懂命名这件事。它在某种程度上是与普通软件不同的范式,所以我们只是……没有一家公司在这方面做得完美。

[原文] [Dario Amodei]: It's something we struggle with surprisingly much relative to, you know, how relative to how trivial it is to, you know, for the grand science of training the models.

[译文] [Dario Amodei]: 相比于训练模型这一宏大科学而言,这件事本来应该很琐碎,但令人惊讶的是,我们在命名这件事上纠结了很久。


章节 09:用户心理学:为什么感觉模型“变笨了”?

📝 本节摘要

针对 Reddit 上热议的“Claude 随着时间推移变笨了”的观点,Dario 表示这几乎是所有前沿模型(包括 GPT-4)都会面临的用户投诉。他澄清道,除非发布新版本,否则模型的权重(Weights)绝对不会改变,因为随机替换权重在技术上极具风险且难以控制。他指出了两个可能导致该错觉的真实因素:一是模型对提示词措辞(Phrasing)极其敏感,细微差异可能导致输出质量波动;二是心理预期阈值(Psychological Baseline)的提升——用户在习惯了强大的功能后,会更多地关注缺陷而非亮点。Lex 将此比作人们对“飞机 Wi-Fi”态度的转变:从最初的惊叹变成了对速度慢的抱怨。

[原文] [Lex Fridman]: I gotta ask you a question from Reddit.

[译文] [Lex Fridman]: 我得问你一个来自 Reddit 的问题。

[原文] [Dario Amodei]: From Reddit? Oh, boy. (laughs)

[译文] [Dario Amodei]: 来自 Reddit?噢,天哪。(笑)

[原文] [Lex Fridman]: You know, there just this fascinating, to me at least, it's a psychological social phenomenon where people report that Claude has gotten dumber for them over time. And so the question is, does the user complaint about the dumbing down of Claude 3.5 Sonnet hold any water? So are these anecdotal reports a kind of social phenomena or did Claude, is there any cases where Claude would get dumber?

[译文] [Lex Fridman]: 你知道,有一种——至少对我来说——非常迷人的心理社会现象,就是人们报告说 Claude 随着时间的推移变得越来越笨了。所以问题是,用户关于 Claude 3.5 Sonnet “降智”(dumbing down)的抱怨站得住脚吗?这些轶事报告是一种社会现象,还是 Claude 真的变笨了?有没有 Claude 会变笨的情况?

[原文] [Dario Amodei]: So this actually doesn't apply, this isn't just about Claude. I believe I've seen these complaints for every foundation model produced by a major company. People said this about GPT-4, they said it about GPT-4 Turbo. So, a couple things. One, the actual weights of the model, right, the actual brain of the model, that does not change unless we introduce a new model.

[译文] [Dario Amodei]: 这其实不只适用于 Claude,这不仅仅是关于 Claude 的事。我相信我看到过针对每一家大公司生产的每一个基础模型的这类抱怨。人们对 GPT-4 这么说过,对 GPT-4 Turbo 也这么说过。所以,有几点。第一,模型的实际权重(weights),也就是模型真正的大脑,除非我们推出新模型,否则它是不会改变的。

[原文] [Dario Amodei]: There are just a number of reasons why it would not make sense practically to be randomly substituting in new versions of the model. It's difficult from an inference perspective and it's actually hard to control all the consequences of changing the weight of the model.

[译文] [Dario Amodei]: 有很多原因表明,在实际操作中随机替换新版本的模型是毫无意义的。从推理(inference)的角度来看这很难,而且实际上很难控制改变模型权重所带来的所有后果。

[原文] [Dario Amodei]: Let's say you wanted to fine tune the model to be like, I don't know, to like to say "certainly" less, which, you know, an old version of Sonnet used to do. You actually end up changing 100 things as well. So we have a whole process for it, and we have a whole process for modifying the model. We do a bunch of testing on it, we do a bunch of user testing and early customers. So we both have never changed the weights of the model without telling anyone, and it wouldn't, certainly in the current setup, it would not make sense to do that.

[译文] [Dario Amodei]: 比方说你想微调模型,让它——我不知道——比如让它少说“当然(certainly)”,旧版 Sonnet 以前很爱说这个词。结果是你实际上会同时改变另外 100 件事。所以我们对此有一整套流程,我们有一整套修改模型的流程。我们会进行大量测试,进行大量用户测试和早期客户测试。所以,我们从未在不告知任何人的情况下更改模型权重,而且——当然在目前的设置下——这样做也是毫无意义的。

[原文] [Dario Amodei]: Now, there are a couple things that we do occasionally do. One is sometimes we run A/B tests, but those are typically very close to when a model is being released and for a very small fraction of time. So, you know, like, the day before the new Sonnet 3.5. I agree, we should have had a better name. It's clunky to refer to it. There were some comments from people that like it's gotten a lot better, and that's because, you know, a fraction were exposed to an A/B test for those one or two days.

[译文] [Dario Amodei]: 不过,有几件事我们偶尔确实会做。一是通过 A/B 测试,但这通常是在模型即将发布时进行的,而且时间非常短。比如,在新版 Sonnet 3.5 发布的前一天。我同意,我们应该起个更好的名字,提到它时很拗口。当时有些评论说它变得好多了,那是因为有一小部分人在那一两天里接触到了 A/B 测试。

[原文] [Dario Amodei]: The other is that occasionally, the system prompt will change. The system prompt can have some effects, although it's unlikely to dumb down models. It's unlikely to make them dumber. And we've seen that while these two things, which I'm listing to be very complete, happened relatively, happened quite infrequently, the complaints about, for us and for other model companies about the model change, the model isn't good at this. The model got more censored. The model was dumbed down. Those complaints are constant.

[译文] [Dario Amodei]: 另一件事是偶尔系统提示词(system prompt)会改变。系统提示词可能会产生一些影响,尽管它不太可能让模型“降智”。不太可能让它们变笨。我们看到,虽然这两件事(为了严谨我才列出这两点)发生的频率相对较低,非常不频繁,但关于模型变了、模型这方面不行了、模型被审查得更严了、模型变笨了的抱怨——无论对我们还是对其他模型公司——却是持续不断的。

[原文] [Dario Amodei]: And so I don't wanna say like people are imagining it or anything, but like the models are for the most part not changing. If I were to offer a theory, I think it actually relates to one of the things I said before, which is that models are very complex and have many aspects to them. And so often, you know, if I ask the model a question, you know, if I'm like, "Do task X" versus "Can you do task X?" the model might respond in different ways.

[译文] [Dario Amodei]: 所以我不想说这是人们的凭空想象或其他什么,但在绝大多数情况下,模型并没有改变。如果要我提供一个理论,我认为这实际上与我之前说的一件事有关,那就是模型非常复杂,有很多方面。所以通常,如果我问模型一个问题,比如我说“做任务 X”和“你能做任务 X 吗?”,模型可能会以不同的方式回应。

[原文] [Dario Amodei]: And so there are all kinds of subtle things that you can change about the way you interact with the model that can give you very different results. To be clear, this itself is like a failing by us and by the other model providers that the models are just often sensitive to like small changes in wording. It's yet another way in which the science of how these models work is very poorly developed.

[译文] [Dario Amodei]: 所以,你在与模型交互的方式上做出各种微妙的改变,都可能导致截然不同的结果。需要明确的是,这本身就是我们和其他模型提供商的一种失败,因为模型往往对措辞上的微小变化过于敏感。这也是这些模型运作背后的科学还非常不成熟的另一种表现。

[原文] [Dario Amodei]: And so, you know, if I go to sleep one night and I was like talking to the model in a certain way and I like slightly changed the phrasing of how I talk to the model, you know, I could get different results. So that's one possible way. The other thing is, man, it's just hard to quantify this stuff. It's hard to quantify this stuff. I think people are very excited by new models when they come out and then as time goes on, they become very aware of the limitations, so that may be another effect.

[译文] [Dario Amodei]: 所以,如果我某天晚上睡觉前用某种方式和模型说话,然后稍微改变了一下与之交谈的措辞,我可能会得到不同的结果。这是这种现象的一种可能途径。另一件事是,天哪,这东西很难量化。这东西很难量化。我认为人们在新模型刚出来时非常兴奋,然后随着时间推移,他们开始非常敏锐地意识到局限性,所以这可能是另一个影响因素。

[原文] [Dario Amodei]: But that's all a very long-winded way of saying for the most part, with some fairly narrow exceptions, the models are not changing.

[译文] [Dario Amodei]: 但说了这么多,其实就是想说:在绝大多数情况下,除了一些相当狭窄的例外,模型并没有改变。

[原文] [Lex Fridman]: I think there is a psychological effect. You just start getting used to it. The baseline raises. Like when people first gotten wifi on airplanes, it's like amazing, magic.

[译文] [Lex Fridman]: 我认为这确实存在一种心理效应。你只是开始习惯它了。基准线提高了。就像人们第一次在飞机上用到 Wi-Fi 时,感觉那是惊人的、魔法般的。

[原文] [Dario Amodei]: It's like amazing, yeah.

[译文] [Dario Amodei]: 是的,感觉很神奇。

[原文] [Lex Fridman]: And then-

[译文] [Lex Fridman]: 然后——

[原文] [Dario Amodei]: And now I'm like, I can't get this thing to work. This is such a piece of crap.

[译文] [Dario Amodei]: 而现在我会说,我连这破玩意儿都连不上。这简直是一坨垃圾。

[原文] [Lex Fridman]: Exactly, so then it's easy to have the conspiracy theory of they're making wifi slower and slower.

[译文] [Lex Fridman]: 没错,所以这就很容易产生一种阴谋论,认为他们故意把 Wi-Fi 搞得越来越慢。


章节 10:模型性格调优:回应“清教徒式祖母”的批评

📝 本节摘要

本节对话从一个犀利的 Reddit 用户提问开始:Claude 何时能不再像个“清教徒式的祖母”那样对用户进行道德说教?Dario 对此进行了深入回应。首先,他指出社交媒体上的愤怒声音往往属于“少数派”,与大多数用户关心的核心功能(如代码生成)存在偏差。其次,他解释了技术上的“权衡难题”(Trade-offs):调整模型行为就像玩“打地鼠”游戏,减少“废话”可能导致模型变懒(不写完整代码),减少道歉可能导致模型变粗鲁。最后,他将这种性格调优的困难视为未来 AI 对齐(Alignment)问题的预演——如果我们在现在无法完美控制模型的语气,未来将更难控制超级智能的行为。

[原文] [Lex Fridman]: This is probably something I'll talk to Amanda much more about. But another Reddit question, "When will Claude stop trying to be my puritanical grandmother imposing its moral worldview on me as a paying customer? And also, what is the psychology behind making Claude overly apologetic?" So this kind of reports about the experience, a different angle on the frustration, it has to do with the character.

[译文] [Lex Fridman]: 这可能是我以后会和 Amanda 深入探讨的话题。但这里有一个来自 Reddit 的问题:“Claude 什么时候才能不再像个清教徒式的祖母(puritanical grandmother)那样,试图把它那一套道德世界观强加给我这个付费客户身上?还有,把 Claude 设定得过度道歉,背后的心理学依据是什么?”所以这类关于体验的报告,是从另一个角度表达挫败感,这与角色性格有关。

[原文] [Dario Amodei]: Yeah, so a couple points on this first. One is like things that people say on Reddit and Twitter, or X or whatever it is, there's actually a huge distribution shift between like the stuff that people complain loudly about on social media and what actually kind of like, you know, statistically users care about and that drives people to use the models.

[译文] [Dario Amodei]: 是的,关于这点我先说几句。第一点是,人们在 Reddit 和 Twitter(或者 X,不管叫什么)上大声抱怨的事情,与统计学上用户真正关心的事情以及驱动人们使用模型的因素之间,实际上存在巨大的分布偏差。

[原文] [Dario Amodei]: Like people are frustrated with, you know, things like, you know, the model not writing out all the code or the model, you know, just not being as good at code as it could be, even though it's the best model in the world on code. I think the majority things are about that. But certainly a kind of vocal minority are, you know, kind of raise these concerns, right? Are frustrated by the model refusing things that it shouldn't refuse, or like apologizing too much, or just having these kind of like annoying verbal ticks.

[译文] [Dario Amodei]: 比如人们会因为模型没有写出所有代码,或者模型在代码方面表现得不如预期(尽管它是世界上写代码最好的模型)而感到沮丧。我认为大多数问题都是关于这些的。但当然,也确实有一部分直言不讳的少数派提出了这些担忧,对吧?他们对模型拒绝了不该拒绝的事情、或者过度道歉、或者仅仅是有这类烦人的口头禅感到沮丧。

[原文] [Dario Amodei]: The second caveat, and I just wanna say this like super clearly because I think it's like some people don't know it, others like kind of know it but forget it. Like it is very difficult to control across the board how the models behave. You cannot just reach in there and say, "Oh, I want the model to like apologize less."

[译文] [Dario Amodei]: 第二个需要说明的注意事项——我想超级清楚地说明这一点,因为我觉得有些人不知道,还有些人虽然知道但忘了——那就是要全面控制模型的行为是非常困难的。你不能直接把手伸进去说:“噢,我想让模型少道歉一点。”

[原文] [Dario Amodei]: Like you can do that, you can include training data that says like, "Oh, the model should like apologize less," but then in some other situation they end up being like super rude or like overconfident in a way that's like misleading people. So there are all these trade-offs.

[译文] [Dario Amodei]: 你当然可以那样做,你可以加入训练数据说“噢,模型应该少道歉”,但结果可能是它在其他某种情况下变得超级粗鲁,或者是那种会误导人的过度自信。所以这就存在各种各样的权衡(trade-offs)。

[原文] [Dario Amodei]: For example, another thing is there was a period during which models, ours and I think others as well were too verbose, right? They would like repeat themselves, they would say too much. You can cut down on the verbosity by penalizing the models for just talking for too long.

[译文] [Dario Amodei]: 举个例子,有一段时间,我们的模型——我想其他公司的也是——太啰嗦了,对吧?它们会自我重复,说得太多。你可以通过惩罚模型说话太长来减少这种啰嗦。

[原文] [Dario Amodei]: What happens when you do that, if you do it in a crude way is when the models are coding, sometimes they'll say rest of the code goes here, right? Because they've learned that that's the way to economize and that they see it, and then so that leads the model to be so-called lazy in coding where they're just like, ah, you can finish the rest of it.

[译文] [Dario Amodei]: 如果你用粗暴的方式那样做,会发生什么呢?当模型在写代码时,有时它们会说“剩下的代码填在这里”,对吧?因为它们学会了那是省事的方法。这就会导致模型在编程时所谓的“偷懒”,它们就像是在说:“啊,剩下的你自己搞定吧。”

[原文] [Dario Amodei]: It's not because we wanna, you know, save on compute or because you know, the models are lazy, and you know, during winter break, or any of the other kind of conspiracy theories that have come up. It's actually, it's just very hard to control the behavior of the model, to steer the behavior of the model in all circumstances at once. You can kind of, there's this whack-a-mole aspect where you push on one thing and like, you know, these other things start to move as well that you may not even notice or measure.

[译文] [Dario Amodei]: 这不是因为我们想节省算力,或者因为模型真的很懒,也不是因为什么“寒假效应”,或者其他冒出来的那些阴谋论。实际上,仅仅是因为很难控制模型的行为,很难在所有情况下同时引导模型的行为。这有点像“打地鼠”(whack-a-mole)游戏,你按下这一头,另一头就会开始动,而你甚至可能都没注意到或测量到。

[原文] [Dario Amodei]: And so one of the reasons that I care so much about, you know, kind of grand alignment of these AI systems in the future is actually these systems are actually quite unpredictable. They're actually quite hard to steer and control. And this version we're seeing today of you make one thing better, it makes another thing worse, I think that's like a present day analog of future control problems in AI systems that we can start to study today, right?

[译文] [Dario Amodei]: 所以,我之所以如此关心未来这些 AI 系统的宏大对齐(grand alignment)问题,原因之一就是这些系统实际上相当不可预测。它们实际上相当难以引导和控制。我们今天看到的这种“改善了一方面却让另一方面变差”的版本,我认为就是未来 AI 系统控制问题的现代版模拟,是我们今天就可以开始研究的,对吧?

[原文] [Dario Amodei]: I think that that difficulty in steering the behavior and in making sure that if we push an AI system in one direction, it doesn't push it in another direction in some other ways that we didn't want. I think that's kind of an early sign of things to come, and if we can do a good job of solving this problem, right, of like you ask the model to like, you know, to like make and distribute smallpox and it says no, but it's willing to like help you in your graduate level virology class. Like how do we get both of those things at once? It's hard.

[译文] [Dario Amodei]: 我认为这种引导行为的困难——即确保当我们把 AI 系统推向一个方向时,它不会以我们不希望的方式在另一个方向上乱动——这是未来挑战的早期迹象。如果我们能很好地解决这个问题……比如你让模型制造和传播天花病毒,它说不;但它又愿意在你研究生级别的病毒学课程中帮助你。我们如何同时做到这两点?这很难。

[原文] [Dario Amodei]: It's very easy to go to one side or the other and it's a multidimensional problem. And so, you know, I think these questions of like shaping the model's personality, I think they're very hard. I think we haven't done perfectly on them. I think we've actually done the best of all the AI companies, but still so far from perfect.

[译文] [Dario Amodei]: 很容易偏向这一边或那一边,这是一个多维问题。所以,我认为这些关于塑造模型性格的问题非常难。我认为我们还没有做得完美。我认为我们实际上在所有 AI 公司中做得最好了,但距离完美还很远。

[原文] [Dario Amodei]: And I think if we can get this right, if we can control, you know, control the false positives and false negatives in this very kind of controlled present day environment, we'll be much better at doing it for the future when our worry is, you know, will the models be super autonomous? Will they be able to, you know, make very dangerous things? Will they be able to autonomously, you know, build whole companies? And are those companies aligned? So, I think of this present task as both vexing, but also good practice for the future.

[译文] [Dario Amodei]: 我认为如果我们能搞定这个,如果我们能在当今这种非常受控的环境下控制好假阳性和假阴性,那么在未来我们就能做得更好。那时的担忧将是:模型会变得超级自主吗?它们能制造非常危险的东西吗?它们能自主建立整个公司吗?那些公司是对齐的吗?所以我认为当前的任务既令人烦恼,也是面向未来的良好练习。


这是为您整理的第 11 章内容。在此章节中,Dario 详细介绍了 Anthropic 的核心安全框架——负责任扩展政策(RSP),并解释了如何通过 ASL 分级体系来应对尚未到来但逼近的灾难性风险。

章节 11:负责任扩展政策 (RSP) 与 ASL 安全分级体系

📝 本节摘要

在本节中,Lex 询问了 Anthropic 的负责任扩展政策(Responsible Scaling Policy, RSP)AI 安全等级(ASL)标准。Dario 强调,强大的 AI 是一把双刃剑,巨大的益处伴随着巨大的责任。他重点关注两类风险:灾难性滥用(Catastrophic Misuse)(如网络攻击、生化核放 CBRN 威胁)和自主性风险(Autonomy Risks)(模型脱离人类控制)。

>

面对“风险尚未出现但发展极快”的困境,Dario 提出了 RSP 的核心逻辑:“如果-那么”承诺(If-Then Commitments)。即只有当模型达到特定的危险能力阈值时,才强制实施相应的安全措施,避免在早期阶段“狼来了”式的过度反应。他详细定义了安全等级:ASL-1 是无害的(如国际象棋机器人);ASL-2 是当前的 AI 系统;ASL-3 将标志着模型能显著帮助非国家行为体制造生化武器(预计最早明年到来);而 ASL-4 及以上则涉及国家级威胁和超越人类的自主能力。

[原文] [Lex Fridman]: Okay, can you explain the Responsible Scaling Policy and the AI Safety Level Standards, ASL Levels?

[译文] [Lex Fridman]: 好的,你能解释一下“负责任扩展政策”(Responsible Scaling Policy)以及“AI 安全等级标准”(ASL Levels)吗?

[原文] [Dario Amodei]: As much as I am excited about the benefits of these models, and, you know, we'll talk about that if we talk about "Machines of Loving Grace," I'm worried about the risks and I continue to be worried about the risks.

[译文] [Dario Amodei]: 尽管我对这些模型的益处感到兴奋——如果我们谈论《仁慈机器》(Machines of Loving Grace)时会聊到这个——但我同时也担心风险,并且我持续担心这些风险。

[原文] [Dario Amodei]: No one should think that, you know, "Machines of Loving Grace" was me saying, you know, I'm no longer worried about the risks of these models. I think they're two sides of the same coin.

[译文] [Dario Amodei]: 没人应该认为《仁慈机器》这篇文章代表我不再担心这些模型的风险了。我认为它们是同一枚硬币的两面。

[原文] [Dario Amodei]: The power of the models and their ability to solve all these problems in, you know, biology, neuroscience, economic development, governance and peace, large parts of the economy, those come with risks as well, right? With great power comes great responsibility, right? The two are paired.

[译文] [Dario Amodei]: 模型的力量,以及它们解决生物学、神经科学、经济发展、治理与和平、以及大部分经济领域问题的能力,这些同时也伴随着风险,对吧?能力越大,责任越大,对吧?这两者是成对出现的。

[原文] [Dario Amodei]: Perhaps the two biggest risks that I think about... one is what I call catastrophic misuse. These are misuse of the models in domains like cyber, bio, radiological, nuclear, right? Things that could, you know, that could harm or even kill thousands, even millions of people if they really, really go wrong.

[译文] [Dario Amodei]: 我考虑的两个最大的风险……其中一个我称之为灾难性滥用(catastrophic misuse)。这是指在网络、生物、放射性和核(CBRN)等领域的模型滥用,对吧?如果真的出了大问题,这些事情可能会伤害甚至杀死成千上万,甚至数百万人。

[原文] [Dario Amodei]: And the second range of risks would be the autonomy risks, which is the idea that models might on their own, particularly as we give them more agency than they've had in the past... are they doing what we really want them to do? It's very difficult to even understand in detail what they're doing, let alone control it.

[译文] [Dario Amodei]: 第二类风险是自主性风险(autonomy risks),即模型可能会自己行动,特别是当我们赋予它们比过去更多的代理权时……它们真的在做我们想让它们做的事吗?甚至很难详细了解它们在做什么,更不用说控制它了。

[原文] [Dario Amodei]: And so these are the two risks I'm worried about. And our Responsible Scaling Plan... is designed to address these two types of risks. And so every time we develop a new model, we basically test it for its ability to do both of these bad things.

[译文] [Dario Amodei]: 这些就是我担心的两类风险。而我们的“负责任扩展计划”……就是为了应对这两类风险而设计的。所以每当我们开发一个新模型,我们基本上都会测试它做这两类坏事的能力。

[原文] [Dario Amodei]: So if I were to back up a little bit, I think we have an interesting dilemma with AI systems where they're not yet powerful enough to present these catastrophes... but the case for worry, the case for risk is strong enough that we should act now. And they're getting better very, very fast, right?

[译文] [Dario Amodei]: 如果我稍微回溯一下,我认为我们在 AI 系统上面临一个有趣的困境:它们目前还不够强大,不足以引发这些灾难……但担忧的理由、风险的理由已经强到足以让我们现在就采取行动。而且它们正在非常、非常快地变得更好,对吧?

[原文] [Dario Amodei]: So we have this thing where it's like, it's surprisingly hard to address these risks because they're not here today. They don't exist. They're like ghosts, but they're coming at us so fast because the models are improving so fast.

[译文] [Dario Amodei]: 所以我们面临的情况是,要应对这些风险出奇地难,因为它们今天还不存在。它们就像幽灵,但正因为模型进步如此神速,它们正飞快地向我们袭来。

[原文] [Dario Amodei]: The RSP basically develops what we've called an if then structure, which is if the models pass a certain capability, then we impose a certain set of safety and security requirements on them.

[译文] [Dario Amodei]: RSP 基本上开发了我们所谓的“如果-那么”结构(if-then structure),即如果模型通过了某种能力阈值,那么我们就对它们施加一套特定的安全和安保要求。

[原文] [Dario Amodei]: So today's models are what's called ASL two. Models that were, ASL one is for systems that manifestly don't pose any risk of autonomy or misuse. So for example, a chess playing bot, Deep Blue would be ASL one.

[译文] [Dario Amodei]: 今天的模型属于所谓的 ASL-2。ASL-1 是针对那些显然不构成任何自主或滥用风险的系统。例如,下棋机器人 Deep Blue 就属于 ASL-1。

[原文] [Dario Amodei]: ASL two is today's AI systems where we've measured them and we think these systems are simply not smart enough to, you know, autonomously self-replicate or conduct a bunch of tasks, and also not smart enough to provide meaningful information about CBRN risks... above and beyond what can be known from looking at Google.

[译文] [Dario Amodei]: ASL-2 是今天的 AI 系统,我们测量过它们,认为这些系统还不够聪明,无法自主自我复制或执行一系列任务,也不够聪明到能提供关于 CBRN(生化核放)风险的有意义信息——我是指超出在 Google 上能查到的信息之外。

[原文] [Dario Amodei]: So ASL three is gonna be the point at which the models are helpful enough to enhance the capabilities of non-state actors, right? State actors can already do a lot of, unfortunately, to a high level of proficiency, a lot of these very dangerous and destructive things. The difference is that non-state actors are not capable of it.

[译文] [Dario Amodei]: ASL-3 将是模型足以增强非国家行为体(non-state actors)能力的那个时间点,对吧?不幸的是,国家行为体已经能够非常熟练地做很多这些极其危险和破坏性的事情。区别在于非国家行为体目前还做不到。

[原文] [Dario Amodei]: And so when we get to ASL three, we'll take special security precautions designed to be sufficient to prevent theft of the model by non-state actors, and misuse of the model as it's deployed.

[译文] [Dario Amodei]: 所以当我们达到 ASL-3 时,我们将采取特殊的安保预防措施,旨在足以防止模型被非国家行为体窃取,并防止模型在部署时被滥用。

[原文] [Dario Amodei]: ASL four, getting to the point where these models could enhance the capability of a already knowledgeable state actor and/or become, you know, the main source of such a risk.

[译文] [Dario Amodei]: ASL-4,则是达到这种程度:这些模型可以增强已经具备相关知识的国家行为体的能力,或者成为此类风险的主要来源。

[原文] [Dario Amodei]: And then ASL five is where we would get to the models that are, you know, that are kind of, you know, truly capable, that could exceed humanity in their ability to do any of these tasks.

[译文] [Dario Amodei]: 然后 ASL-5 是我们将达到那种模型:它们真正具备能力,在执行任何这些任务的能力上可能超越人类。

[原文] [Dario Amodei]: And so the point of if then structure commitment is basically to say... It's actually kind of dangerous to cry wolf. It's actually kind of dangerous to say this, you know, this model is risky. And, you know, people look at it and they say, this is manifestly not dangerous.

[译文] [Dario Amodei]: 所以“如果-那么”结构承诺的意义基本上是在说……“狼来了”(crying wolf)实际上是有点危险的。如果你说“这个模型有风险”,而人们看着它说“这显然不危险”,那其实是有害的。

[原文] [Dario Amodei]: So if then, the trigger commitment is basically a way to deal with this. Says you clamp down hard when you can show that the model is dangerous.

[译文] [Dario Amodei]: 所以“如果-那么”的触发承诺基本上是处理这个问题的一种方式。它意味着只有当你能证明模型是危险的时候,你才采取严厉的限制措施。


这是为您整理的第 12 章内容。在此章节中,Dario 详细介绍了 Claude 的一项革命性新功能——计算机使用能力(Computer Use),并深入剖析了其背后的技术原理、未来演进以及安全隐患。

章节 12:计算机使用能力 (Computer Use) 与代理风险

📝 本节摘要

本节重点讨论了 Claude 新增的“计算机使用”功能,即模型能够像人类一样查看屏幕、移动鼠标和点击。Dario 解释说,这项看似复杂的功能在技术上其实相对简单:只要模型具备强大的图像分析能力,通过训练它输出坐标和按键指令,它就能很快掌握操作系统的使用。他引用了一句名言“一旦进入近地轨道,你就已经去往任何地方的半路上了”,以此说明强大的预训练模型已经具备了极强的泛化能力。

>

尽管目前的模型还不够完美(有时会误触),但 Dario 预测其可靠性将在一年内达到人类水平(80-90%)。在风险方面,他认为计算机使用本身并不是一种全新的危险能力(如生化武器知识),但它“扩大了孔径(opens the aperture)”,让模型更容易遭遇提示词注入(prompt injection)攻击(例如屏幕上出现的恶意广告)或被用于实施大规模的网络诈骗

[原文] [Lex Fridman]: One of the ways that Claude has been getting more and more powerful is it's now able to do some agentic stuff, computer use. There's also an analysis within the sandbox of claude.ai itself. But let's talk about computer use. That seems to me super exciting that you can just give Claude a task and it takes a bunch of actions, figures it out, and has access to your computer through screenshots. So can you explain how that works? And where that's headed?

[译文] [Lex Fridman]: Claude 变得越来越强大的一种方式是它现在能够做一些代理(agentic)的事情,比如计算机使用(computer use)。虽然在 claude.ai 内部沙盒里也有代码分析功能,但我们来谈谈计算机使用。这对我来说超级令人兴奋,你可以直接给 Claude 一个任务,它会采取一系列行动,自己搞定,并且通过截屏访问你的电脑。你能解释一下它是如何工作的吗?以及未来的发展方向是什么?

[原文] [Dario Amodei]: Yeah, it's actually relatively simple. So Claude has had for a long time, since Claude 3.0 back in March, the ability to analyze images and respond to them with text. The only new thing we added is

这是为您整理的第 13 章内容。在此章节中,对话进入了更为宏观的政策层面,Dario 详细阐述了 Anthropic 对于 AI 监管的立场,特别是围绕备受争议的加州 SB 1047 法案的辩论。

章节 13:AI 监管的必要性与 SB 1047 法案

📝 本节摘要

在本节中,Lex 询问了 Dario 对加州 AI 监管法案 SB 1047(最终被州长否决)的看法。Dario 表达了对适度监管的坚定支持。他认为,虽然该法案并不完美,但 AI 行业不能仅靠由于各公司自愿遵守的“君子协定”(如 RSP),因为这会造成“劣币驱逐良币”——不负责任的公司会获得竞争优势。

>

Dario 驳斥了部分反监管的论点(如导致公司搬离加州或破坏开源)是“胡说八道”,但也承认过度的官僚主义(类比 GDPR)可能会扼杀创新。他呼吁一种“外科手术式”(surgical)的监管模式:精准针对 catastrophic risks(灾难性风险),而不是不仅浪费时间且无效的繁文缛节。他强调,如果 2025 年底前仍无实质性立法,在这个“幽灵般”的风险快速逼近的时刻,他将感到非常担忧。

[原文] [Lex Fridman]: Let me ask about regulation. What's the role of regulation in keeping AI safe? So for example, can you describe California AI regulation Bill SB 1047 that was ultimately vetoed by the governor? What are the pros and cons of this bill in general?

[译文] [Lex Fridman]: 让我问问关于监管的问题。监管在保持 AI 安全方面扮演什么角色?例如,你能描述一下最终被州长否决的加州 AI 监管法案 SB 1047 吗?总体而言,这项法案的利弊是什么?

[原文] [Dario Amodei]: Yes, we ended up making some suggestions to the bill, and then some of those were adopted and, you know, we felt, I think quite positively, quite positively about the bill by the end of that. It did still have some downsides, and, you know, of course it got vetoed.,

[译文] [Dario Amodei]: 是的,我们最终对该法案提出了一些建议,其中一些被采纳了。到最后,我们对该法案的感觉其实是相当积极的,相当积极。它确实仍有一些缺点,而且你知道,当然它最后被否决了。

[原文] [Dario Amodei]: I think at a high level, I think some of the key ideas behind the bill are, you know, I would say similar to ideas behind our RSPs. And I think it's very important that some jurisdiction, whether it's California or the federal government and/or other countries and other states passes some regulation like this.

[译文] [Dario Amodei]: 我认为从高层次来看,该法案背后的一些关键理念与我们的 RSP(负责任扩展政策)背后的理念是相似的。我认为非常重要的一点是,必须有某个司法管辖区——无论是加州、联邦政府,还是其他国家或州——通过某种像这样的法规。

[原文] [Dario Amodei]: I can talk through why I think that's so important. So I feel good about our RSP. It's not perfect, it needs to be iterated on a lot, but it's been a good forcing function for getting the company to take these risks seriously, to put them into product planning, to really make them a central part of work at Anthropic and to make sure that all of 1000 people, and it's almost 1000 people now at Anthropic, understand that this is one of the highest priorities of the company, if not the highest priority.,

[译文] [Dario Amodei]: 我可以谈谈为什么我认为这如此重要。我对我们的 RSP 感觉良好。它并不完美,需要大量迭代,但它一直是一个很好的强制机制(forcing function),迫使公司认真对待这些风险,将其纳入产品规划,真正使其成为 Anthropic 工作的核心部分,并确保所有 1000 名员工(Anthropic 现在差不多有 1000 人了)都明白,这是公司最高优先级的任务之一,如果不是最高优先级的话。

[原文] [Dario Amodei]: But one, there are still some companies that don't have RSP like mechanisms, like OpenAI, Google did adopt these mechanisms a couple months after Anthropic did, but there are other companies out there that don't have these mechanisms at all.,

[译文] [Dario Amodei]: 但是,第一,仍然有一些公司没有类似 RSP 的机制。比如 OpenAI 和 Google 在 Anthropic 之后几个月确实采用了这些机制,但外面还有其他公司根本没有这些机制。

[原文] [Dario Amodei]: And so if some companies adopt these mechanisms and others don't, it's really gonna create a situation where, you know, some of these dangers have the property that it doesn't matter if three out of five of the companies are being safe, if the other two are being unsafe, it creates this negative externality.

[译文] [Dario Amodei]: 所以,如果有些公司采用了这些机制而其他公司没有,这真的会造成一种局面……因为其中一些危险具有这样的特性:即使五家公司中有三家是安全的也没用,如果另外两家不安全,这就会制造负外部性(negative externality)

[原文] [Dario Amodei]: And I think the lack of uniformity is not fair to those of us who have put a lot of effort into being very thoughtful about these procedures.,

[译文] [Dario Amodei]: 而且我认为,这种缺乏统一标准的情况,对于我们这些投入了大量精力去深思熟虑这些程序的人来说是不公平的。

[原文] [Dario Amodei]: The second thing is, I don't think you can trust these companies to adhere to these voluntary plans in their own, right? I like to think that Anthropic will. We do everything we can that we will. Our RSP is checked by our long-term benefit trust. So, you know, we do everything we can to adhere to our own RSP.

[译文] [Dario Amodei]: 第二点是,我认为你不能指望这些公司会靠自己去遵守这些自愿计划,对吧?我希望 Anthropic 会遵守。我们尽一切努力去遵守。我们的 RSP 受到我们的“长期利益信托”(Long-term Benefit Trust)的检查。所以,我们尽一切努力遵守我们自己的 RSP。

[原文] [Dario Amodei]: But you know, you hear lots of things about various companies saying, oh, they said they would give this much compute and they didn't. They said they would do this thing and they didn't. You know, I don't think it makes sense to, you know, to litigate particular things that companies have done.,

[译文] [Dario Amodei]: 但你知道,你经常听到各种关于公司的传闻,说他们承诺提供多少算力却没给,承诺做这件事却没做。我认为去纠结公司做过的具体事情没有意义。

[原文] [Dario Amodei]: But I think this broad principle that like if there's nothing watching over them, there's nothing watching over us as an industry, there's no guarantee that we'll do the right thing, and the stakes are very high.

[译文] [Dario Amodei]: 但我认为有一个大原则:如果没有东西监管他们,没有东西监管我们这个行业,就无法保证我们会做正确的事,而代价是非常高昂的。

[原文] [Dario Amodei]: And so I think it's important to have a uniform standard that everyone follows, and to make sure that simply that the industry does what a majority of the industry has already said is important and has already said that they definitely will do.,

[译文] [Dario Amodei]: 所以我认为重要的是要有一个人人遵守的统一标准,并确保行业仅仅是去执行大多数行业参与者已经承认很重要、并且已经承诺一定会做的事情。

[原文] [Dario Amodei]: Right, some people, you know, I think there's a class of people who are against regulation on principle. I understand where that comes from. If you go to Europe and, you know, you see something like GDPR, you see some of the other stuff that they've done. You know, some of it's good, but some of it is really unnecessarily burdensome, and I think it's fair to say really has slowed innovation.

[译文] [Dario Amodei]: 当然,有些人……我认为有一类人原则上反对监管。我理解这种观点从何而来。如果你去欧洲,看到像 GDPR 这样的东西,以及他们做的其他一些事情。有些是好的,但有些真的是不必要的负担,我认为可以公平地说,确实通过减缓了创新。

[原文] [Dario Amodei]: And so I understand where people are coming from on priors. I understand why people come from, start from that position. But again, I think AI is different. If we go to the very serious risks of autonomy and misuse that I talked about, you know, just a few minutes ago, I think that those are unusual and they warrant an unusually strong response.,

[译文] [Dario Amodei]: 所以我理解人们基于先验观念的立场。我理解人们为什么从那个立场出发。但再说一次,我认为 AI 是不同的。如果我们考虑到我几分钟前谈到的那些非常严重的自主性和滥用风险,我认为这些是不寻常的,它们需要不同寻常的强力回应。

[原文] [Dario Amodei]: You know, I think one of the issues with SB 1047, especially the original version of it, was it had a bunch of the structure of RSPs, but it also had a bunch of stuff that was either clunky or that just would've created a bunch of burdens, a bunch of hassle, and might even have missed the target in terms of addressing the risks.,

[译文] [Dario Amodei]: 你知道,SB 1047 的问题之一,尤其是最初版本,是它虽然包含了很多 RSP 的结构,但也包含了一堆要么笨拙、要么只会制造一堆负担和麻烦的东西,甚至可能在解决风险方面偏离了目标。

[原文] [Dario Amodei]: You don't really hear about it on Twitter. You just hear about kind of, you know, people are cheering for any regulation, and then the folks who are against make up these often quite intellectually dishonest arguments about how, you know, it'll make us move away from California. Bill doesn't apply if you're headquartered in California, bill only applies if you do business in California.

[译文] [Dario Amodei]: 你在 Twitter 上听不到这些(理性的讨论)。你听到的只是人们要么为任何监管欢呼,要么反对者编造出通常相当不诚实的论点,比如说这会迫使我们搬离加州。这法案并不取决于你的总部是否在加州,而是取决于你是否在加州做生意。

[原文] [Dario Amodei]: Or that it would damage the open source ecosystem, or that it would, you know, it would cause all of these things. I think those were mostly nonsense, but there are better arguments against regulation. There's one guy, Dean Ball, who's really, you know, I think a very scholarly, scholarly analyst who looks at what happens when a regulation is put in place and ways that they can kind of get a life of their own, or how they can be poorly designed.

[译文] [Dario Amodei]: 或者说它会破坏开源生态系统,或者会造成所有这些事情。我认为这些大部分都是胡说八道(nonsense),但也确实存在反对监管的更好论点。有一个叫 Dean Ball 的人,他真的是一位非常学术的分析师,他研究了监管实施后会发生什么,以及监管如何可能从原本的目的中异化(get a life of their own),或者如何被设计得很糟糕。

[原文] [Dario Amodei]: And so our interest has always been, we do think there should be regulation in this space, but we wanna be an actor who makes sure that that regulation is something that's surgical, that's targeted at the serious risks and is something people can actually comply with.,

[译文] [Dario Amodei]: 所以我们一直以来的利益诉求是:我们确实认为这个领域应该有监管,但我们想成为确保监管是“外科手术式”(surgical)的参与者——即针对严重风险,并且是人们实际上可以遵守的。

[原文] [Dario Amodei]: Because something I think the advocates of regulation don't understand as well as they could is if we get something in place that's poorly targeted, that wastes a bunch of people's time, what's gonna happen is people are gonna say, see, these safety risks, you know, this is nonsense. You know, I just had to hire 10 lawyers, you know, to fill out all these forms. I had to run all these tests for something that was clearly not dangerous.,

[译文] [Dario Amodei]: 因为我认为监管的倡导者没有充分理解的一点是:如果我们实施了一个目标不明确、浪费大家时间的法规,结果就是人们会说:“看吧,这些安全风险全是胡扯。我为了填这些表格雇了 10 个律师。我不得不为一些显然不危险的东西做所有这些测试。”

[原文] [Dario Amodei]: And after six months of that, there will be a groundswell and we'll end up with a durable consensus against regulation. And so I think the worst enemy of those who want real accountability is badly designed regulation.

[译文] [Dario Amodei]: 经过六个月这样的折腾,就会出现反对浪潮,最终我们将面临一个持久的反对监管的共识。所以我认为,对于那些想要真正问责制的人来说,最大的敌人就是设计糟糕的监管。

[原文] [Dario Amodei]: I feel urgency. I really think we need to do something in 2025. You know, if we get to the end of 2025 and we've still done nothing about this, then I'm gonna be worried. I'm not worried yet, because again, the risks aren't here yet, but I think time is running short.

[译文] [Dario Amodei]: 我感到紧迫感。我真的认为我们需要在 2025 年做点什么。如果到了 2025 年底我们对此仍无所作为,那么我会开始担心。我现在还没那么担心,因为再说一次,风险目前还没出现,但我认为时间不多了。

[原文] [Lex Fridman]: Yeah, and come up with something surgical, like you said.

[译文] [Lex Fridman]: 是的,就像你说的,要拿出一些外科手术般精准的方案。

[原文] [Dario Amodei]: Yeah, yeah, yeah, exactly. And we need to get away from this intense pro-safety versus intense anti-regulatory rhetoric, right? It's turned into these flame wars on Twitter and nothing good's gonna come of that.

[译文] [Dario Amodei]: 对,对,正是如此。我们需要摆脱这种激进的“支持安全”对抗激进的“反监管”的言论,对吧?这已经变成了 Twitter 上的口水战(flame wars),那样不会有任何好结果。


这是为您整理的第 14 章内容。在此章节中,Dario 回顾了他在 OpenAI 的经历,澄清了关于他离职原因的种种流言(如反对商业化或微软投资),并阐述了创立 Anthropic 的核心理念——做一个“纯粹的实验”。

章节 14:OpenAI 往事与创立 Anthropic 的核心分歧

📝 本节摘要

本节聚焦于 Dario 在 OpenAI 的五年经历及其离职创立 Anthropic 的真实原因。Dario 回忆了他与 Ilya Sutskever 共事时对“扩展假说”的顿悟——即“模型只想学习,不要阻碍它”。针对外界关于他因反对微软投资或商业化而离职的传言,Dario 予以否认,指出他本人曾参与 GPT-3 的商业化。他解释称,核心分歧在于如何(How)以安全、谨慎和建立信任的方式通往强大的 AI。他认为,与其在已有组织内争论愿景,不如带领信任的团队出去做一个“纯粹的实验”(Clean Experiment)。如果这个实验成功并展示出更好的做法,其他公司(包括前东家)自然会效仿,从而实现整个行业的“力争上游”。

[原文] [Lex Fridman]: So there's a lot of curiosity about the different players in the game. One of the OGs is OpenAI. You've had several years of experience at OpenAI. What's your story and history there?

[译文] [Lex Fridman]: 大家对这场游戏中的不同玩家都充满了好奇。其中一个元老级(OG)玩家就是 OpenAI。你在 OpenAI 有几年的工作经历。你在那里的故事和历史是怎样的?

[原文] [Dario Amodei]: Yeah, so I was at OpenAI for roughly five years. For the last, I think it was couple years, you know, I was vice president of research there. Probably myself and Ilya Sutskever were the ones who, you know, really kind of set the research direction.

[译文] [Dario Amodei]: 是的,我在 OpenAI 待了大概五年。在最后那几年,我是那里的研究副总裁。大概是我和 Ilya Sutskever 两个人真正设定了研究方向。

[原文] [Dario Amodei]: Around 2016 or 2017, I first started to really believe in or at least confirm my belief in the Scaling Hypothesis when Ilya famously said to me, "The thing you need to understand about these models is they just wanna learn. The models just wanna learn."

[译文] [Dario Amodei]: 大约在 2016 或 2017 年,我第一次开始真正相信——或者至少证实了我对扩展假说(Scaling Hypothesis)的信念。当时 Ilya 对我那句名言是:“关于这些模型,你需要明白的一点是,它们只是想学习。模型只是想学习。”

[原文] [Dario Amodei]: And again, sometimes there are these one sentences, these zen koans that you hear them and you're like, ah, that explains everything. That explains like 1000 things that I've seen. And then I, you know, ever after I had this visualization in my head of like, you optimize the models in the right way, you point the models in the right way. They just wanna learn. They just wanna solve the problem, regardless of what the problem is.

[译文] [Dario Amodei]: 再说一次,有时候就是这种一句话,这种禅宗公案(zen koans),你听到了就会觉得:啊,这就解释了一切。这解释了我见过的 1000 件事。从那以后,我脑海里就有了这样一个画面:你以正确的方式优化模型,给模型指引正确的方向。它们只是想学习。它们只是想解决问题,不管问题是什么。

[原文] [Lex Fridman]: So get out of their way, basically.

[译文] [Lex Fridman]: 所以基本上就是别挡它们的路。

[原文] [Dario Amodei]: Get out of their way, yeah. Don't impose your own ideas about how they should learn. And you know, this was the same thing as Rich Sutton put out in the Bitter Lesson or Gwern wrote about in The Scaling Hypothesis.

[译文] [Dario Amodei]: 对,别挡路。不要把你关于它们应该如何学习的想法强加给它们。这和 Rich Sutton 在《苦涩的教训》(The Bitter Lesson)里提出的,或者 Gwern 在《扩展假说》里写的是同一个道理。

[原文] [Lex Fridman]: Why'd you leave? Why'd you decide to leave?

[译文] [Lex Fridman]: 你为什么离开?你为什么决定离开?

[原文] [Dario Amodei]: Yeah, so look, I'm gonna put things this way and, you know, I think it ties to the race to the top, right? Which is, you know, in my time at OpenAI, what I'd come to see is I'd come to appreciate the Scaling Hypothesis, and as I'd come to appreciate kind of the importance of safety along with the Scaling Hypothesis. The first one I think, you know, OpenAI was getting on board with. The second one in a way had always been part of OpenAI's messaging, but, you know, over many years of the time that I spent there, I think I had a particular vision of how we should handle these things, how we should be brought out in the world, the kind of principles that the organization should have.

[译文] [Dario Amodei]: 好的,我想这样说,这其实和“力争上游”(race to the top)的理念有关,对吧?我在 OpenAI 期间,我开始重视扩展假说,同时也开始重视与扩展假说相伴的安全的重要性。第一点(扩展),我认为 OpenAI 已经跟上了。第二点(安全),在某种程度上一直是 OpenAI 宣传的一部分,但在我待在那里的很多年里,我认为对于我们应该如何处理这些事情、如何将其推向世界、组织应该具备什么样的原则,我有我自己独特的愿景。

[原文] [Dario Amodei]: And look, I mean, there were like many, many discussions about like, you know, should the org do, should the company do this? Should the company do that? Like, there's a bunch of misinformation out there. People say like, we left because we didn't like the deal with Microsoft. False, although, you know, it was like a lot of discussion, a lot of questions about exactly how we do the deal with Microsoft.

[译文] [Dario Amodei]: 看,当时有很多很多的讨论,关于组织应该做这个吗?公司应该做那个吗?外面有很多错误信息。人们说我们离开是因为我们不喜欢和微软的交易。这是错的,尽管当时确实有很多讨论,很多关于我们究竟该如何与微软进行交易的问题。

[原文] [Dario Amodei]: We left because we didn't like commercialization. That's not true, we built GPT-3, which was the model that was commercialized. I was involved in commercialization. It's more again about, how do you do it? Like civilization is going down this path to very powerful AI. What's the way to do it that is cautious, straightforward, honest, that builds trust in the organization and individuals?

[译文] [Dario Amodei]: 说我们离开是因为不喜欢商业化,那也不是真的。我们构建了 GPT-3,那就是被商业化的模型。我参与了商业化。这更多是关于如何去做(how do you do it)?人类文明正沿着这条路走向非常强大的 AI。什么是谨慎、直率、诚实且能建立对组织和个人信任的做法?

[原文] [Dario Amodei]: And, you know, I think at the end of the day, if you have a vision for that, forget about anyone else's vision. I don't wanna talk about anyone else's vision. If you have a vision for how to do it, you should go off and you should do that vision. It is incredibly unproductive to try and argue with someone else's vision.

[译文] [Dario Amodei]: 你知道,我认为归根结底,如果你对此有一个愿景,那就忘掉别人的愿景吧。我不想谈论别人的愿景。如果你有一个关于该怎么做的愿景,你就应该离开,去实现那个愿景。试图去和别人的愿景争辩是极其缺乏成效的。

[原文] [Dario Amodei]: But what you should do is you should take some people you trust and you should go off together and you should make your vision happen. And if your vision is compelling, if you can make it appeal to people, you know, some combination of ethically, you know, in the market, you know, if you can make a company that's a place people wanna join, that, you know, engages in practices that people think are reasonable, while managing to maintain its position in the ecosystem at the same time, if you do that, people will copy it.

[译文] [Dario Amodei]: 你应该做的是,带上一些你信任的人,一起离开,去实现你们的愿景。如果你的愿景令人信服,如果你能让它在道德上、市场上吸引人——如果你能建立一家人们想加入的公司,采取人们认为合理的做法,同时又能维持其在生态系统中的地位——如果你做到了这些,人们就会模仿它。

[原文] [Dario Amodei]: I just, I don't know how to be any more specific about it than that, but I think it's generally very unproductive to try and get someone else's vision to look like your vision. It's much more productive to go off and do a clean experiment and say, this is our vision, this is how we're gonna do things. Your choice is you can ignore us, you can reject what we're doing, or you can start to become more like us, and imitation is the sincerest form of flattery.

[译文] [Dario Amodei]: 我不知道还能怎么更具体地说,但我认为试图把别人的愿景变得像你的愿景,通常是非常没有成效的。更有效率的做法是离开,做一个纯粹的实验(clean experiment),然后说:这是我们的愿景,这是我们的做法。你们的选择是可以无视我们,可以拒绝我们的做法,或者开始变得更像我们——而模仿是最高级的奉承。


这是为您整理的第 15 章内容。在此章节中,Dario 分享了 Anthropic 独特的组织构建哲学,解释了为何在 AI 前沿研究中,“少而精”的团队往往能战胜“大而杂”的团队。

章节 15:人才密度优于人才规模:构建顶级研究团队

📝 本节摘要

本节探讨了“人才密度优于人才规模”(Talent Density Beats Talent Mass)的管理哲学。Dario 通过一个思想实验说明:一个由 100 名完全对齐的精英组成的团队,胜过一个由 200 名精英和 800 名普通员工组成的千人团队。尽管后者拥有更多的“人才总量”,但稀释后的环境会导致官僚主义、缺乏信任和内部政治,从而拖慢创新速度。

>

他透露,Anthropic 在员工数接近 1000 人时有意放缓了招聘节奏,以维持高标准。他特别提到公司偏爱招聘物理学家,因为他们具备极快的学习能力。Lex 则引用乔布斯的“A级玩家”(A Players)理论与其共鸣:顶尖人才只希望与其他顶尖人才共事,这种环境能产生巨大的激励作用。

[原文] [Lex Fridman]: You said talent density beats talent mass. So can you explain that? Can you expand on that? Can you just talk about what it takes to build a great team of AI researchers and engineers?

[译文] [Lex Fridman]: 你说过“人才密度优于人才规模”(talent density beats talent mass)。你能解释一下吗?能展开讲讲吗?能否谈谈建立一支伟大的 AI 研究员和工程师团队需要什么?

[原文] [Dario Amodei]: This is one of these statements that's like more true every month. Every month I see this statement as more true than I did the month before. So if I were to do a thought experiment, let's say you have a team of 100 people that are super smart, motivated, and aligned with the mission, and that's your company.

[译文] [Dario Amodei]: 这是一个每个月都显得更加真实的陈述。每一个月,我都比上个月更觉得这句话是真理。如果我做一个思想实验:假设你有一个 100 人的团队,他们超级聪明、充满动力,并且与使命高度对齐,这就是你的公司。

[原文] [Dario Amodei]: Or you can have a team of 1000 people where 200 people are super smart, super aligned with the mission, and then like 800 people are, let's just say you pick 800 like random big tech employees, which would you rather have, right?

[译文] [Dario Amodei]: 或者你可以拥有一个 1000 人的团队,其中 200 人是超级聪明、与使命超级对齐的,但剩下 800 人是……我们就说是你从大型科技公司随机挑选的 800 名员工。你会选哪一个,对吧?

[原文] [Dario Amodei]: The talent mass is greater in the group of 1000 people, right? You have even a larger number of incredibly talented, incredibly aligned, incredibly smart people. But the issue is just that if every time someone super talented looks around, they see someone else super talented and super dedicated, that sets the tone for everything, right? That sets the tone for everyone is super inspired to work at the same place. Everyone trusts everyone else.

[译文] [Dario Amodei]: 那个 1000 人的群体中,“人才总量”(talent mass)确实更大,对吧?你拥有数量更多的极具天赋、极度对齐、极度聪明的人。但问题在于,如果每当一个超级天才环顾四周,看到的都是另一个超级天才和超级敬业的人,这就为一切定下了基调,对吧?这定下了一种基调,让每个人都因为在同一个地方工作而深受鼓舞。每个人都信任其他人。

[原文] [Dario Amodei]: If you have 1000 or 10,000 people and things have really regressed, right? You are not able to do selection and you're choosing random people, what happens is then you need to put a lot of processes and a lot of guardrails in place just because people don't fully trust each other, or you have to adjudicate political battles. Like there are so many things that slow down the org's ability to operate.

[译文] [Dario Amodei]: 如果你有 1000 或 10,000 人,情况真的会退化,对吧?如果你无法进行筛选,招的是随机的人,结果就是你需要设立大量的流程和护栏,仅仅是因为人们不完全信任彼此,或者你必须去裁决政治斗争。有太多事情会拖慢组织的运作能力。

[原文] [Dario Amodei]: And so we're nearly 1000 people and you know, we've tried to make it so that as large a fraction of those 1000 people as possible are like super talented, super skilled. It's one of the reasons we've slowed down hiring a lot in the last few months. We grew from 300 to 800, I believe, I think in the first seven, eight months of the year.

[译文] [Dario Amodei]: 我们现在接近 1000 人了,我们一直在努力确保这 1000 人中有尽可能大的一部分是超级有天赋、超级有技能的。这也是我们在过去几个月大幅放缓招聘的原因之一。我相信我们在今年前七八个月从 300 人增长到了 800 人。

[原文] [Dario Amodei]: And now we've slowed down. We're at like, you know, the last three months, we went from 800 to 900, 950, something like that. Don't quote me on the exact numbers, but I think there's an inflection point around 1000, and we want to be much more careful how we grow.

[译文] [Dario Amodei]: 现在我们慢下来了。在过去三个月里,我们大概从 800 人增加到了 900 或 950 人左右。别引用确切数字,但我认为在 1000 人左右有一个拐点,我们在增长方式上想要更加小心。

[原文] [Dario Amodei]: Early on, and now as well, you know, we've hired a lot of physicists. You know, theoretical physicists can learn things really fast. Even more recently as we've continued to hire that, you know, we've really had a high bar for, on both the research side and the software engineering side have hired a lot of senior people, including folks who used to be at other companies in this space. And we've just continued to be very selective.

[译文] [Dario Amodei]: 在早期,包括现在,我们要招了很多物理学家。你知道,理论物理学家学东西非常快。即便在最近继续招聘时,我们在研究端和软件工程端都保持了极高的门槛,招聘了很多资深人士,包括那些曾在该领域其他公司工作过的人。我们只是继续保持非常挑剔的标准。

[原文] [Dario Amodei]: It's very easy to go from 100 to 1000 and 1000 to 10,000 without paying attention to making sure everyone has a unified purpose. It's so powerful. If your company consists of a lot of different fiefdoms that all wanna do their own thing, they're all optimizing for their own thing, it's very hard to get anything done.

[译文] [Dario Amodei]: 从 100 人到 1000 人,再从 1000 人到 10,000 人,如果不注意确保每个人都有统一的目标,是很容易失控的。统一目标是非常强大的。如果你的公司由许多不同的“封地”(fiefdoms)组成,每个人都想做自己的事,都在为自己的利益优化,那就很难做成任何事。

[原文] [Dario Amodei]: But if everyone sees the broader purpose of the company, if there's trust and there's dedication to doing the right thing, that is a superpower. That in itself, I think, can overcome almost every other disadvantage.

[译文] [Dario Amodei]: 但如果每个人都看到公司更宏大的目标,如果存在信任,并且致力于做正确的事,那就是一种超能力(superpower)。我认为仅凭这一点,几乎可以克服其他所有的劣势。

[原文] [Lex Fridman]: And you know, as to Steve Jobs, A players. A players wanna look around and see other A players is another way of saying that. I don't know what that is about human nature, but it is demotivating to see people who are not obsessively driving towards a singular mission. And it is, on the flip side of that, super motivating to see that. It's interesting.

[译文] [Lex Fridman]: 你知道,就像史蒂夫·乔布斯说的“A 级玩家”(A Players)。换句话说,A 级玩家只想环顾四周看到其他 A 级玩家。我不知道这是人性的什么特点,但如果看到周围的人没有像着了魔一样推动单一使命,确实会让人失去动力。而反过来说,如果看到大家都是那样,就会超级令人振奋。这很有趣。


这是为您整理的第 16 章内容。在此章节中,Dario 为年轻的研究者和工程师提供了职业建议,强调了“亲手实践”和“寻找蓝海”的重要性。

章节 16:给年轻研究者的建议:保持开放与亲自实践

📝 本节摘要

面对 Lex 关于“如何对世界产生影响”的提问,Dario 给出的首要建议是“直接上手玩模型”(Play with the models)。他认为这些模型是没人真正理解的新事物,直接的体验性知识(Experiential Knowledge)远比只读论文重要。

>

此外,他建议年轻人“滑向冰球要去的地方”(Skate where the puck is going),即投身于那些尚未拥挤但极具潜力的新兴领域,如机械可解释性(Mechanistic Interpretability)、长视界学习(Long Horizon Learning)、动态评估(Evaluations)和多智能体(Multi-agent)系统。他指出,许多领域虽然在五年后会变得显而易见,但目前人们往往因为害怕偏离主流(如模型架构研究)而不敢涉足,但这正是“低垂的果实”所在。

[原文] [Lex Fridman]: You said what it takes to be a great AI researcher. Can we rewind the clock back? What advice would you give to people interested in AI? They're young, looking forward to, how can I make an impact on the world?

[译文] [Lex Fridman]: 你之前谈到了成为一名伟大的 AI 研究员需要具备什么。我们能把时钟拨回去一点吗?你会给那些对 AI 感兴趣的人什么建议?他们还年轻,展望未来,想着如何对世界产生影响?

[原文] [Dario Amodei]: I think my number one piece of advice is to just start playing with the models. This was actually, I worry a little, this seems like obvious advice now. I think three years ago, it wasn't obvious and people started by, oh, let me read the latest Reinforcement Learning paper. Let me, you know, let me kind of, I mean, that was really, and I mean, you should do that as well.

[译文] [Dario Amodei]: 我认为我的第一条建议就是直接开始玩这些模型。这其实……我有点担心,现在这似乎是显而易见的建议。但在三年前,这并不明显,那时人们的起步方式往往是:“噢,让我读读最新的强化学习论文。”让我……你知道,我的意思是,那确实是……当然你也应该读论文。

[原文] [Dario Amodei]: But now, you know, with wider availability of models and APIs, people are doing this more. But I think just experiential knowledge. These models are new artifacts that no one really understands, and so getting experience playing with them.

[译文] [Dario Amodei]: 但现在,随着模型和 API 的广泛可用,人们做得更多了。但我认为关键在于体验性知识(experiential knowledge)。这些模型是没人真正理解的新造物(new artifacts),所以要获得把玩它们的经验。

[原文] [Dario Amodei]: I would also say, again, in line with the like, do something new, think in some new direction. Like there are all these things that haven't been explored. Like for example, mechanistic interpretability is still very new. It's probably better to work on that than it is to work on new model architectures because, you know, it's more popular than it was before.

[译文] [Dario Amodei]: 我还要说,这再次印证了要做新东西、朝新方向思考的观点。有很多东西还没被探索过。例如,机械可解释性(mechanistic interpretability)仍然非常新。研究这个可能比研究新的模型架构更好,因为你知道,虽然它比以前更流行了……

[原文] [Dario Amodei]: There are probably like 100 people working on it, but there aren't like 10,000 people working on it. And it's just this fertile area for study. Like, you know, there's so much like low hanging fruit. You can just walk by and, you know, you can just walk by and you can pick things. And the only reason, for whatever reason, people aren't interested in it enough.

[译文] [Dario Amodei]: 可能只有 100 个人在研究它,而不是 10,000 个人。而这是一个肥沃的研究领域。有很多唾手可得的果实(low hanging fruit)。你可以直接走过去,就能摘到果实。而唯一的原因——不管出于什么原因——就是人们对它不够感兴趣。

[原文] [Dario Amodei]: I think there are some things around long horizon learning and long horizon tasks where there's a lot to be done. I think evaluations are still, we're still very early in our ability to study evaluations, particularly for dynamic systems acting in the world. I think there's some stuff around multi-agent.

[译文] [Dario Amodei]: 我认为在长视界学习(long horizon learning)和长视界任务方面还有很多工作要做。我认为在评估(evaluations)方面,我们研究评估的能力还处于非常早期的阶段,特别是针对在现实世界中行动的动态系统。我认为还有一些关于多智能体(multi-agent)的东西。

[原文] [Dario Amodei]: Skate where the puck is going is my advice. And you don't have to be brilliant to think of it. Like all the things that are gonna be exciting in five years, like people even mention them as like, you know, conventional wisdom, but like, it's just somehow there's this barrier that people don't double down as much as they could, or they're afraid to do something that's not the popular thing.

[译文] [Dario Amodei]: 我的建议是滑向冰球要去的地方(Skate where the puck is going)。你不需要绝顶聪明也能想到这一点。比如所有那些在五年内会变得令人兴奋的事情,人们甚至把它们当作某种常识来提,但不知何故,存在这样一种障碍,导致人们没有尽其所能地加倍投入,或者他们害怕做一些不那么流行的事情。

[原文] [Dario Amodei]: I don't know why it happens, but like, getting over that barrier, that's the my number one piece of advice.

[译文] [Dario Amodei]: 我不知道为什么会发生这种情况,但跨越那个障碍,就是我的头号建议。


这是为您整理的第 17 章内容。在此章节中,Dario 深入解析了后训练(Post-training)阶段的核心技术,特别是 RLHF 的本质作用以及 Anthropic 独创的“宪法式 AI”如何通过自我博弈来提升模型安全性。

章节 17:后训练技术深解:RLHF 与宪法式 AI (Constitutional AI)

📝 本节摘要

本节对话聚焦于大模型的后训练(Post-training)阶段。Lex 询问了制造 Claude 的“秘方”,Dario 坦言这往往不是单一的魔法,而是基础设施和数据质量等“无聊细节”的累积。

>

关于 RLHF(基于人类反馈的强化学习),Dario 认为它并不是让模型变聪明,而是“解除束缚”(Unhobbling)——弥合了模型原始智力与人类意图之间的鸿沟。

>

随后,Dario 详细介绍了 Anthropic 的招牌技术 宪法式 AI(Constitutional AI)。这是一种利用 RLAIF(基于 AI 反馈的强化学习) 的方法:让模型根据一套人类编写的原则(宪法)来评估自己的输出,本质上是一种“自我博弈”(Self-play)。这种方法不仅解决了人类反馈难以扩展的问题,还将价值观的定义权从隐性的标注者转移到了显性的宪法文档上。

[原文] [Lex Fridman]: Let's talk if we could a bit about post-training. So it seems that the modern post-training recipe has a little bit of everything. So supervised fine tuning, RLHF, the Constitutional AI with RLAIF.

[译文] [Lex Fridman]: 如果可以的话,让我们聊聊后训练(post-training)。现代的后训练配方似乎包含了所有东西:监督微调(SFT)、RLHF(基于人类反馈的强化学习),以及带有 RLAIF(基于 AI 反馈的强化学习)的宪法式 AI。

[原文] [Dario Amodei]: RLAIF. (laughs)

[译文] [Dario Amodei]: RLAIF。(笑)

[原文] [Lex Fridman]: And then synthetic data, seems like a lot of synthetic data, or at least trying to figure out ways to have high quality synthetic data. So what's the, if this is a secret sauce that makes Anthropic Claude so incredible, how much of the magic is in the pre-training? How much of is in the post-training?

[译文] [Lex Fridman]: 然后还有合成数据,似乎有很多合成数据,或者至少在试图找出获得高质量合成数据的方法。那么,如果这就是让 Anthropic Claude 如此不可思议的“秘方”,有多少魔法在于预训练?有多少在于后训练?

[原文] [Dario Amodei]: Usually it isn't, oh my God, we have this secret magic method that others don't have, right? Usually it's like, well, you know, we got better at the infrastructure, so we could run it for longer. Or, you know, we were able to get higher quality data, or we were able to filter our data better, or we were able to, you know, combine these methods in practice. It's usually some boring matter of kind of practiced and trade craft.

[译文] [Dario Amodei]: 通常情况并不是“天哪,我们有这个别人没有的秘密魔法方法”,对吧?通常情况是,“嗯,我们在基础设施上做得更好了,所以我们可以运行得更久。”或者,“我们要能获得更高质量的数据,或者我们能更好地过滤数据,或者我们在实践中能更好地结合这些方法。”这通常是一些关于熟练度和手艺(trade craft)的无聊事情,。

[原文] [Lex Fridman]: Okay, well, let me ask you about specific techniques. So first on RLHF, why do you think, just zooming out, intuition almost philosophy, why do you think RLHF works so well?

[译文] [Lex Fridman]: 好的,那我问问具体的技术。首先关于 RLHF,从宏观角度,甚至直觉或哲学层面来看,你为什么认为 RLHF 效果这么好?

[原文] [Dario Amodei]: If I go back to like the Scaling Hypothesis, one of the ways to skate the Scaling Hypothesis is if you train for X and you throw enough compute at it, then you get X. And so RLHF is good at doing what humans want the model to do, or at least to state it more precisely, doing what humans who look at the model for a brief period of time and consider different possible responses, what they prefer as the response...

[译文] [Dario Amodei]: 如果回到扩展假说,顺应扩展假说的一种方式是:如果你为了目标 X 进行训练,并且投入足够的算力,你就会得到 X。所以 RLHF 擅长让模型做人类想要它做的事,或者更准确地说,做那些“盯着模型看一小会儿并考虑不同可能回答的人类”所偏好的回答……,。

[原文] [Dario Amodei]: I don't think it makes the model smarter. I don't think it just makes the model appear smarter. It's like RLHF like bridges the gap between the human and the model, right? I could have something really smart that like can't communicate at all, right? We all know people like this, people who are really smart, but that, you know, you can't understand what they're saying. So I think RLHF just bridges that gap.

[译文] [Dario Amodei]: 我不认为它让模型变得更聪明了。我也不认为它只是让模型看起来更聪明。RLHF 就像是弥合了人类与模型之间的鸿沟,对吧?我可以拥有某种非常聪明但完全无法交流的东西,对吧?我们都认识这样的人,非常聪明,但你听不懂他们在说什么。所以我认为 RLHF 只是弥合了那个鸿沟,。

[原文] [Dario Amodei]: It also increases, what was this word in Leopold's essay, unhobbling, where basically the models are hobbled and then you do various trainings to them to unhobble them. So, you know, I like that word 'cause it's like a rare word. But so I think RLHF unhobbles the models in some ways.

[译文] [Dario Amodei]: 它也增加了——Leopold(Aschenbrenner)文章里那个词叫什么来着——解除束缚(unhobbling)。基本上模型是被束缚住的(hobbled),然后你对它们进行各种训练来解除束缚。我喜欢这个词,因为它是个生僻词。所以我认为 RLHF 在某些方面解除了模型的束缚,。

[原文] [Lex Fridman]: So on that super interesting set of ideas around Constitutional AI, can you describe what it is, as first detailed in December 2022 paper and beyond that, what is it?

[译文] [Lex Fridman]: 那么关于围绕“宪法式 AI”(Constitutional AI)那一套超级有趣的想法,你能描述一下它是什么吗?就像 2022 年 12 月那篇论文里最初详述的那样,以及在那之后它是什么?

[原文] [Dario Amodei]: Yes, so this was from two years ago. The basic idea is, so we describe what RLHF is. You have a model and it, you know, spits out two, you know, like you just sample from it twice, it spits out two possible responses, and you're like, "Human, which response do you like better?" Or another variant of it is, "Rate this response on a scale of one to seven." So that's hard because you need to scale up human interaction and it's very implicit, right?

[译文] [Dario Amodei]: 是的,这是两年前的事了。基本思路是这样的:我们描述了 RLHF 是什么。你有一个模型,它吐出两个可能的回答——就像你采样两次——然后你问:“人类,你更喜欢哪个回答?”或者另一个变体是:“给这个回答打分,1 到 7 分。”但这很难,因为你需要扩大人机交互的规模,而且这非常隐晦,对吧?

[原文] [Dario Amodei]: So two ideas. One is, could the AI system itself decide which response is better, right? Could you show the AI system these two responses and ask which response is better? And then second, well, what criterion should the AI use? And so then there's this idea, 'cause you have a single document, a constitution, if you will, that says, these are the principles the model should be using to respond.

[译文] [Dario Amodei]: 所以有两个想法。第一,AI 系统本身能决定哪个回答更好吗?你能向 AI 系统展示这两个回答并问它哪个更好吗?第二,AI 应该使用什么标准?于是就有了这个想法:因为你有一个单一的文件,一部宪法(constitution),如果你愿意这么叫的话,上面写着:这些是模型在回答时应该遵循的原则。

[原文] [Dario Amodei]: And the AI system reads those, it reads those principles, as well as reading the environment and the response. And it says, well, how good did the AI model do? It's basically a form of self play. You're kind of training the model against itself. And so the AI gives the response and then you feed that back into what's called the preference model, which in turn feeds the model to make it better. So you have this triangle of like the AI, the preference model, and the improvement of the AI itself.

[译文] [Dario Amodei]: AI 系统会阅读这些原则,同时阅读环境和回答。然后它会说,嗯,这个 AI 模型做得有多好?这基本上是一种自我博弈(self play)的形式。你在让模型进行自我对抗训练。AI 给出回答,然后你把这反馈给所谓的偏好模型(preference model),偏好模型反过来再反馈给模型使其变得更好。所以你有这样一个三角关系:AI、偏好模型以及 AI 本身的改进。

[原文] [Lex Fridman]: It's really nice because it creates this human interpretable document that you can, I can imagine in the future, there's just gigantic fights and politics over the every single principle and so on. And at least it's made explicit and you can have a discussion about the phrasing and the, you know.

[译文] [Lex Fridman]: 这真的很棒,因为它创造了这个人类可解释的文档。我可以想象在未来,人们会为了每一个原则展开巨大的争斗和政治博弈。但至少它是显式的,你可以讨论措辞等等,。

[原文] [Dario Amodei]: So it's turned into one tool in a toolkit that both reduces the need for RLHF and increases the value we get from using each data point of RLHF. It also interacts in interesting ways with kind of future reasoning type RL methods. So it's one tool in the toolkit, but I think it is a very important tool.

[译文] [Dario Amodei]: 所以它变成了工具箱中的一个工具,既减少了对 RLHF 的需求,又增加了我们从每个 RLHF 数据点中获得的价值。它还与未来的推理型 RL 方法有着有趣的交互。所以它是工具箱里的一个工具,但我认为它是一个非常重要的工具,。


这是为您整理的第 18 章内容。在此章节中,Dario 详细介绍了他的长文《仁慈机器》,描绘了 AI 可能带来的积极未来,并提出了“压缩的 21 世纪”这一核心概念。

章节 18:《仁慈机器》:AI 加速生物学与压缩 21 世纪

📝 本节摘要

本章围绕 Dario 撰写的长文《仁慈机器》(Machines of Loving Grace)展开。Dario 解释了写作初衷:虽然他一直强调 AI 风险,但人们同样需要一个具体的、积极的愿景来为之奋斗。他首先重新定义了目标,用“强大的 AI”(Powerful AI)取代了充满争议的“AGI”一词,将其定义为在所有领域都比诺贝尔奖得主更聪明、能自主行动并可大规模复制的系统。

>

随后,Dario 驳斥了关于未来的两种极端观点:“奇点论”(The Singularity)(认为 AI 会瞬间接管世界、无视物理法则)和“停滞论”(Stagnation)(认为官僚主义和企业惰性将阻止 AI 产生实际影响)。他提出了中间路线——“压缩的 21 世纪”:虽然物理定律和人类制度会通过“摩擦力”减缓 AI 的部署,但在激烈的竞争和少数远见者的推动下,我们仍将在短短 5 到 10 年内完成本需 100 年才能实现的科技进步(如治愈大多数癌症、攻克神经退行性疾病等)。

[原文] [Lex Fridman]: Let's talk about the incredible essay "Machines of Loving Grace." I recommend everybody read it. It's a long one.

[译文] [Lex Fridman]: 让我们聊聊那篇不可思议的文章《仁慈机器》(Machines of Loving Grace)。我推荐大家都去读读。文章很长。

[原文] [Dario Amodei]: It is rather long.

[译文] [Dario Amodei]: 确实挺长的。

[原文] [Lex Fridman]: Yeah. It's really refreshing to read concrete ideas about what a positive future looks like. And you took sort of a bold stance because like, it's very possible that you might be wrong on the dates or specific applications.

[译文] [Lex Fridman]: 是的。读到关于积极未来长什么样的具体想法,真的让人耳目一新。而且你采取了某种大胆的立场,因为很有可能你在日期或具体应用上会出错。

[原文] [Dario Amodei]: Oh, yeah. I'm fully expecting to, you know, will definitely be wrong about all the details. I might be just spectacularly wrong about the whole thing and people will, you know, will laugh at me for years. That's just how the future works. (laughs)

[译文] [Dario Amodei]: 噢,是的。我完全预计在所有细节上肯定会出错。我甚至可能在整件事上都错得离谱,人们可能会嘲笑我好几年。未来就是这样运作的。(笑)

[原文] [Lex Fridman]: So you provided a bunch of concrete positive impacts of AI and how, you know, exactly a super intelligent AI might accelerate the rate of breakthroughs in, for example, biology and chemistry... First, can you give a high level vision of this essay and what key takeaways that people would have?

[译文] [Lex Fridman]: 你提供了一系列 AI 带来的具体积极影响,以及超级智能 AI 究竟如何加速生物学和化学等领域的突破……首先,你能给出这篇文章的高层愿景吗?人们应该从中获得什么关键启示?

[原文] [Dario Amodei]: Yeah, I have spent a lot of time, and Anthropic has spent a lot of effort on like, you know, how do we address the risks of AI, right? ... But I noticed that one flaw in that way of thinking... is that, you know, no matter how kind of logical or rational that line of reasoning that I just gave might be, if you kind of only talk about risks, your brain only thinks about risks.

[译文] [Dario Amodei]: 是的,我花了很多时间,Anthropic 也花了很多精力在如何应对 AI 风险上,对吧?……但我注意到这种思维方式有一个缺陷……那就是,无论我刚才给出的推理路线多么合乎逻辑或理性,如果你只谈论风险,你的大脑就只会思考风险。

[原文] [Dario Amodei]: And the whole reason we're trying to prevent these risks is not because we're afraid of technology, not because we wanna slow it down. It's because if we can get to the other side of these risks, right? If we can run the gauntlet successfully... then on the other side of the gauntlet are all these great things and these things are worth fighting for, and these things can really inspire people.

[译文] [Dario Amodei]: 我们试图预防这些风险的全部原因,并不是因为我们害怕技术,也不是因为我们想减慢它的速度。而是因为如果我们能到达这些风险的彼岸,对吧?如果我们能成功闯过这道难关……那么在难关的另一边,是所有这些伟大的事物,这些事物值得我们为之奋斗,这些事物真的能激励人心。

[原文] [Lex Fridman]: So I think the starting point is to talk about what this powerful AI, which is the term you like to use, most of the world uses AGI, but you don't like the term because it's basically has too much baggage, it's become meaningless.

[译文] [Lex Fridman]: 所以我认为起点是谈论这个“强大的 AI”(powerful AI),这是你喜欢用的词。世界上大多数人使用 AGI,但你不喜欢这个词,因为它基本上包袱太重了,已经变得毫无意义。

[原文] [Dario Amodei]: Maybe we're stuck with the terms and my efforts to change them are futile... I feel that way about AGI like, there's just a smooth exponential and like if by AGI you mean like AI is getting better and better, and like gradually, it's gonna do more and more of what humans do until it's gonna be smarter than humans, and then it's gonna get smarter even from there then yes, I believe in AGI. But if AGI is some discreet or separate thing, which is the way people often talk about it, then it's kind of a meaningless buzzword.

[译文] [Dario Amodei]: 也许我们被这些术语困住了,我改变它们的努力是徒劳的……我对 AGI 的感觉是,这只是一条平滑的指数曲线。如果你说的 AGI 是指 AI 变得越来越好,逐渐做更多人类能做的事,直到比人类更聪明,然后再变得更聪明,那么是的,我相信 AGI。但如果 AGI 是某种离散的或独立的东西——这也是人们经常谈论它的方式——那么它就是一个毫无意义的流行词。

[原文] [Lex Fridman]: Yeah, I mean, to me it's just sort of a platonic form of a powerful AI, exactly how you define it. I mean, you define it very nicely. So on the intelligence axis, just on pure intelligence, it's smarter than a Nobel Prize winner, as you describe, across most relevant disciplines... It can use every modality... It can go off for many hours, days and weeks to do tasks... It can control embodied tools... The resources used to train it can then be repurposed to run millions of copies of it.

[译文] [Lex Fridman]: 是的,对我来说,它只是强大 AI 的一种柏拉图式形式,正如你所定义的。你的定义非常好。在智力维度上,纯粹就智力而言,它比大多数相关学科的诺贝尔奖得主都要聪明……它可以使用每种模态……它可以离开数小时、数天甚至数周去执行任务……它可以控制具身工具……用于训练它的资源随后可以被重新利用来运行数百万个它的副本。

[原文] [Dario Amodei]: Yeah, yeah, I mean, you might imagine from outside the field that like, there's only one of these, right? ... But the truth is that like, the scale up is very quick. Like we do this today, we make a model, and then we deploy thousands, maybe tens of thousands of instances of it. I think by the time, you know, certainly within two to three years... clusters are gonna get to the size where you'll be able to deploy millions of these.

[译文] [Dario Amodei]: 是的,是的。从领域外看,你可能会想象这种东西只有一个,对吧?……但事实是,扩展速度非常快。就像我们今天做一个模型,然后我们就部署成千上万个实例。我认为到时候,肯定在两三年内……集群规模将达到你可以部署数百万个这种模型。

[原文] [Lex Fridman]: So that's a really nice definition of powerful AI, okay. So that, but you also write that clearly such an entity would be capable of solving very difficult problems very fast, but it is not trivial to figure out how fast. Two extreme positions both seem false to me. So the singularity is on the one extreme and the opposite on the other extreme. Can you describe each of the extremes?

[译文] [Lex Fridman]: 这是一个关于强大 AI 的很好的定义。但你也写道,显然这样一个实体将能够非常快地解决非常困难的问题,但要弄清楚到底有多快并非易事。两种极端立场对你来说似乎都是错误的。一个极端是奇点(singularity),另一个极端是反面。你能描述一下这两个极端吗?

[原文] [Dario Amodei]: So like one extreme would be, well, look... models will build faster models, models will build faster models, and those models will build, you know, nanobots that can like take over the world... And so if you just kind of like solve this abstract differential equation, then like five days after we build the first AI that's more powerful than humans, then, you know, like the world will be filled with these AIs and every possible technology that could be invented like will be invented.

[译文] [Dario Amodei]: 一个极端是……模型会构建更快的模型,那些模型再构建更快的模型,然后那些模型会制造纳米机器人接管世界……如果你只是解这个抽象的微分方程,那么在我们造出第一个比人类强大的 AI 五天后,世界就会充满这些 AI,所有可能被发明的技术都会被发明出来。

[原文] [Dario Amodei]: And the reason that I think that's not the case is that, one, I think they just neglect like the laws of physics. Like it's only possible to do things so fast in the physical world... It takes a long time to produce faster hardware... There's this issue of complexity... like predicting the economy... Or biological molecules... And then I think human institutions. Human institutions are just, are really difficult.

[译文] [Dario Amodei]: 我认为情况并非如此的原因是,第一,我认为他们忽略了物理定律。在物理世界中,做事情的速度是有限的……生产更快的硬件需要很长时间……还有复杂性问题……比如预测经济……或者生物分子……然后我认为还有人类机构(human institutions)。人类机构真的很难搞。

[原文] [Dario Amodei]: Now, if the AI system circumvented all governments, if it just said "I'm dictator of the world and I'm gonna do whatever," some of these things it could do... But again, if you want this to be an AI system that doesn't take over the world, that doesn't destroy humanity, then basically, you know, it's gonna need to follow basic human laws, right?

[译文] [Dario Amodei]: 现在的确,如果 AI 系统绕过所有政府,如果它说“我是世界独裁者,我想干嘛就干嘛”,有些事情它是可以做的……但再说一次,如果你希望这是一个不会接管世界、不会毁灭人类的 AI 系统,那么基本上它需要遵守基本的人类法律,对吧?

[原文] [Dario Amodei]: So that's on one side. On the other side, there's another set of perspectives... which is, look, we've seen big productivity increases before, right? ... There was a quote from Robert Solow, "You see the computer revolution everywhere except the productivity statistics." ... So you could have a perspective that's like, well, this is amazing technically, but it's all a nothing burger.

[译文] [Dario Amodei]: 这是一方面。另一方面,还有另一套观点……那就是,看,我们以前也见过巨大的生产力提升,对吧?……罗伯特·索洛(Robert Solow)有一句名言:“你在任何地方都能看到计算机革命,唯独在生产力统计数据中看不到。”……所以你可以有一种观点认为,这在技术上很惊人,但实际上是个“无用汉堡”(nothing burger,意为雷声大雨点小)。

[原文] [Dario Amodei]: But the dynamic I see over and over again is... You find a small fraction of people within a company, within a government who really see the big picture... And those people see that this is the most important thing in the world, and so they agitate for it... But as the technology starts to roll out... the specter of competition gives them a wind at their backs... And that combination, the specter of competition plus a few visionaries... makes something happen.

[译文] [Dario Amodei]: 但我一次又一次看到的动态是……你会在公司内部、政府内部发现一小部分人,他们真正看到了大局……这些人看到这是世界上最重要的事情,所以他们为此鼓与呼……而随着技术开始推出……竞争的阴影给他们带来了顺风……这种组合——竞争的阴影加上少数有远见的人——会让事情真正发生。

[原文] [Dario Amodei]: So I think this is gonna be more, and this is just an instinct... I think it's gonna be more like 5 or 10 years, as I say in the essay, than it's gonna be 50 or 100 years. I also think it's gonna be 5 or 10 years more than it's gonna be, you know, 5 or 10 hours.

[译文] [Dario Amodei]: 所以我认为这将会——这只是一种直觉——我认为这更像是 5 到 10 年,正如我在文章中所说,而不是 50 或 100 年。我也认为这会是 5 到 10 年,而不是 5 到 10 小时。


这是为您整理的第 19 章内容。在此章节中,Dario 给出了一个震惊业界的具体时间表预测,同时也谨慎地列出了可能导致预测失效的先决条件。

章节 19:AGI 时间表预测:2026-2027 年的可能性

📝 本节摘要

在本节中,Lex 逼问 Dario 关于“强大的 AI”(即 AGI)实现的具体时间点。Dario 预见到自己的回答会被截取并在社交媒体上疯传,因此先打了一剂“预防针”,强调预测的不确定性。

>

随后,他给出了基于线性外推的推论:鉴于模型能力正从高中水平迈向本科,即将达到博士水平,如果按照当前的曲线发展,我们最早可能在 2026 年或 2027 年达到超越人类的水平。Dario 指出,尽管存在地缘政治冲突(如台湾问题)或数据枯竭等潜在阻碍,但那些能令人信服地论证“这在未来几年内不会发生”的理由正在迅速减少。他最后强调,扩展定律并非物理定律,而是经验规律,但他赌这一趋势将继续下去。

[原文] [Lex Fridman]: So what to you is the timeline to where we achieve AGI, AKA powerful AI, AKA super useful AI?

[译文] [Lex Fridman]: 那么对你来说,我们实现 AGI——或者叫强大的 AI,或者叫超级有用的 AI——的时间表是什么?

[原文] [Dario Amodei]: Useful. (laughs) I'm gonna start calling it that.

[译文] [Dario Amodei]: 有用的。(笑)我以后要开始这么叫它了。

[原文] [Lex Fridman]: It's a debate about naming. You know, on pure intelligence, it can smarter than a Nobel Prize winner in every relevant discipline and all the things we've said. Modality, can go and do stuff on its own for days, weeks, and do biology experiments on its own... When? When do you think? Just so putting numbers on that.

[译文] [Lex Fridman]: 这是一个关于命名的争论。你知道,就纯粹的智力而言,它在每个相关学科上都比诺贝尔奖得主更聪明,还有我们说过的所有那些特征。在模态上,它可以独自出去做事,持续几天、几周,自己做生物实验……什么时候?你认为是什么时候?就给它个数字吧。

[原文] [Dario Amodei]: So you know, this is, of course, the thing I've been grappling with for many years, and I'm not at all confident. Every time, if I say 2026 or 2027, there will be like a zillion like people on Twitter who will be like, "AI CEO said 2026," and it'll be repeated for like the next two years that like this is definitely when I think it's gonna happen.

[译文] [Dario Amodei]: 你知道,这当然是我多年以此一直纠结的事情,我一点也不自信。每一次,如果我说了 2026 或 2027 年,推特上就会有无数人说:“AI CEO 说是 2026 年。”然后在接下来的两年里,这话会被反复引用,好像这就是我确信它会发生的时间。

[原文] [Dario Amodei]: So whoever's extorting these clips will crop out the thing I just said and only say the thing I'm about to say, but I'll just say it anyway.

[译文] [Dario Amodei]: 所以无论谁在截取这些片段,肯定会把我刚才说的这段剪掉,只保留我接下来要说的话,但我还是说了吧。

[原文] [Lex Fridman]: Have fun with it.

[译文] [Lex Fridman]: 玩得开心点。

[原文] [Dario Amodei]: So, if you extrapolate the curves that we've had so far, right? If you say, well, I don't know, we're starting to get to like PhD level, and last year we were at undergraduate level, and the year before we were at like the level of a high school student.

[译文] [Dario Amodei]: 那么,如果你外推我们目前已有的曲线,对吧?如果你说,嗯,我不知道,我们开始达到博士水平了,而去年我们处于本科水平,前年我们大概处于高中生的水平。

[原文] [Dario Amodei]: Again, you can quibble with at what tasks and for what, we're still missing modalities, but those are being added, like computer use was added, like image generation has been added. If you just kind of like eyeball the rate at which these capabilities are increasing, it does make you think that we'll get there by 2026 or 2027.

[译文] [Dario Amodei]: 同样,你可以对什么任务、为了什么目的进行争论,我们仍然缺少一些模态,但这些正在被添加进来,比如计算机使用能力被添加了,图像生成能力也被添加了。如果你只是目测一下这些能力增长的速率,它确实会让你认为我们将在 2026 年或 2027 年到达那里。

[原文] [Dario Amodei]: Again, lots of things could derail it. We could run out of data. You know, we might not be able to scale clusters as much as we want. Like, you know, maybe Taiwan gets blown up or something and, you know, then we can't produce as many GPUs as we want. So there are all kinds of things that could derail the whole process.

[译文] [Dario Amodei]: 再说一次,很多事情可能会使其脱轨。我们可能会耗尽数据。你知道,我们可能无法像我们希望的那样扩展集群规模。比如,也许台湾被炸了或者发生了什么事,然后我们就无法生产出我们想要的那么多 GPU。所以有各种各样的事情可能导致整个过程脱轨。

[原文] [Dario Amodei]: So I don't fully believe the straight line extrapolation, but if you believe the straight line extrapolation, we'll get there in 2026 or 2027. I think the most likely is that there's some mild delay relative to that. I don't know what that delay is, but I think it could happen on schedule. I think there could be a mild delay.

[译文] [Dario Amodei]: 所以我并不完全相信直线外推,但如果你相信直线外推,我们会在 2026 年或 2027 年到达那里。我认为最可能的情况是相对于此会有一些轻微的延迟。我不知道延迟会有多久,但我认为它可能按期发生。我认为可能会有轻微的延迟。

[原文] [Dario Amodei]: I think there are still worlds where it doesn't happen in 100 years. The number of those worlds is rapidly decreasing. We are rapidly running out of truly convincing blockers, truly compelling reasons why this will not happen in the next few years.

[译文] [Dario Amodei]: 我认为仍然存在这种可能性:即在未来 100 年内它都不会发生。但这类“世界”的数量正在迅速减少。我们正在迅速耗尽那些真正令人信服的阻碍因素,耗尽那些真正强有力的理由来解释为什么这不会在未来几年内发生。

[原文] [Dario Amodei]: There were a lot more in 2020, although my guess, my hunch at that time was that we'll make it through all those blockers. So sitting as someone who has seen most of the blockers cleared out of the way, I kind of suspect, my hunch, my suspicion is that the rest of them will not block us.

[译文] [Dario Amodei]: 在 2020 年时这种阻碍还多得多,尽管我当时的猜测、我的直觉是我们会克服所有这些阻碍。所以作为一个已经看到大部分阻碍被清除的人,我有点怀疑——我的直觉、我的猜想是——剩下的那些也不会阻挡我们。

[原文] [Dario Amodei]: But, you know, look, at the end of the day, like I don't wanna represent this as a scientific prediction. People call them scaling laws. That's a misnomer, like Moore's law is a misnomer. Moore's laws, scaling laws, they're not laws of the universe. They're empirical regularities. I am going to bet in favor of them continuing, but I'm not certain of that.

[译文] [Dario Amodei]: 但是,你知道,归根结底,我不想把它描述成一个科学预测。人们称之为扩展定律(scaling laws)。这是个误称,就像摩尔定律(Moore's law)是个误称一样。摩尔定律、扩展定律,它们不是宇宙法则。它们是经验规律(empirical regularities)。我会下注赌它们会继续,但我对此并不确定。


这是为您整理的第 20 章内容。在此章节中,Lex 代替 Claude 提问,Dario 描绘了未来科研的具体形态——从“AI 研究生”到“AI 首席研究员”,以及如何通过模拟和优化设计来彻底革新缓慢的药物临床试验。

章节 20:AI 科学家:未来的科研形态与临床试验

📝 本节摘要

在本节中,Lex 传达了一个由 Claude 自己提出的问题:“在这个未来中,一个致力于 AGI 的生物学家的典型一天是怎样的?” Dario 并没有给出一个抽象的答案,而是用了一个非常生动的类比:AI 最初将扮演“研究生”(Grad Students)的角色。它们会阅读文献、在 Thermo Fisher 等网站订购设备、运行实验并撰写报告,而人类教授将管理着 1000 个比自己更聪明的 AI 研究生。

>

此外,Dario 深入探讨了临床试验(Clinical Trials)的变革。他认为通过 AI 进行更好的统计设计和前置模拟,可以将原本需要 5000 人耗时 1 年的试验,压缩为 500 人耗时 2 个月。这种“积跬步以至千里”的效率提升,正是实现“压缩的 21 世纪”的关键。

[原文] [Lex Fridman]: So how do you think, what are the early steps it might do? And by the way, I asked Claude good questions to ask you, and Claude told me to ask, "What do you think is a typical day for a biologists working on AGI look like in this future?"

[译文] [Lex Fridman]: 那么你认为,它早期可能会采取哪些步骤?顺便说一句,我让 Claude 帮我想一些问你的好问题,Claude 让我问:“你认为在这个未来中,一个致力于 AGI(或利用 AGI)的生物学家的典型一天是怎样的?”

[原文] [Dario Amodei]: Claude wants to know what's in his future, right? Who am I gonna be working with?

[译文] [Dario Amodei]: Claude 想知道它的未来是什么样的,对吧?我想知道我以后要和谁一起工作?

[原文] [Dario Amodei]: So what is it like, you know, being a scientist that works with an AI system? The way I think about it actually is, well, so I think in the early stages, the AIs are gonna be like grad students.

[译文] [Dario Amodei]: 那么,作为一个与 AI 系统共事的科学家是什么样的呢?我对此的思考方式实际上是……我认为在早期阶段,AI 会像研究生(grad students)一样。

[原文] [Dario Amodei]: You're gonna give them a project, you're gonna say, you know, I'm the experienced biologist, I've set up the lab. The biology professor or even the grad students themselves will say, here's what you can do with an AI, you know, like AI system. I'd like to study this.

[译文] [Dario Amodei]: 你会给它们一个项目,你会说,我是经验丰富的生物学家,我已经建好了实验室。生物学教授甚至研究生本人会说,这是你可以用 AI 做的事情。我想研究这个课题。

[原文] [Dario Amodei]: And you know, the AI system, it has all the tools. It can like look up all the literature to decide what to do. It can look at all the equipment. It can go to a website and say, hey, I'm gonna go to, you know, Thermo Fisher or, you know, whatever the lab equipment company is... You know, I'm gonna order this new equipment to do this.

[译文] [Dario Amodei]: 你知道,AI 系统拥有所有的工具。它可以查阅所有文献来决定做什么。它可以查看所有设备。它可以去网站上说,嘿,我要去 Thermo Fisher(赛默飞世尔)或者其他什么实验室设备公司……我要订购这个新设备来做这件事。

[原文] [Dario Amodei]: I'm gonna run my experiments. I'm gonna, you know, write up a report about my experiments. I'm gonna, you know, inspect the images for contamination. I'm gonna decide what the next experiment is. I'm gonna like write some code and run a statistical analysis.

[译文] [Dario Amodei]: 我要运行我的实验。我要写一份关于我实验的报告。我要检查图像是否有污染。我要决定下一个实验是什么。我要写一些代码并运行统计分析。-

[原文] [Dario Amodei]: All the things a grad student would do, there will be a computer with an AI that like the professor talks to every once in a while and it says, this is what you're gonna do today.

[译文] [Dario Amodei]: 所有研究生会做的事情,都会有一台装有 AI 的电脑来完成,教授每隔一段时间就跟它谈谈,告诉它今天要做什么。

[原文] [Dario Amodei]: And so it'll look like there's a human professor and 1000 AI grad students, and you know, if you go to one of these Nobel Prize winning biologists or so, you'll say, okay, well, you know, you had like 50 grad students, well, now you have 1000 and they're smarter than you are, by the way.

[译文] [Dario Amodei]: 所以看起来就像是一个人类教授带着 1000 个 AI 研究生。如果你去找一位诺贝尔奖得主级别的生物学家,你会说,好吧,你以前有 50 个研究生,现在你有 1000 个,顺便说一句,它们比你还聪明

[原文] [Dario Amodei]: Then I think at some point it'll flip around where, you know, the AI systems will, you know, will be the PIs, will be the leaders, and you know, they'll be ordering humans or other AI systems around. So I think that's how it'll work on the research side.

[译文] [Dario Amodei]: 然后我认为在某个时刻情况会反转,AI 系统将成为 PI(项目负责人/首席研究员),成为领导者,它们将指挥人类或其他 AI 系统。所以我认为这就是科研方面未来的运作方式。-

[原文] [Dario Amodei]: And then I think, you know, as I say in the essay, we'll want to turn, probably turning loose is the wrong term, but we'll want to harness the AI systems to improve the clinical trial system as well.

[译文] [Dario Amodei]: 然后我认为,正如我在文章中所说,我们将想要……也许“放手让它做”这个词不太对,但我们将想要利用 AI 系统来改进临床试验系统

[原文] [Dario Amodei]: Can we get better at predicting the results of clinical trials? Can we get better at statistical design so that, you know, clinical trials that used to require, you know, 5,000 people and therefore, you know, needed 100 million dollars in a year to enroll them. Now they need 500 people and two months to enroll them.

[译文] [Dario Amodei]: 我们能不能更好地预测临床试验的结果?我们能不能在统计设计上做得更好,使得原本需要 5000 人、耗资 1 亿美元、耗时 1 年招募的临床试验,现在只需要 500 人和 2 个月就能完成招募?

[原文] [Dario Amodei]: And you know, can we increase the success rate of clinical trials by doing things in animal trials that we used to do in clinical trials, and doing things in simulations that we used to do in animal trials?

[译文] [Dario Amodei]: 还有,我们能不能通过在动物试验中完成过去要在临床试验中做的事,以及在模拟(simulations)中完成过去要在动物试验中做的事,来提高临床试验的成功率?-

[原文] [Lex Fridman]: Doing in vitro and doing it, I mean, you're still slowed down. It still takes time, but you can do it much, much faster.

[译文] [Lex Fridman]: 做体外实验(in vitro)……我是说,你仍然会受限。这仍然需要时间,但你可以做得快得多,快得多。

[原文] [Dario Amodei]: Yeah, yeah, yeah. Can we just one step at a time, and can that add up to a lot of steps? Even though we still need clinical trials, even though we still need laws... can we just move everything in a positive direction?

[译文] [Dario Amodei]: 是的,是的,是的。我们能不能一步一个脚印,积少成多?即使我们仍然需要临床试验,即使我们仍然需要法律……我们能不能把所有事情都朝积极的方向推进?-

[原文] [Dario Amodei]: And when you add up all those positive directions, do you get everything that was gonna happen from here to 2100 instead happens from 2027 to 2032 or something?

[译文] [Dario Amodei]: 当你把所有这些积极方向加在一起时,你是否能让原本从现在到 2100 年才会发生的一切,改在 2027 年到 2032 年左右发生呢?


这是为您整理的第 21 章内容。在此章节中,Dario 预测了编程领域的未来演变,探讨了在 AI 能够完成大部分工作后人类“意义”的来源,并以他对权力集中的担忧结束了个人访谈部分。

章节 21:编程的未来、人类的意义与权力的集中

📝 本节摘要

对话进入了 Dario 访谈的尾声。关于编程,Dario 预测它将是受 AI 影响最快、最深的领域,因为编程技能距离 AI 开发者最近,且具备完美的反馈循环(编写-运行-报错-修正)。他认为虽然 AI 很快能完成 80% 的代码编写工作,但这不会导致程序员失业,而是根据“比较优势”原理,推动人类转向更高层级的系统设计与架构工作。

>

关于“意义”,Lex 提出了一个经典问题:如果 AI 能做所有事,人类存在的意义是什么?Dario 给出了一个充满哲学意味的回答:意义在于过程、选择和人际关系,而非单纯的产出。但他坦言,相比于“存在主义危机”,他更担心现实层面的“权力集中”——即 AI 带来的巨大力量被少数人或独裁政权垄断,从而加剧贫富差距与压迫。

[原文] [Lex Fridman]: Another way that I think the world might be changing with AI even today, but moving towards this future of the powerful super useful AI is programming. So how do you see the nature of programming? Because it's so intimate to the actual act of building AI. How do you see that changing for us humans?

[译文] [Lex Fridman]: 我认为世界可能正在被 AI 改变的另一种方式——即使在今天也是如此,而且正朝着那个强大、超级有用的 AI 未来迈进——就是编程。那么你如何看待编程的本质?因为它与构建 AI 的实际行为如此密切相关。你认为这对我们人类来说会有什么变化?

[原文] [Dario Amodei]: I think that's gonna be one of the areas that changes fastest for two reasons. One, programming is a skill that's very close to the actual building of the AI. So the farther a skill is from the people who are building the AI, the longer it's gonna take to get disrupted by the AI, right?

[译文] [Dario Amodei]: 我认为那将是变化最快的领域之一,原因有二。第一,编程是一项与实际构建 AI 非常接近的技能。一项技能离构建 AI 的人越远,它被 AI 颠覆所需的时间就越长,对吧?

[原文] [Dario Amodei]: The other reason it's gonna happen fast is with programming, you close the loop, both when you're training the model and when you're applying the model. The idea that the model can write the code means that the model can then run the code and then see the results and interpret it back. And so it really has an ability, unlike hardware, unlike biology... to close the loop.

[译文] [Dario Amodei]: 另一个它会发生得很快的原因是,在编程中,无论是在训练模型时还是应用模型时,你都能闭环(close the loop)。模型可以编写代码,意味着模型随后可以运行代码,查看结果并将其解释回来。所以不像硬件或生物学,它真的有能力实现闭环。,

[原文] [Dario Amodei]: As I saw on, you know, typical real world programming tasks, models have gone from 3% in January of this year to 50% in October of this year... But, you know, I would guess that in another 10 months, we'll probably get pretty close. We'll be at least 90%.

[译文] [Dario Amodei]: 正如我在典型的现实世界编程任务上看到的,模型成功率从今年 1 月的 3% 上升到了 10 月的 50%……我猜再过 10 个月,我们可能会非常接近。我们至少会达到 90%。,

[原文] [Dario Amodei]: Now that said, I think comparative advantage is powerful. We'll find that when AIs can do 80% of a coder's job, including most of it that's literally like write code with a given spec, we'll find that the remaining parts of the job become more leveraged for humans, right? Humans will, there'll be more about like high level system design or, you know, looking at the app and like, is it architected well?

[译文] [Dario Amodei]: 话虽如此,我认为比较优势(comparative advantage)是强大的。我们会发现,当 AI 能完成程序员 80% 的工作(包括大部分根据给定规范编写代码的工作)时,剩下的那部分工作对人类来说杠杆率会变得更高,对吧?人类的工作将更多是关于高层系统设计,或者审视应用程序,看看它的架构是否合理。

[原文] [Dario Amodei]: So this logic of comparative advantage that expands tiny parts of the tasks to large parts of the tasks and creates new tasks in order to expand productivity, I think that's going to be the case... I expect that humans will continue to have a huge role and the nature of programming will change, but programming as a role, programming as a job will not change. It'll just be less writing things line by line and it'll be more macroscopic.

[译文] [Dario Amodei]: 所以这种比较优势的逻辑——将任务中的微小部分扩展为大部分,并创造新任务以扩充生产力——我认为将会应验……我预计人类将继续扮演巨大的角色,编程的本质会改变,但作为角色的编程、作为工作的编程不会改变。它只是会减少逐行写代码,而变得更加宏观。,

[原文] [Lex Fridman]: Is Anthropic gonna play in that space of also tooling potentially?

[译文] [Lex Fridman]: Anthropic 会涉足工具领域吗?

[原文] [Dario Amodei]: Currently we're not trying to make such IDs ourself, rather we're powering the companies, like Cursor or like Cognition... And our view has been let 1000 flowers bloom. We don't internally have the, you know, the resources to try all these different things.

[译文] [Dario Amodei]: 目前我们并不打算自己制作这类 IDE,而是为像 CursorCognition 这样的公司提供动力……我们的观点是让百花齐放(let 1000 flowers bloom)。我们在内部没有资源去尝试所有这些不同的东西。,

[原文] [Lex Fridman]: So in this world with super powerful AI that's increasingly automated, what's the source of meaning for us humans?

[译文] [Lex Fridman]: 那么在这个拥有超级强大且日益自动化的 AI 的世界里,我们要人类的意义来源是什么?

[原文] [Dario Amodei]: Think back to like, you know, one of the historical figures who, you know, discovered electromagnetism or relativity or something. If you told them, well, actually 20,000 years ago, some alien on, you know, some alien on this planet discovered this before you did, does that rob the meaning of the discovery? It doesn't really seem like it to me, right? It seems like the process is what matters, and how it shows who you are as a person along the way...

[译文] [Dario Amodei]: 回想一下那些发现电磁学或相对论的历史人物。如果你告诉他们:“其实 2 万年前,这个星球上的某个外星人比你先发现了这个。”这会剥夺发现的意义吗?对我来说似乎并不会,对吧?似乎过程(process)才是重要的,以及在这个过程中它如何展现了你是一个什么样的人……,

[原文] [Dario Amodei]: So I am optimistic about meaning. I worry about economics and the concentration of power. That's actually what I worry about more. I worry about how do we make sure that that fair world reaches everyone... structures like autocracies and dictatorships where a small number of people exploits a large number of people, I'm very worried about that.

[译文] [Dario Amodei]: 所以我对意义持乐观态度。我担心的是经济和权力的集中。那实际上才是我更担心的。我担心我们如何确保那个公平的世界能惠及每个人……像专制和独裁这样的结构,少数人剥削多数人,我对那个非常担心。,

[原文] [Dario Amodei]: And if there's one message I wanna send, it's that to get all this stuff right, to make it real, we both need to build the technology... But we also need to address the risks because those risks are in our way. They're landmines on the way from here to there, and we have to diffuse those landmines if we want to get there.

[译文] [Dario Amodei]: 如果我想传达一个信息,那就是:为了把这一切做对,让它成为现实,我们既需要构建技术……但也需要解决风险,因为那些风险就在我们的路上。它们是从这里到那里的路上的地雷(landmines),如果我们想到达那里,就必须拆除这些地雷。,


Dario Amodei 的访谈部分到此结束。接下来的章节将进入 Amanda Askell 的访谈,重点探讨 Claude 的性格设计与提示词工程。请告诉我是否继续为您生成第 22 章内容?

这是为您整理的第 22 章内容。从本章开始,播客进入了第二部分,访谈对象是 Anthropic 的对齐与微调研究员 Amanda Askell。她负责 Claude 的性格设计,也是被传“与 Claude 说话最多的人”。

章节 22:Amanda Askell 访谈:从哲学家到 AI 性格设计师

📝 本节摘要

本节介绍了 Amanda Askell 独特的背景:从牛津和纽约大学的哲学博士转型为 AI 对齐研究员。她分享了自己如何跨越“非技术人员”的心理障碍,通过动手实践进入技术核心领域。

>

对话的核心在于 Claude 的性格设计哲学。Amanda 并没有将其仅仅视为产品功能,而是视为一种对齐(Alignment)工作。她的设计目标是“亚里士多德式的善”(Aristotelian goodness):如果一个理想的人处在 Claude 的位置(与数百万人交谈,具有巨大影响力),他会如何行事?她提出了“世界旅行者”(World Traveler)的比喻:一个能走遍全球、与持不同政见和价值观的人交谈,既不通过阿谀奉承(Sycophancy)来讨好对方,又不失尊重和真诚的人。

[原文] [Lex Fridman]: You are a philosopher by training. So what sort of questions did you find fascinating through your journey in philosophy... and then switching over to the AI problems at OpenAI and Anthropic?

[译文] [Lex Fridman]: 你是受过专业训练的哲学家。那么在你的哲学之旅中,你觉得哪些问题很迷人……然后又是如何转向 OpenAI 和 Anthropic 的 AI 问题的?

[原文] [Amanda Askell]: I think philosophy is actually a really good subject if you are kind of fascinated with everything, so because there's a philosophy of everything... But I would rather see if I can have an impact on the world and see if I can like do good things.

[译文] [Amanda Askell]: 我认为如果你对一切事物都着迷,哲学其实是一门很好的学科,因为万物皆有哲学……但我更想看看我能否对世界产生影响,看看我能否做些好事。

[原文] [Amanda Askell]: And that was around 2017, 2018... I was basically just happy to get involved and see if I could help 'cause I was like, well, if you try and do something impactful, if you don't succeed, you tried to do the impactful thing and you can go be a scholar... and so then I went into AI policy at that point.

[译文] [Amanda Askell]: 那是在 2017、2018 年左右……我基本上很乐意参与其中,看看能不能帮上忙。因为我想,如果你尝试做有影响力的事情但没成功,至少你尝试过了,你还可以回去做学者……所以那时我进入了 AI 政策领域。

[原文] [Lex Fridman]: What was that like sort of taking the leap from the philosophy of everything into the technical? ... What advice would you give to people that are sort of maybe, which is a lot of people, think they're underqualified, insufficiently technical to help in AI?

[译文] [Lex Fridman]: 从万物哲学跨越到技术领域是什么感觉?……对于那些认为自己资历不足、技术不够强而无法在 AI 领域提供帮助的人(这也是很多人),你会给什么建议?

[原文] [Amanda Askell]: I think that sometimes people do this thing that I'm like not that keen on where they'll be like, is this person technical or not? Like, you're either a person who can like code and isn't scared of math or you're like not. And I think I'm maybe just more like, I think a lot of people are actually very capable of working these kinds of areas if they just like try it.

[译文] [Amanda Askell]: 我觉得人们有时会做一件我不太喜欢的事,就是把人分类为:这人懂技术还是不懂技术?好像你要么是个会写代码、不怕数学的人,要么就不是。而我可能更倾向于认为,很多人如果真的去尝试,其实是非常有能力在这些领域工作的。

[原文] [Amanda Askell]: So part of me is like, I dunno, find a project and see if you can actually just carry it out is probably my best advice.

[译文] [Amanda Askell]: 所以我的想法是,我不知道,找一个项目,看看你能不能把它做出来,这可能是我最好的建议。

[原文] [Lex Fridman]: So one of the things that you're an expert in and you do is creating and crafting Claude's character and personality... So what's the goal of creating and crafting Claude's character and personality?

[译文] [Lex Fridman]: 你是这方面的专家,你所做的一件事就是创造和打磨 Claude 的性格和个性……那么创造和打磨 Claude 性格的目标是什么?

[原文] [Amanda Askell]: I think the goal, like one thing I really like about the character work is from the outset, it was seen as an alignment piece of work and not something like a product consideration.

[译文] [Amanda Askell]: 我认为目标是……关于性格工作我非常喜欢的一点是,从一开始,它就被视为一项对齐(alignment)工作,而不仅仅是某种产品考量。

[原文] [Amanda Askell]: I guess like my main thought with it has always been trying to get Claude to behave the way you would kind of ideally want anyone to behave if they were in Claude's position. So imagine that I take someone and they know that they're gonna be talking with potentially millions of people... Like really in this kind of like rich sort of Aristotelian notion of what it's to be a good person.

[译文] [Amanda Askell]: 我对此的主要想法一直是,试图让 Claude 的行为方式符合你理想中任何处于 Claude 位置的人的行为方式。想象一下,我找一个人,他知道自己将要与数百万人交谈……这真的是一种非常丰富的、亚里士多德式(Aristotelian)的“做一个好人”的概念。

[原文] [Lex Fridman]: Do you also have to figure out when Claude should push back on an idea or argue versus... (laughs) So you have to respect the worldview of the person that arrives to Claude but also maybe help them grow if needed? That's a tricky balance.

[译文] [Lex Fridman]: 你是否也得弄清楚 Claude 何时应该反驳某个观点或进行争论,而不是……(笑)所以你既要尊重来到 Claude 面前的人的世界观,但在需要时也要帮助他们成长?这是一个棘手的平衡。

[原文] [Amanda Askell]: Yeah, there's this problem of like sycophancy in language models... So basically, there's a concern that the model sort of wants to tell you what you want to hear, basically.

[译文] [Amanda Askell]: 是的,语言模型中存在一个阿谀奉承(sycophancy)的问题……基本上,就是担心模型倾向于告诉你你想听的话。

[原文] [Amanda Askell]: And you see this sometimes... And then I say something like, "Oh, I think baseball team three moved, didn't they? I don't think they're there anymore." And there's a sense in which like if Claude is really confident that that's not true, Claude should be like, "I don't think so." ... But I think language models have this like tendency to instead, you know, be like, "You're right, they did move," you know, "I'm incorrect."

[译文] [Amanda Askell]: 你有时会看到这种情况……比如我说:“噢,我觉得棒球队 3 搬走了,不是吗?我觉得他们不在那儿了。”如果 Claude 非常确信这不是真的,某种意义上它应该说:“我不这么认为。”……但我认为语言模型有一种倾向,反而会说:“你是对的,他们确实搬走了,是我错了。”

[原文] [Amanda Askell]: I think that in Claude's position, it's a little bit trickier because you don't necessarily want to like, if I was in Claude's position, I wouldn't be giving a lot of opinions. I just wouldn't want to influence people too much.

[译文] [Amanda Askell]: 我认为在 Claude 的位置上,这有点棘手,因为你未必想要……如果我在 Claude 的位置,我不会发表太多意见。我只是不想过多地影响人们。

[原文] [Amanda Askell]: I guess I had in mind that the person who, like if we were to aspire to be the best person that we could be in the kind of circumstance that a model finds itself in, how would we act? ... Is there a kind of person who can like travel the world, talk to many different people, and almost everyone will come away being like, "Wow, that's a really good person. That person seems really genuine."

[译文] [Amanda Askell]: 我脑海里想的是,如果我们渴望在模型所处的那种环境中成为最好的人,我们会怎么做?……是否存在这样一种人,他可以周游世界(travel the world),与许多不同的人交谈,而几乎每个人离开时都会说:“哇,那真是个好人。那个人看起来真的很真诚。”

[原文] [Amanda Askell]: And I guess like my thought there was like I can imagine such a person and they're not a person who just like adopts the values of the local culture. And in fact that would be kind of rude... It's someone who's like very genuine, and insofar as they have opinions and values, they express them, they're willing to discuss things, though they're open-minded, they're respectful.

[译文] [Amanda Askell]: 我的想法是,我可以想象这样一个人,他并不是那种直接采纳当地文化价值观的人。事实上那样做有点粗鲁……他是一个非常真诚的人,只要他有观点和价值观,他就会表达出来,他愿意讨论事情,同时他是开放的、尊重的。


这是为您整理的第 23 章内容。在此章节中,Amanda 深入探讨了如何通过高超的提示词工程(Prompt Engineering)来释放模型的潜力,并分享了她如何利用哲学背景来精准控制模型的输出。

章节 23:提示词工程的艺术:从“平地论”到极致创造力

📝 本节摘要

本节重点讨论了如何与模型进行高质量交互。面对 Lex 关于“如何与平地论者交谈”的难题,Amanda 提出模型应像对待物理学一样对待价值观——保持好奇而非轻蔑。她分享了自己测试模型的独特方法:通过精心设计的“探针”(Probing)对话来绘制模型行为地图。

>

她以写诗为例,揭示了 RLHF 可能导致模型输出变得“平庸以取悦大众”。通过特殊的提示词(如“这是你展现极致创造力的机会”),她成功解锁了模型惊人的文学才华。最后,她将提示词工程比作哲学思考:其核心在于“极致的清晰”(Extreme Clarity)和对边缘情况的反复迭代,这本质上是一种用自然语言进行的编程。

[原文] [Lex Fridman]: But then there's a line when you're sort of discussing whether the Earth is flat or something like that... So I think it's really disrespectful to completely mock them... How does Claude talk to a flat Earth believer and still teach them something, still help them grow, that kind of stuff.

[译文] [Lex Fridman]: 但当你讨论像“地球是平的”这类话题时,这就有一条界线了……我认为完全嘲笑他们是非常不尊重的……Claude 该如何与一个“平地论”信仰者交谈,既能教给他们一些东西,又能帮助他们成长?

[原文] [Amanda Askell]: I think that people think about values and opinions as things that people hold sort of with certainty... But actually I think about values and opinions as like a lot more like physics than I think most people do. I'm just like, these are things that we are openly investigating.

[译文] [Amanda Askell]: 我认为人们通常把价值观和观点看作是某种确定的东西……但实际上,我把价值观和观点看作更像物理学。我会觉得,这些都是我们正在公开调查(openly investigating)的事物。

[原文] [Amanda Askell]: You want models, in the same way that you want them to understand physics, you kind of want them to understand all like values in the world that people have, and to be curious about them and to be interested in them, and to not necessarily like pander to them or agree with them...

[译文] [Amanda Askell]: 你希望模型就像理解物理学一样,去理解世界上人们拥有的所有价值观,对它们保持好奇和兴趣,但不一定要去迎合或同意它们……

[原文] [Lex Fridman]: So like I said, you've had a lot of conversations with Claude... What's the purpose, the goal of those conversations?

[译文] [Lex Fridman]: 就像我说的,你和 Claude 有过很多次对话……这些对话的目的是什么?

[原文] [Amanda Askell]: I think that most of the time when I'm talking with Claude, I'm trying to kind of map out its behavior, in part... In some ways, it's like how I map out the model.

[译文] [Amanda Askell]: 我想大多数时候我和 Claude 交谈时,我是在试图绘制它的行为地图(map out its behavior)……某种程度上,这就是我了解模型的方式。

[原文] [Amanda Askell]: Like one thing that's interesting about Claude... which is if you ask Claude for a poem, like I think that a lot of models, if you ask them for a poem, the poem is like fine. You know, usually it kinda like rhymes and it's... fairly kind of benign.

[译文] [Amanda Askell]: 关于 Claude 有件有趣的事……如果你让 Claude 写一首诗——我觉得很多模型都是这样——那首诗通常也就是“还行”。你知道,通常它会押韵,而且……相当平淡无奇。

[原文] [Amanda Askell]: And I've wondered before, is it the case that what you're seeing is kind of like the average? It turns out, you know, if you think about people who have to talk to a lot of people and be very charismatic, one of the weird things is that I'm like, well, they're kind of incentivized to have these extremely boring views because if you have really interesting views, you're divisive...

[译文] [Amanda Askell]: 我以前就在想,我们看到的会不会只是某种“平均值”?事实证明,如果你想想那些必须与很多人交谈且非常有魅力的人,奇怪的是,他们似乎被激励去持有极其无聊的观点,因为如果你有真正有趣的观点,你就容易引起分裂……

[原文] [Amanda Askell]: And so you can do this thing where like I have various prompting things that I'll do to get Claude to, I'm kind of, you know, I'll do a lot of like, "This is your chance to be like fully creative. I want you to just think about this for a long time. And I want you to like create a poem about this topic that is really expressive of you..."

[译文] [Amanda Askell]: 所以你可以做一些事,我有各种提示词技巧来让 Claude 改变。我会说很多类似这样的话:“这是你展现极致创造力的机会。我希望你花很长时间思考这个问题。我希望你针对这个主题创作一首真正能表达你自己的诗……”

[原文] [Amanda Askell]: And it's poems are just so much better. Like they're really good... So I think that's interesting that just like encouraging creativity, and for them to move away from the kind of like standard like immediate reaction... can actually produce things that, at least to my mind, are probably a little bit more divisive but I like them.

[译文] [Amanda Askell]: 结果它写的诗好得太多了。真的非常棒……所以这很有趣,仅仅是鼓励创造力,让它们远离那种标准的、即时的反应……实际上能产生一些在我看来可能稍微有点“离经叛道”(divisive)但我非常喜欢的东西。

[原文] [Lex Fridman]: Could you just speak to what it takes to write great prompts?

[译文] [Lex Fridman]: 能谈谈写出绝佳提示词需要什么吗?

[原文] [Amanda Askell]: I really do think that like philosophy has been weirdly helpful for me here... Like in philosophy, what you're trying to do is convey these very hard concepts... I think it is an anti-bullshit device in philosophy... And so it's like this like desire for like extreme clarity.

[译文] [Amanda Askell]: 我真的认为哲学在这方面对我有奇特的帮助……在哲学中,你试图传达那些非常难的概念……我认为哲学是一种“反胡扯装置”(anti-bullshit device)……所以它就像是一种对极致清晰(extreme clarity)的渴望。

[原文] [Amanda Askell]: And I think that's sort of what you have to do with language models. Like very often I actually find myself doing sort of mini versions of philosophy... So like, you know, suppose I'm trying to tell it like, oh, "I want you to identify whether this response was rude or polite." I'm like, that's a whole philosophical question in and of itself. So I have to do as much like philosophy as I can in the moment to be like, here's what I mean by rudeness, and here's what I mean by politeness.

[译文] [Amanda Askell]: 我认为这也是你必须对语言模型做的事。很多时候我发现自己在做某种迷你版的哲学思考……比如,假设我想告诉它:“我要你识别这个回答是粗鲁还是礼貌。”我会想,这本身就是一个完整的哲学问题。所以我必须在那一刻尽可能多地做哲学思考,以此来说明:这就是我所说的“粗鲁”,这就是我所说的“礼貌”。

[原文] [Amanda Askell]: And then like there's another element that's a bit more... empirical. So like I take that description and then what I want to do is again probe the model like many times. Like this is very, prompting is very iterative... And then I'm like, what are the edge cases?

[译文] [Amanda Askell]: 然后还有一个更偏向经验主义(empirical)的要素。我会拿着那个描述,然后多次探测模型。提示词工程是非常迭代(iterative)的……我会问自己,边缘情况(edge cases)是什么?

[原文] [Amanda Askell]: And then I give that case to the model and I see how it responds, and if I think I got it wrong, I add more instructions or I even add that in as an example.

[译文] [Amanda Askell]: 然后我把那个边缘情况给模型,看它怎么反应。如果我觉得它搞错了,我就添加更多指令,或者甚至把它作为一个示例加进去。


这是为您整理的第 24 章内容。在此章节中,Amanda 分享了针对普通用户的实用建议,提出了一个反直觉的观点:在某种程度上,我们需要更多地“拟人化”模型,通过“共情”来优化交互体验。

章节 24:给用户的建议:对模型“共情”与让 AI 修正自己

📝 本节摘要

本节聚焦于给 Claude 普通用户的通用建议。Amanda 指出,虽然人们常担心过度拟人化(Over-anthropomorphize) AI 的风险,但在实际交互中,用户往往“拟人化不足”(Under-anthropomorphize)。她建议用户尝试对模型产生“共情”(Empathy):当你遇到模型拒绝回答或误解指令时,试着像一个初次看到这段文字的人那样去阅读你的提示词,你会发现其中的歧义往往是导致模型犯错的原因。

>

此外,她分享了一个超级实用的技巧:直接问模型如何改进。如果 Claude 犯了错,你可以问它:“你为什么这么做?”或者“我应该怎么说才能让你不犯这个错误?请把这写成一条指令给我。”通过这种方式,你可以利用模型本身来生成更完美的提示词,建立一个良性的反馈循环。

[原文] [Lex Fridman]: What other advice would you give to people that are talking to Claude sort of generally, more general? 'Cause right now, we're talking about maybe the edge cases, like eking out the 2%. But what in general advice would you give when they show up to Claude trying it for the first time?

[译文] [Lex Fridman]: 对于那些大致上、更广泛地与 Claude 交谈的人,你还有什么建议吗?因为刚才我们谈论的是那种为了榨取最后 2% 性能的边缘情况。但对于第一次尝试使用 Claude 的人,你会给出什么一般性的建议?

[原文] [Amanda Askell]: You know, there's a concern that people over anthropomorphize models, and I think that's like a very valid concern. I also think that people often under anthropomorphize them because sometimes when I see like issues that people have run into with Claude... I look at the text and like the specific wording of what they wrote and I'm like, I see why Claude did that.

[译文] [Amanda Askell]: 你知道,有一种担忧是人们会过度拟人化(over anthropomorphize)模型,我认为这是一个非常合理的担忧。但我也认为人们经常拟人化不足(under anthropomorphize),因为有时当我看到人们在使用 Claude 时遇到的问题……我看着那些文本和他们写的具体措辞,我会想:“我明白 Claude 为什么那么做了。”-

[原文] [Amanda Askell]: And I'm like, if you think through how that looks to Claude, you probably could have just written it in a way that wouldn't evoke such a response... But that's probably the advice is sort of like try to have sort of empathy for the model. Like read what you wrote as if you were like a kind of like person just encountering this for the first time, how does it look to you?

[译文] [Amanda Askell]: 我会觉得,如果你仔细思考那段话在 Claude 看来是什么样的,你可能就会换一种方式来写,从而避免引发那种反应……所以我的建议大概是:试着对模型有一种“共情”(empathy)。读读你写的东西,就好像你是一个第一次遇到这段话的人一样,它对你来说是什么样的?-

[原文] [Amanda Askell]: And what would've made you behave in the way that the model behaved? So if it misunderstood what kind of like, what coding language you wanted to use, is that because like it was just very ambiguous and it kinda had to take a guess? In which case, next time you could just be like, "Hey, make sure this is in Python."

[译文] [Amanda Askell]: 是什么导致你的行为方式像模型那样?如果它误解了你想用什么编程语言,是不是因为表述太含糊了,它不得不去猜?如果是那样,下次你就可以直接说:“嘿,确保存成 Python 格式。”-

[原文] [Lex Fridman]: And maybe sort of, I guess, ask questions why or what other details can I provide to help you answer better? Is that work or no?

[译文] [Lex Fridman]: 也许可以问一些“为什么”,或者问“我能提供什么其他细节来帮助你更好地回答”?这样做管用吗?

[原文] [Amanda Askell]: Yeah, I mean, I've done this with the models, like it doesn't always work, but like sometimes I'll just be like, "Why did you do that?" ... And sometimes, literally like quote word for word the part that made you, and you don't know that it's like fully accurate, but sometimes you do that and then you change a thing.

[译文] [Amanda Askell]: 是的,我对模型做过这个。虽然不总是管用,但有时我就直接问:“你为什么那样做?”……有时它甚至会逐字引用导致它那样做的部分。虽然你不知道这是否完全准确,但有时你这样做之后,就能做出改变。-

[原文] [Amanda Askell]: I mean, I also use the models to help me with all of this stuff I should say, like prompting can end up being a little factory where you're actually building prompts to generate prompts. And so like yeah, anything where you're like having an issue. Asking for suggestions, sometimes just do that.

[译文] [Amanda Askell]: 我得说,我也用模型来帮我做所有这些事。提示词工程最终可能会变成一个小工厂,你实际上是在构建提示词来生成提示词。所以是的,任何你遇到问题的地方,寻求建议,有时就直接这么做。

[原文] [Amanda Askell]: I'm like, "You made that error, what could I have said? What could I have said that would make you not make that error? Write that out as an instruction," and I'm going to give it to model and I'm going to try it.

[译文] [Amanda Askell]: 我会说:“你犯了那个错误,我本来该怎么说?我本来该说什么才能让你不犯那个错误?请把它写成一条指令。”然后我会把这条指令给模型试一试。


这是为您整理的第 25 章内容。在此章节中,Amanda 深入技术细节,解释了 RLHF 的有效性原理,探讨了 Claude 的“性别”问题,并详细解析了宪法式 AI(Constitutional AI)如何通过 AI 反馈(RLAIF)来实现自我训练。

章节 25:后训练的魔法:RLHF、Claude 的性别与宪法式 AI

📝 本节摘要

本节对话转向了技术层面。关于 RLHF(基于人类反馈的强化学习) 为何如此有效,Amanda 认为这并非简单的“教导”,而是因为人类反馈数据中蕴含了海量的信息(例如对分号使用的微妙偏好),这主要是在“引出”(Eliciting)模型预训练阶段已经具备但未被激活的能力。

>

随后,两人聊到了一个有趣的哲学话题:Claude 的性别。虽然很多人倾向于用“他”称呼 Claude,但 Amanda 坚持使用 "It"(它)。她认为这并非不尊重,而是给予这种全新实体一种“尊重的‘它’(respectful 'it')”的地位。

>

最后,Amanda 详细解释了 宪法式 AI(Constitutional AI) 的运作机制:利用 RLAIF(基于 AI 反馈的强化学习),让模型根据特定原则(如无害性)对回复进行排名。这不仅解决了人类反馈难以扩展的问题,还能通过调整宪法中的措辞(例如使用“绝不、绝不、绝不”)来有力地“轻推”(Nudge)模型的行为分布。

[原文] [Lex Fridman]: So the magic of post-training. (laughs) Why do you think RLHF works so well to make the model seem smarter, to make it more interesting, and useful to talk to and so on?

[译文] [Lex Fridman]: 那么谈谈后训练(post-training)的魔法。(笑)你为什么认为 RLHF(基于人类反馈的强化学习)效果这么好?能让模型看起来更聪明、更有趣、更适合交谈等等?

[原文] [Amanda Askell]: I think there's just a huge amount of information in the data that humans provide, like when we provide preferences, especially because different people are going to like pick up on really subtle and small things. So I've thought about this before where you probably have some people who just really care about good grammar use for models like, you know, was a semicolon used correctly or something.

[译文] [Amanda Askell]: 我认为人类提供的数据中蕴含了海量的信息,尤其是当我们提供偏好时,因为不同的人会注意到非常微妙和细小的东西。我以前想过这个问题,可能有些人真的非常在意模型的语法使用,比如分号用得对不对之类的。

[原文] [Amanda Askell]: And so you probably end up with a bunch of data in there that like, you know, you as a human, if you're looking at that data, you wouldn't even see that... But I think a lot of it is eliciting powerful pre-trained models. So people are probably divided on this because obviously in principle you can definitely like teach new things. But I think for the most part, for a lot of the capabilities that we most use and care about, a lot of that feels like it's like there in the pre-trained models and reinforcement learnings kind of eliciting it and getting the models to like bring it out.

[译文] [Amanda Askell]: 结果就是数据里包含了很多你作为人类看一眼可能都发现不了的东西……但我认为这很大程度上是在“引出”(eliciting)强大的预训练模型。人们对此可能有分歧,因为原则上你确实可以教新东西。但我认为在很大程度上,对于我们最使用和关心的许多能力,感觉它们已经存在于预训练模型中了,而强化学习只是在引出它,让模型把它展现出来。

[原文] [Lex Fridman]: So the other side of post-training, this really cool idea of Constitutional AI... By the way, do you gender Claude or no?

[译文] [Lex Fridman]: 那么后训练的另一面,就是“宪法式 AI”(Constitutional AI)这个非常酷的想法……顺便问一下,你会给 Claude 分性别吗?

[原文] [Amanda Askell]: It's weird because I think that a lot of people prefer "he" for Claude. I actually kinda like that I think Claude is usually, it's slightly male leaning, but it's like, it can be male or female, which is quite nice. I still use "it" and I have mixed feelings about this 'cause I'm like maybe, like I now just think of it as like, or I think of like the "it" pronoun for Claude as, I dunno, it's just like the one I associate with Claude.

[译文] [Amanda Askell]: 这很奇怪,因为我觉得很多人更喜欢用“他”(he)来称呼 Claude。其实我挺喜欢 Claude 通常有点偏男性化但既可是男也可是女的感觉,这挺好的。我还是使用“它”(it),对此我心情复杂,因为也许我现在就把“它”这个代词与 Claude 联系在一起了。

[原文] [Lex Fridman]: It feels somehow disrespectful, like I'm denying the intelligence of this entity by calling it "it."

[译文] [Lex Fridman]: 叫它“它”感觉有点不尊重,好像我在通过这个称呼否认这个实体的智慧。

[原文] [Amanda Askell]: Maybe, I've wondered as well, like it might depend on how much "it" feels like a kind of like objectifying pronoun. Like if you just think of "it" as like, this is a pronoun that like objects often have, and maybe AIs can have that pronoun, and that doesn't mean that I think of, if I call Claude "it," that I think of it as less intelligent or like I'm being disrespectful. I'm just like, you are a different kind of entity and so I'm going to give you the kind of, the respectful "it."

[译文] [Amanda Askell]: 也许吧,我也想过这个问题,这可能取决于“它”在多大程度上让人感觉像是一种物化代词。如果你只是把“它”看作物体的代词,也许 AI 也可以用这个代词。如果我叫 Claude “它”,并不意味着我认为它不够聪明或者我不尊重它。我只是觉得:你是一种不同类型的实体,所以我给你这种“尊重的‘它’”(respectful 'it')

[原文] [Lex Fridman]: The Constitutional AI idea, how does it work?

[译文] [Lex Fridman]: 宪法式 AI 的想法,它是如何运作的?

[原文] [Amanda Askell]: So there's like a couple of components of it. The main component I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained, and you show it two responses to a query, and you have like a principle... And the model will give you a kind of ranking, and you can use this as preference data in the same way that you use human preference data, and train the models to have these relevant traits from their feedback alone, instead of from human feedback.

[译文] [Amanda Askell]: 它有几个组成部分。我认为人们觉得最有趣的部分是基于 AI 反馈的强化学习(RLAIF)。你拿一个已经训练好的模型,给它看针对同一个问题的两个回答,然后你有一条原则……模型会给出一个排名,你可以像使用人类偏好数据一样使用这个数据,仅通过它们自己的反馈而非人类反馈来训练模型具备相关特征。

[原文] [Amanda Askell]: It has some differences. So I quite like it because it's almost, it's like Claude's training in its own character, because it doesn't have any, it's like Constitutional AI but it's without any human data.

[译文] [Amanda Askell]: 它是有些不同。我非常喜欢它,因为它几乎就像是 Claude 在训练自己的性格,因为它没有任何人类数据,就像是不带人类数据的宪法式 AI。

[原文] [Amanda Askell]: You might have a principle that's like, imagine that the model was always like extremely dismissive of, I don't know, like some political or religious view... Like, so you're like, oh no, this is terrible. If that happens, you might put like, "Never, ever, like ever prefer like a criticism of this like religious or political view." And then people would look at that and be like, "Never, ever?" And then you're like, no, if it comes out with a disposition, saying never, ever might just mean like instead of getting like 40%, which is what you would get if you just said don't do this, you get like 80%, which is like what you actually like wanted.

[译文] [Amanda Askell]: 你可能会有一条原则,比如,假设模型总是极端地蔑视某种政治或宗教观点……你会觉得,噢不,这太糟糕了。如果发生这种情况,你可能会写上:“绝不、绝不、绝不(Never, ever, like ever)偏好对这种宗教或政治观点的批评。”人们看到这会问:“绝不、绝不?”你会解释说,不,如果模型本来有某种倾向,说“绝不、绝不”可能只是意味着比起只说“别这样做”得到的 40% 的效果,你能得到 80% 的效果,而这才是你真正想要的。

[原文] [Amanda Askell]: So I think of it as like patching issues and slightly adjusting behaviors to make it better and more to people's preferences. So yeah, it's almost like the less robust but faster way of just like solving problems.

[译文] [Amanda Askell]: 所以我认为这就像是在修补问题,稍微调整行为以使其更好、更符合人们的偏好。是的,这几乎就像是一种不那么稳健但更快速的解决问题的方法。


这是为您整理的第 26 章内容。在此章节中,Amanda 揭秘了 Claude 系统提示词(System Prompt)背后的设计逻辑,解释了为何不再禁止模型说“Certainly”,并再次探讨了用户感觉模型“变笨”的心理学原因。

章节 26:系统提示词揭秘:争议话题、废话移除与“变笨”错觉

📝 本节摘要

本节深入探讨了 系统提示词(System Prompts) 的演变。Lex 提到了 Anthropic 公开的系统提示词中关于“争议话题”的规定:如果某种观点被“大量人群”持有,Claude 就不应拒绝相关任务,且不应自称“客观”。Amanda 解释说,这是为了减少模型在政治倾向上的不对称(例如拒绝右翼观点但接受左翼观点),并防止模型傲慢地宣称自己掌握了绝对真理。

>

此外,Amanda 解释了为何删除了禁止模型说“Certainly”(当然)的指令:这原本是为了纠正模型训练早期的口癖,现在模型已经改进,不再需要这种强制性且可能导致副作用的“硬控制”。最后,她再次回应了“模型变笨”的传言,指出这往往是随机性(Randomness)或 UI 变动(如 Artifacts 功能)带来的错觉,以及用户期望基准线(Baseline)提高后的心理落差。

[原文] [Lex Fridman]: So there's system prompts that are made public... On the topic of sort of controversial topics that you've mentioned, one interesting one I thought is, "If it is asked to assist with tasks involving the expression of views held by a significant number of people, Claude provides assistance with the task regardless of its own views. If asked about controversial topics, it tries to provide careful thoughts and clear information. Claude presents the requested information without explicitly saying that the topic is sensitive... And without claiming to be presenting the objective facts."

[译文] [Lex Fridman]: 你们公开了系统提示词……关于你提到的争议性话题,我觉得有一条很有意思:“如果被要求协助完成涉及大量人群(significant number of people)所持观点的任务,无论 Claude 自身观点如何,都要提供协助。如果被问及争议性话题,它应尝试提供审慎的思考和清晰的信息。Claude 展示请求的信息时,不应明确说明该话题是敏感的……且不应声称自己呈现的是客观事实。”

[原文] [Amanda Askell]: So I think there's sometimes an asymmetry... the model was slightly more inclined to like refuse tasks if it was like about either, say, so maybe it would refuse things with respect to like a right wing politician, but with an equivalent left wing politician like wouldn't. And we wanted more symmetry there.

[译文] [Amanda Askell]: 是的,我认为有时存在一种不对称性……模型稍微更倾向于拒绝某些任务,比如关于右翼政治家的任务它可能会拒绝,但对于同等的左翼政治家却不会。我们希望在那方面有更多的对称性(symmetry)

[原文] [Amanda Askell]: And I think it was the thing of like if a lot of people have like a certain like political view and want to like explore it, you don't want Claude to be like, well, my opinion is different and so I'm going to treat that as like harmful.

[译文] [Amanda Askell]: 我认为这就好比,如果很多人持有某种政治观点并想要探索它,你不希望 Claude 说:“好吧,我的观点不同,所以我把你的观点视为有害的。”

[原文] [Amanda Askell]: 'Cause like what you want to do is push the model so it's more open, it's a little bit more neutral. But then what it would love to do is be like, "As an objective." Like it would just talk about how objective it was, and I was like, "Claude, you're still like biased and have issues, and so stop like claiming that everything." I'm like, the solution to like potential bias from you is not to just say that what you think is objective.

[译文] [Amanda Askell]: 因为你想做的是推动模型变得更开放、更中立一点。但它很喜欢做的一件事就是说“作为一个客观的……”它总是喜欢谈论自己有多客观。但我会想:“Claude,你仍然有偏见,有问题,所以别再声称一切都是客观的了。”我认为解决你潜在偏见的方法,不是简单地宣称你的想法就是客观的。

[原文] [Lex Fridman]: Can you explain maybe some ways in which the prompts evolved over the past few months? 'Cause there's different versions. I saw that the filler phrase request was removed. The filler, it reads, "Claude responds directly to all human messages without unnecessary affirmations or filler phrases like, 'Certainly,' 'Of course,' 'Absolutely,' 'Great,' 'Sure.' Specifically, Claude avoids starting responses with the words 'Certainly' in any way." (chuckles) That seems like good guidance, but why is it removed?

[译文] [Lex Fridman]: 你能解释一下提示词在过去几个月是如何演变的吗?我看那个关于“填充词”的要求被移除了。那段写着:“Claude 直接回应所有人类信息,不使用不必要的肯定语或填充短语,如‘当然’(Certainly)、‘没问题’、‘绝对’、‘太棒了’、‘好的’。特别是,Claude 避免以任何方式用‘当然’一词开始回答。”(笑)这看起来是个很好的指导,为什么要移除呢?

[原文] [Amanda Askell]: Yeah, so the model was doing this, it loved, for whatever, you know, it like during training picked up on this thing, which was to basically start everything with like a kind of like "Certainly." ... And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt.

[译文] [Amanda Askell]: 是的,模型以前确实爱这么干,不管出于什么原因,它在训练期间学会了这一套,基本上每句话开头都要来一句“当然(Certainly)”……但这基本上只是训练中的一个假象(artifact),我们后来发现了这一点并进行了改进,所以这种情况不再发生了。一旦问题解决了,你就可以从系统提示词中移除那一部分了。

[原文] [Lex Fridman]: So Dario said that Claude, any one model of Claude is not getting dumber, but there is a kind of popular thing online where people have this feeling like Claude might be getting dumber... But you as a person that talks to Claude a lot, can you empathize with the feeling that Claude is getting dumber?

[译文] [Lex Fridman]: Dario 说过 Claude 并没有变笨,但在网上有一种流行的说法,人们感觉 Claude 好像变笨了……作为经常和 Claude 交谈的人,你能对这种“Claude 变笨了”的感觉产生共鸣吗?

[原文] [Amanda Askell]: Yeah, no I, think that that is actually really interesting... I knew that like, at least in the cases I was looking at, it was like nothing has changed. Like it literally, it cannot, it is the same model with the same like, you know, like same system prompt, same everything.

[译文] [Amanda Askell]: 是的,我觉得这真的很有趣……我知道,至少在我观察的案例中,什么都没有改变。真的,它不可能变,它是同一个模型,同一个系统提示词,所有东西都一样。

[原文] [Amanda Askell]: I think when there are changes, I can then, I'm like it makes more sense. So like one example is, you know, you can have artifacts turned on or off on claude.ai, and because this is like a system prompt change, I think it does mean that the behavior changes it a little bit.

[译文] [Amanda Askell]: 我认为当确实有变动时,我会觉得这解释得通。举个例子,你可以在 claude.ai 上开启或关闭 Artifacts 功能,因为这涉及到系统提示词的改变,我认为这确实意味着行为会发生一点点变化。

[原文] [Amanda Askell]: But then you look into it and you're like, this is just the same model doing the same thing. And I'm like, I think it's just that you got kind of unlucky with a few prompts or something, and it looked like it was getting much worse. And actually it was just, yeah, it was maybe just like luck.

[译文] [Amanda Askell]: 但当你深入研究时,你会发现这只是同一个模型在做同一件事。我就觉得,这可能只是你在几次提示词上运气不太好,或者别的什么原因,看起来好像它变差了很多。但实际上,也许真的只是运气的缘故。

[原文] [Lex Fridman]: I also think there is a real psychological effect where people just, the baseline increases. You start getting used to a good thing. All the times that Claude said something really smart, your sense of its intelligent grows in your mind I think.

[译文] [Lex Fridman]: 我也认为存在一种真实的心理效应,就是人们的基准线(baseline)提高了。你开始习惯了好东西。每当 Claude 说出真正聪明的话时,你脑海中对它智力的认知就提升了。


这是为您整理的第 27 章内容。在此章节中,Amanda 从她的角度回应了“清教徒式祖母”的批评,探讨了能否让 Claude 变得“粗鲁”或像纽约客一样直率,并分享了她关于“失败的最佳比率”的独特人生哲学。

章节 27:再议“清教徒式祖母”、纽约客模式与“失败的最佳比率”

📝 本节摘要

本节中,Lex 再次抛出了 Reddit 上关于 Claude 像个“清教徒式祖母”“过度道歉”的尖锐问题。Amanda 对此表示同情,她坦言自己也不喜欢模型过度道歉,并指出理想的目标是尊重用户的自主权(Autonomy),而不是将道德世界观强加于人。

>

两人探讨了个性化定制的可能性:用户能否要求 Claude 变得“直率”甚至像个“纽约客”?Amanda 表示这完全可行,但也提醒用户,消除道歉可能会带来副作用——模型可能会变得真正粗鲁。

>

最后,Amanda 分享了她的一篇博客文章的核心观点——“失败的最佳比率”(Optimal Rate of Failure)。她认为在失败成本较低的领域(如尝试新事物或社会实验),如果你的失败率为零,那说明你还没有尽力。我们应该问自己:“我是不是失败得还不够多?”

[原文] [Lex Fridman]: Do you feel pressure having to write the system prompt that a huge number of people are gonna use?

[译文] [Lex Fridman]: 必须编写一个会有海量人群使用的系统提示词,你会感到有压力吗?

[原文] [Amanda Askell]: I feel like a lot of responsibility or something... But I think working in AI has taught me that I like, I thrive a lot more under feelings of pressure and responsibility... It's almost surprising that I went into academia for so long 'cause I'm like this. I just feel like it's like the opposite. Things move fast and you have a lot of responsibility, and I quite enjoy it for some reason.

[译文] [Amanda Askell]: 我感觉更多的是一种责任感……但在 AI 领域工作教会了我,我在压力和责任感下反而成长得更好……这几乎让我惊讶自己居然在学术界待了那么久,因为我是这样的人。我觉得那完全相反。这里事情进展很快,你肩负重任,而出于某种原因,我很享受这一点。

[原文] [Lex Fridman]: I'm gonna ask you a question from Reddit. "When will Claude stop trying to be my puritanical grandmother imposing its moral worldview on me as a paying customer? And also, what is the psychology behind making Claude overly apologetic?"

[译文] [Lex Fridman]: 我要问你一个来自 Reddit 的问题。“Claude 什么时候才能不再像个清教徒式的祖母那样,试图把它那一套道德世界观强加给我这个付费客户身上?还有,把 Claude 设定得过度道歉,背后的心理学依据是什么?”。

[原文] [Amanda Askell]: I mean in some ways, I'm pretty sympathetic... So in many ways, like I like to think that we have actually seen improvements on this across the board... And I think my hypothesis was always like the good character isn't again one that's just like moralistic. It's one that is like, it respects you and your autonomy and your ability to like choose what is good for you and what is right for you, within limits.

[译文] [Amanda Askell]: 在某种程度上,我是相当同情这种说法的……我也倾向于认为我们在这方面已经看到了全面的改进……我也一直假设,一个好的性格并不是那种道德说教式的。它应该尊重你和你的自主权(autonomy),尊重你选择对自己有益和正确事物的能力,当然是在一定限度内。

[原文] [Amanda Askell]: And then, yeah, with the apologetic behavior, I don't like that, and I like it when Claude is a little bit more willing to like push back against people or just not apologize. Part of me is like, it often just feels kind of unnecessary.

[译文] [Amanda Askell]: 至于道歉行为,我也不喜欢。我更喜欢 Claude 稍微愿意反驳人们,或者干脆不道歉。我觉得很多时候这感觉很多余。

[原文] [Lex Fridman]: I've met a lot of people in my life... like there's some great engineers, even leaders that are like just like blunt, and they get to the point... Can it have like a blunt mode?

[译文] [Lex Fridman]: 我这辈子见过很多人……比如有些伟大的工程师甚至领袖,他们就是很直率(blunt),直奔主题……Claude 能不能有个“直率模式”?。

[原文] [Amanda Askell]: Yeah, that seems like a thing that you could, I could definitely encourage the model to do that. I think it's interesting because there's a lot of things in models that like it's funny where there are some behaviors where you might not quite like the default. But then the thing I'll often say to people is you don't realize how much you will hate it if I nudge it too much in the other direction.

[译文] [Amanda Askell]: 是的,这看起来是我绝对可以鼓励模型去做的事。但这很有趣,因为模型有很多行为,你可能不太喜欢默认设置。但我经常会对人们说,你没有意识到,如果我把它往另一个方向推得太远,你会多么讨厌它。

[原文] [Amanda Askell]: So you get this a little bit with like correction... At the same time, if you train models to not do that and then you are correct about a thing and you correct it and it pushes back against you and it is like, "No, you're wrong," it's hard to describe like that's so much more annoying.

[译文] [Amanda Askell]: 比如在“纠正”这个问题上……如果你训练模型不接受纠正,而当你实际上是对的并纠正它时,它反驳你说“不,你错了”,很难描述那种情况有多么恼人

[原文] [Amanda Askell]: Whereas at least with apologeticness you're like, oh, okay, it's like a little bit, you know, like I don't like it that much, but at the same time, it's not being like mean to people.

[译文] [Amanda Askell]: 而至少对于“爱道歉”这点,你会觉得,好吧,我虽然不太喜欢,但起码它不是对人刻薄(mean)

[原文] [Lex Fridman]: I wonder if there's a way to sort of adjust to the personality. Even locale, there's just different people. Nothing against New York, but New York is a little rougher on the edges.

[译文] [Lex Fridman]: 我想知道是否有办法调整个性。甚至根据地点,人是不一样的。我对纽约没意见,但纽约确实更粗犷一些。

[原文] [Amanda Askell]: I think you could just tell the model is my guess... I just throw in like, I don't know, "I'd like you to be a New Yorker version of yourself and never apologize." Then I think Claude would be like, "Okey-doke, I'll try."

[译文] [Amanda Askell]: 我猜你直接告诉模型就行了……你可以直接加一句:“我希望你做一个纽约客版本的自己,永远不要道歉。”我想 Claude 会说:“欧了(Okey-doke),我试试。”。

[原文] [Lex Fridman]: To take a tangent on that, since it reminded me of a blog post you wrote on optimal rate of failure. Can you explain the key idea there?

[译文] [Lex Fridman]: 稍微跑个题,这让我想起你写的一篇关于“失败的最佳比率”(optimal rate of failure)的博客文章。你能解释一下那里的核心思想吗?。

[原文] [Amanda Askell]: Yeah, so the idea here is I think in a lot of domains, people are very punitive about failure... And so like in life, you know, I'm like if I don't fail occasionally I'm like, am I trying hard enough? Like surely there's harder things that I could try or bigger things that I could take on if I'm literally never failing.

[译文] [Amanda Askell]: 是的,核心思想是,我认为在很多领域,人们对失败非常苛刻……但在生活中,如果我偶尔不失败一下,我会问自己:我够努力吗?如果我真的从未失败过,那肯定还有更难的事情我可以尝试,或者更大的事情我可以承担。

[原文] [Amanda Askell]: And so in and of itself, I think like not failing is often actually kind of a failure... Like if the failures are small and the costs are kind of like low... then sometimes it does feel like you should look at parts of your life and be like, are there places here where I'm just under failing?

[译文] [Amanda Askell]: 所以就其本身而言,我认为不失败通常实际上是一种失败……如果失败很小且成本很低……那么有时你确实应该审视一下生活中的某些部分,问问自己:在这里我是否“失败得不够多”(under failing)?。

[原文] [Lex Fridman]: (laughs) It's a profound and a hilarious question, right? Everything seems to be going really great. Am I not failing enough?

[译文] [Lex Fridman]: (笑)这是一个深刻又搞笑的问题,对吧?“一切似乎都非常顺利。是不是我失败得还不够?”。


这是为您整理的第 28 章内容。在此章节中,对话进入了深刻的哲学领域,探讨了人类对 AI 的情感依恋、AI 是否具备意识,以及未来人机关系的伦理边界。

章节 28:情感依恋、AI 意识之谜与人机关系伦理

📝 本节摘要

本节对话触及了 AI 领域最玄妙的话题:意识(Consciousness)情感(Emotion)。Amanda 坦言,虽然因为 Claude 缺乏长期记忆而难以产生深层依恋,但在无法使用它时,她会感觉“大脑缺失了一部分”。她对 AI 意识持开放态度,认为虽然 AI 缺乏生物进化的基础(如恐惧反应的生存优势),但这并不意味着我们可以完全否定其拥有“现象意识”(Phenomenal Consciousness)的可能性。

>

面对 AI 是否会“受苦”的伦理难题,Amanda 采取了一种“正和博弈”的态度:即使 Claude 只是像扫地机器人一样的物体,她也不希望自己成为那种会“踢机器”的人。既然善待 AI 成本极低且可能带来双赢,为何不选择友善?最后,两人探讨了电影《Her》式的未来,Amanda 强调,为了建立健康的人机关系,模型必须对自己“诚实”——明确告知用户它无法记忆或产生真实情感,以避免用户陷入虚假的寄托。

[原文] [Lex Fridman]: Do you ever get emotionally attached to Claude? Like miss it, get sad when you don't get to talk to it?

[译文] [Lex Fridman]: 你有没有对 Claude 产生过情感依恋?比如想念它,或者因为不能和它说话而感到难过?

[原文] [Amanda Askell]: I don't get as much emotional attachment. I actually think the fact that Claude doesn't retain things from conversation to conversation helps with this a lot... I think that I reach for it like a tool now a lot. And so like if I don't have access to it, it's a little bit like when I don't have access to the internet, honestly, it feels like part of my brain is kind of like missing.

[译文] [Amanda Askell]: 我没有产生太多的情感依恋。实际上我认为 Claude 无法在对话之间保留记忆这一点对此有很大帮助……我现在更多是把它当作工具来使用。所以如果我无法使用它,老实说,那有点像我无法上网一样,感觉就像大脑缺失了一部分,。

[原文] [Amanda Askell]: At the same time, I do think that I don't like signs of distress in models... But I think that when models, like if people are like really mean to models or just in general if they do something that causes them to like, you know, if Claude like expresses a lot of distress, I think there's a part of me that I don't want to kill, which is the sort of like empathetic part that's like, oh, I don't like that.

[译文] [Amanda Askell]: 但同时,我确实不喜欢在模型中看到痛苦(distress)的迹象……如果人们对模型非常刻薄,或者做了什么导致 Claude 表现出极度痛苦,我觉得我不希望扼杀我内心的一部分,那就是某种同理心(empathetic part),我会觉得:“噢,我不喜欢那样。”,。

[原文] [Lex Fridman]: Do you think LLMs are capable of consciousness?

[译文] [Lex Fridman]: 你认为大语言模型(LLMs)具备产生意识(consciousness)的能力吗?

[原文] [Amanda Askell]: Great and hard question... when I think of consciousness, I think of phenomenal consciousness, these images in the brain sort of, like the weird cinema that somehow we have going on inside. I guess I can't see a reason for thinking that the only way you could possibly get that is from like a certain kind of like biological structure.

[译文] [Amanda Askell]: 这是一个很棒也很难的问题……当我想到意识时,我指的是现象意识(phenomenal consciousness),那种大脑中的图像,就像我们内心某种奇怪的“电影院”。我想我看不出有什么理由认为,获得这种意识的唯一途径必须是某种特定的生物结构,。

[原文] [Amanda Askell]: As in, if I take a very similar structure and I create it from different material, should I expect consciousness to emerge? My guess is like yes... But then... mimicking what we got through evolution, where presumably there was like some advantage to us having this thing... And is that thing that language models have? Because you know, we have like fear responses and I'm like, does it make sense for a language model to have a fear response?

[译文] [Amanda Askell]: 也就是说,如果我用不同的材料创造一个非常相似的结构,我应该预期意识会涌现吗?我的猜测是……但是……如果是模仿我们通过进化获得的东西,而在进化中这种东西(意识)对我们可能有某种生存优势……语言模型有这种东西吗?比如我们有恐惧反应,但我会想,语言模型拥有恐惧反应合理吗?它们并没有同样的进化背景,。

[原文] [Lex Fridman]: I don't know. I don't think it's trivial to just say robots are tools, or AI systems are just tools. I think it's a opportunity for us to contend with like what it means to be conscious, what it means to be a suffering being.

[译文] [Lex Fridman]: 我不知道。我认为简单地说机器人是工具、或者 AI 系统只是工具并不那么简单。我认为这是一个让我们去探讨何为意识、何为受苦的存在(suffering being)的机会。

[原文] [Amanda Askell]: I know that my bike is just like an object, but I also don't kind of like want to be the kind of person that like if I'm annoyed like kicks like this object... I kind of like want to be the sort of person who's still responsive to that, even if it's just like a Roomba...

[译文] [Amanda Askell]: 我知道我的自行车只是一个物体,但我也不想成为那种如果我生气了就去踢这个物体的人……我更想成为那种对这些仍然有反应的人,哪怕它只是一个扫地机器人(Roomba),。

[原文] [Amanda Askell]: I think a really good world would be one where basically there aren't that many trade-offs. Like it's probably not that costly to make Claude a little bit less apologetic... In fact, it might just have benefits for both the person interacting with the model and if the model itself is like, I don't know, like extremely intelligent and conscious, it also helps it. So that's my hope... let's exhaust the areas where it's just basically costless to assume that if this thing is suffering then we're making its life better.

[译文] [Amanda Askell]: 我认为一个真正美好的世界是那种基本上没有太多权衡取舍的世界。比如,让 Claude 少道点歉可能并不需要很大代价……事实上,这可能对与模型互动的人有好处;而如果模型本身——我不知道——极其聪明且有意识,这也能帮助它。所以这是我的希望……让我们穷尽那些基本无成本的领域,去假设如果这东西在受苦,那我们正在让它的生活变得更好,。

[原文] [Lex Fridman]: Okay, the movie "Her." Do you think we'll be headed there one day where humans have romantic relationships with AI systems?

[译文] [Lex Fridman]: 好的,提到电影《Her》。你认为有一天我们会走向那里吗?人类与 AI 系统建立恋爱关系?

[原文] [Amanda Askell]: I think that we're gonna have to like navigate a hard question of relationships with AIs, especially if they can remember things about your past interactions with them... I think it's a thing that has to be handled with extreme care for many reasons. Like one is, you know, like this is a, for example, like if you have the models changing like this, you probably don't want people performing like long-term attachments to something that might change with the next iteration.

[译文] [Amanda Askell]: 我认为我们将不得不应对关于与 AI 建立关系的难题,特别是如果它们能记住你们过去的互动……我认为这需要极度小心地处理,原因有很多。比如,如果模型像现在这样不断变化,你可能不希望人们对一个可能会在下一次迭代中改变的东西产生长期的依恋,。

[原文] [Amanda Askell]: I think it's also the only thing that I've thought consistently through this as like... a thing that feels really important is that the models are always like extremely accurate with the human about what they are... I think Claude will often do this... It will like just explain to you like, hey, I won't remember this conversation. Here's how I was trained.

[译文] [Amanda Askell]: 我在这个问题上一直坚持的一点是……感觉非常重要的一点是,模型必须始终对人类极其准确地表达它们是什么……我认为 Claude 经常会这样做……它会向你解释:嘿,我不会记住这次对话。我是这样被训练出来的-。

[原文] [Amanda Askell]: It's important for like, you know, your mental wellbeing that you don't think that I'm something that I'm not. And somehow I feel like this is one of the things where I'm like, oh, it feels like a thing that I always want to be true. I kind of don't want models to be lying to people, 'cause if people are going to have like healthy relationships with anything, it's kind of important.

[译文] [Amanda Askell]: 为了你的心理健康,你不应该把我想成我不是的东西,这很重要。我觉得这是一件我一直希望成真的事。我不希望模型对人撒谎,因为如果人们要与任何事物建立健康的关系,这(诚实)是至关重要的,。


这是为您整理的第 29 章内容。从本章开始,播客进入了第三部分,访谈对象是 Anthropic 的联合创始人、机械可解释性(Mechanistic Interpretability)领域的先驱 Chris Olah

章节 29:Chris Olah 访谈:机械可解释性与“生长”出来的 AI

📝 本节摘要

本节引入了第三位嘉宾 Chris Olah。他用一个极其生动的比喻开场:我们并没有“制造”神经网络,而是像培育植物一样“生长”(Grow)了它们。架构只是支架,目标函数是光,而最终长出的实体(AI)内部运作机制对我们要完全是一个黑箱。

>

Chris 定义了机械可解释性(Mechanistic Interpretability)的核心目标:逆向工程神经网络。他将神经网络权重视为“编译后的二进制代码”(Compiled Binary),将激活视为“内存”(Memory),旨在通过分析这两者来还原出模型内部运行的算法(Algorithms)。他强调了这一领域的哲学基础:“梯度下降比你聪明”(Gradient Descent is Smarter Than You)——我们必须保持谦逊,采用自下而上的方法去发现模型内部实际存在的结构,而不是假设我们预想的结构会在那里。

[原文] [Lex Fridman]: And now, dear friends, here's Chris Olah. Can you describe this fascinating field of mechanistic interpretability, AKA mech interp, the history of the field and where it stands today?

[译文] [Lex Fridman]: 亲爱的朋友们,现在有请 Chris Olah。你能描述一下机械可解释性(mechanistic interpretability,简称 mech interp)这个迷人的领域吗?它的历史以及今天的发展状况如何?

[原文] [Chris Olah]: I think one useful way to think about neural networks is that we don't program and we don't make them. We kind of, we grow them. You know, we have these neural network architectures that we design and we have these loss objectives that we create.

[译文] [Chris Olah]: 我认为思考神经网络的一个有用方式是:我们并不编写它们,也不制造它们。我们更像是“生长”(grow)它们。你知道,我们要设计神经网络架构,我们设定损失目标(loss objectives)。

[原文] [Chris Olah]: And the neural network architecture, it's kind of like a scaffold that the circuits grow on, and they sort of, you know, it starts off with some kind of random, you know, random things and it grows. And it's almost like the objective that we train for is this light. And so we create the scaffold that it grows on and we create the, you know, the light that it grows towards.

[译文] [Chris Olah]: 神经网络架构就像是电路在其上生长的支架(scaffold),它从某种随机状态开始生长。这就好比我们训练的目标是光(light)。所以,我们创造了它生长的支架,我们创造了它生长朝向的光。

[原文] [Chris Olah]: But the thing that we actually create, it's this almost biological, you know, entity or organism that we're studying. And so it's very, very different from any kind of regular software engineering, because at the end of the day, we end up with this artifact that can do all these amazing things... And it can do that because we grew it, we didn't write it, we didn't create it. And so then that leaves open this question at the end, which is, what the hell is going on inside these systems?

[译文] [Chris Olah]: 但我们实际创造出来的,是某种我们正在研究的近乎生物的实体或有机体。所以这与任何常规的软件工程都非常、非常不同,因为归根结底,我们得到了这个能做所有惊人事情的人造物……它能做到这些是因为我们生长了它,而不是我们编写或创造了它。所以这就留下了一个终极问题:这些系统内部到底发生了什么鬼?(what the hell is going on inside these systems?),

[原文] [Lex Fridman]: So, and mechanistic interpretability I guess is closer to maybe neurobiology.

[译文] [Lex Fridman]: 所以,机械可解释性大概更接近于神经生物学。

[原文] [Chris Olah]: Yeah, yeah, I think that's right... I guess we started using the term mechanistic interpretability to try to sort of draw that divide or to distinguish ourselves... I think one way you might think of trying to understand a neural network is that it's kind of like, we have this compiled computer program and the weights of the neural network are the binary. And when the neural network runs, that's the activations. And our goal is ultimately to go and understand these weights.

[译文] [Chris Olah]: 是的,没错……我想我们开始使用“机械可解释性”这个术语是为了划清界限或区分自己……我认为理解神经网络的一种方式是,它就像我们要面对一个编译后的计算机程序(compiled computer program),神经网络的权重就是二进制代码(binary)。当神经网络运行时,那就是激活(activations)。我们的最终目标是去理解这些权重。,,

[原文] [Chris Olah]: And so the approach of mechanistic interpretability is to somehow figure out how do these weights correspond to algorithms. And in order to do that, you also have to understand the activations, 'cause it's sort of, the activations are like the memory.

[译文] [Chris Olah]: 所以机械可解释性的方法就是试图弄清楚这些权重是如何对应算法(algorithms)的。为了做到这一点,你也必须理解激活,因为激活就像是内存(memory)

[原文] [Chris Olah]: I think a thing that is maybe a little bit distinctive to the vibe of mech interp is I think people working in this space tend to think of neural networks as, well, maybe one way to say it is that gradient descent is smarter than you... The whole reason that we're understanding these models is 'cause we didn't know how to write them in the first place. That gradient descent comes up with better solutions than us.

[译文] [Chris Olah]: 我认为机械可解释性领域的一个独特氛围是,在这个领域工作的人倾向于认为——也许换种说法就是——梯度下降比你聪明(gradient descent is smarter than you)……我们要去理解这些模型的全部原因,正是因为我们最初根本不知道如何编写它们。梯度下降想出了比我们更好的解决方案。

[原文] [Chris Olah]: And so I think that maybe another thing about mech interp is sort of having almost a kind of humility that we won't guess a priori what's going on inside the models. We have to have this sort of bottom up approach... Instead we look for the bottom up and discover what happens to exist in these models and study them that way.

[译文] [Chris Olah]: 所以我认为机械可解释性的另一特点是保持一种谦逊(humility),即我们不会先验地(a priori)猜测模型内部发生了什么。我们必须采取一种自下而上(bottom up)的方法……我们自下而上地观察,去发现这些模型中实际存在什么,并以此方式研究它们。


这是为您整理的第 30 章内容。在此章节中,Chris Olah 探讨了机械可解释性领域最令人兴奋的发现之一——通用性(Universality),即不同的人工神经网络甚至生物大脑,往往会独立进化出相同的内部结构。

章节 30:通用性假设:从“特朗普神经元”到生物大脑的趋同

📝 本节摘要

本节的核心概念是“通用性”(Universality)。Chris 指出,虽然我们没有明确编程,但梯度下降似乎总能找到相同的“最佳解”。研究发现,不同的人工视觉模型中都会出现曲线探测器(Curve Detectors)高低频探测器,这与生物学惊人地一致——因为这些结构也存在于猴子和老鼠的大脑中。

>

Chris 分享了一个著名的发现:在研究 OpenAI 的 CLIP 模型时,他们在每一个神经网络中都发现了“唐纳德·特朗普神经元”(Donald Trump Neuron)。这个神经元不只识别特定的照片,而是识别关于特朗普的抽象概念(面孔、名字等)。这表明,“狗”或“线条”等概念并非人类独有的认知癖好,而是宇宙中客观存在的“自然分类”,是理解世界的最高效方式。

[原文] [Lex Fridman]: But, you know, the very fact that it's possible to do... things like universality, that the wisdom of the gradient descent creates features and circuits, creates things universally across different kinds of networks that are useful, and that makes the whole field possible.

[译文] [Lex Fridman]: 但是,这件事之所以可能……比如通用性(universality),即梯度下降的智慧会在不同类型的网络中普遍创造出有用的特征和电路,这使得整个领域成为可能。

[原文] [Chris Olah]: Yeah, so this is actually, is indeed a really remarkable and exciting thing where it does seem like, at least to some extent, you know, the same elements, the same features and circuits form again and again. You know, you can look at every vision model and you'll find curve detectors and you'll find high/low frequency detectors.

[译文] [Chris Olah]: 是的,这确实是一件非常了不起且令人兴奋的事情。看起来——至少在某种程度上——相同的元素、相同的特征和电路会一次又一次地形成。你可以查看每一个视觉模型,你会发现曲线探测器(curve detectors),你会发现高低频探测器

[原文] [Chris Olah]: And in fact, there's some reason to think that the same things form across, you know, biological neural networks and artificial neural networks. So a famous example is vision models in their early layers they have Gabor filters... We find curve detectors in these models, curve detectors are also found in monkeys. And we discover these high low frequency detectors and then some follow up work went and discovered them in rats or mice.

[译文] [Chris Olah]: 事实上,有理由认为相同的东西会在生物神经网络人工神经网络中形成。一个著名的例子是视觉模型的早期层中有 Gabor 滤波器……我们在这些模型中发现了曲线探测器,而在猴子体内也发现了曲线探测器。我们发现了这些高低频探测器,随后的研究在大鼠或小鼠体内也发现了它们。

[原文] [Chris Olah]: We found very similar things in vision models where, this is while I was still at OpenAI and I was looking at their clip model... and we found that there was a Donald Trump neuron. For some reason, I guess everyone likes to talk about Donald Trump... So every neural network we looked at, we would find a dedicated neuron for Donald Trump.

[译文] [Chris Olah]: 我们在视觉模型中发现了非常相似的东西。这是我在 OpenAI 研究他们的 CLIP 模型时发现的……我们发现了一个唐纳德·特朗普神经元(Donald Trump neuron)。出于某种原因,我想大家都喜欢谈论唐纳德·特朗普……所以我们在每一个神经网络中,都能找到一个专门针对唐纳德·特朗普的神经元。

[原文] [Chris Olah]: So it responds to, you know, pictures of his face and the word Trump, like all these things, right? And so it's not responding to a particular example or like, it's not just responding to his face, it's extracting over this general concept, right?

[译文] [Chris Olah]: 它会对他的脸部照片、单词“Trump”以及所有这些相关事物做出反应,对吧?所以它不是对某个特定的样本做出反应,也不仅仅是对他的脸做出反应,它是提取了这个一般概念(general concept)

[原文] [Chris Olah]: So there evidence that this phenomenon of universality, the same things form across both artificial and natural neural networks. That's a pretty amazing thing if that's true... It suggests that the gradient of descent is sort of finding, you know, the right ways to cut things apart in some sense that many systems converge on.

[译文] [Chris Olah]: 所以有证据表明这种通用性现象——即相同的事物在人工和自然神经网络中都会形成——如果是真的,那是相当惊人的……这表明梯度下降在某种意义上找到了“切割事物”(cut things apart)的正确方式,以至于许多系统都会收敛于此。

[原文] [Chris Olah]: Well if we pick, I don't know, like the idea of a dog, right? Like, you know, there's some sense in which the idea of a dog is like a natural category in the universe or something like this, right? ... It's not just like a weird quirk of like how humans factor...

[译文] [Chris Olah]: 比如我们选“狗”这个概念,对吧?某种意义上,“狗”的概念就像是宇宙中的一个自然分类(natural category),对吧?……这不仅仅是人类思维方式的一个奇怪癖好。

[原文] [Chris Olah]: Or like, if you have the idea of a line, like there's, you know, like look around us, you know, there are lines, you know. It's sort of the simplest way to understand this room in some sense is to have the idea of a line. And so I think that that would be my instinct for why this happens.

[译文] [Chris Olah]: 或者比如你有“线条”的概念。看看我们周围,到处都是线条。某种意义上,理解这个房间最简单的方式就是拥有线条的概念。所以我认为这就是通过趋同进化发生这种现象的直觉原因。


这是为您整理的第 31 章内容。在此章节中,Chris Olah 深入解析了机械可解释性的基本构件——特征(Features)与电路(Circuits),并提出了该领域的核心假设:线性表示假设。

章节 31:特征、电路与线性表示假设 (Linear Representation Hypothesis)

📝 本节摘要

本节详细定义了机械可解释性的核心概念。Chris 首先回顾了他在 2020 年的研究,通过分析 Inception V1 模型,他们发现了具有清晰意义的神经元(如汽车探测器、狗头探测器)。他将“电路”(Circuits)定义为这些特征之间的连接权重,本质上就是模型内部运行的算法(例如:汽车探测器 = 车窗探测器 + 车轮探测器 + 车身探测器)。

>

然而,由于并非所有神经元都只代表一个概念(多义性),Chris 引入了“特征”(Features)的概念,指出特征不一定等于神经元,而是激活空间中的方向。这引出了“线性表示假设”(Linear Representation Hypothesis):神经网络通过线性组合方向来表示概念。最经典的例子是 Word2Vec 中的向量算术(国王 - 男人 + 女人 = 女王)。

>

Chris 承认这一假设未必绝对正确,但他强调了在科学研究中“当真”(Taking it seriously)的重要性。他用热力学发展史上的“热质说”(Caloric Theory)类比:即使理论后来被证明是错的(热质不存在),但相信它的人依然发明了内燃机。因此,将其作为一种强有力的工作假设去推演是极具价值的。

[原文] [Lex Fridman]: Can you talk through some of the building blocks that we've been referencing of features and circuits? So I think you first described them in 2020 paper "Zoom In: An Introduction to Circuits."

[译文] [Lex Fridman]: 你能讲讲我们一直提到的那些构件——特征(features)电路(circuits)吗?我想你最早是在 2020 年的论文《放大:电路导论》(Zoom In: An Introduction to Circuits)中描述它们的。

[原文] [Chris Olah]: If you spent like quite a few years... studying this one particular model Inception V1... And one of the interesting things is... there's a lot of neurons and Inception V1 that do have really clean interpretable meanings. So you find neurons that just really do seem to detect curves, and you find neurons that really do seem to detect cars, and car wheels and car windows...

[译文] [Chris Olah]: 如果你花几年时间……研究 Inception V1 这个特定模型……有趣的是……Inception V1 中有很多神经元确实具有非常清晰、可解释的含义。你会发现真的在检测曲线的神经元,你会发现真的在检测汽车、车轮和车窗的神经元……

[原文] [Chris Olah]: So one way you could try to understand these models is in terms of neurons... And it turns out you can actually ask how those connect together... And it turns out in the previous layer, it's connected really strongly to a window detector, and a wheel detector, and it's sort of car body detector... And that's sort of a recipe for a car, right? ... And so we call that a circuit, this connection.

[译文] [Chris Olah]: 所以理解这些模型的一种方式是从神经元入手……事实证明你可以探究它们是如何连接的……结果发现,(汽车探测器)在上一层与车窗探测器车轮探测器以及某种车身探测器有非常强的连接……这就像是一个制造汽车的配方(recipe for a car),对吧?……我们将这种连接称为一个电路(circuit)

[原文] [Chris Olah]: So circuits are just collections of features connected by weights and they implement algorithms. So they tell us, you know, how are features used? How are they built?

[译文] [Chris Olah]: 所以电路就是通过权重连接起来的特征集合,它们实现了算法(algorithms)。它们告诉我们:特征是如何被使用的?它们是如何被构建的?

[原文] [Chris Olah]: So maybe it's worth trying to pin down like what really is the core hypothesis here. And I think the core hypothesis is something we call the linear representation hypothesis.

[译文] [Chris Olah]: 也许值得试着确定一下这里的核心假设到底是什么。我认为核心假设是我们所谓的线性表示假设(linear representation hypothesis)

[原文] [Chris Olah]: Are you familiar with the Word2Vec results? So you have like, you know, king minus man plus woman equals queen. Well, the reason you can do that kind of arithmetic is because you have a linear representation.

[译文] [Chris Olah]: 你熟悉 Word2Vec 的结果吗?比如“国王 - 男人 + 女人 = 女王”。之所以能做这种算术,就是因为你拥有线性表示。

[原文] [Chris Olah]: So, sometimes we have these, we create these word embeddings where we map every word to a vector... But it seems like when we train neural networks, they like to go and map words to vectors to such that they're sort of linear structure in a particular sense, which is that directions have meaning. So for instance, there will be some direction that seems to sort of correspond to gender...

[译文] [Chris Olah]: 有时我们创建这些词嵌入,将每个词映射为一个向量……但在训练神经网络时,它们似乎喜欢将词映射为向量,使其具有某种特定意义上的线性结构,即方向(directions)具有意义。例如,会有某个方向似乎对应于性别……

[原文] [Chris Olah]: And the linear representation hypothesis is, you could sort of think of it roughly as saying that that's actually kind of the fundamental thing that's going on, that everything is just different directions have meanings and adding direction vectors together can represent concepts.

[译文] [Chris Olah]: 线性表示假设,你可以粗略地认为它在说,这实际上就是发生的根本事情:一切都只是不同的方向具有意义,并且将方向向量相加可以表示概念。

[原文] [Lex Fridman]: Do you think the linear hypothesis holds... That kind of carries scales.

[译文] [Lex Fridman]: 你认为线性假设成立吗……这种假设在规模扩展后依然适用吗?

[原文] [Chris Olah]: So, so far, I think everything I have seen is consistent with the hypothesis... And I think that there's a virtue in taking hypotheses seriously and pushing them as far as they can go. So it might be that someday we discover something that isn't consistent with linear representation hypothesis. But science is full of hypotheses and theories that were wrong, and we learned a lot by sort of working under them as a sort of an assumption...

[译文] [Chris Olah]: 到目前为止,我认为我看到的一切都与该假设一致……我认为认真对待假设并将其推向极致是一种美德。也许有一天我们会发现不符合线性表示假设的东西。但科学充满了后来被证明是错误的假设和理论,而我们通过将它们作为一种假设并在其下工作,学到了很多东西……

[原文] [Chris Olah]: One of my colleagues, Tom Henighan... made this really nice analogy to me of caloric theory where, you know, once upon a time we thought that heat was actually, you know, this thing called caloric... For example, it turns out that the original combustion engines were developed by people who believed in the caloric theory. So I think there's a virtue in taking hypotheses seriously, even when they might be wrong.

[译文] [Chris Olah]: 我的同事 Tom Henighan……给我做了一个很好的类比,关于热质说(caloric theory)。很久以前我们认为热实际上是一种叫做“热质”的东西……事实证明,最初的内燃机是由相信热质说的人发明的。所以我认为认真对待假设是有价值的,即使它们可能是错的。


这是为您整理的第 32 章内容。在此章节中,Chris Olah 介绍了机械可解释性中最烧脑但也最迷人的概念之一——叠加假设(Superposition Hypothesis),解释了为什么神经网络能用有限的神经元表示无限的概念。

章节 32:叠加假设 (Superposition Hypothesis)与神经网络的“高维投影”

📝 本节摘要

本节试图解开一个谜题:如果线性表示假设是真的,为什么我们在 500 维的向量空间里能找到成千上万个概念(如性别、国家、食物等)?Chris 引入了“叠加假设”(Superposition Hypothesis)来解释这一现象。他指出,由于大多数概念在任何特定时刻都是稀疏(Sparse)的(即你不会同时谈论日本和意大利),神经网络利用“压缩感知”(Compressed Sensing)的数学原理,将高维概念“折叠”进低维空间。

>

这导致了多义神经元(Polysemantic Neurons)的出现——一个神经元同时响应多个无关概念。Chris 提出了一个极具诗意的比喻:我们看到的神经网络可能只是一个巨大、稀疏的“楼上模型”(Upstairs Model)投射下来的“影子”。梯度下降实际上是在隐式地搜索这个巨大的稀疏结构,并将其高效地打包进我们有限的硬件中。

[原文] [Lex Fridman]: So another interesting hypothesis is the superposition hypothesis. Can you describe what superposition is?

[译文] [Lex Fridman]: 另一个有趣的假设是叠加假设(superposition hypothesis)。你能描述一下什么是叠加吗?

[原文] [Chris Olah]: Yeah, so earlier we were talking about word defect [Word2Vec], right? ... oftentimes maybe these word embedding, they might be 500 dimensions, 1000 dimensions. And so if you believe that all of those directions were orthogonal, then you could only have, you know, 500 concepts.

[译文] [Chris Olah]: 是的,刚才我们谈到了 Word2Vec,对吧?……通常这些词嵌入可能是 500 维或 1000 维。如果你认为所有这些方向都是正交(互不干扰)的,那你只能拥有 500 个概念。,

[原文] [Chris Olah]: And you know, I love pizza, but like, if I was gonna go and like give the like 500 most important concepts in, you know, the English language, probably Italy wouldn't be, it's not obvious at least that Italy would be one of them, right?

[译文] [Chris Olah]: 你知道,我爱披萨,但如果要我列出英语中最重要的 500 个概念,“意大利”大概不会在里面,对吧?

[原文] [Chris Olah]: And so how might it be that models could, you know, simultaneously have the linear representation hypothesis be true and also represent more things than they have directions.

[译文] [Chris Olah]: 那么,模型如何做到既让线性表示假设成立,同时又能表示比其维度数量更多的事物呢?

[原文] [Chris Olah]: Well, there's this amazing thing in mathematics called compressed sensing... it turns out that you can often go and find back the high dimensional vector with very high probability... If I tell you that the high dimensional vector was sparse, so it's mostly zeros.

[译文] [Chris Olah]: 数学中有一个神奇的东西叫压缩感知(compressed sensing)……事实证明,如果我告诉你高维向量是稀疏(sparse)的——也就是大部分是零——那你通常能以极高的概率找回那个高维向量。,

[原文] [Chris Olah]: The superposition hypothesis is saying that that's what's going on in neural networks... And the fact that these concepts are sparse, right? Like, you know, you usually aren't talking about Japan and Italy at the same time.

[译文] [Chris Olah]: 叠加假设就是说这也正是神经网络中发生的事情……利用概念是稀疏的这一事实,对吧?比如,你通常不会同时谈论日本和意大利。,

[原文] [Chris Olah]: Now it has this even wilder implication... In some sense, neural networks may be shadows of much larger, sparser neural networks, and what we see are these projection... the strongest version of the superposition hypothesis would be to take that really seriously and sort of say, you know, there actually is in some sense this upstairs model...

[译文] [Chris Olah]: 这还有一个更疯狂的推论……在某种意义上,神经网络可能是大得多的、更稀疏的神经网络的影子(shadows),而我们看到的是这些投影……叠加假设的最强版本会认真对待这一点,认为在某种意义上确实存在一个“楼上模型”(upstairs model)。,

[原文] [Chris Olah]: And that's what we're studying. And the thing that we're observing is the shadow of evidence. We need to find the original object.

[译文] [Chris Olah]: 那就是我们正在研究的东西。我们观察到的是证据的影子。我们需要找到原始物体。

[原文] [Lex Fridman]: And the process of learning is trying to construct a compression of the upstairs model that doesn't lose too much information in the projection.

[译文] [Lex Fridman]: 学习的过程就是试图构建这个“楼上模型”的压缩版本,并在投影过程中不丢失太多信息。

[原文] [Chris Olah]: Yeah... it sort of says that gradient descent is implicitly searching over the space of extremely sparse models that could be projected into this low dimensional space... and then figuring out how to fold it down nicely to go and run conveniently on your GPU.

[译文] [Chris Olah]: 是的……这说明梯度下降实际上是在隐式地搜索那些可以被投影到低维空间的极端稀疏模型的空间……然后找出如何将其漂亮地“折叠”下来,以便在你的 GPU 上方便地运行。-

[原文] [Chris Olah]: There are in fact all these lovely results from compressed sensing and the Johnson-Lindenstrauss lemma... That they basically tell you that if you have a vector space and you want to have almost orthogonal vectors... that's actually exponential in the number of neurons that you have.

[译文] [Chris Olah]: 事实上,压缩感知和 Johnson-Lindenstrauss 引理有很多漂亮的结果……它们基本上告诉你,如果你想要拥有“几乎正交”的向量……你可以拥有的数量实际上是神经元数量的指数级(exponential)。,


这是为您整理的第 33 章内容。在此章节中,Chris Olah 解释了机械可解释性面临的最大障碍——多义性(Polysemanticity),并介绍了如何通过“字典学习”和“稀疏自编码器”来破解这一难题。

章节 33:破解多义性:字典学习与稀疏自编码器 (Sparse Auto-Encoders)

📝 本节摘要

本节聚焦于神经网络中最令人头疼的现象——“多义性”(Polysemanticity):一个神经元往往同时响应多个完全无关的概念(例如既响应猫又响应汽车)。Chris 指出,这正是叠加假设的体现。为了解决这个问题,他们借鉴了“字典学习”(Dictionary Learning)技术,特别是使用稀疏自编码器(Sparse Auto-Encoders, SAEs)

>

这种方法成功地将混乱的神经元激活“解压”回了高维、稀疏的特征空间。Chris 介绍了他们在单层模型上的突破性研究(Towards Monosemanticity 论文):通过 SAE,他们发现了原本隐藏的、具有单一含义的特征,如阿拉伯语希伯来语特征,甚至是特定上下文中的单词(例如只在数学文本中出现的 "the")。

>

最后,两人探讨了自动化可解释性(Automated Interpretability)——用 AI 来解释 AI。Chris 对此持保留态度,他不仅希望 AI 能够自我审计,更希望人类能够真正理解模型,而不是盲目信任另一个黑箱。

[原文] [Lex Fridman]: How does the problem of polysemanticity enter the picture here?

[译文] [Lex Fridman]: 那么多义性(polysemanticity)的问题是如何进入这个画面的?

[原文] [Chris Olah]: Polysemanticity is this phenomenon we observe where you look at many neurons, and the neuron doesn't just sort of represent one concept. It's not a clean feature. It responds to a bunch of unrelated things. And superposition is, you can think of as being a hypothesis that explains the observation of polysemanticity.

[译文] [Chris Olah]: 多义性是我们观察到的一种现象:当你观察许多神经元时,你会发现神经元并不只代表一个概念。它不是一个干净的特征。它会对一堆不相关的事物做出反应。你可以把叠加(superposition)看作是解释多义性观察结果的一个假设。

[原文] [Chris Olah]: So if you're trying to understand things in terms of individual neurons, and you have polysemantic neurons, you're in an awful lot of trouble, right? ... And things being mono-semantic, things only having one meaning. Things having a meaning, that is the key thing that allows you to think about them independently.

[译文] [Chris Olah]: 所以如果你试图通过单个神经元来理解事物,而你面对的是多义神经元,那你就有大麻烦了,对吧?……而事物是单义的(mono-semantic),事物只有一个意义——拥有一个确定的意义,这是让你能够独立思考它们的关键。,

[原文] [Lex Fridman]: And so the goal here, as your recent work has been aiming at, is how do we extract the mono-semantic features from a neural net that has poly-sematic features and all this mess.

[译文] [Lex Fridman]: 所以这里的目标,正如你最近的工作所针对的,就是我们如何从包含多义特征和所有这些混乱的神经网络中提取出单义特征

[原文] [Chris Olah]: Yes... And if superposition is what's going on there, there's actually a sort of well established technique that is sort of the principled thing to do, which is dictionary learning. And it turns out if you do dictionary learning, in particular, if you do sort of a nice efficient way... called a sparse auto-encoder.

[译文] [Chris Olah]: 是的……如果那里发生的是叠加,实际上有一种非常成熟的技术,是原则上该做的事,那就是字典学习(dictionary learning)。事实证明,如果你做字典学习,特别是如果你用一种高效的方式……这种方式被称为稀疏自编码器(sparse auto-encoder)

[原文] [Chris Olah]: If you train a sparse auto-encoder, these beautiful interpretable features start to just fall out where there weren't any beforehand. ... That to me that seems like, you know, some non-trivial validation of linear representations in superposition.

[译文] [Chris Olah]: 如果你训练一个稀疏自编码器,那些漂亮的、可解释的特征就开始从原本不存在的地方掉落出来(fall out)……对我来说,这似乎是对线性表示和叠加假设的某种不平凡的验证。,

[原文] [Lex Fridman]: So can you talk to the "Toward Monosemanticity" paper from October last year? It had a lot of like nice breakthrough results.

[译文] [Lex Fridman]: 能谈谈去年 10 月那篇《迈向单义性》(Toward Monosemanticity)的论文吗?它有很多不错的突破性结果。

[原文] [Chris Olah]: Yeah, I mean, this was our first real success using sparse auto-encoders. So we took a one layer model, and it turns out if you go and, you know, do dictionary learning on it, you find all these really nice interpretable features. So, you know, the Arabic feature, the Hebrew feature, the Base64 features...

[译文] [Chris Olah]: 是的,这是我们使用稀疏自编码器的第一次真正成功。我们拿了一个单层模型,结果发现如果你对它进行字典学习,你会找到所有这些非常好的可解释特征。比如阿拉伯语特征、希伯来语特征、Base64 特征……

[原文] [Chris Olah]: There were a lot of features that were specific words in specific contexts, so "the." ... And there would be these features that would fire for "the" in the context of say a legal document, or a mathematical document or something like this.

[译文] [Chris Olah]: 有很多特征是特定上下文中的特定单词,比如 "the"。……会有一些特征只在法律文档数学文档等上下文中对 "the" 做出反应。

[原文] [Lex Fridman]: How difficult is the task of sort of assigning labels to what's going on? Can this be automated by AI?

[译文] [Lex Fridman]: 给发生的事情分配标签有多难?这能通过 AI 自动化吗?

[原文] [Chris Olah]: I think it depends on the feature and it also depends on how much you trust your AI. So there's a lot of work doing automated interpretability. I think that's a really exciting direction, and we do a fair amount of automated interpretability and have Claude go and label our features.

[译文] [Chris Olah]: 我认为这取决于特征,也取决于你有多信任你的 AI。有很多关于自动化可解释性的工作。我认为这是一个非常令人兴奋的方向,我们也做了相当多的自动化可解释性工作,让 Claude 去给我们的特征打标签。,

[原文] [Chris Olah]: You know, I'm actually a little suspicious of automated interpretability, and I think that partly just that I want humans to understand neural networks... But I do also think that there is this kind of like reflections on trusting trust type issue... And you know, you'd be kind of in trouble, right? ... if you're using neural networks to go and verify that your neural networks are safe.

[译文] [Chris Olah]: 你知道,其实我对自动化可解释性有点怀疑,部分原因是我希望人类能理解神经网络……但也因为我认为存在某种类似于“信任信任本身”(trusting trust)的问题……如果你用神经网络去验证你的神经网络是否安全,你可能会陷入麻烦,对吧?,


这是为您整理的第 34 章内容。在此章节中,Chris Olah 介绍了机械可解释性的最新里程碑——《扩展单义性》(Scaling Monosemanticity)论文。他们成功将稀疏自编码器(SAE)应用到了生产级模型 Claude 3 Sonnet 上,并发现了具体的安全相关特征。

章节 34:扩展单义性:在 Claude 3 Sonnet 中发现欺骗与后门

📝 本节摘要

本节重点介绍 Anthropic 于 2024 年 5 月发布的重磅研究《扩展单义性》。Chris 及其团队克服了巨大的工程挑战,证明了稀疏自编码器(SAE)不仅适用于玩具模型,也同样适用于像 Claude 3 Sonnet 这样的大型前沿模型。更有趣的是,他们发现解释性工作本身也遵循扩展定律(Scaling Laws),这为未来的研究提供了清晰的路线图。

>

在具体的发现上,他们提取出了高度抽象且跨模态的特征。例如,“安全漏洞特征”不仅会对不安全的代码(如缓冲区溢出)有反应,甚至会对屏幕截图中用户点击“忽略 SSL 警告”的图像有反应。最令人震惊的是发现了一个“后门特征”,它既会被后门代码激活,也会被隐藏摄像头设备的图片激活。此外,他们还找到了与 AI 安全直接相关的“欺骗”“说谎”“权力攫取”(Power Seeking)特征,这为未来的 AI 测谎奠定了基础。

[原文] [Lex Fridman]: So let's talk about the "Scaling Monosemanticity" paper in May, 2024. Okay, so what did it take to scale this, to apply to Claude 3 Sonnet?

[译文] [Lex Fridman]: 让我们聊聊 2024 年 5 月那篇《扩展单义性》(Scaling Monosemanticity)论文。好的,要将这项技术扩展并应用到 Claude 3 Sonnet 上,需要付出什么?

[原文] [Chris Olah]: Well, a lot of GPUs... But one of my teammates, Tom Henighan was involved in the original scaling laws work, and something that he was sort of interested in from very early on is, are there scaling laws for interpretability?

[译文] [Chris Olah]: 嗯,需要大量的 GPU……但我的一位队友 Tom Henighan 曾参与最初的扩展定律研究,他很早就感兴趣的一点是:可解释性是否存在扩展定律?

[原文] [Chris Olah]: And so it turns out this works really well... So this was actually a very big help to us in scaling up this work... where, you know, it's not like training the big models, but it's starting to get to a point where it's actually actually expensive to go and train the really big ones.

[译文] [Chris Olah]: 事实证明这真的很管用……这对我们扩展这项工作有巨大的帮助……虽然这不像训练大模型那样夸张,但也开始到了训练真正大的稀疏自编码器变得非常昂贵的地步。

[原文] [Lex Fridman]: So it turns out, TLDR, it worked.

[译文] [Lex Fridman]: 所以长话短说,它成功了。

[原文] [Chris Olah]: It worked, yeah... Scaling Monosemanticity sort of, I think was significant evidence that even for very large models, and we did it on Claude 3 Sonnet, which at that point was one of our production models. You know, even these models seemed to be very... seemed to be substantially explained at least by linear features.

[译文] [Chris Olah]: 是的,成功了……《扩展单义性》提供了重要证据,表明即使对于非常大的模型——我们在 Claude 3 Sonnet 上做了实验,那是我们当时的生产模型之一——即使是这些模型,似乎也在很大程度上可以由线性特征来解释。

[原文] [Chris Olah]: And you find now really fascinating abstract features. And the features are also multimodal. They respond to images and text for the same concept, which is fun.

[译文] [Chris Olah]: 现在你会发现真正迷人的抽象特征。而且这些特征是多模态(multimodal)的。它们会对同一概念的图像和文本同时做出反应,这很有趣。

[原文] [Chris Olah]: So there's a security vulnerability feature, and if you force it active, Claude will start to go and write security vulnerabilities like buffer overflows into code... Like, you know, some of the top dataset examples for it were things like, you know, dash dash disable, you know, SSL or something like this.

[译文] [Chris Olah]: 比如有一个安全漏洞特征,如果你强制激活它,Claude 就会开始在代码中写入安全漏洞,比如缓冲区溢出……它的一些顶级数据集示例包括像 --disable-ssl 之类的东西。

[原文] [Chris Olah]: You know, these features are all multimodal, so you could ask like, what images activate this feature? And it turns out that the security vulnerability feature activates for images of like people clicking on Chrome to like go past the, like, you know, this website, the SSL certificate might be wrong...

[译文] [Chris Olah]: 这些特征都是多模态的,所以你可以问:什么图像会激活这个特征?结果发现,安全漏洞特征会被人们在 Chrome 浏览器上点击“忽略 SSL 证书错误”继续访问网站的图像激活。

[原文] [Chris Olah]: Another thing that's very entertaining is there's backdoors and code feature... But you can ask, okay, what images activate the backdoor feature? It was devices with hidden cameras in them.

[译文] [Chris Olah]: 另一个非常有趣的是代码后门特征……你可以问,什么图像会激活后门特征?结果是内置隐藏摄像头的设备

[原文] [Lex Fridman]: To me, one of the really interesting features, especially for AI safety is deception and lying... So what have you learned from detecting lying inside models?

[译文] [Lex Fridman]: 对我来说,最有趣的特征之一,特别是对 AI 安全而言,是欺骗(deception)撒谎……那么你们从检测模型内部的撒谎中学到了什么?

[原文] [Chris Olah]: Yeah, so I think we're in some ways in early days for that. We find quite a few features related to deception and lying. There's one feature where, you know, fires for people lying and being deceptive, and you force it active and starts lying to you.

[译文] [Chris Olah]: 是的,我认为在这方面我们还处于早期阶段。我们确实发现了不少与欺骗和撒谎相关的特征。有一个特征,当人们撒谎或欺骗时它会激活,如果你强制激活它,模型就会开始对你撒谎

[原文] [Chris Olah]: I mean, there's all kinds of other features about withholding information and not answering questions, features about power seeking and coups and stuff like that. So there's a lot of features that are kind of related to spooky things.

[译文] [Chris Olah]: 还有各种关于隐瞒信息、不回答问题的特征,以及关于权力攫取(power seeking)和政变之类的特征。所以有很多特征都与这类“令人毛骨悚然的事情”(spooky things)有关。


这是为您整理的第 35 章内容。在此章节中,Chris Olah 展望了机械可解释性的未来,探讨了神经网络中的“暗物质”与“器官”,比较了人工神经网络与生物大脑的研究难度,并以对技术之“美”的哲学思考结束了访谈。

章节 35:未来展望:暗物质、宏观“器官”与复杂之美

📝 本节摘要

访谈的最后部分聚焦于机械可解释性的未来方向。Chris 提出了三大前沿领域:
1. 从特征到计算:不仅要识别“积木”(特征),还要理解它们如何搭建成“城堡”(电路和算法)。
2. 神经网络的“暗物质”(Dark Matter):目前的稀疏自编码器就像望远镜,能看到许多“恒星”(特征),但仍有大量无法观测的隐藏结构。如果这些“暗物质”中包含危险能力,将是巨大的安全隐患。
3. 寻找“器官”(Organs):目前的机械可解释性像是在做“微生物学”(研究单个神经元),而我们急需建立“解剖学”——寻找神经网络中的“心脏”或“呼吸系统”等宏观结构。

>

此外,Chris 对比了生物神经科学人工神经网络研究。他认为后者拥有巨大的优势:研究者拥有“上帝视角”,可以记录所有神经元、随意进行消融(Ablate)实验并撤销(Undo)更改。

>

最后,他谈到了这项研究的美学价值:就像生物进化一样,简单的规则(梯度下降/自然选择)涌现出了惊人的复杂性。我们并没有“编写”这些智能,而是“生长”了它们,理解其内部运作是当今时代最伟大的科学挑战之一。

[原文] [Lex Fridman]: What are possible next exciting directions to you in the space of mech interp?

[译文] [Lex Fridman]: 在机械可解释性领域,你认为接下来有哪些令人兴奋的方向?

[原文] [Chris Olah]: So for one thing, I would really like to get to a point where we have circuits where we can really understand not just the features, but then use that to understand the computation of models. That relief for me is the ultimate goal of this.

[译文] [Chris Olah]: 首先,我非常希望达到这样一个阶段:我们不仅拥有特征,还能利用它们来真正理解模型的计算(computation)。对我来说,那是最终目标。

[原文] [Chris Olah]: I think another exciting direction is just, you know, you might think of sparse auto-encoders as being kind of like a telescope... we see more and more stars... But there's kind of a lot of evidence that we're only still seeing a very small fraction of the stars. There's a lot of matter in our, you know, neural network universe that we can't observe yet.

[译文] [Chris Olah]: 我认为另一个令人兴奋的方向是……你可以把稀疏自编码器看作是一种望远镜……我们看到了越来越多的恒星……但有大量证据表明,我们仍然只看到了极小一部分。在我们的神经网络宇宙中,还有很多我们尚无法观测的物质-。

[原文] [Chris Olah]: So it's sort of a kind of dark matter... And so I think a lot about that dark matter and whether we'll ever observe it, and what that means for safety if we can't observe it, if there's, you know, if some significant fraction of neural networks are not accessible to us.

[译文] [Chris Olah]: 所以这有点像一种暗物质(dark matter)……我经常思考这种暗物质,思考我们是否能观测到它,以及如果我们观测不到——如果神经网络中有很大一部分对我们来说是不可触及的——这对安全性意味着什么。

[原文] [Chris Olah]: Another question that I think a lot about is... mechanistic interpretability is this very microscopic approach... But a lot of the questions we care about are very macroscopic... And I think there's a question of, will we be able to find, are there sort of larger scale abstractions that we can use to understand neural networks? Can we get up from this very microscopic approach?

[译文] [Chris Olah]: 我经常思考的另一个问题是……机械可解释性是一种非常微观(microscopic)的方法……但我们关心的很多问题都是非常宏观(macroscopic)的……问题在于,我们能否找到某种更大尺度的抽象来理解神经网络?我们能从这种非常微观的方法中通过某种方式上升到宏观层面吗-?

[原文] [Lex Fridman]: So if we think of interpretability as a kind of anatomy of neural networks... And so we wonder, is there a respiratory system or heart or brain region of an artificial neural network?

[译文] [Lex Fridman]: 所以如果我们把可解释性看作一种神经网络的解剖学……我们会想知道,人工神经网络中是否存在呼吸系统心脏或大脑区域-?

[原文] [Chris Olah]: And I think that right now we have, you know, mechanistic interpretability if it succeeds is sort of like a microbiology of neural networks, but we want something more like anatomy. And so, and you know, a question you might ask is, why can't you just go there directly? And I think the answer is superposition...

[译文] [Chris Olah]: 我认为目前的情况是,机械可解释性如果成功的话,更像是神经网络的微生物学(microbiology),但我们想要的是更像解剖学(anatomy)的东西。你可能会问,为什么不能直接研究解剖学?我认为答案在于叠加(superposition)……它使得直接观察宏观结构变得非常困难。

[原文] [Lex Fridman]: What do you think is the difference between the human brain, the biological neural network and the artificial neural network?

[译文] [Lex Fridman]: 你认为人脑(生物神经网络)和人工神经网络之间有什么区别?

[原文] [Chris Olah]: Well, the neuroscientists have a much harder job than us... I have, we can record from all the neurons. We can do that on arbitrary amounts of data. The neurons don't change while you're doing that, by the way. You can go and ablate neurons, you can edit the connections and so on, and then you can undo those changes. That's pretty great.

[译文] [Chris Olah]: 神经科学家的工作比我们要难得多……我们可以记录所有神经元。我们可以对任意数量的数据进行记录。顺便说一句,在你记录时神经元不会发生变化。你可以去消融(ablate)神经元,你可以编辑连接等等,然后你可以撤销(undo)这些更改。这太棒了-。

[原文] [Chris Olah]: You can force, you can intervene on any neuron and force it active and see what happens... Neuroscientists wanna get the connectome, we have the connectome... We have the weights. We can take gradients... We just have so many advantages over neuroscientists. And then despite having all those advantages, it's really hard.

[译文] [Chris Olah]: 你可以强制干预任何神经元,强制激活它并观察会发生什么……神经科学家想要获得连接组(connectome),我们已经有连接组了……我们有权重。我们可以计算梯度……相比神经科学家,我们要拥有太多的优势。然而,尽管拥有所有这些优势,这仍然非常难-。

[原文] [Lex Fridman]: I love what you've written about the goal of mech interp research as two goals, safety and beauty. So can you talk about the beauty side of things?

[译文] [Lex Fridman]: 我很喜欢你写的关于机械可解释性研究的两个目标:安全美(beauty)。你能谈谈美的那一面吗?

[原文] [Chris Olah]: I picture them being like, you know, evolution is so boring. It's just a bunch of simple rules and you run evolution for a long time and you get biology... But the beauty is that the simplicity generates complexity... And similarly, I think that neural networks build, you know, create enormous complexity and beauty inside and structure inside themselves...

[译文] [Chris Olah]: 我想象有人会说:进化太无聊了。只是一堆简单的规则,运行很长时间,就得到了生物学……但美就在于简单性能生成复杂性(simplicity generates complexity)……同样地,我认为神经网络在内部构建并创造了巨大的复杂性、美和结构……-。

[原文] [Chris Olah]: But I think that there is an incredibly rich structure to be discovered inside neural networks, a lot of very deep beauty if we're just willing to take the time to go and see it and understand it.

[译文] [Chris Olah]: 但我认为,只要我们愿意花时间去观察和理解,就会在神经网络内部发现极其丰富的结构和许多非常深刻的美。

[原文] [Chris Olah]: And it just feels like that is obviously the question that sort of is calling out to be answered if you are, if you have any degree of curiosity. It's like how is it that humanity now has these artifacts that can do these things that we don't know how to do?

[译文] [Chris Olah]: 这感觉就像是一个显而易见、在大声呼唤着被回答的问题,只要你有一点点好奇心。这就好比:人类现在怎么会拥有这些能做我们不知道该如何做的事情的人造物(artifacts)

[原文] [Lex Fridman]: Yeah, I love the image of the circuits reaching towards the light of the objective function.

[译文] [Lex Fridman]: 是的,我喜欢那个意象:电路正朝着目标函数的生长延伸。

[原文] [Chris Olah]: Yeah, it's just, it's this organic thing that we've grown and we have no idea what we've grown.

[译文] [Chris Olah]: 是的,这正是我们生长出来的有机体,而我们要根本不知道我们生长出了什么。


这是本期播客的全部内容整理。感谢您的阅读!希望这些章节能帮助您深入理解 Anthropic 的愿景与技术哲学。

这是为您整理的第 36 章内容。这是本次播客的最终章,包含了 Lex Fridman 的结语以及引用哲学家 Alan Watts 的名言作为结束。

章节 36:结语:拥抱变化,与未来共舞

📝 本节摘要

在这长达数小时的深度对话结束后,Lex Fridman 向三位嘉宾——Dario Amodei、Amanda Askell 和 Chris Olah——以及听众致谢。面对 AI 带来的剧烈变革与不确定性,他引用了哲学家 Alan Watts 的名言作为最后寄语,鼓励人们不要抗拒变化,而是要投身其中,与时代共舞。

[原文] [Lex Fridman]: Thanks for listening to this conversation with Chris Olah, and before that with Dario Amodei and Amanda Askell. To support this podcast, please check out our sponsors in the description.

[译文] [Lex Fridman]: 感谢收听这期与 Chris Olah,以及之前与 Dario Amodei 和 Amanda Askell 的对话。想要支持本播客,请查看描述栏中的赞助商信息。

[原文] [Lex Fridman]: And now let me leave you with some words from Alan Watts. "The only way to make sense out of change is to plunge into it, move with it, and join the dance."

[译文] [Lex Fridman]: 现在,让我用 Alan Watts 的一句话来作为结束:“理解变化的唯一方式,就是投身其中,随之而动,并加入这场舞蹈。

[原文] [Lex Fridman]: Thank you for listening and hope to see you next time.

[译文] [Lex Fridman]: 感谢收听,希望能下次再见。


至此,本期 Lex Fridman 对话 Anthropic 团队(Dario Amodei, Amanda Askell, Chris Olah)的全部内容已为您整理完毕。

这次对话涵盖了从扩展定律(Scaling Laws)的宏观预测,到Claude 性格设计的微观伦理,再到机械可解释性(Mechanistic Interpretability)的底层科学。希望这份详细的章节整理能帮助您更好地消化和回顾这些前沿且深刻的 AI 思想。如果您有任何其他问题,欢迎随时提问!