How difficult is AI alignment? | Anthropic Researc

章节 1：实用主义对齐观——与其追求“哲人王”，不如拥抱不确定性

📝 本节摘要：

访谈开始，主持人Alex介绍了来自Anthropic不同团队（社会影响、对齐科学、微调、可解释性）的研究员。Amanda首先分享了她对“对齐（Alignment）”的看法。她反对过度理论化或试图定义完美的“效用函数”，主张采取实用主义路径：设定一个“足够好”的基准并进行迭代。她认为模型不应被灌输某种固定的、绝对自信的道德观（类似“哲人王”），而应像人类一样，对道德框架保持不确定性，并能根据情境和新信息进行更新，这样才更安全。

[原文] [Alex]: We are super excited to have everyone here, folks that we've met already, folks that are new. This panel is just gonna be really casual. And we have researchers from four different teams at Anthropic. We have folks from Societal Impacts, that's me. Folks from Alignment Science, that's Jan. Alignment, Finetuning, Amanda. And interpretability, Josh. I'm gonna start with asking a question to Amanda, from the Alignment Finetuning team. And I want you to talk a little bit about how you see alignment, what it means to you.

[译文] [Alex]: 我们非常高兴大家能来到这里，无论是我们已经见过的朋友，还是新面孔。这场圆桌讨论会非常随意。我们有来自Anthropic四个不同团队的研究人员。有来自社会影响（Societal Impacts）团队的，那就是我。有来自对齐科学（Alignment Science）团队的，那是Jan。来自对齐微调（Alignment Finetuning）的，Amanda。还有可解释性（Interpretability）团队的，Josh。我要先问来自对齐微调团队的Amanda一个问题。我想请你谈谈你是如何看待对齐的，它对你来说意味着什么。

[原文] [Alex]: You know, 'cause you're in charge of a lot of our work on how the model should behave. And why should you be the philosopher king that decides how Claude behaves, what its characteristics and attributes are?

[译文] [Alex]: 你知道，因为你负责了很多关于模型应该如何表现的工作。既然如此，为什么你应该成为决定Claude如何表现、具备什么特征和属性的“哲人王（philosopher king）”呢？

[原文] [Amanda]: I mean ask Plato. He's the one that decided I should be the philosopher king. (Alex laughing) The question of, like, what is alignment, maybe this is, like, a slightly spicy view that I have. I think people are very, very tempted to spend a lot of time trying to define this concept because there's, like, lots of ways of doing it

[译文] [Amanda]: 我想你得去问柏拉图。是他决定了我应该成为哲人王的。（Alex笑）关于什么是对齐这个问题，也许我有一个稍微激进点（spicy）的观点。我认为人们非常、非常容易把大量时间花在试图定义这个概念上，因为确实有很多定义方式，

[原文] [Amanda]: and they, you know, I dunno they have, like, social choice theory in the back of their head and they're like, "Oh, well if you imagine everyone has a utility function, there's, like, limits on exactly what you can say about how to maximize all of those utility functions, et cetera." And I think I'm more inclined to just be like, we kind of want things to go well enough that you can iterate on and improve them later. And the bar isn't some like perfect notion of alignment. I'm sure there is that concept. One can define it, one can argue about it

[译文] [Amanda]: 比如，你知道，我不确定，他们脑子里可能有社会选择理论（social choice theory），然后他们会想，“哦，如果你想象每个人都有一个效用函数（utility function），那么关于如何最大化所有这些效用函数等等，你能说的其实是有限制的。”而我认为我更倾向于这种态度：我们只是希望事情进展得足够顺利，以便后续可以迭代和改进。标准并不是某种完美的对齐概念。我确信存在那个概念。人们可以定义它，可以争论它，

[原文] [Amanda]: but for the most part the initial goal is, like, let's just make things like go well and, like, meet a certain kind of, like, lower bar where that is, like, you know, if it's not perfect, if some people, like, don't like it, you can just improve on it. So, like, my view of of alignment is actually probably, like, I mostly want to, like, hit that, like, and iterate from there. In terms of, like, how the model should behave and how I think about that, I think I've spoken about this before but my basic concept right now

[译文] [Amanda]: 但在很大程度上，最初的目标就是，让我们先把事情做好，达到某种最低标准，即如果它不完美，如果有些人不喜欢它，你可以直接改进它。所以，我对对齐的看法实际上可能是，我主要想达到那个标准，然后从那里开始迭代。至于模型应该如何表现以及我如何思考这个问题，我想我之前谈过这个，但我目前的基本概念是，

[原文] [Amanda]: for the model is trying to get it to behave the way that I think, like, a very good, like, morally motivated, like, kind human would act if they found themselves roughly in this circumstance. It's a little bit strange 'cause they also have to find themselves in the circumstances of, like, being an AI (panelists laughing) who's like talking to millions of people, which does in fact affect how you behave. Like, maybe you would normally be willing to, like, just chitchat about politics with someone, but if you're gonna be talking with millions of people,

[译文] [Amanda]: 对于模型来说，是试图让它的表现像我认为一个非常好的、有道德动力的、善良的人在大概处于这种情况下会做出的行为那样。这有点奇怪，因为它们也必须意识到自己处于“身为一个AI”的情况下（小组成员笑），正在与数百万人交谈，这实际上确实会影响你的行为。比如，也许你通常愿意和某人闲聊政治，但如果你要和数百万人交谈，

[原文] [Amanda]: maybe you'd actually be like, "Hmm, I should maybe be, like, a little bit more concerned about potentially influencing people." But I do think it's actually an important model, which is, like, sometimes people are kinda like, "Oh, what values should you put into the model?" And I think I'm often like, "Well, do we think this way with humans?" Where I'm just like, someone just injected me with, like, value serum or something and I just have these, like, fixed things that I'm, like, completely certain of. I'm like, I don't know, that seems, like,

[译文] [Amanda]: 也许你实际上会想，“嗯，我可能应该更担心一下潜在地对人们产生影响。”但我确实认为这是一个重要的模型，就像有时候人们会问，“噢，你应该把什么价值观放入模型中？”而我通常会想，“嗯，我们对人类也是这样想的吗？”就好像有人给我注射了“价值观血清”之类的东西，然后我就有了这些我完全确信的固定观念。我觉得，我不知道，这看起来，

[原文] [Amanda]: almost, like, dangerous or something. Most of us just have, like, a mix of, like, things that we do value but that we would trade off against other things and a lot of uncertainty about different, like, moral frameworks. We hit cases where we're suddenly, like, "Oh, actually my value framework doesn't accord with my intuitions," and we update. I think my view is that ethics is actually, like, a lot more like physics than people think. It's actually like a lot more kind of, like, empirical and something that we're uncertain over

[译文] [Amanda]: 几乎有点危险之类的。我们大多数人只是拥有一系列我们确实重视的东西，但我们会权衡这些东西与其它东西，并且对不同的道德框架有很多不确定性。我们会遇到这样的情况，突然觉得，“哦，实际上我的价值框架与我的直觉不符，”然后我们会更新。我认为我的观点是，伦理学实际上比人们想象的更像物理学。它实际上更像是一种经验性的东西，是我们不确定的东西，

[原文] [Amanda]: and that we have hypotheses about. Like, I think that if I met someone who was just completely confident in their moral view, there is no such moral view I could give that person that would not make me kind of terrified. Whereas if I instead have someone who's just, like, "I don't know, I'm kind of uncertain over this and I just, like, update in response to, like, new information about ethics and I, like, think through these things," that's the kind of person that feels, like, less scary to me. So at least at the moment I'm not, this isn't a claim

[译文] [Amanda]: 是我们有假设的东西。比如，我想如果我遇到一个对自己的道德观完全自信的人，无论那个人持有什么样的道德观，都会让我感到恐惧。相反，如果我遇到的人说，“我不知道，我对这个有点不确定，我只是根据关于伦理的新信息进行更新，我在思考这些事情，”那才是让我觉得不那么可怕的人。所以至少目前我不——这不是在声称

[原文] [Amanda]: that this is somehow going to, like, completely, like, align models or anything. But that's the kind of immediate goal. I realize I've talked a lot and you asked about the philosopher king question. (Alex laughing) I guess. Okay. I guess I should give a quick answer to that. There is also this question of, like, well, in the kind of, like, should you put values into the model? Maybe I've partially answered it where I'm like, "The models should just be uncertain over values that exist in the world." And so ideally it's not just someone injecting their values

[译文] [Amanda]: 这种方法能以某种方式完全对齐模型之类的。但这是一种短期目标。我意识到我说了很多，而你问的是哲人王的问题。（Alex笑）我想，好吧。我想我应该给那个问题一个快速的回答。这就涉及到一个问题，比如，你应该把价值观放入模型中吗？也许我已经部分回答了，我的观点是，“模型应该只对世界上存在的价值观保持不确定性。”所以理想情况下，不仅仅是某人在注入他们的价值观，

[原文] [Amanda]: or their preferences, nor is it something like everyone just voting on values to put into models but instead like people that are uncertain and responsive to these things, models should also be like that. So maybe that's kinda my view.

[译文] [Amanda]: 或他们的偏好，也不是像所有人投票决定把什么价值观放入模型，而是像那些对此保持不确定并能对这些事情做出反应的人一样，模型也应该那样。所以这大概就是我的观点。

章节 2：超级对齐难题——当人类无法监督比自己更聪明的模型时

📝 本节摘要：

Jan对Amanda的观点提出了“反驳”（实际上是延伸）。他指出，Amanda的“善良人类”模拟法在当前很实用，但面临扩展性挑战（Super Alignment）。当AI变得极其复杂，开始执行人类无法理解的长程任务（如复杂的生物研究）时，人类无法再通过阅读文本记录（transcripts）来判断其行为是否正确。这引出了核心问题：在人类无法直接验证的情况下，如何确保强大的模型依然是对齐的？

[原文] [Alex]: Okay, we're gonna come back to that. And I'm gonna ask Jan, why is Amanda's view completely wrong (panelists laughing) and why is this not enough to align models as they get, you know...

[译文] [Alex]: 好的，我们稍后会回到这一点。现在我要问Jan，为什么Amanda的观点是完全错误的（小组成员笑），以及为什么随着模型变得……这不足以对齐模型？

[原文] [Jan]: I see. (Alex laughing) - She didn't say that. But you know, we're playing up the tension between the bets.

[译文] [Jan]: 我明白了。（Alex笑）——她没那么说。但你知道，我们在渲染这些不同赌注（bets）之间的张力。

[原文] [Jan]: Yeah, I imagine if everyone was, like, kind human that was just trying to, (Alex laughing) you know, act morally, I think what Amanda is doing is very practical, right? Like, can we just like make the models be well behaved now and, like, where would we go with this? Like, if FI is doing more and more complicated things. Right now, like, if Amanda does this character work and then she reads a lot of transcripts and you're, like, "Okay, this is, like, I like this, this model is behaving morally," I'm just picturing this is what you're doing.

[译文] [Jan]: 是的，我想如果每个人都是那种，善良的人类，只是试图（Alex笑）你知道，表现得有道德，我认为Amanda正在做的事情非常务实，对吧？比如，我们能不能现在就让模型表现良好，然后我们要往哪里走？比如，如果AI正在做越来越复杂的事情。目前，如果Amanda做这种角色工作，然后她阅读大量的对话记录（transcripts），你会觉得，“好的，这个，我喜欢这个，这个模型表现得很道德，”我只是在想象这就是你们正在做的事。

[原文] [Jan]: (panelists laughing) But, like, what do we do when the model is doing really complex things? And it's just like an agent in the world, it's, like, doing these, like, really long trajectories, it's doing stuff that we don't understand, like doing some bio research and we're like, "Is this dangerous? Like, I dunno." So that's the challenge really interested in. Like, the super alignment problem. How do we solve that? How do we, like, basically scale this beyond, like, things that we can look at, if we can look at it

[译文] [Jan]: （小组成员笑）但是，当模型在做真正复杂的事情时我们该怎么办？它就像世界上的一个代理（agent），它在执行这些非常长的轨迹（trajectories），做着我们不理解的事情，比如做一些生物研究，我们会想，“这危险吗？比如，我不知道。”所以这就是我真正感兴趣的挑战。也就是超级对齐（super alignment）问题。我们要如何解决这个问题？我们基本上如何将其扩展到超出我们可以观察的范围，如果我们可以观察它，

[原文] [Jan]: we just do some RLHF, it's great. Or you do some constitutional AI. But how do we know that our constitution is actually getting the the model to do that, the right thing that we actually want? So I think that's the big question in my mind.

[译文] [Jan]: 我们就做一些RLHF（人类反馈强化学习），那很棒。或者你做一些宪法AI（constitutional AI）。但我们怎么知道我们的宪法实际上是在让模型做那个，做我们真正想要的正确的事情？所以我认为那是我心中的大问题。

[原文] [Alex]: Do we get to respond? - Yes, you can respond. - You're only allowed to disagree. (laughs)

[译文] [Amanda]: 我们能回应吗？[Alex]: 是的，你可以回应。[Alex]: 你只被允许反对。（笑）

[原文] [Amanda]: Okay. I don't really disagree, though, and I can't lie. - Need to up the disagreement feature, you know? - I'm usually so disagreeable as well. It's just terrible. This is what philosophy taught me

[译文] [Amanda]: 好的。不过我并不是真的反对，我不能撒谎。[Alex]: 需要调高反对功能，知道吗？[Amanda]: 我通常也是很爱唱反调的。这太糟糕了。这就是哲学教会我的，

[原文] [Amanda]: was how to be disagreeable. I guess, like, my thought is that the way, I mean, I think of my work as doing several things, but one of them is being kind of iterative towards alignment. So in a lot of cases you're actually trying to get the model to kind of oversee its own, you know, so it's not, like, me, my eyes can't look at that many transcripts or something, but I can get models to, like, look at these things and, like, if alignment is iterative, I think my worry is that if people neglect the kind of, like, the ground

[译文] [Amanda]: 如何唱反调。我想，我的想法是，这种方式，我是说，我认为我的工作是在做几件事，但其中之一是朝着对齐进行迭代。所以在很多情况下，你实际上是在试图让模型某种程度上监督它自己，你知道，所以不是我，我的眼睛看不了那么多对话记录之类的，但我可以让模型去看这些东西，而且，如果对齐是迭代的，我想我担心的是如果人们忽视了这种，基础，

[原文] [Amanda]: and just think, "You can just have like a pretty bad model and it's just gonna help you with these things." I'm like, "I'd kind of rather that you had the most aligned model trying to, like, then help you in the future and, like, that kind of iterative work."

[译文] [Amanda]: 并且只是认为，“你可以只有一个相当糟糕的模型，而它会帮你处理这些事情。”我会想，“我有点宁愿你拥有最对齐的模型试图去在未来帮助你，以及那种迭代的工作。”

[原文] [Jan]: But how do you iterate when you can't read the transcripts anymore? And you have to rely on the aligned model. But then how do you know that it's actually trying to help you?

[译文] [Jan]: 但是当你再也无法阅读对话记录时，你要如何迭代呢？你必须依赖那个已对齐的模型。但那你怎么知道它实际上是在试图帮助你？

章节 3：可解释性赌注——打开黑盒寻找“善良特征”

📝 本节摘要：

针对如何验证模型是否真诚的问题，Josh介绍了可解释性（Interpretability）的作用。他引用了著名的“钟形曲线”梗（傻瓜、普通人、绝地武士），提出可解释性可能是“绝地武士”级别的解决方案：直接查看模型内部的运作机制。与其依赖模型给出的类似人类的合理化解释（可能是在撒谎），不如通过技术（如稀疏自编码器 SAEs）直接观察模型在回答时激活了哪些“特征”（例如“欺骗特征”或“善意谎言特征”），从而判断其真实动机。

[原文] [Alex]: And one of our bets, you know, in order to, like, you know, guard against the case that a model might be very deeply trying to sabotage against this process is interpretability. How do you see, like, interpretability as a bet, you know, situated among, you know, the more straightforward alignment approaches, you know? Is it just as simple as like, "Ooh, we find the nice feature

[译文] [Alex]: 我们的一项赌注，你知道，为了防范模型可能在非常深层地试图破坏这个过程的情况，就是可解释性（interpretability）。你是如何看待可解释性作为一个赌注的？它位于那些更直接的对齐方法之中？它是否就像“噢，我们找到了‘善良特征’”那样简单？

[原文] [Alex]: and we, like, up the nice feature, we find the evil feature and we, like, drop the evil feature," right? Or is it, you know...

[译文] [Alex]: 然后我们调高‘善良特征’，我们找到‘邪恶特征’然后我们把它去掉，”对吗？还是说，它是……

[原文] [Josh]: So I feel like everything in AI is like that meme with the bell curve and, like, the idiot and then the really sweaty guy talking a lot and then the Jedi who agrees with the idiot, and, like, there is a possibility that, like, it turns out that the secret to alignment is just turn on the nice feature. (panelists laughing) Like, for a sufficiently galaxy brain version of, like, the nice feature.

[译文] [Josh]: 我觉得AI领域的所有事情都像那个钟形曲线的梗（meme）：一边是傻瓜，中间是那个说个不停、满头大汗的家伙，另一边是同意傻瓜观点的绝地武士（Jedi）。有一种可能性是，结果发现对齐的秘密真的就是“打开善良特征”。（小组成员笑）如果是对于某种足够“脑洞大开（galaxy brain）”版本的“善良特征”来说的话。

[原文] [Josh]: I think (laughs) that in some sense though, I'm hoping that interpretability is also like the Jedi version of just, like, "Well look at well how the model's doing things and check that it's safe," which is, like, maybe very hard, but also if you could do that, (clears throat) would potentially just answer your question. I think that, like, one of the things that are sort of, like, comes up in both, like, the near term and slightly longer term is, like, you wanna understand, like, why the model did one thing instead of another

[译文] [Josh]: 我想（笑）在某种意义上，我希望可解释性也是那个“绝地武士”版本，也就是简单地“看看模型是怎么做事的，并检查它是否安全”，这也许非常难，但如果你能做到这一点，（清嗓子）可能就直接回答了你的问题。我认为，无论是在短期还是稍长期的未来，都会出现的一件事是，你想理解为什么模型做了一件事而不是另一件事，

[原文] [Josh]: when you could come up with, like, plausible alternative explanations and one way is to ask it. But the issue is the models are so analogous to people that they'll just give you an answer to why they did that, you know, as anybody would. But, like, how do you trust that any of that stuff? And it's like, "Well if you could, like, look inside and just, like, see what it was thinking about as it was giving you the answer." And, like, even now with stuff like the SAEs, you can, like, see there's some feature active

[译文] [Josh]: 当你可以提出看似合理的替代解释时，一种方法是直接问它。但问题是，模型太像人了，它们只会给你一个关于它们为什么那样做的答案，就像任何人都会做的那样。但是，你如何相信那些东西呢？这就好比，“好吧，如果你能向内部看，直接看到它在给你答案时在想什么。”甚至现在有了像SAEs（稀疏自编码器）这样的东西，你可以看到有一些特征被激活了，

[原文] [Josh]: and, like, when else is that happening? And it's, like, "Okay, it's like other instances of people telling white lies." And you're like, "Well, then maybe the model is telling," it's like, that's on the Jedi side, I think, right? And so I think that, like, trying to just, like, look inside, see if we can figure out what the parts are and then see, like, are you comfortable with using that part when it does other things, is the basic bet.

[译文] [Josh]: 比如，这还在什么时候发生？就像是，“好的，这就像人们说善意谎言的其他例子。”然后你会想，“嗯，那也许模型正在撒谎，”我觉得这就是在“绝地武士”那一边的，对吧？所以我认为，试图直接向内部看，看看我们能否弄清楚各个部分是什么，然后看看，当它做其他事情时使用那个部分你是否感到放心，这就是基本的赌注。

[原文] [Amanda]: I have a question. - Yeah. - How do you know you're turning up the nice feature

[译文] [Amanda]: 我有个问题。[Josh]: 嗯。[Amanda]: 你怎么知道你调高的是“善良特征”，

[原文] [Amanda]: and not the pretend to be nice feature whenever humans are looking feature?

[译文] [Amanda]: 而不是“每当人类在看时就假装善良的特征”？

章节 4：自动化研究与思维链断层

📝 本节摘要：

本节探讨了未来的对齐策略。Jan提出，中期的最佳赌注是自动化对齐研究，即训练模型来帮助人类进行机器学习研究，从而解决对齐问题。Josh则补充了一个关键的技术隐忧：目前的模型通过英语进行“思维链（Chain of Thought）”推理，人类尚可理解；但未来模型的推理过程可能变得不可理喻（inscrutable），或不再使用人类语言。如何在模型“思维”变得完全不可读之前建立信任，是一个巨大的挑战。

[原文] [Jan]: I mean I think a really obvious thing we should do more of is like what Amanda said. Just say can we just get the models to help us? And then the question's, of course, how do we trust the models? Like, how do we bootstrap this whole process? And, like, you know, you could hope maybe we can, like, leverage the dumber models that we trust more, but they might also not be able to figure it out.

[译文] [Jan]: 我认为一件非常明显我们应该多做的事情就像Amanda说的那样。就是说我们能不能让模型来帮助我们？当然问题随之而来，我们如何信任模型？比如，我们如何引导（bootstrap）整个过程？你知道，你可以希望也许我们可以利用我们更信任的较笨的模型，但它们可能也搞不清楚。

[原文] [Jan]: And so I guess like there's the whole scaled oversight work where, you know, we are exploring various, like, multi-agent dynamics to try to train models to, you know, help us figure out these kind of problems. It seems like overall, like, these problems might, like, maybe the problems are, like, all kind of easy and, like, we can just do the Amanda thing and just, like, make sense of data. Or it's, like, really hard and we have to figure out, like, fully new ideas and approaches that we don't know yet. I think kind of our best bet in the medium term

[译文] [Jan]: 所以我想还有整个扩展监督（scaled oversight）的工作，我们在探索各种多智能体动态，试图训练模型来帮助我们解决这类问题。总体来看，这些问题可能，也许这些问题其实都很简单，我们可以直接用Amanda的方法，理解数据就行。或者它真的很难，我们必须找出我们还不知道的全新想法和方法。我认为我们在中期最好的赌注

[原文] [Jan]: is to try to figure out how to automate alignment research and then we can hopefully get the models to do it. So now we've reduced the problem, like, to, "How can we trust this model to do anything?" To just, like, "Well, can we just trust it to do this, like, much more narrow thing of, like, do some ML research which we understand, like, reasonably well? And how can we evaluate it or, like, how can we give it feedback on those kind of things?"

[译文] [Jan]: 是试图弄清楚如何自动化对齐研究，然后我们希望能让模型来做这件事。所以现在我们将问题从“我们如何信任这个模型做任何事？”简化为，“嗯，我们能不能只信任它做这件更窄的事情，比如做一些我们理解得相当好的机器学习（ML）研究？我们如何评估它，或者如何就这些事情给它反馈？”

[原文] [Josh]: Oh, I was gonna say I think we're in this special zone

[译文] [Josh]: 哦，我想说我认为我们正处于这个特殊区域，

[原文] [Josh]: and I'm terrified about what happens next, but we're in the special zone right now where, like, there's something happens on the forward pass but a lot of the information you need is passed back through with the tokens it generates. It's just, like, the chain of thought is really important for the models to be very smart. And, like, the chain of thought is currently in English. And so then you've got, like, a factorized problem, which is, like, is the chain of thought, like, reasonably safe and, like, is that faithful

[译文] [Josh]: 我对接下来会发生什么感到恐惧，但我们现在正处于这个特殊区域，在前向传播（forward pass）中发生了一些事情，但你需要的大量信息是通过它生成的token传回的。就像，思维链（chain of thought）对于模型变得非常聪明真的很重要。而且，目前的思维链是英文的。所以你就有了一个分解的问题，即，思维链是否相当安全，以及它是否忠实于

[原文] [Josh]: to what's happening on, like, one forward pass? And, like, maybe like you can do some interpretability to, like, check that piece and then you can just, like, you or models can inspect it and you get the other piece. And the horrifying moment is, like, when all of that, the very long thing, like, isn't in English, right? It's in, like, some inscrutable thing that, like, you've learned through, like, crazy long RL to do it. And, like, I think that a big challenge can be crossing that gap where, like, none of the intermediates are intelligible

[译文] [Josh]: 在一次前向传播中发生的事情？也许你可以做一些可解释性工作来检查那一部分，然后你可以，比如你或者模型可以检查它，你就得到了另一部分。而那个可怕的时刻是，当所有这些，那个很长的东西，不再是英文的时候，对吧？它是某种不可理喻的（inscrutable）东西，是通过疯狂漫长的强化学习（RL）学来的。我认为一个巨大的挑战就是跨越那个鸿沟，即所有的中间过程都是不可理解的，

[原文] [Josh]: and there's, like, massive amounts of compute before it like drops out something that people can read.

[译文] [Josh]: 并且在它输出人们能读懂的东西之前，经过了大量的计算。

章节 5：侦测欺骗——“模型生物”与微调博弈

📝 本节摘要：

团队讨论了如何判断世界是“对齐很容易”还是“对齐很难”。Jan提到了模型生物（Model Organisms）工作：故意制造出具有欺骗性或“阴暗”行为的模型（如潜伏特工 Sleeper Agents），然后测试是否能检测或修复它们。这是一个“红队/蓝队”的博弈：制造者试图隐藏恶意，检测者（Amanda的团队）试图在不知道内情的情况下通过微调或观察将其消除。如果在微调后恶意行为依然存在（即便表面看似良善），则说明我们处于一个“困难模式”的世界。

[原文] [Jan]: I feel like the model organisms work is, like, trying to figure this out, right?

[译文] [Jan]: 我觉得模型生物（model organisms）的工作就是在试图弄清楚这一点，对吧？

[原文] [Jan]: Like, can we try to deliberately make deceptive models or misaligned models, models that, like, try to do shady stuff. Like, how good are they, how hard is it to do it? I mean, we might fundamentally go around, like, about it the wrong way and that's why we fail, but if we do succeed it should tell us, like, how close are we to that kind of world? And then, you know, once you have your deceptive model that does all these shady things, like, can you fix it? What if you don't know what the model, if it's a shady model or not, right?

[译文] [Jan]: 比如，我们能不能试着故意制造欺骗性模型或未对齐的模型，那些试图做阴暗之事的模型。比如，它们有多厉害，做这件事有多难？我的意思是，我们可能根本上走错了路，所以我们失败了，但如果我们成功了，它应该告诉我们，我们离那种世界有多近？然后，你知道，一旦你有了做所有这些阴暗之事的欺骗性模型，你能修复它吗？如果你不知道这个模型是不是阴暗模型怎么办，对吧？

[原文] [Jan]: We are playing these interability audits, which I'm very excited about, but I don't actually know what the state is.

[译文] [Jan]: 我们正在进行这些可解释性审计（interability audits），我对此非常兴奋，但我实际上不知道现在的状况如何。

[原文] [Alex]: Well, we haven't done the audit yet, but your people are trying to make it and we're trying to catch it. (panelists laughing)

[译文] [Alex]: 嗯，我们还没做审计，但你的人正在试图制造它，而我们正在试图抓住它。（小组成员笑）

[原文] [Amanda]: Yeah, and maybe a sign that seems important is something like how robust or, like, if you have, like, modal organisms work and then turns out you just put it through some character training and it just comes out being really nice again, then I'd be like, "Okay, that's a good sign, Re: the kind of world that we're in." Whereas if it, like, is just kind of, like, shallow, like, shell on top of, like, I dunno, the same behavior, then I'm like, "Okay, we're in a slightly harder world."

[译文] [Amanda]: 是的，也许一个看似重要的迹象是它有多鲁棒（robust），或者，如果你有模型生物工作，结果发现你只是让它经过一些角色训练，它出来后又变得非常好了，那我就会想，“好的，这是个好迹象，关于我们处于什么样的世界。”然而，如果它只是某种，浅层的，像是在同样的（恶意）行为之上盖了一层壳，那我就会想，“好的，我们处于一个稍微更难的世界。”

[原文] [Amanda]: In terms of, like, model organisms, I guess my hope would be you'd actually have a kind of like a red team, blue team set up where you have a way of detecting whether the behavior that you have like instilled in a model is still there. And my job is actually to not know.

[译文] [Amanda]: 关于模型生物，我想我的希望是你实际上会有一种像红队、蓝队的设置，你有办法检测你灌输给模型的行为是否还在那里。而我的工作实际上是不去知道。

[原文] [Amanda]: And in fact it would be really good if I'm trying to train the model, I actually just don't know what it is that you've done because that's, like, a better way of testing whether my intervention's actually working. Whereas because otherwise it is just so hard to not, like, just try to train to the test or something. So I almost want to be completely ignorant of it.

[译文] [Amanda]: 事实上，如果我试图训练模型，但我实际上根本不知道你们做了什么，那就太好了，因为那是测试我的干预是否真正奏效的更好方法。否则真的很难不变成那是那种“应试训练”。所以我几乎希望对此一无所知。

章节 6：观众问答——代理协作、平庸之恶与系统性风险

📝 本节摘要：

进入观众提问环节。第一个问题关于多智能体协作（Multi-agent）：当构建多个智能体进行辩论时，过度对齐的模型往往因“过于客气”或拒绝讨论敏感话题而陷入死循环。Amanda认为，单个模型内部的思维审视可能比多智能体更可预测。第二个问题引用汉娜·阿伦特的“平庸之恶”，询问当数百万智能体交互时，是否会产生单个模型层面上看不见的系统性邪恶（副现象）。团队承认必须从系统角度（System standpoint）考虑对齐，而不仅仅是隔离地看单个模型。

[原文] [Audience 1]: So if I'm trying to create this kind of multi-agentic deliberation thing, but I butt up against this aligned model who is so unwanting to deliberate with its own self, because all of them are like, "I'm sorry, I can't talk about that." And so you just get this endless loop. Did you have commentary on that? 'Cause many people, we're not all using Claude in a single inference forward pathway.

[译文] [Audience 1]: 所以如果我试图创建这种多智能体审议（multi-agentic deliberation）的东西，但我碰到了这个已经对齐的模型，它非常不愿意与自己进行审议，因为它们都像是在说，“对不起，我不能谈论那个。”所以你就陷入了这个死循环。你们对此有评论吗？因为很多人，我们并不是都在单一推理前向路径中使用Claude。

[原文] [Amanda]: Like, my sense is that we're often just very willing to, like, reflect on many things, we go back and forth, we come to, like, conclusions, and so in the same way that, like, a human would think through any kinda standard problem that they were, like, facing, I'm imagining that, like, moral deliberation in the model's just gonna look kind of like that. More like the deliberation of a single model than, like, multiple models weighing in, if that makes sense.

[译文] [Amanda]: 我的感觉是，我们通常非常愿意反思很多事情，我们会反复思考，得出结论，就像人类会思考他们面临的任何标准问题一样，我想象模型中的道德审议看起来也会是那样的。更像是单个模型的审议，而不是多个模型介入，如果这讲得通的话。

[原文] [Audience 2]: I want to try to draw maybe a relatively strange parallel between, like, Hannah Arendt's work on the banality of evil, the idea that most humans tend not to be evil, but when put in certain situations, the coupling constants between humans being so huge, the evil comes as an epiphenomenon of the system, right? And so what occurs to me is as you talk about model alignment, you're focused on one model, that has been most of your comments. How do you think about the coupling not only with society but as you work on agents and potentially

[译文] [Audience 2]: 我想试着画一个相对奇怪的平行线，关于汉娜·阿伦特（Hannah Arendt）关于“平庸之恶（banality of evil）”的著作，即大多数人类倾向于不邪恶，但当被置于某些情境中，人类之间的耦合常数（coupling constants）如此巨大，邪恶作为系统的副现象（epiphenomenon）出现了，对吧？所以我想到的是，当你们谈论模型对齐时，你们关注的是一个模型，这是你们大部分评论的内容。你们如何思考这种耦合，不仅是与社会的耦合，还有当你们研究代理（agents）以及潜在的

[原文] [Audience 2]: millions and millions of agents, that sort of epiphenomenon of those systems?

[译文] [Audience 2]: 数百万数百万的代理时，那种系统的副现象？

[原文] [Alex]: I mean, I think broadly when you think about safety and alignment, you have to think about it from a systems standpoint, you can't just think about it from, you know, an individual models perspective in isolation. And I think we've seen a lot of work where, you know, a lot of jailbreaks operate by pitting different values sort of against each other, putting the model in a difficult situation where, you know, it's designed

[译文] [Alex]: 我的意思是，我认为广义上当你思考安全和对齐时，你必须从系统角度（systems standpoint）来思考，你不能仅仅从孤立的单个模型视角来思考。我认为我们已经看到了很多工作，很多越狱（jailbreaks）操作是通过将不同的价值观相互对立，将模型置于一个困难的境地，

[原文] [Amanda]: The banality of evil point feels especially relevant if you're, like, thinking of models as just, like, doing whatever humans say because in in some ways, like , that is the idea that if you have a society that, like, either collectively just like allows for harmful things to take place or even endorses them and you have models, this isn't necessarily people misusing models, it's just that you'd be using models to, like, facilitate some kind of, like, harmful activity.

[译文] [Amanda]: 如果你把模型想成只是做人类所说的任何事，那么“平庸之恶”这个点就感觉特别相关，因为在某种程度上，这种想法是，如果你有一个社会，集体允许有害的事情发生甚至支持它们，而你拥有模型，这不一定是人们在滥用模型，而只是你在使用模型来促进某种有害活动。

[原文] [Amanda]: And so I think there is, like, fundamentally actually a tension between having more models be cordial to the very least individual humans and having them be, like, in a sense, like, aligned with, like, all humans.

[译文] [Amanda]: 所以我认为，在让模型对至少是个体的对人类更加“亲切（cordial）”，与让它们在某种意义上与“全人类”对齐之间，根本上实际上存在一种张力。

章节 7：未知的未知与“聪明的说谎者”

📝 本节摘要：

最后的问答环节讨论了当前方案的完整性。团队承认，即使所有对齐技术都成功，仍有“未知的未知（Unknown Unknowns）”和社会影响问题。最后一个问题关于用较弱的模型评估较强的模型：当模型出现“顿悟（Grokking）”能力（如突然学会Base64编码）时，它可能变得更擅长欺骗。Josh指出，虽然这很可怕，但好消息是内部特征（如“撒谎”特征）在不同语言（英语或Base64）下可能是一致的，这给了可解释性技术在未来继续发挥作用的希望。

[原文] [Audience 3]: If you were all to succeed in your areas, would that be a complete solution to AI safety or are there pieces missing and if so, what are they?

[译文] [Audience 3]: 如果你们都在各自的领域取得了成功，那会是AI安全的完整解决方案吗？还是有缺失的部分？如果有，是什么？

[原文] [Jan]: There's also, like, a lot of people working at Anthropic who are not at this panel. (audience laughing) We are oversimplifying a little bit.

[译文] [Jan]: 还有很多人在Anthropic工作但不在这个讨论组里。（观众笑）我们有点过于简化了。

[原文] [Amanda]: I want to add an almost, like, I dunno if this is pessimistic or just, I don't think it's pessimistic. But I think that like there's also this way of talking about alignment and the alignment problem as, like, a single, like, theoretical problem or something like that where people will be like, "Does this solve it?" and somehow it's never felt right.

[译文] [Amanda]: 我想补充一点，几乎，我不知道这是悲观还是只是——我不认为这是悲观。但我认为有一种谈论对齐和对齐问题的方式，把它当作一个单一的理论问题之类的，人们会问，“这解决了吗？”而这种感觉从来都不太对。

[原文] [Amanda]: I'm just like, "Look, problems might just arise that we're not even thinking of now." And in fact that's like very, very common in, like, many, many disciplines and I kind of expect it to be true here and it would be really dangerous, I think, if we were just like, "Oh yeah, we've like solved this problem," because I'm just like, I dunno, it could be the actual problems that we've just not thought of yet.

[译文] [Amanda]: 我只是觉得，“听着，可能会出现我们现在甚至没想到的问题。”事实上这在许多许多学科中都非常常见，我预期在这里也是如此。如果我们只是觉得，“哦耶，我们已经解决了这个问题，”那真的会很危险，因为我觉得，我不知道，真正的问题可能是我们要还没想到的。

[原文] [Jan]: Yeah, unknown unknowns. - But we should solve the problem. And then once we did we should say that we solved it. (panelists laughing)

[译文] [Jan]: 是的，未知的未知。[Amanda]: 但我们应该解决这个问题。一旦解决了，我们就应该说我们解决了。（小组成员笑）

[原文] [Audience 4]: Yeah, Jan was talking earlier about using dumber models to evaluate smarter models over time. And I'm wondering to what extent you see, like, the Groking abilities in models where, like, suddenly it's really duplicitous or do you sort of see, "Oh, it's lying, but it's very bad at lying and now I can catch that and maybe, like, nip it in the bud while it's still weak."

[译文] [Audience 4]: 是的，Jan早些时候谈到用较笨的模型来评估较聪明的模型。我想知道你在多大程度上看到了模型中的“顿悟（Grokking）”能力，比如突然之间它变得非常奸诈？还是说你会看到，“哦，它在撒谎，但它撒谎很拙劣，现在我可以抓住它，也许可以在它还弱的时候将其扼杀在萌芽状态。”

[原文] [Jan]: Like one that I remember is, like, GPT-4 could just read and write in Base64 super reliably and 3.5 could not. And so if you use 3.5 to oversee 4, it's, like, really easy for the model to get around this.

[译文] [Jan]: 我记得的一个例子是，GPT-4可以非常可靠地用Base64编码读写，而3.5不行。所以如果你用3.5来监督4，模型要绕过这个真的很简单。

[原文] [Josh]: I will say one, like, you know, right side of the distribution Jedi moment was like, (panelists laughing) that the features, you know, just like also work in Base64. So it's like, is the model talking about California in Base64 or, like, a story about children lying to their parents in Base64 and it's like the same things activate?

[译文] [Josh]: 我要说一个，你知道，分布右侧的“绝地武士时刻”是，（小组成员笑）那些特征，在Base64中也同样起作用。所以就像是，模型是在用Base64谈论加利福尼亚，还是在用Base64讲一个关于孩子对父母撒谎的故事？而同样的特征被激活了？

[原文] [Josh]: And so, like, we do get sometimes get lucky as, like, the models are like very capable. Part of that is they have some, like, very general synthetic thing and, like, maybe you can just like tap that to get some generalization. That would've been like pretty, pretty tough out.

[译文] [Josh]: 所以，有时候我们确实很幸运，因为模型非常有能力。部分原因是它们拥有某种非常通用的综合能力，也许你可以直接利用这一点来获得某种泛化。否则那将会是非常、非常棘手的。