Things every AI Engineer should be using in agents..

章节 1:AI 工程师面临的成本与效率挑战

📝 本节摘要

本节指出了当前 AI 工程师面临的核心痛点:如何在用户与大模型(LLM)交互的过程中降低对话成本。讲者提到了大模型(如 Claude Opus)昂贵的 Token 费用,以及“中间迷失(Lost in the middle)”现象导致的性能问题。核心观点是,尽管有各种限制手段,工程师仍需处理高达 50-70% 的 Token 浪费问题。

[原文] [Speaker]: the real question for AI engineers nowadays is how do you bring the cost for a single conversation from this to this

[译文] [Speaker]: 如今对于 AI 工程师来说,真正的问题是如何将单次对话的成本从这个数降到这个数。

[原文] [Speaker]: Given that your product is an agent that's sitting between the user and some kind of a open API or a cloud API that's hosting the LLM you're going to be paying for each prompt that the user makes

[译文] [Speaker]: 鉴于你的产品是一个介于用户和某种托管 LLM(大语言模型)的 Open API 或云 API 之间的智能体(Agent),你将不得不为用户的每一个提示词(Prompt)付费。

[原文] [Speaker]: And obviously you want to reduce that payment because depending on what kind of an LM you're going to be using Obviously stronger LM have bigger contact sizes and they also cost more

[译文] [Speaker]: 显然你想减少这笔支出,因为这取决于你使用的是哪种 LLM,很明显,更强大的 LLM 拥有更大的上下文窗口(Context Sizes),成本也更高。

[原文] [Speaker]: For example clot oppus is going to cost this much for 1K tokens and as tokens are our currency of LLM This is way too much

[译文] [Speaker]: 例如 Claude Opus(原文误识别为 clot oppus)每 1000 个 Token 要花这么多钱,而 Token 是我们在 LLM 中的货币,这实在是太多了。

[原文] [Speaker]: And not to forget we're also have to pay for the output

[译文] [Speaker]: 别忘了我们还得为输出内容付费。

[原文] [Speaker]: Not to mention the fact that bigger context sizes are not even always ideal We have a problem called lost in the middle where the LLM actually disregards the user prompt if it gets too big and especially the middle section

[译文] [Speaker]: 更不用说更大的上下文窗口甚至并不总是理想的,我们面临一个被称为“中间迷失(Lost in the middle)”的问题,即如果内容太长,LLM 实际上会忽略用户的提示词,特别是位于中间部分的内容。

[原文] [Speaker]: So despite all the products that we have in our hand with guard railing rate limiting and batching engineers are the ones who have to deal with this waste Meaning 50 to 70% of our tokens are actually wasted and this has to be optimized

[译文] [Speaker]: 因此,尽管我们手头有各种具备护栏(Guard railing)、速率限制和批处理功能的产品,工程师们仍是必须处理这种浪费的人,这意味着我们 50% 到 70% 的 Token 实际上被浪费了,而这必须得到优化。


章节 2:策略一:提示词工程与动态模型路由

📝 本节摘要

讲者介绍了第一个优化策略:更好的提示词工程。与其微调模型,不如优化用户的输入。本节详细讲解了根据任务复杂度(简单 vs 复杂)动态选择模型(如 GPT-4o mini vs Claude Sonnet)的路由机制,以及“零样本(Zero-shot)”、“少样本(Few-shot)”和“批处理提示(Batch Prompting)”等具体技术。

[原文] [Speaker]: Tokens means money More tokens means lower performance because the LLM has to actually read all of those tokens And of course as we learned it's also less quality

[译文] [Speaker]: Token 意味着金钱,更多的 Token 意味着更低的性能,因为 LLM 必须实际阅读所有这些 Token,而且正如我们所知,这也意味着更低的质量。

[原文] [Speaker]: Now there are three major strategies that you as an AI engineer can use and that's why you should watch the video till the very end And we're going to start with the first one which is simply better prompt engineering And here are the examples

[译文] [Speaker]: 现在,作为 AI 工程师,你可以使用三种主要策略,这也是你应该把视频看到最后的原因,我们将从第一个策略开始,那就是更简单的——更好的提示词工程(Prompt Engineering),以下是示例。

[原文] [Speaker]: But what is prompt engineering anyway Well it's the fact that we don't have to fine-tune the LLM anymore We're simply not touching it What we need to work on is actually the user's prompt

[译文] [Speaker]: 但到底什么是提示词工程呢?其实就是我们不再需要微调(Fine-tune)LLM 了,我们完全不去动它,我们需要处理的实际上是用户的提示词。

[原文] [Speaker]: For example you can let your agent decide which LM it wants to use based on the user's input For example for simple task we're going to be using this one For complex ones we're going to be using clot sonet

[译文] [Speaker]: 例如,你可以让你的智能体根据用户的输入决定使用哪个 LLM,比如对于简单的任务,我们将使用这个,对于复杂的任务,我们将使用 Claude Sonnet(原文误识别为 clot sonet)。

[原文] [Speaker]: So every time the user gives us a prompt given that it wants something it's a task We're going to classify it either by simple or complex task

[译文] [Speaker]: 所以每次用户给我们一个提示词,假设它想要什么,这便是一个任务,我们将把它分类为简单任务或复杂任务。

[原文] [Speaker]: As soon as we have it and by the way for classification I'm also going to be using the very simple LLM As soon as we have the classification we simply have an if statement that decides if it's a simple one then use the open API client which is the GP240 or if it's a complex task then we're going to be using clots from Entropic

[译文] [Speaker]: 一旦我们有了它——顺便说一句,对于分类,我也将使用非常简单的 LLM——一旦我们有了分类结果,我们只需要一个 if 语句来决定:如果是简单的,就使用 OpenAI 客户端(原文口误为 open API client),即 GPT-4o(原文误识别为 GP240);如果是复杂任务,我们将使用 Anthropic 的 Claude(原文误识别为 clots from Entropic)。

[原文] [Speaker]: Concise prompts should be a no-brainer But still a lot of engineers are making this mistake by wasting a lot of tokens Instead you can opt in for something smaller very clear so that the LLM knows what to do just by reading this one sentence

[译文] [Speaker]: 简洁的提示词应该是显而易见的,但仍有很多工程师在犯这个错误,浪费了大量的 Token;相反,你可以选择更短小、非常清晰的内容,这样 LLM 仅通过阅读这一句话就知道该做什么。

[原文] [Speaker]: Next there's a thing called zeroot prompt and a few shot prompt Zeroot prompt simply gives a prompt to the LLM and for example says classify this as either positive or negative

[译文] [Speaker]: 接下来有一个叫做零样本提示(Zero-shot prompt,原文误识别为 zeroot)和少样本提示(Few-shot prompt)的东西。零样本提示只是给 LLM 一个提示词,例如说“把这个分类为正面或负面”。

[原文] [Speaker]: While a few prompt actually gives some examples before asking something and why do we need that Well few short prompts actually reduce the chances of the LLM misinterpreting the instructions Although with a oneshot prompt is costing less

[译文] [Speaker]: 而少样本提示实际上是在提问之前给出一些例子,为什么我们需要这个?嗯,少样本提示实际上降低了 LLM 曲解指令的几率,尽管单样本提示(One-shot prompt)成本更低。

[原文] [Speaker]: And last but not least you can opt in for a batch prompting Instead of giving one system prompt and three different user prompts you can simply merge all of that into one prompt and give it to an LLM And this has been studied and proven that it actually does not reduce the quality as long as you do batch prompting for similar requests

[译文] [Speaker]: 最后但同样重要的是,你可以选择批处理提示(Batch Prompting)。与其给出一个系统提示词和三个不同的用户提示词,你可以简单地把所有这些合并成一个提示词给 LLM,这已经被研究并证明,只要你是对类似的请求进行批处理,它实际上并不会降低质量。


章节 3:工具推荐:UWare 与“氛围编码”

📝 本节摘要

讲者分享了在黑客马拉松中发现的工具 UWare。该工具支持“Vibe coding”(氛围编码/直觉式编码),即通过对话生成全栈 Web 应用。本节重点介绍了其退款机制(Credit care)、内置后端服务(UBase)以及实时屏幕共享修复代码(Code View)的功能。

[原文] [Speaker]: By the way I was on hackathon last weekend and I had to build a prototype within one day And while I was searching for a perfect tool that's going to let me vibe code and build a website I stumbled upon youware

[译文] [Speaker]: 顺便说一下,上周末我参加了一个黑客马拉松,我必须在一天内构建一个原型,当我在寻找一个完美的工具让我可以进行“氛围编码(Vibe code)”并构建网站时,我偶然发现了 UWare(原文误识别为 youware)。

[原文] [Speaker]: And I was lucky to learn that UWare who is the sponsor of today's video lets you do exactly that by simply vibe coding Just imagine turning an idea into working web app not by writing code but simply by having a conversation

[译文] [Speaker]: 我很幸运地了解到,作为今天视频赞助商的 UWare,让你正是可以通过简单的“氛围编码”做到这一点,想象一下,不是通过写代码,而是仅仅通过对话,就把一个点子变成一个可工作的 Web 应用。

[原文] [Speaker]: That's exactly what this platform calls vibe coding You describe what you want and UWare generates the project from scratch live in your browser in minutes

[译文] [Speaker]: 这正是该平台所称的“氛围编码”,你描述你想要什么,UWare 就会在几分钟内于你的浏览器中从头开始实时生成项目。

[原文] [Speaker]: One thing I love about it is credit care If the AI generates something you don't like or you try something experimental that doesn't work out UWare literally refineses your credits automatically So you're never afraid to explore or test ideas

[译文] [Speaker]: 我喜欢它的一点是信用分关怀(Credit care),如果 AI 生成了你不喜欢的东西,或者你尝试了一些不成功的实验性内容,UWare 实际上会自动退还(原文误识别为 refineses,应为 refunds)你的信用分,所以你永远不用害怕探索或测试想法。

[原文] [Speaker]: And it doesn't stop at just design with UBase U gives you a real back end right out of the box You get database storage user login and authentication file and media storage and even secure secrets management for things like API keys all without setting up servers or writing backing code yourself

[译文] [Speaker]: 而且它不仅止步于设计,通过 UBase,它开箱即给你一个真正的后端,你将获得数据库存储、用户登录和认证、文件和媒体存储,甚至是对 API 密钥等内容的安全机密管理,所有这些都不需要你自己设置服务器或编写后端代码。

[原文] [Speaker]: That means your app isn't just a static prototype It's a full stack production ready application

[译文] [Speaker]: 这意味着你的应用不仅仅是一个静态原型,它是一个全栈的、可用于生产环境的应用程序。

[原文] [Speaker]: But the feature that really blew me away while building was code view Instead of trying to describe UI issues or bugs with text you can share your screen and talk to the AI in real time The AI sees exactly what you're doing including hover states transitions and context then fixes or adjusts the design for you

[译文] [Speaker]: 但在构建过程中真正让我震惊的功能是代码视图(Code view),与其试图用文字描述 UI 问题或 Bug,你可以共享你的屏幕并实时与 AI 交谈,AI 能确切地看到你在做什么,包括悬停状态、过渡动画和上下文,然后为你修复或调整设计。

[原文] [Speaker]: It feels like you're sitting next to a human engineer guiding your project I used the wear on my hackathon to go from an idea to a live preview of my application And if you want to experiment with your ideas as a viewer of my channel you're going to get discount by using the link below Go check it out and let me know what you think

[译文] [Speaker]: 这感觉就像你坐在一位人类工程师旁边指导你的项目,我在黑客马拉松上使用了 UWare(原文误识别为 the wear),从一个点子变成了我应用的实时预览,如果你作为我频道的观众想要实验你的想法,使用下面的链接将获得折扣,去看看吧,并让我知道你的想法。


章节 4:策略二:优化的响应缓存与语义缓存

📝 本节摘要

本节介绍了第二个策略:优化缓存。除了基础的“响应缓存”(即完全相同的问题直接返回存储的答案),讲者重点推荐了“语义缓存(Semantic Caching)”。通过向量化处理,系统可以识别出语义相似的问题(例如“什么是圆周率”和“圆周率的含义”),并直接从数据库(如向量数据库)返回结果,从而避免重复调用 LLM。

[原文] [Speaker]: Now the next point we're going to talk about is optimized caching The first type is going to be response caching Meaning we have a fast API in between and if the user makes a prompt the prompt goes to an LLM LLM response and the user gets an answer

[译文] [Speaker]: 现在我们要谈的下一点是优化缓存(Optimized Caching),第一种类型是响应缓存,意味着我们在中间有一个快速的 API,如果用户发出一个提示词,提示词传给 LLM,LLM 做出响应,用户得到答案。

[原文] [Speaker]: What we want to do though is eliminate this round trip to an LLM and back For this if the user asks for example what is a PI number And the next day the user asks the same we actually want to retrieve this from a cache and not from an LLM

[译文] [Speaker]: 但我们想要做的是消除这一趟去往 LLM 再返回的行程,为此,如果用户问例如“什么是 PI(圆周率)数值”,而第二天用户问了同样的问题,我们实际上希望从缓存中检索这个答案,而不是从 LLM。

[原文] [Speaker]: Meaning if the LLM returns the answer the first time we're actually going to be using some kind of a database basically a key value store where we can store this question Meaning the next day if the question comes as what is a pi number and our question is already stored in the simple text and the answer as well we're going to be giving this value back to the user instead of contacting the LLM

[译文] [Speaker]: 意思是如果 LLM 第一次返回了答案,我们要实际使用某种数据库,基本上是一个键值存储(Key Value Store),我们可以存储这个问题,意味着第二天如果问题是“什么是 PI 数值”,而我们的问题已经以纯文本形式存储了,答案也是,我们将把这个值直接给回用户,而不是联系 LLM。

[原文] [Speaker]: Now a cool thing here is though that we can also vectorize this because now it's going to be semantic caching and not just response caching

[译文] [Speaker]: 现在这里有个很酷的事情是,我们也可以将其向量化(Vectorize),因为这样它就变成了语义缓存(Semantic Caching),而不仅仅是响应缓存。

[原文] [Speaker]: Semantic caching means that the question does not have to be what is a pi number It can also be what does a pi number mean or what's the meaning of pi number Meaning these sentences are semantically similar

[译文] [Speaker]: 语义缓存意味着问题不一定非得是“什么是 PI 数值”,它也可以是“PI 数值意味着什么”或“PI 数值的含义是什么”,意思是这些句子在语义上是相似的。

[原文] [Speaker]: And by the way I have a video on vector databases Go check it out if you want to learn more about this

[译文] [Speaker]: 顺便说一句,我有一个关于向量数据库的视频,如果你想了解更多,去看看吧。

[原文] [Speaker]: Semantic caching in langraph would look something like this We're importing our vector store database and it can also be MongoDB or radius by the way And the first thing we're going to do is do a similarity search on the database

[译文] [Speaker]: LangGraph 中的语义缓存看起来大概是这样,我们导入向量存储数据库,顺便说也可以是 MongoDB 或 Redis(原文误识别为 radius),我们要做的第一件事是在数据库上进行相似性搜索。

[原文] [Speaker]: And if the results are found then we're giving the response If not only then we're invoking the our LLM with the question of the user and then of course storing it in the database again

[译文] [Speaker]: 如果找到了结果,我们就给出响应;如果没有,只有在那时我们才用用户的问题调用我们的 LLM,然后当然要再次把它存储在数据库中。


章节 5:策略三:上下文管理与记忆优化

📝 本节摘要

最后一节强调了“上下文管理(Context Management)”的重要性。AI 工程师不仅要关注提示词,还要管理整个上下文堆栈(系统提示、记忆、RAG 工具等)。讲者建议通过总结模式、修剪旧消息和清除冗余状态来优化内存。最后提到了 LangChain 的 ConversationSummaryMemory 类,它可以自动总结长对话,防止 Context 溢出。

[原文] [Speaker]: And last but very important one is context management Context management or context engineering means we don't only care about the system prompt or the user prompt but we care about the whole stack of the things that can go into the context window

[译文] [Speaker]: 最后但非常重要的一点是上下文管理(Context Management),上下文管理或上下文工程意味着我们不仅关心系统提示词或用户提示词,我们关心的是可以进入上下文窗口的整个堆栈的内容。

[原文] [Speaker]: Meaning the system prompt user prompt short-term and long-term memory the rack tools structured output and all the guardrail So all of these things have to be managed before the prompt is gone to an LLM

[译文] [Speaker]: 意思是系统提示词、用户提示词、短期和长期记忆、RAG 工具(原文误识别为 rack tools)、结构化输出以及所有的护栏,所以所有这些东西都必须在提示词发送给 LLM 之前进行管理。

[原文] [Speaker]: And by the way I'm going to be making more videos on prompt engineering as well as context engineering So make sure to subscribe so that you're a welle equipped AI engineer And don't miss any future videos

[译文] [Speaker]: 顺便说一句,我将制作更多关于提示词工程以及上下文工程的视频,所以确保订阅,这样你就能成为一名装备精良的 AI 工程师,并且不错过任何未来的视频。

[原文] [Speaker]: The gist of context engineering would be summarizing the patterns trimming the long old messages from the context and of course purging any redundant state that's coming from the memory

[译文] [Speaker]: 上下文工程的要点在于总结模式、从上下文中修剪长而旧的消息,当然还有清除来自记忆的任何冗余状态。

[原文] [Speaker]: For example if the user had this long conversation with the our system where they're asking about some questions and then figuring out that never never mind we can forget that What you can do is actually summarize all of that conversation and then store it in memory and delete the old sentences from the memory

[译文] [Speaker]: 例如,如果用户与我们的系统进行了长对话,他们询问了一些问题,然后发现“不用了,算了,我们可以忘了那个”,你可以做的是实际上总结所有这些对话,然后将其存储在内存中,并从内存中删除旧的句子。

[原文] [Speaker]: Or if the user supplies a very long error stack into our system with some type error which we can do again to analyze this or first to detect this and then make a summary out of that as well so that we're making sure that we don't save really long error stacks in our database

[译文] [Speaker]: 或者,如果用户向我们的系统提供了一个非常长的错误堆栈,其中包含某种类型错误,我们可以再次对此进行分析,或者先检测它,然后也对此进行总结,以确保我们不会在数据库中保存非常长的错误堆栈。

[原文] [Speaker]: Lucky for us tools like longchain and long graph actually have pre-made or predefined classes for exactly that One of them is conversation summary memory

[译文] [Speaker]: 幸运的是,像 LangChain(原文误识别为 longchain)和 LangGraph(原文误识别为 long graph)这样的工具实际上有针对此的确切的预制或预定义类,其中之一是对话总结记忆(Conversation Summary Memory)。

[原文] [Speaker]: What it's going to do is it's going to have a rolling memory that automatically summarizes the conversations Meaning if the user asks multiple things it's not just going to blatantly save all of that in the database but instead make a summary as soon as it decides that the window or the memory is getting too long

[译文] [Speaker]: 它要做的是拥有一个自动总结对话的滚动记忆,意思是如果用户问了多件事,它不会只是公然地把所有这些都保存在数据库中,而是要在它判定窗口或记忆变得太长时立即进行总结。

[原文] [Speaker]: If you guys learned something new make sure to like the video and subscribe And I'm going to see you guys in the next one Goodbye

[译文] [Speaker]: 如果你们学到了一些新东西,请确保给视频点赞并订阅,我们下个视频见,再见。