What is a Vector Database? Powering Semantic Search & AI Applications

章节 1:传统数据库的困境与“语义鸿沟”

📝 本节摘要

本节以一张山景日落的图片为例,探讨了传统关系型数据库在处理非结构化数据时的局限性。虽然可以通过存储二进制文件、元数据和人工标签来管理图片,但这种方式无法捕捉图片的“语义语境”(如色调风格或特定景观特征)。作者指出,计算机存储数据的方式与人类理解数据的方式之间存在断层,这被称为“语义鸿沟”,导致传统的结构化查询无法有效检索具有多维特征的非结构化数据。

[原文] [Speaker]: What is a vector database? Well, they say a picture is worth a thousand words. So let's start with one.

[译文] [Speaker]: 什么是向量数据库?嗯,常言道,一图胜千言。所以让我们先从一张图片开始。

[原文] [Speaker]: Now in case you can't tell, this is a picture of a sunset on a mountain vista. Beautiful.

[译文] [Speaker]: 既然你可能看不出来,这是一张山景日落的图片。很美。

[原文] [Speaker]: Now let's say this is a digital image and we want to store it. We want to put it into a database and we're going to use a traditional database here called a relational database.

[译文] [Speaker]: 现在假设这是一张数字图像,我们想要存储它。我们想把它放入数据库中,这里我们将使用一种传统的数据库,称为关系型数据库(Relational Database)。

[原文] [Speaker]: Now what can we store in that relational database of this picture? Well we can put the actual picture binary data into our database to start with,

[译文] [Speaker]: 那么关于这张图片,我们能在关系型数据库中存储什么呢?嗯,首先我们可以把实际的图片二进制数据(Binary Data)放入我们的数据库,

[原文] [Speaker]: so this is the actual image file but we can also store some other information as well like some basic metadata about the picture so that would be things like the file format and the date that it was created, stuff like that.

[译文] [Speaker]: 也就是实际的图像文件,但我们也可以存储一些其他信息,比如关于图片的一些基本元数据(Metadata),那会是像文件格式和创建日期之类的东西。

[原文] [Speaker]: And we can also add some manually added tags to this as well. So we could say, let's have tags for sunset and landscape and orange,

[译文] [Speaker]: 我们也可以给它添加一些人工标签。所以我们可以说,让我们加上“日落”、“风景”和“橙色”的标签,

[原文] [Speaker]: and that sort of gives us a basic way to be able to retrieve this image, but it kind of largely misses the images overall semantic context.

[译文] [Speaker]: 这某种程度上给了我们一种检索这张图片的基本方法,但它在很大程度上丢失了图片的整体语义语境(Semantic Context)。

[原文] [Speaker]: Like how would you query for images with similar color palettes for example using this information or images with landscapes of mountains in the background for example.

[译文] [Speaker]: 比如利用这些信息,你该如何查询具有相似调色板(Color Palettes)的图片,或者背景中有山脉景观的图片呢。

[原文] [Speaker]: Those concepts aren't really represented very well in these structured fields and that disconnect between how computers store data how humans understand it has a name. It's called the semantic gap.

[译文] [Speaker]: 那些概念在这些结构化字段中并没有得到很好的体现,这种计算机存储数据的方式与人类理解数据的方式之间的脱节有一个名字。它被称为语义鸿沟(Semantic Gap)。

[原文] [Speaker]: Now traditional database queries like select star where color equals orange, it kind of falls short because it doesn't really capture the nuanced multi-dimensional nature of unstructured data.

[译文] [Speaker]: 传统的数据库查询,比如“选择所有颜色等于橙色的项目”(SELECT * WHERE color = orange),之所以力不从心,是因为它无法真正捕捉非结构化数据(Unstructured Data)那细微的、多维的本质。


章节 2:向量数据库与嵌入(Embeddings)的定义

📝 本节摘要

本节正式引入了向量数据库的概念。作者解释了它是如何通过将非结构化数据(如图像、文本、音频)转化为向量嵌入(Vector Embeddings)——本质上是一组数字数组——来解决语义理解问题的。其核心机制在于:在向量空间中,语义相似的项目会被放置在彼此靠近的位置,而不相似的项目则相距甚远。这使得计算机可以通过数学运算执行相似度搜索,从而找到语义上相关的内容。

[原文] [Speaker]: Well, that's where vector databases come in by representing data as mathematical vector embeddings.

[译文] [Speaker]: 嗯,这就是向量数据库派上用场的地方,它通过将数据表示为数学向量嵌入(Vector Embeddings)来发挥作用。

[原文] [Speaker]: and what vector embeddings are, it's essentially an array of numbers.

[译文] [Speaker]: 至于向量嵌入是什么,它本质上是一个数字数组。

[原文] [Speaker]: Now these vectors, they capture the semantic essence of the data where similar items are positioned close together in vector space and dissimilar items are positioned far apart,

[译文] [Speaker]: 这些向量捕捉了数据的语义本质(Semantic Essence),其中相似的项目在向量空间中被放置在彼此靠近的位置,而不相似的项目则被放置在相距甚远的位置,

[原文] [Speaker]: and with vector databases, we can perform similarity searches as mathematical operations, looking for vector embeddings that are close to each other,

[译文] [Speaker]: 借助向量数据库,我们可以将相似度搜索(Similarity Searches)作为数学运算来执行,寻找彼此接近的向量嵌入,

[原文] [Speaker]: and that kind of translates to finding semantically similar content.

[译文] [Speaker]: 这某种程度上也就转化为了寻找语义上相似的内容。

[原文] [Speaker]: Now we can represent all sorts of unstructured data in a vector database.

[译文] [Speaker]: 现在,我们可以在向量数据库中表示各种各样的非结构化数据。

[原文] [Speaker]: What could we put in here?

[译文] [Speaker]: 我们可以放什么进去呢?

[原文] [Speaker]: Well image files of course like our mountain sunset.

[译文] [Speaker]: 嗯,当然是图像文件,就像我们的山景日落。

[原文] [Speaker]: We could put in a text file as well or we could even store audio files as well in here.

[译文] [Speaker]: 我们也可以放入文本文件,或者我们甚至可以在这里存储音频文件。

[原文] [Speaker]: Well this is unstructured data and these complex objects They are actually transformed into vector embeddings, and those vector embeddings are then stored in the vector database.

[译文] [Speaker]: 这些都是非结构化数据,这些复杂的对象实际上被转换成了向量嵌入,然后这些向量嵌入被存储在向量数据库中。

[原文] [Speaker]: So what do these vector embeddings look like?

[译文] [Speaker]: 那么这些向量嵌入看起来像什么呢?

[原文] [Speaker]: Well, I said there are arrays of numbers and there are arrays of numbers where each position represents some kind of learned feature.

[译文] [Speaker]: 嗯,我说过它们是数字数组,并且在这些数字数组中,每个位置代表某种习得特征(Learned Feature)。


章节 3:解读向量维度:山景与海景的实例对比

📝 本节摘要

本节通过一个简化的案例,具体展示了向量嵌入是如何工作的。作者对比了“山脉日落”与“海滩日落”两张图片的向量数值。在这个假设的例子中,特定的维度数值对应特定的特征(如海拔高度、城市元素密度、暖色调强度)。结果显示,两张图片在代表“暖色调”的维度上数值相近(都是日落),但在代表“海拔变化”的维度上数值差异巨大(山脉高,海滩低)。作者最后强调,实际的机器学习系统通常包含数千个维度,且这些维度往往不像本例这样具有直观的可解释性。

[原文] [Speaker]: So let's take a simplified example. So remember our mountain picture here?

[译文] [Speaker]: 让我们举个简化的例子。还记得我们这里的山景图吗?

[原文] [Speaker]: Yep, we can represent that as a vector embedding.

[译文] [Speaker]: 是的,我们可以把它表示为一个向量嵌入。

[原文] [Speaker]: Now, let's say that the vector embedding for the mountain has a first dimension of say 0.91, then let's say the next one is 0.15, and then there's a third dimension of 0.83 and kind of so forth.

[译文] [Speaker]: 现在,假设山的向量嵌入的第一维是,比如说 0.91,然后假设下一个是 0.15,接着第三维是 0.83,以此类推。

[原文] [Speaker]: What does all that mean?

[译文] [Speaker]: 这都意味着什么?

[原文] [Speaker]: Well, the 0.91 in the first dimension, that indicates significant elevation changes because, hey, this is the mountains.

[译文] [Speaker]: 嗯,第一维的 0.91,这表示显著的海拔变化,因为,嘿,这是山区。

[原文] [Speaker]: Then 0.15 The second dimension here, that shows few urban elements, don't see many buildings here, so that's why that score is quite low.

[译文] [Speaker]: 然后这里的第二维 0.15,那表示几乎没有城市元素,在这里看不到很多建筑物,所以这就是为什么那个分数相当低。

[原文] [Speaker]: 0.83 in the third dimension, that represents strong warm colors like a sunset and so on.

[译文] [Speaker]: 第三维的 0.83,那代表强烈的暖色调,就像日落等等。

[原文] [Speaker]: All sorts of other dimensions can be added as well.

[译文] [Speaker]: 也可以添加各种各样的其他维度。

[原文] [Speaker]: Now we could compare that to a different picture. What about this one, which is a sunset at the beach?

[译文] [Speaker]: 现在我们可以把它与另一张图片进行比较。这张怎么样,这是一张海滩日落。

[原文] [Speaker]: So let's have a look at the vector embeddings for the beach example.

[译文] [Speaker]: 所以让我们看看海滩示例的向量嵌入。

[原文] [Speaker]: So this would also have a series of dimensions. Let's say the first one is 0.12, then we have a 0.08, and then finally we have a 0.89 and then more dimensions to follow.

[译文] [Speaker]: 这也会有一系列的维度。假设第一个是 0.12,然后我们有一个 0.08,最后我们有一个 0.89,接下来还有更多维度。

[原文] [Speaker]: Now, notice how there are some similarities here.

[译文] [Speaker]: 现在,注意这里有一些相似之处。

[原文] [Speaker]: The third dimension, 0.83 and 0.89, pretty similar. That's because they both have warm colors. They're both pictures of sunsets,

[译文] [Speaker]: 第三维,0.83 和 0.89,非常相似。那是因为它们都有暖色调。它们都是日落的图片,

[原文] [Speaker]: but the first dimension that differs quite a lot here because a beach has minimal elevation changes compared to the mountains.

[译文] [Speaker]: 但这里的第一维差别很大,因为与山脉相比,海滩的海拔变化极小。

[原文] [Speaker]: Now this is a very simplified example.

[译文] [Speaker]: 这只是一个非常简化的例子。

[原文] [Speaker]: In real machine learning systems vector embeddings typically contain hundreds or even thousands of dimensions

[译文] [Speaker]: 在真实的机器学习系统中,向量嵌入通常包含数百甚至数千个维度,

[原文] [Speaker]: and I should also say that individual dimensions like this they rarely correspond to such clearly interpretable features, but you get the idea.

[译文] [Speaker]: 而且我还应该说,像这样的单个维度很少对应如此清晰可解释的特征,但你懂那个意思。


章节 4:嵌入模型(Embedding Models)的运作机制

📝 本节摘要

本节深入探讨了向量嵌入的生成过程。作者指出,这是通过在海量数据集上训练的“嵌入模型”(Embedding Models)来实现的,不同类型的数据需要使用专用的模型(如处理图像的 Clip,处理文本的 GloVe,处理音频的 Wav2vec)。这一过程的核心机制在于“分层提取”:数据穿过模型的多个层级,早期层级识别基础特征(如图像边缘或单个单词),而深层层级则负责理解更抽象的概念(如完整物体或上下文含义),最终生成包含数百甚至数千个维度的高维向量,从而实现传统数据库无法做到的相似度搜索。

[原文] [Speaker]: And this all brings up the question of how are these vector embeddings actually created?

[译文] [Speaker]: 这就引出了一个问题:这些向量嵌入实际上是如何创建的?

[原文] [Speaker]: Well, the answer is through embedding models that have been trained on massive data sets.

[译文] [Speaker]: 嗯,答案是通过已经在海量数据集上训练过的嵌入模型(Embedding Models)。

[原文] [Speaker]: So each type of data has its own specialized type of embedding model that we can use.

[译文] [Speaker]: 所以每种类型的数据都有其自己专用的嵌入模型可供我们使用。

[原文] [Speaker]: So I'm gonna give you some examples of those.

[译文] [Speaker]: 所以我打算给你举几个这方面的例子。

[原文] [Speaker]: For example, Clip. You might use Clip for images.

[译文] [Speaker]: 例如,Clip。你可能会对图像使用 Clip。

[原文] [Speaker]: if you're working with text, you might use GloVe, and if you're working with audio, you might use Wav2vec

[译文] [Speaker]: 如果你正在处理文本,你可能会使用 GloVe;如果你正在处理音频,你可能会使用 Wav2vec。

[原文] [Speaker]: These processes are all kind of pretty similar.

[译文] [Speaker]: 这些过程都有点非常相似。

[原文] [Speaker]: Basically, you have data that passes through multiple layers.

[译文] [Speaker]: 基本上,你的数据会穿过多个层。

[原文] [Speaker]: And as it goes through the layers of the embedding model, each layer is extracting progressively more abstract features.

[译文] [Speaker]: 当它穿过嵌入模型的各个层时,每一层都在提取逐渐更加抽象的特征。

[原文] [Speaker]: So for images, the early layers might detect some pretty basic stuff, like let's say edges,

[译文] [Speaker]: 所以对于图像,早期层可能会检测一些非常基础的东西,比如我们说边缘,

[原文] [Speaker]: and then as we get to deeper layers, we would recognize more complex stuff, like maybe entire objects.

[译文] [Speaker]: 然后当我们到达更深层时,我们会识别更复杂的东西,比如也许是整个物体。

[原文] [Speaker]: perhaps for text these early layers would figure out the words that we're looking at, individual words,

[译文] [Speaker]: 也许对于文本,这些早期层会弄清楚我们正在看的词,单个的词,

[原文] [Speaker]: but then later deeper layers would be able to figure out context and meaning,

[译文] [Speaker]: 但随后更深的层将能够弄清楚上下文和含义,

[原文] [Speaker]: and how this essentially works is we take the high dimensional vectors from this deeper layer here,

[译文] [Speaker]: 而这本质上是如何运作的呢,就是我们从这里的这个更深层提取高维向量,

[原文] [Speaker]: and those high dimensional vectors often have hundreds or maybe even thousands of dimensions that capture the essential characteristics of the input.

[译文] [Speaker]: 那些高维向量通常有数百甚至可能数千个维度,捕捉了输入的基本特征。

[原文] [Speaker]: Now we have vector embeddings created. We can perform all sorts of powerful operations that just weren't possible with those traditional relational databases,

[译文] [Speaker]: 现在我们创建了向量嵌入。我们可以执行各种各样强大的操作,这在那些传统的关系型数据库中是不可能的,

[原文] [Speaker]: things like similarity search, where we can find items that are similar to a query item by finding the closest vectors in the space.

[译文] [Speaker]: 像是相似度搜索(Similarity Search),我们可以通过寻找空间中最近的向量来找到与查询项目相似的项目。


章节 5:向量索引:在大规模数据中平衡速度与精度

📝 本节摘要

当数据库规模扩展至数百万个向量时,逐一比对查询向量与库中所有向量会变得极其缓慢且低效。本节介绍了解决方案——向量索引(Vector Indexing)。通过采用近似最近邻(ANN)算法,系统不再追求“绝对精确”的匹配,而是快速找到“极有可能”的最近邻。作者还列举了两种具体的索引方法:HNSW(构建多层图)和 IVF(将空间划分为簇),这两种方法的核心逻辑都是通过牺牲微小的精度来换取搜索速度的巨大提升。

[原文] [Speaker]: But when you have millions of vectors in your database and those vectors are made up of hundred or maybe even thousands of dimensions,

[译文] [Speaker]: 但是当你的数据库中有数百万个向量,而且这些向量由数百甚至可能数千个维度组成时,

[原文] [Speaker]: you can't effectively and efficiently compare your query vector to every single vector in the database.

[译文] [Speaker]: 你无法有效且高效地将你的查询向量与数据库中的每一个向量进行比较。

[原文] [Speaker]: It would just be too slow.

[译文] [Speaker]: 那简直太慢了。

[原文] [Speaker]: So there is a process to do that and it's called vector indexing.

[译文] [Speaker]: 所以有一个过程可以做到这一点,它被称为向量索引(Vector Indexing)。

[原文] [Speaker]: Now this is where vector indexing uses something called approximate nearest neighbor or ANN algorithms

[译文] [Speaker]: 这就是向量索引使用一种被称为近似最近邻(Approximate Nearest Neighbor)或 ANN 算法的地方,

[原文] [Speaker]: and instead of finding the exact closest match these algorithms quickly find vectors that are very likely to be among the closest matches.

[译文] [Speaker]: 这些算法不是寻找确切的最近匹配,而是快速找到极有可能属于最近匹配的向量。

[原文] [Speaker]: Now there are a bunch of approaches for this.

[译文] [Speaker]: 现在有一堆方法可以做到这一点。

[原文] [Speaker]: For example, HNSW, that is Hierarchical Navigable Small World that creates multi-layered graphs connecting similar vectors,

[译文] [Speaker]: 例如,HNSW,即分层导航小世界(Hierarchical Navigable Small World),它创建连接相似向量的多层图,

[原文] [Speaker]: and there's also IVF, that's Inverted File Index, which divides the vector space into clusters and only searches the most relevant of those clusters.

[译文] [Speaker]: 还有 IVF,即倒排文件索引(Inverted File Index),它将向量空间划分为簇(Clusters),并且只搜索其中最相关的簇。

[原文] [Speaker]: These indexing methods, they basically are trading a small amount of accuracy for pretty big improvements in search speed.

[译文] [Speaker]: 这些索引方法,它们基本上是以牺牲少量精度为代价,换取搜索速度的巨大提升。


章节 6:向量数据库的核心应用:RAG与总结

📝 本节摘要

本节介绍了向量数据库的一个核心应用场景——检索增强生成(RAG)。在 RAG 架构中,向量数据库负责存储文档、文章和知识库的切片嵌入。当用户提问时,系统通过对比向量相似度快速找到相关的文本切片,并将其投喂给大型语言模型(LLM)以生成回答。最后,作者总结了向量数据库的双重价值:它既是非结构化数据的存储之地,也是实现快速、语义化检索的关键工具。

[原文] [Speaker]: Now, vector databases are a core feature of something called RAG, retrieval augmented generation,

[译文] [Speaker]: 现在,向量数据库是被称为 RAG,即检索增强生成(Retrieval Augmented Generation)的核心功能,

[原文] [Speaker]: where vector databases store chunks of documents and articles and knowledge bases as embeddings and

[译文] [Speaker]: 在那里,向量数据库将文档、文章和知识库的切片(Chunks)作为嵌入进行存储,

[原文] [Speaker]: then when a user asks a question, the system finds the relevant text chunks by comparing vector similarity

[译文] [Speaker]: 然后当用户提出问题时,系统通过比较向量相似度来找到相关的文本切片,

[原文] [Speaker]: and feeds those to a large language model to generate responses using the retrieved information.

[译文] [Speaker]: 并将那些切片投喂给大型语言模型,以便利用检索到的信息生成回答。

[原文] [Speaker]: So that's vector databases.

[译文] [Speaker]: 这就是向量数据库。

[原文] [Speaker]: They are both a place to store unstructured data and a place to retrieve it quickly and semantically.

[译文] [Speaker]: 它们既是存储非结构化数据的地方,也是快速且语义化地检索数据的地方。