
PostgresML • 2024-07-19
改进向量搜索 - 使用 PostgresML 和 LlamaIndex 进行重排序
搜索与重排序:提升结果相关性
搜索系统通常采用两种主要方法:关键词搜索和语义搜索。关键词搜索将精确的查询词与索引的数据库内容匹配,而语义搜索则利用自然语言处理(NLP)和机器学习来理解查询上下文和意图。许多高效的系统结合了这两种方法以获得最佳结果。
在初步检索后,重排序可以进一步提高结果的相关性。传统的重排序依赖历史用户交互数据,但这种方法难以处理新内容,并且需要大量数据才能有效训练。一种先进的替代方案是使用交叉编码器(cross-encoders),它们直接比较查询-结果对的相似度。
交叉编码器直接比较两段文本并计算相似度得分。与传统的语义搜索方法不同,我们无法为交叉编码器预先计算嵌入(embeddings)并在之后重复使用。相反,我们必须对每个想要比较的文本对运行交叉编码器,这使得该方法计算成本高昂,不适用于大规模搜索。然而,它对于对数据集的一个子集进行重排序非常有效,因为它擅长评估新的、未见数据,而无需大量的用户交互数据进行微调。

交叉编码器通过解决传统重排序系统在深度文本分析方面的局限性(特别是对于新颖或高度特定的内容)来补充和增强它们。它们不依赖大量用户交互数据进行训练(尽管此类数据仍然有益),并且擅长处理新的和以前未见过的数据。这使得交叉编码器成为在重排序场景中提升搜索结果相关性的绝佳选择。
实现重排序
我们将使用 LlamaIndex 和 PostgresML 托管索引来实现一个简单的重排序示例。有关 PostgresML 托管索引的更多信息,请查看我们与 LlamaIndex 的联合公告:使用 LlamaIndex + PostgresML 简化您的 RAG 应用架构。
安装所需的依赖项以开始
pip install llama_index llama-index-indices-managed-postgresml
我们将使用 Paul Graham 数据集,该数据集可以通过 curl 下载
mkdir data
curl -o data/paul_graham_essay.txt https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
PostgresML 托管索引将处理我们文档的存储、分割、嵌入和查询。我们只需要一个数据库连接字符串。如果您还没有帐户,请创建您的 PostgresML 帐户。完成您的个人资料后,您将获得 100 美元的免费积分。
设置 PGML_DATABASE_URL 环境变量
export PGML_DATABASE_URL="{YOUR_CONNCECTION_STRING}"
让我们创建我们的索引
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.indices.managed.postgresml import PostgresMLIndex
documents = SimpleDirectoryReader("data").load_data()
index = PostgresMLIndex.from_documents(
documents, collection_name="llama-index-rerank-example"
)
请注意,collection_name 用于唯一标识您正在使用的索引。
在这里,我们使用 SimpleDirectoryReader 加载文档,然后从这些文档构建 PostgresMLIndex。
此工作流程不需要文档预处理。相反,文档直接发送到 PostgresML,在那里根据流水线规范进行存储、分割和嵌入。这是使用 PostgresML 托管索引的一个独特之处。
现在开始搜索!我们可以通过从我们的索引创建一个检索器来执行语义搜索并获得排名前 2 的结果。
retriever = index.as_retriever(limit=2)
docs = retriever.retrieve("What did the author do as a child?")
for doc in docs:
print("---------")
print(f"Id: {doc.id_}")
print(f"Score: {doc.score}")
print(f"Text: {doc.text}")
这样做我们得到
---------
Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e
Score: 0.7793415653313153
Text: Wow, I thought, there's an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anything.
This had been possible in principle since 1993, but not many people had realized it yet. I had been intimately involved with building the infrastructure of the web for most of that time, and a writer as well, and it had taken me 8 years to realize it. Even then it took me several years to understand the implications. It meant there would be a whole new generation of essays. [11]
In the print era, the channel for publishing essays had been vanishingly small. Except for a few officially anointed thinkers who went to the right parties in New York, the only people allowed to publish essays were specialists writing about their specialties. There were so many essays that had never been written, because there had been no way to publish them. Now they could be, and I was going to write them. [12]
I've worked on several different things, but to the extent there was a turning point where I figured out what to work on, it was when I started publishing essays online. From then on I knew that whatever else I did, I'd always write essays too.
---------
Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e
Score: 0.7770352826735559
Text: Asterix comics begin by zooming in on a tiny corner of Roman Gaul that turns out not to be controlled by the Romans. You can do something similar on a map of New York City: if you zoom in on the Upper East Side, there's a tiny corner that's not rich, or at least wasn't in 1993. It's called Yorkville, and that was my new home. Now I was a New York artist — in the strictly technical sense of making paintings and living in New York.
I was nervous about money, because I could sense that Interleaf was on the way down. Freelance Lisp hacking work was very rare, and I didn't want to have to program in another language, which in those days would have meant C++ if I was lucky. So with my unerring nose for financial opportunity, I decided to write another book on Lisp. This would be a popular book, the sort of book that could be used as a textbook. I imagined myself living frugally off the royalties and spending all my time painting. (The painting on the cover of this book, ANSI Common Lisp, is one that I painted around this time.)
The best thing about New York for me was the presence of Idelle and Julian Weber. Idelle Weber was a painter, one of the early photorealists, and I'd taken her painting class at Harvard. I've never known a teacher more beloved by her students. Large numbers of former students kept in touch with her, including me. After I moved to New York I became her de facto studio assistant.
这些结果还不错,但并不完美。让我们尝试使用交叉编码器进行重排序。
retriever = index.as_retriever(
limit=2,
rerank={
"model": "mixedbread-ai/mxbai-rerank-base-v1",
"num_documents_to_rerank": 100
}
)
docs = retriever.retrieve("What did the author do as a child?")
for doc in docs:
print("---------")
print(f"Id: {doc.id_}")
print(f"Score: {doc.score}")
print(f"Text: {doc.text}")
在这里,我们将检索器配置为返回排名前两位的文档,但这一次,我们添加了一个 rerank 参数来使用 mixedbread-ai/mxbai-rerank-base-v1 模型。这意味着我们的初始语义搜索将返回 100 个结果,然后这些结果将由 mixedbread-ai/mxbai-rerank-base-v1 模型进行重排序,最终只呈现排名前两位的 D 结果。
运行此代码输出
Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e
Score: 0.17803585529327393
Text: What I Worked On
February 2021
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.
---------
Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e
Score: 0.1057136133313179
Text: I wanted not just to build things, but to build things that would last.
In this dissatisfied state I went in 1988 to visit Rich Draves at CMU, where he was in grad school. One day I went to visit the Carnegie Institute, where I'd spent a lot of time as a kid. While looking at a painting there I realized something that might seem obvious, but was a big surprise to me. There, right on the wall, was something you could make that would last. Paintings didn't become obsolete. Some of the best ones were hundreds of years old.
And moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding.
I had always liked looking at paintings. Could I make them? I had no idea. I'd never imagined it was even possible. I knew intellectually that people made art — that it didn't just appear spontaneously — but it was as if the people who made it were a different species. They either lived long ago or were mysterious geniuses doing strange things in profiles in Life magazine. The idea of actually being able to make art, to put that verb before that noun, seemed almost miraculous.
这些结果要好得多!我们可以看到排名靠前的文档包含了用户问题的答案。请注意,我们无需指定第三方 API 来进行重排序。PostgresML 再次在数据库中利用交叉编码器处理重排序。
我们可以直接在 RAG 中使用重排序
query_engine = index.as_query_engine(
streaming=True,
vector_search_limit=2,
vector_search_rerank={
"model": "mixedbread-ai/mxbai-rerank-base-v1",
"num_documents_to_rerank": 100,
},
)
results = query_engine.query("What did the author do as a child?")
for text in results.response_gen:
print(text, end="", flush=True)
运行此代码输出
Based on the context information, as a child, the author worked on writing (writing short stories) and programming (on the IBM 1401 using Fortran) outside of school.
这正是我们想要的准确答案!
重排序带来更好的结果
搜索可能很复杂。使用交叉编码器进行重排序通过比较文本对并有效处理新数据来改进搜索。使用 LlamaIndex 和 PostgresML 实现重排序可以改善搜索结果,在检索增强生成(RAG)应用中提供更精确的答案。
要开始使用 PostgresML 和 LlamaIndex,您可以按照 PostgresML 入门指南设置您的帐户,并使用上面的示例处理您自己的数据。