bstadt • 2024-01-30

使用 Nomic Embed 和 LlamaIndex 构建一个完全开源的检索器

什么是检索器？

最近，检索增强生成（RAG）技术使语言模型能够减少幻觉、提高响应质量，并保持对世界的最新了解，而无需对模型本身进行重新训练。这是通过为语言模型配备一个检索器和一个数据库来实现的。在推理时，RAG 系统使用检索器从数据库中选择相关文档，并将其传递给语言模型的上下文窗口。

目前，最流行的检索器类型基于嵌入模型。这种嵌入模型将数据库中的所有文档转换为向量表示。然后，在推理时，它将查询转换为向量表示，并从数据库中检索与查询向量最相似的文档。

在这篇文章中，我们将向您展示如何使用 LlamaIndex 和 Nomic Embed 构建一个完全开源的检索器，Nomic Embed 是第一个在短期和长期上下文基准测试中性能均超过 OpenAI Ada 的完全开源嵌入模型。

为什么选择开源？

随着人工智能越来越多地部署到国防、医药和金融等高影响领域，整个系统的端到端可审计性成为安全部署人工智能的关键组成部分。不幸的是，当今大多数 RAG 系统中使用的闭源嵌入模型故意混淆了训练协议，无法进行审计。

此外，随着采用人工智能的组织日益成熟，对闭源嵌入模型的依赖将导致供应商锁定（vendor lock-in），并限制根据业务需求修改嵌入模型的能力。

幸运的是，像 Nomic Embed 这样的完全开源嵌入模型提供了训练过程的端到端可审计性，也为进一步改进和修改模型提供了坚实的基础。

操作指南

要使用 LlamaIndex 和 Nomic Embed 构建一个开源检索器，我们将首先导入相关的库

from llama_index.embeddings import NomicEmbedding
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
)

接下来，我们需要为我们的数据库下载一些数据。在这个示例中，我们将使用 Paul Graham 的一篇随笔，我们可以从这里下载，并将其放入名为 ./data/paul_graham 的目录中。

现在，是时候获取数据库中文档的向量了。为此，我们将使用 LlamaIndex 的 SimpleDirectoryReader 和 Nomic 的托管推理服务。您需要将 <NOMIC_API_KEY> 替换为您的 Nomic API 密钥，您可以在注册 Nomic Atlas 后从这里获取。

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
nomic_api_key = "&lt;NOMIC_API_KEY&gt;"
embed_model = NomicEmbedding(
    api_key=nomic_api_key,
    model_name="nomic-embed-text-v1",
    task_type="search_document"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model, chunk_size=1024,
)
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context, show_progress=True
)

请注意，我们在 NomicEmbedding 中将 task_type 设置为 search_document。Nomic Embed 支持多种不同类型的任务，而 search_document 已针对为 RAG 数据库构建文档表示进行了优化。

一旦我们的数据库设置完毕，我们就准备构建我们的检索器了。使用 LlamaIndex，这就像几行 Python 代码一样简单

embed_model = NomicEmbedding(
    api_key=nomic_api_key,
    model_name="nomic-embed-text-v1",
    task_type="search_query"
)

service_context = ServiceContext.from_defaults(
    embed_model=embed_model
)

search_query_retriever = index.as_retriever(service_context=service_context, similarity_top_k=1)

再次请注意，我们正在使用一个新的 NomicEmbedding 模型，其 task_type 设置为 search_query。这种任务类型已针对为检索数据库进行搜索的查询嵌入进行了优化。

最后，我们可以使用我们的检索器根据用户查询来找到相关文档！例如

retrieved_nodes_nomic = retriever_nomic.retrieve(
    "What software did Paul write?"
)

返回一篇描述 Paul 的第一个程序的文档

Node ID: 380fbb0e-6fc1-41de-a4f6-3f22cd508df3
Similarity: 0.6087318771843091
Text: What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.

With microcomputers, everything changed.

结论与后续步骤

在这篇文章中，我们向您展示了如何使用 Nomic Embed 和 LlamaIndex 构建一个完全开源的检索器。如果您想深入了解，可以在这里找到 Nomic Embed 的源代码。您还可以使用 Nomic Atlas 来可视化您的检索数据库，并使用LlamaIndex 将其连接到生成模型以实现完整的 RAG。