Jerry Liu • 2023-06-22

LlamaIndex 和 Weaviate

合著者

Jerry Liu（LlamaIndex 联合创始人/CEO）
Erika Cardenas（Weaviate 开发者倡导者）

虽然大型语言模型（LLM）如 GPT-4 在生成和推理方面具有令人印象深刻的能力，但在访问和检索特定事实、数据或上下文相关信息方面存在局限性。解决此问题的一个常用方案是建立检索增强生成（RAG）系统：将语言模型与外部存储提供程序结合，创建一个整体软件系统，能够协调这些组件之间以及与其他组件的交互，以实现“与数据聊天”的体验。

Weaviate 和 LlamaIndex 的结合提供了轻松搭建强大可靠的 RAG 技术栈所需的关键组件，这样您就可以轻松地在数据上提供强大的支持 LLM 的体验，例如搜索引擎、聊天机器人等。首先，我们可以使用 Weaviate 作为充当外部存储提供程序的向量数据库。接下来，我们可以使用像 LlamaIndex 这样的强大数据框架，在构建 LLM 应用时协助进行围绕 Weaviate 的数据管理和编排。

在这篇博客文章中，我们将概述 LlamaIndex 以及一些核心的数据管理和查询模块。然后，我们将逐步介绍一个初步的演示笔记本。

我们正在开启一个新的系列，指导您如何在 LLM 应用中使用 LlamaIndex 和 Weaviate。

LlamaIndex 简介

LlamaIndex 是一个用于构建 LLM 应用的数据框架。它提供了一个全面的工具包，用于摄取、管理和查询您的外部数据，以便您可以在 LLM 应用中使用这些数据。

数据摄取

在数据摄取方面，LlamaIndex 提供了连接到 100 多个数据源的连接器，涵盖从不同的文件格式（.pdf, .docx, .pptx）到 API（Notion, Slack, Discord 等）再到网页抓取工具（Beautiful Soup, Readability 等）。这些数据连接器主要托管在 LlamaHub 上。这使用户能够轻松地从现有文件和应用程序中集成数据。

数据索引

数据加载后，LlamaIndex 提供了使用各种数据结构和存储集成选项（包括 Weaviate）对数据进行索引的能力。LlamaIndex 支持对非结构化、半结构化和结构化数据进行索引。索引非结构化数据的一种标准方法是将源文档分割成文本“块”，对每个块进行嵌入，并将每个块/嵌入存储在向量数据库中。

数据查询

数据摄取/存储后，LlamaIndex 提供了工具来定义一个针对您的数据的高级检索/查询“引擎”。我们的检索器构建允许您根据输入提示从知识库中检索数据。查询引擎构建允许您定义一个接口，该接口可以接收输入提示，并输出一个知识增强的响应——它可以在内部使用检索和合成（LLM）模块。

下面给出了一些查询引擎“任务”的示例，大致按从易到难的顺序排列

语义搜索：通过嵌入与查询的相似度，从知识语料库中检索出与查询最相似的前 k 个条目，并根据这些上下文合成响应。
结构化分析：将自然语言转换为可执行的 SQL 查询。
基于文档的查询分解：将一个查询分解为子问题，每个子问题对应底层文档的一个子集。每个子问题可以针对其自己的查询引擎执行。

演示笔记本演练

让我们来看一个简单的例子，了解如何将 LlamaIndex 与 Weaviate 结合使用，在 Weaviate 博客上构建一个简单的问答（QA）系统！

完整代码可在 Weaviate recipes repo 中找到。

第一步是设置您的 Weaviate 客户端。在此示例中，我们通过端口 http://localhost:8080 连接到本地 Weaviate 实例。

import weaviate
# connect to your weaviate instance
client = weaviate.Client("http://localhost:8080")

下一步是摄取 Weaviate 文档并将文档解析成块。您可以选择使用我们的众多网页阅读器之一来抓取任何网站——但幸运的是，下载的文件已经在 recipes repo 中准备就绪。

from llama_index.node_parser import SimpleNodeParser
# load the blogs in using the reader
blogs = SimpleDirectoryReader('./data').load_data()
# chunk up the blog posts into nodes
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(blogs)

在这里，我们使用 SimpleDirectoryReader 从给定目录加载所有文档。然后我们使用我们的 SimpleNodeParser 将源文档分块成 Node 对象（文本块）。

下一步是 1) 定义一个 WeaviateVectorStore，以及 2) 使用 LlamaIndex 在此向量存储上构建向量索引。

# construct vector store
vector_store = WeaviateVectorStore(weaviate_client = client, index_name="BlogPost", text_key="content")
# setting up the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store = vector_store)
# set up the index
index = VectorStoreIndex(nodes, storage_context = storage_context)

我们的 WeaviateVectorStore 抽象在我们的数据抽象和 Weaviate 服务之间创建了一个中心接口。请注意，VectorStoreIndex 是从节点和包含 Weaviate 向量存储的存储上下文对象初始化的。在初始化阶段，节点被加载到向量存储中。

最后，我们可以在索引之上定义一个查询引擎。该查询引擎将执行语义搜索和响应合成，并输出答案。

query_engine = index.as_query_engine()
response = query_engine.query("What is the intersection between LLMs and search?")
print(response)

您应该得到如下所示的答案

The intersection between LLMs and search is the ability to use LLMs to improve search capabilities, such as retrieval-augmented generation, query understanding, index construction, LLMs in re-ranking, and search result compression. LLMs can also be used to manage document updates, rank search results, and compress search results. LLMs can be used to prompt the language model to extract or formulate a question based on the prompt and then send that question to the search engine, or to prompt the model with a description of the search engine tool and how to use it with a special `[SEARCH]` token. LLMs can also be used to prompt the language model to rank search results according to their relevance with the query, and to classify the most likely answer span given a question and text passage as input.