Kate Silverstein • 2024-05-14

使用 LlamaIndex 和 llamafile 构建本地私有研究助手

llamafile

这是一篇来自我们在 Mozilla 的朋友关于 Llamafile 的客座文章

llamafile 是 Mozilla 的一个开源项目，它是在您的笔记本电脑上运行大型语言模型 (LLM) 的最简单方法之一。您只需从 HuggingFace 下载一个 llamafile，然后运行该文件即可。就是这样简单。在大多数计算机上，您无需安装任何东西。

您可能希望在笔记本电脑上运行 LLM 的几个原因包括

1. 隐私：在本地运行意味着您无需与第三方共享您的数据。

2. 高可用性：无需互联网连接即可运行基于 LLM 的应用程序。

3. 自带模型：您可以轻松测试许多不同的开源 LLM（HuggingFace 上可用的任何模型），并查看哪个最适合您的任务。

4. 免费调试/测试：本地 LLM 允许您测试基于 LLM 的系统的许多部分，而无需支付 API 调用费用。

在这篇博客文章中，我们将展示如何设置 llamafile 并使用它在您的计算机上运行本地 LLM。然后，我们将展示如何将 LlamaIndex 与您的 llamafile 作为本地 RAG 研究助手的 LLM 和嵌入后端一起使用。您无需注册任何云服务或将您的数据发送给任何第三方——一切都将在您的笔记本电脑上运行。

注意：您还可以从我们的 GitHub 仓库获取以下所有示例代码的 Jupyter notebook。

下载并运行 llamafile

首先，什么是 llamafile？llamafile 是一个可在您自己的计算机上运行的可执行 LLM。它包含给定开源 LLM 的权重，以及实际在您的计算机上运行该模型所需的一切。无需安装或配置（有一些注意事项，在此处讨论）。

每个 llamafile 捆绑了 1) gguf 格式的模型权重和元数据 + 2) 使用 [Cosmopolitan Libc](https://github.com/jart/cosmopolitan) 特别编译的 `llama.cpp` 副本。这使得模型可以在大多数计算机上运行，无需额外安装。llamafile 还提供一个类似 ChatGPT 的浏览器界面、一个 CLI 以及用于聊天模型的 OpenAI 兼容 REST API。

设置 llamafile 只需要 2 个步骤

1. 下载 llamafile

2. 使 llamafile 可执行

下面我们将详细介绍每个步骤。

步骤 1：下载 llamafile

HuggingFace 模型中心上有很多 llamafile 可用（只需搜索 'llamafile'），但为了本次演练的目的，我们将使用 TinyLlama-1.1B（0.67 GB，模型信息）。要下载模型，您可以点击此下载链接：TinyLlama-1.1B，或者打开终端并使用类似 `wget` 的命令。下载时间取决于您的互联网连接质量，大约需要 5-10 分钟。

wget https://hugging-face.cn/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile

该模型体积小，在实际回答问题方面表现不会很好，但由于其下载速度相对较快，并且推理速度可以在几分钟内为您索引向量存储，因此对于下面的示例来说足够了。对于更高质量的 LLM，您可能希望使用更大的模型，例如 Mistral-7B-Instruct（5.15 GB，模型信息）。

步骤 2：使 llamafile 可执行

如果您不是从命令行下载的 llamafile，请查明您的浏览器将下载的 llamafile 存储在何处。

现在，打开您的计算机终端，如有必要，转到存储 llamafile 的目录：`cd path/to/downloaded/llamafile`

如果您使用的是 macOS、Linux 或 BSD，您需要授予计算机执行此新文件的权限。（只需执行一次。）

如果您使用的是 Windows，则只需通过在文件名末尾添加“.exe”来重命名文件即可，例如将 `TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile` 重命名为 `TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile.exe`

chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

初步测试

现在，您的 llamafile 应该可以使用了。首先，您可以检查用于构建您下载的 llamafile 二进制文件的 llamafile 库版本

./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --version

llamafile v0.7.0

这篇文章是使用基于 `llamafile v0.7.0` 构建的模型撰写的。如果您的 llamafile 显示的版本不同，并且下面的某些步骤未按预期工作，请在 llamafile 问题跟踪器上提交问题。

使用 llamafile 最简单的方法是通过其内置的聊天界面。在终端中运行

./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

您的浏览器应该会自动打开并显示一个聊天界面。（如果未打开，只需打开浏览器并访问 http://localhost:8080）。聊天完成后，返回终端并按下 `Control-C` 关闭 llamafile。如果您在 notebook 中运行这些命令，只需中断 notebook 内核即可停止 llamafile。

在本演练的其余部分，我们将使用 llamafile 的内置推理服务器，而不是浏览器界面。llamafile 的服务器提供一个 REST API，用于通过 HTTP 与 TinyLlama LLM 交互。完整的服务器 API 文档可在此处获取。要以服务器模式启动 llamafile，请运行

./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding

总结：下载并运行 llamafile

# 1. Download the llamafile-ized model
wget https://hugging-face.cn/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile

# 2. Make it executable (you only need to do this once)
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# 3. Run in server mode
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding

使用 LlamaIndex 和 llamafile 构建研究助手

现在，我们将展示如何将 LlamaIndex 与您的 llamafile 结合使用，构建一个研究助手，帮助您了解感兴趣的主题——对于这篇文章，我们选择了信鸽。我们将展示如何准备数据、将其索引到向量存储中，然后进行查询。

在本地运行 LLM 的一个好处是隐私。您可以混合使用维基百科页面等“公共数据”和“私有数据”，而无需担心与第三方共享您的数据。私有数据可能包括您关于某个主题的私人笔记或机密内容的 PDF。只要您使用本地 LLM（和本地向量存储），您就无需担心数据泄露。下面，我们将展示如何结合这两种类型的数据。

我们的向量存储将包括维基百科页面、一份关于信鸽饲养的陆军手册，以及我们在阅读此主题时记录的一些简要笔记。要开始，请下载我们的示例数据

mkdir data

# Download 'The Homing Pigeon' manual from Project Gutenberg
wget https://www.gutenberg.org/cache/epub/55084/pg55084.txt -O data/The_Homing_Pigeon.txt

# Download some notes on homing pigeons
wget https://gist.githubusercontent.com/k8si/edf5a7ca2cc3bef7dd3d3e2ca42812de/raw/24955ee9df819e21975b1dd817938c1bfe955634/homing_pigeon_notes.md -O data/homing_pigeon_notes.md

接下来，我们需要安装 LlamaIndex 及其一些集成

# Install llama-index
pip install llama-index-core
# Install llamafile integrations and SimpleWebPageReader
pip install llama-index-embeddings-llamafile llama-index-llms-llamafile llama-index-readers-web

启动 llamafile 服务器并配置 LlamaIndex

在此示例中，我们将使用同一个 llamafile 生成将索引到我们的向量存储中的嵌入，并将其用作稍后回答查询的 LLM。（但是，您完全可以将一个 llamafile 用于嵌入，另一个 llamafile 用于 LLM 功能——您只需在不同的端口上启动 llamafile 服务器即可。）

要启动 llamafile 服务器，请打开终端并运行

./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser --embedding --port 8080

现在，我们将配置 LlamaIndex 使用此 llamafile

# Configure LlamaIndex
from llama_index.core import Settings
from llama_index.embeddings.llamafile import LlamafileEmbedding
from llama_index.llms.llamafile import Llamafile
from llama_index.core.node_parser import SentenceSplitter

Settings.embed_model = LlamafileEmbedding(base_url="http://localhost:8080")

Settings.llm = Llamafile(
	base_url="http://localhost:8080",
	temperature=0,
	seed=0
)

# Also set up a sentence splitter to ensure texts are broken into semantically-meaningful chunks (sentences) that don't take up the model's entire
# context window (2048 tokens). Since these chunks will be added to LLM prompts as part of the RAG process, we want to leave plenty of space for both
# the system prompt and the user's actual question.
Settings.transformations = [
	SentenceSplitter(
    	chunk_size=256,
    	chunk_overlap=5
	)
]

准备数据并构建向量存储

现在，我们将加载我们的数据并对其进行索引。

# Load local data
from llama_index.core import SimpleDirectoryReader
local_doc_reader = SimpleDirectoryReader(input_dir='./data')
docs = local_doc_reader.load_data(show_progress=True)

# We'll load some Wikipedia pages as well
from llama_index.readers.web import SimpleWebPageReader
urls = [
	'https://en.wikipedia.org/wiki/Homing_pigeon',
	'https://en.wikipedia.org/wiki/Magnetoreception',
]
web_reader = SimpleWebPageReader(html_to_text=True)
docs.extend(web_reader.load_data(urls))

# Build the index
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
	docs,
	show_progress=True,
)

# Save the index
index.storage_context.persist(persist_dir="./storage")

查询您的研究助手

最后，我们准备好提出一些关于信鸽的问题了。

query_engine = index.as_query_engine()
print(query_engine.query("What were homing pigeons used for?"))

	Homing pigeons were used for a variety of purposes, including military reconnaissance, communication, and transportation. They were also used for scientific research, such as studying the behavior of birds in flight and their migration patterns. In addition, they were used for religious ceremonies and as a symbol of devotion and loyalty. Overall, homing pigeons played an important role in the history of aviation and were a symbol of the human desire for communication and connection.

print(query_engine.query("When were homing pigeons first used?"))

The context information provided in the given context is that homing pigeons were first used in the 19th century. However, prior knowledge would suggest that homing pigeons have been used for navigation and communication for centuries.

结论

在这篇文章中，我们展示了如何下载和设置通过 llamafile 在本地运行的 LLM。然后，我们展示了如何将此 LLM 与 LlamaIndex 结合使用，构建一个简单的基于 RAG 的研究助手来学习信鸽。您的助手是 100% 本地运行的：您无需支付 API 调用费用或将数据发送给第三方。

作为下一步，您可以尝试使用更好的模型，例如 Mistral-7B-Instruct 来运行上面的示例。您还可以尝试为不同的主题构建研究助手，例如“半导体”或“如何烘焙面包”。

要了解有关 llamafile 的更多信息，请访问 GitHub 上的项目，阅读这篇关于使用 LLM 的 bash 单行命令的博客文章，或在 Discord 上向社区问好。