LlamaIndex • 2023-11-15

宣布 LlamaIndex 0.9

我们的勤奋团队很高兴宣布我们最新的重大版本 LlamaIndex 0.9！您现在就可以获取它

pip install --upgrade llama_index

在 LlamaIndex v0.9 中，我们花费时间改进了用户体验的几个关键方面，包括 token 计数、文本分割等等！

作为其中的一部分，有一些新功能和现有用法上的微小变化，开发者应该了解。

用于摄取和转换数据的新概念 IngestionPipline
数据摄取和转换现在会自动缓存
节点解析/文本分割/元数据提取模块的接口已更新
默认 tokenizer 的更改，以及自定义 tokenizer
PyPi 的打包/安装更改（减少臃肿，新增安装选项）
更可预测和一致的导入路径
此外，测试版中包含：用于处理文本和图像的多模态 RAG 模块！

有问题或疑虑？您可以在 GitHub 上报告问题，或者在我们的 Discord 上提问！

继续阅读以了解我们新功能和更改的更多详情。

IngestionPipeline — 纯粹用于数据摄取的新抽象

有时，您可能只希望从数据源摄取和嵌入节点，例如当您的应用程序允许用户上传新数据时。LlamaIndex V0.9 中引入了 IngestionPipepline 的概念。

一个 IngestionPipeline 使用了对输入数据应用的新概念 Transformations。

那么 Transformation 是什么呢？它可以是一个

文本分割器
节点解析器
元数据提取器
嵌入模型

以下是基本用法模式的快速示例

from llama_index import Document
from llama_index.embeddings import OpenAIEmbedding
from llama_index.text_splitter import SentenceSplitter
from llama_index.extractors import TitleExtractor
from llama_index.ingestion import IngestionPipeline, IngestionCache

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)
nodes = pipeline.run(documents=[Document.example()])

Transformation 缓存

每次运行同一个 IngestionPipeline 对象时，它会缓存输入节点 + transformations 的哈希值以及管道中每个 transformation 的输出。

在后续运行中，如果发生缓存命中，该 transformation 将被跳过，并转而使用缓存结果。这大大加快了重复运行的速度，并有助于在决定使用哪些 transformations 时缩短迭代时间。

以下是一个保存和加载本地缓存的示例

from llama_index import Document
from llama_index.embeddings import OpenAIEmbedding
from llama_index.text_splitter import SentenceSplitter
from llama_index.extractors import TitleExtractor
from llama_index.ingestion import IngestionPipeline, IngestionCache

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)
# will only execute full pipeline once
nodes = pipeline.run(documents=[Document.example()])
nodes = pipeline.run(documents=[Document.example()])
# save and load
pipeline.cache.persist("./test_cache.json")
new_cache = IngestionCache.from_persist_path("./test_cache.json")
new_pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
    ],
    cache=new_cache,
)
# will run instantly due to the cache
nodes = pipeline.run(documents=[Document.example()])

以下是使用 Redis 作为缓存、Qdrant 作为向量存储的另一个示例。运行此操作将直接将节点插入到您的向量存储中，并将每个 transformation 步骤缓存在 Redis 中。

from llama_index import Document
from llama_index.embeddings import OpenAIEmbedding
from llama_index.text_splitter import SentenceSplitter
from llama_index.extractors import TitleExtractor
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.ingestion.cache import RedisCache
from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    cache=IngestionCache(cache=RedisCache(), collection="test_cache"),
    vector_store=vector_store,
)
# Ingest directly into a vector db
pipeline.run(documents=[Document.example()])
# Create your index
from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_vector_store(vector_store)

自定义 Transformations

实现自定义 transformations 非常简单！让我们添加一个 transform，以便在调用 embeddings 之前从文本中删除特殊字符。

对 transformations 的唯一真正要求是它们必须接受一个节点列表并返回一个节点列表。

import re
from llama_index import Document
from llama_index.embeddings import OpenAIEmbedding
from llama_index.text_splitter import SentenceSplitter
from llama_index.ingestion import IngestionPipeline
from llama_index.schema import TransformComponent

class TextCleaner(TransformComponent):
  def __call__(self, nodes, **kwargs):
    for node in nodes:
      node.text = re.sub(r'[^0-9A-Za-z ]', "", node.text)
    return nodes
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TextCleaner(),
        OpenAIEmbedding(),
    ],
)
nodes = pipeline.run(documents=[Document.example()])

节点解析/文本分割 — 平坦化和简化的接口

我们使解析和分割文本的接口变得更加简洁。

之前

from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
	MetadataExtractor, TitleExtractor
) 
from llama_index.text_splitter import SentenceSplitter

node_parser = SimpleNodeParser(
  text_splitter=SentenceSplitter(chunk_size=512),
  metadata_extractor=MetadataExtractor(
  extractors=[TitleExtractor()]
 ),
)
nodes = node_parser.get_nodes_from_documents(documents)

之后

from llama_index.text_splitter import SentenceSplitter
from llama_index.extractors import TitleExtractor 

node_parser = SentenceSplitter(chunk_size=512)
extractor = TitleExtractor()

# use transforms directly
nodes = node_parser(documents)
nodes = extractor(nodes)

以前，LlamaIndex 中的 NodeParser 对象变得极其臃肿，包含了文本分割器和元数据提取器，这给用户更改这些组件带来了困扰，也给我们维护和开发它们带来了麻烦。

在 V0.9 中，我们将整个接口平坦化为一个单一的 TransformComponent 抽象，以便更轻松地设置、使用和自定义这些 transformations。

我们已尽力最小化对用户的影响，但需要注意的主要一点是 SimpleNodeParser 已被移除，其他节点解析器和文本分割器已被提升以拥有相同的功能，只是使用了不同的解析和分割技术。

任何旧的 SimpleNodeParser 导入都将重定向到最等效的模块 SentenceSplitter。

此外，包装对象 MetadataExtractor 已被移除，改为直接使用提取器。

所有这些的完整文档可在下方找到

Tokenization 和 Token 计数 — 改进的默认设置和自定义

以前 LlamaIndex 的一个主要痛点是 tokenization。许多组件使用不可配置的 gpt2 tokenizer 进行 token 计数，这让使用非 OpenAI 模型或甚至 OpenAI 模型（如此处所示的一些临时修复）的用户感到头痛！

在 LlamaIndex V0.9 中，这个全局 tokenizer 现在可配置，并默认为 CL100K tokenizer，以匹配我们默认的 GPT-3.5 LLM。

对 tokenizer 的唯一要求是它必须是一个可调用函数，接受一个字符串并返回一个列表。

配置此项的一些示例如下

from llama_index import set_global_tokenizer

# tiktoken
import tiktoken
set_global_tokenizer(
  tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)
# huggingface
from transformers import AutoTokenizer
set_global_tokenizer(
  AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta").encode
)

此外，TokenCountingHandler 得到了升级，提供了更好的 token 计数功能，并且在可用时直接使用 API 响应中的 token 计数。

打包 — 减少臃肿

为了现代化 LlamaIndex 的打包方式，V0.9 也带来了安装方面的变化。

这里最大的变化是 LangChain 现在是一个可选包，默认情况下不会安装。

要将 LangChain 作为 llama-index 安装的一部分进行安装，您可以按照下面的示例操作。根据您的需求，还有其他安装选项，我们也欢迎未来对附加组件做出更多贡献。

# installs langchain
pip install llama-index[langchain]
 
# installs tools needed for running local models
pip install llama-index[local_models]

# installs tools needed for postgres
pip install llama-index[postgres]

# combinations!
pip isntall llama-index[local_models,postgres]

如果您之前在代码中导入了 langchain 模块，请相应地更新您的项目打包要求。

导入路径 — 更一致和可预测

我们对导入路径做了两项更改

我们移除了根级别中不常用的导入，以加快导入 llama_index 的速度
我们现在有了一个一致的策略，将“用户界面”概念在 level-1 模块中可导入。

from llama_index.llms import OpenAI, ...
from llama_index.embeddings import OpenAIEmbedding, ...
from llama_index.prompts import PromptTemplate, ...
from llama_index.readers import SimpleDirectoryReader, ...
from llama_index.text_splitter import SentenceSplitter, ...
from llama_index.extractors import TitleExtractor, ...
from llama_index.vector_stores import SimpleVectorStore, ...

我们仍然在根级别公开一些最常用的模块。

from llama_index import SimpleDirectoryReader, VectorStoreIndex, ...

多模态 RAG

鉴于最近 GPT-4V API 的发布，多模态用例比以往任何时候都更容易实现。

为了帮助用户使用这些功能，我们开始引入一些新模块，以支持多模态 RAG 的用例

多模态 LLM（GPT-4V、Llava、Fuyu 等）
用于联合图像-文本嵌入/检索的多模态嵌入（例如 clip）
多模态 RAG，结合了索引和查询引擎

我们的文档中有一个关于多模态检索的完整指南。

感谢大家的支持！

作为一个开源项目，没有我们数百位贡献者的支持，我们将无法存在。我们非常感谢他们以及全球数十万 LlamaIndex 用户的支持。Discord 上见！

介绍 LlamaIndex 0.11
2024-08-22
LlamaIndex 上 Jamba-Instruct 的 256k 上下文窗口
2024-07-31
LlamaIndex 新闻通讯 2024-04-02
2024-04-02
LlamaIndex 新闻通讯 2024-03-26
2024-03-26