Jerry Liu • 2023-12-13

LlamaIndex + Gemini

（由 Jerry Liu, Haotian Zhang, Logan Markewich, 和 Laurie Voss @ LlamaIndex 合著）

今天是 Google 公开发布其最新 AI 模型 Gemini 的日子。我们很荣幸能成为 Gemini 的首发合作伙伴，LlamaIndex 今天就已提供支持！

从 0.9.15 版本开始，LlamaIndex 完全支持所有目前已发布和即将发布的 Gemini 模型（Gemini Pro, Gemini Ultra）。我们支持“仅文本”的 Gemini 变体（文本输入/文本输出格式）以及多模态变体（接受文本和图像作为输入，输出文本）。我们对多模态抽象进行了一些基础性更改，以支持 Gemini 多模态接口，该接口允许用户输入多个图像以及文本。我们的 Gemini 集成也已功能完备：它们支持（非流式、流式）、（同步、异步）和（文本补全、聊天消息）格式——总共 8 种组合。

此外，我们还支持全新的语义检索 API，它将存储、嵌入模型、检索和 LLM 捆绑到一个 RAG 管道中。我们将展示如何单独使用它，或者分解后与 LlamaIndex 组件捆绑使用以创建高级 RAG 管道。

非常感谢 Google Labs 和 Semantic Retriever 团队帮助我们获得早期访问权限。

Google Labs: Mark McDonald, Josh Gordon, Arthur Soroken
Semantic Retriever: Lawrence Tsang, Cher Hu

以下部分详细介绍了 LlamaIndex 中全新的 Gemini 和 Semantic Retriever 抽象。如果您现在不想阅读，请务必收藏下面我们详细的笔记本指南！

Gemini 发布与支持

关于 Gemini 的报道很多，它在各种基准测试中都表现出色。Ultra 变体（尚未公开发布）在从 MMLU 到 Big-Bench Hard 再到数学和编码任务的基准测试中都优于 GPT-4。它们的多模态演示展示了从科学论文理解到文献综述等领域的联合图像/文本理解能力。

让我们看看如何在 LlamaIndex 中使用 Gemini 的示例。我们将介绍文本模型（from llama_index.llms import Gemini）和多模态模型（from llama_index.multi_modal_llms.gemini import GeminiMultiModal）。

文本模型

完整的笔记本指南在此

我们从文本模型开始。下面的代码片段展示了许多不同的配置，从补全到聊天，再到流式和异步。

from llama_index.llms import Gemini

# completion
resp = Gemini().complete("Write a poem about a magic backpack")
# chat
messages = [
    ChatMessage(role="user", content="Hello friend!"),
    ChatMessage(role="assistant", content="Yarr what is shakin' matey?"),
    ChatMessage(
        role="user", content="Help me decide what to have for dinner."
    ),
]
resp = Gemini().chat(messages)
# streaming (completion)
llm = Gemini()
resp = llm.stream_complete(
    "The story of Sourcrust, the bread creature, is really interesting. It all started when..."
)
# streaming (chat)
llm = Gemini()
messages = [
    ChatMessage(role="user", content="Hello friend!"),
    ChatMessage(role="assistant", content="Yarr what is shakin' matey?"),
    ChatMessage(
        role="user", content="Help me decide what to have for dinner."
    ),
]
resp = llm.stream_chat(messages)
# async completion
resp = await llm.acomplete("Llamas are famous for ")
print(resp)
# async streaming (completion)
resp = await llm.astream_complete("Llamas are famous for ")
async for chunk in resp:
    print(chunk.text, end="")

Gemini 类当然有可以设置的参数。这包括 model_name、temperature、max_tokens 和 generate_kwargs。

例如，您可以执行

llm = Gemini(model="models/gemini-ultra")

多模态模型

完整的笔记本指南在此

在此笔记本中，我们测试了具有多模态输入功能的 gemini-pro-vision 变体。它包含以下特性

支持 complete 和 chat 功能
支持流式和异步
在补全端点中，除了文本外，还支持输入多张图像
未来工作：在我们的抽象层中支持文本和图像交错的多轮对话，但目前尚未在 gemini-pro-vision 中启用。

让我们来看一个具体的例子。假设我们得到一张以下场景的照片

然后我们可以初始化我们的 Gemini Vision 模型，并向它提问：“识别这张照片是在哪个城市拍摄的”

from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.multi_modal_llms.generic_utils import (
    load_image_urls,
)

image_urls = [
    "&lt;https://storage.googleapis.com/generativeai-downloads/data/scene.jpg&gt;",
    # Add yours here!
]
image_documents = load_image_urls(image_urls)
gemini_pro = GeminiMultiModal(model="models/gemini-pro")
complete_response = gemini_pro.complete(
    prompt="Identify the city where this photo was taken.",
    image_documents=image_documents,
)

我们的回答如下

New York City

我们也可以插入多张图像。这里有一个包含梅西和罗马斗兽场图像的例子。

image_urls = [
    "&lt;https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg&gt;",
    "&lt;https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg&gt;",
]
image_documents_1 = load_image_urls(image_urls)
response_multi = gemini_pro.complete(
    prompt="is there any relationship between those images?",
    image_documents=image_documents_1,
)
print(response_multi)

多模态用例（结构化输出，RAG）

完整的笔记本指南在此

我们创建了大量关于不同多模态用例的资源，从结构化输出提取到 RAG。

感谢 Haotian Zhang，我们有了 Gemini 在这两个用例上的示例。请参阅我们详细的笔记本指南获取更多信息。同时，这里是最终结果！

使用 Gemini Pro Vision 进行结构化数据提取

输出

('restaurant', 'La Mar by Gaston Acurio')
('food', 'South American')
('location', '500 Brickell Key Dr, Miami, FL 33131')
('category', 'Restaurant')
('hours', 'Open ⋅ Closes 11 PM')
('price', 4.0)
('rating', 4)
('review', '4.4 (2,104)')
('description', 'Chic waterfront find offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticucho.')
('nearby_tourist_places', 'Brickell Key Park')

多模态 RAG

我们在多张餐厅图像上运行结构化输出提取器，对这些节点建立索引，然后提问“请为我推荐奥兰多的一家餐厅及其附近的旅游景点”

I recommend Mythos Restaurant in Orlando. It is an American restaurant located at 6000 Universal Blvd, Orlando, FL 32819, United States. It has a rating of 4 and a review score of 4.3 based on 2,115 reviews. The restaurant offers a mythic underwater-themed dining experience with a view of Universal Studios' Inland Sea. It is located near popular tourist places such as Universal's Islands of Adventure, Skull Island: Reign of Kong, The Wizarding World of Harry Potter, Jurassic Park River Adventure, Hollywood Rip Ride Rockit, and Universal Studios Florida.

语义检索器

生成式语言语义检索器提供了专门的嵌入模型用于高质量检索，以及一个经过调优的 LLM 用于生成带有安全设置的有根据的输出。

它可以直接使用（结合我们的 GoogleIndex）或分解成不同的组件（GoogleVectorStore 和 GoogleTextSynthesizer），并与 LlamaIndex 的抽象结合使用！

我们的完整语义检索器笔记本指南在此。

开箱即用配置

您只需几行代码即可开箱即用。只需定义索引，插入节点，然后获取查询引擎

from llama_index.indices.managed.google.generativeai import GoogleIndex

index = GoogleIndex.from_corpus(corpus_id="&lt;corpus_id&gt;")
index.insert_documents(nodes)
query_engine = index.as_query_engine(...)
response = query_engine.query("&lt;query&gt;")

这里一个很棒的功能是 Google 的查询引擎支持不同的回答风格以及安全设置。

回答风格

摘要式（简洁但抽象）
提取式（简短且提取）
详细式（额外细节）

安全设置

您可以在查询引擎中指定安全设置，这允许您定义在不同设置下回答是否明确的防护措施。有关更多信息，请参阅 generative-ai-python 库。

分解成不同组件

GoogleIndex 基于两个组件构建：向量存储（GoogleVectorStore）和响应合成器（GoogleTextSynthesizer）。您可以将这些作为模块化组件，结合 LlamaIndex 的抽象，创建高级 RAG。

笔记本指南重点介绍了三个高级 RAG 用例

Google 检索器 + 重排序：使用语义检索器返回相关结果，但随后使用我们的重排序模块在将结果提供给响应合成之前对其进行处理/过滤。
多查询 + Google 检索器：使用我们的多查询功能，例如我们的 MultiStepQueryEngine，将一个复杂问题分解成多个步骤，并针对语义检索器执行每个步骤。
HyDE + Google 检索器： HyDE 是一种流行的查询转换技术，它根据查询“虚构”一个答案，并使用该虚构答案进行嵌入查找。将其作为从语义检索器进行检索步骤之前的一个步骤。