NVIDIA • 2024-11-08

使用 NVIDIA NIM 和 LlamaIndex 实现 Agent 增强型 AI 查询引擎

生成式 AI 应用中的 Agent 提供了强大的技术，用于协调 AI 系统并为其增强额外功能。NVIDIA NIM™ 是一套用于部署 AI 模型微服务，支持使用经过 Agent 行为训练的模型来构建 Agent。NIM 微服务集成了 LlamaIndex 和 LangChain 等众多框架，支持将 NIM 用于 Agent 应用。在本示例中，我们将展示如何将它们与 LlamaIndex 一起使用。

用于 Agent 的 NVIDIA NIM

Agent 是一个实体，它利用 LLM 的能力，在系统状态中执行一系列操作或使用工具，以达到预期的结果。它们通常涉及推理、规划、执行以及响应环境变化。

NIM 微服务提供了一种随时随地部署生产就绪、性能最佳的 Agent 的途径。它们是 NVIDIA AI Enterprise 的一部分，这是一套易于使用的微服务，旨在通过行业标准 API 在云、数据中心和工作站上安全、可靠地部署高性能 AI 模型推理。

NIM 微服务可从 NVIDIA API catalog 免费试用，用于使用 LlamaIndex 构建 Agent。

为什么要使用 Agent？

Agent 由 LLM 提供支持，能够利用数据和推理来执行复杂的任务，从而指导其决策过程。通过 Agent，人类设定目标，Agent 进行推理和决策，选择实现目标所需的操作。它们在可能需要将子任务分配给其他模型或工具的复杂系统中表现出色。

让我们通过一个示例用例来谈谈 Agent。零售聊天机器人用于协助客户的产品体验。配备 Agent 的零售聊天机器人可能比没有 Agent 的聊天机器人提供更多的洞察力和价值。例如，Agent 可以识别何时使用工具搜索客户评论，以回答用户查询，例如“这件衣服的尺码如何？”。虽然产品描述中可能没有关于服装尺码的信息，但评论中可能提供。没有 Agent 的聊天机器人将仅限于查看描述，而配备 Agent 的聊天机器人可以判断应查看评论以获得客户问题的最佳答案。有了 Agent，零售聊天机器人等系统可以做出更好的决策来协助购物者。

Agent 也可以处理复杂的用户查询；Agent 可以将这些查询分解成更小的问题，并将其路由到适当的查询引擎。这通常被称为“路由 Agent”，可以获得更准确的响应。假设您正在构建一个针对 10-K 报告（美国上市公司年度财务披露）的聊天机器人。我们以 NVIDIA 为例。当用户询问“NVIDIA 2023 年第三季度的盈利与现在相比如何？”时，Agent 可以被配置成将此查询分解为子查询，例如：

NVIDIA 的盈利是多少？
NVIDIA 最近的盈利是多少？
NVIDIA 在 2023 年第三季度的盈利是多少？

然后可以回答这些子问题，如果适用，Agent 将使用检索到的答案形成完整响应。

Agent 还可以提供一组工具，这些工具可以帮助回答一些子问题。例如，可以有一组工具，其中每个工具负责回答有关 NVIDIA 特定年份财务状况的问题。通过以这种方式利用工具，可以确保 Agent 将关于特定季度的查询路由到相应财年的信息。

在本博客中，我们将介绍如何使用 LlamaIndex 和 NIM 构建一个查询路由器，用于回答与旧金山市预算数据相关的问题。

使用 LlamaIndex 构建查询路由 Agent

本博客的完整代码可在本 notebook 中找到，本文也解释了部分代码片段。在本示例中，我们将使用 LlamaIndex 构建一个 Agent 增强型查询引擎。该查询引擎将首先将关于旧金山预算的复杂查询分解为子查询。由于子查询最好由不同的源文档来回答，因此 Agent 配备了引用正确源文档的工具。一旦所有相关查询都已回答，Agent 将整合信息以形成全面响应。

让我们深入了解实现这一目标的技术细节！

入门

本示例中使用了两组 NVIDIA NIM：

NV-EmbedQA-E5-v5 NIM 微服务用于使用 LlamaIndex 将文档块和用户查询转换为 embeddings。
Llama 3.1 8B NIM 微服务用于使用 LlamaIndex 为 Agent 提供支持。

为了快速入门，请在 https://build.nvidia.com 上以托管 API 的形式访问 NIM 微服务。或者，如果您愿意，可以通过 https://build.nvidia.com 下载并按照文档自行托管 NIM 微服务。

将 NIM 微服务与 LlamaIndex 一起使用

NIM 微服务部署灵活便捷，与 LlamaIndex 集成使用也十分简单。部署完成后，要将其与 LlamaIndex 一起使用，您需要安装 NVIDIA LlamaIndex 包。

pip install llama-index-llms-nvidia llama-index-embeddings-nvidia

接下来，声明您将使用的 NIM。如果您使用的是 NVIDIA 托管的 NIM，则需要 API 密钥。假设您的 NVIDIA API 密钥已存储在操作系统中作为环境变量。更多关于将 NIM 与 LlamaIndex 一起使用的文档可在此处找到。

from llama_index.llms.nvidia import NVIDIA
from llama_index.embeddings.nvidia import NVIDIAEmbedding

Settings.embed_model = NVIDIAEmbedding(model="nvidia/nv-embedqa-e5-v5", api_key=os.environ["NVIDIA_API_KEY"],truncate="END")

Settings.llm = NVIDIA(model="meta/llama-3.1-8b-instruct", api_key=os.environ["NVIDIA_API_KEY"])

子问题查询引擎

查询引擎是 LlamaIndex 中的一个概念，它接收自然语言输入并返回丰富响应。在本示例中，我们创建一个子问题查询引擎。它接收一个复杂的单一问题，并将其分解为多个子问题，每个子问题都可以由不同的工具回答。

这个查询引擎被声明为 Workflow，这是 LlamaIndex 中的另一个概念。工作流是事件驱动的，用于通过步骤将多个事件串联起来。工作流在单个函数上使用 @step 装饰器来定义步骤。每个步骤负责处理特定事件类型并发出新事件。这个装饰器还会推断步骤的输入和输出类型。

这个查询引擎中有 3 个步骤，它们通过事件流相互连接。让我们详细分解这些步骤。

步骤 1：分解原始查询

query：接收原始查询并将其分解为子问题。

步骤 2：生成子问题

sub_question：对于给定的子问题，从配备工具的 ReAct Agent 生成响应。

正如您之前回忆的，这些工具对于为 Agent 提供更多独特方式从数据中提取信息非常有用。ReAct（推理和行动）是一种常见的 Agent 实现，它通过提示来引导 LLM 通过解释推理和行动动态地创建、维护和调整计划。如果您想查看本示例中的提示是什么样子，可以在 LlamaIndex 的源代码中此处查看！

为每个单独的子问题生成一个答案。

步骤 2：合并答案

combine_answers：将各个答案合并成完整响应。

请查看下面的子问题查询引擎代码！

class QueryEvent(Event):
    question: str


class AnswerEvent(Event):
    question: str
    answer: str


class SubQuestionQueryEngine(Workflow):
    @step
    async def query(self, ctx: Context, ev: StartEvent) -> QueryEvent:
        if hasattr(ev, "query"):
            await ctx.set("original_query", ev.query)
            print(f"Query is {await ctx.get('original_query')}")

        if hasattr(ev, "llm"):
            await ctx.set("llm", ev.llm)

        if hasattr(ev, "tools"):
            await ctx.set("tools", ev.tools)

        response = (await ctx.get("llm")).complete(
            f"""
            Given a user question, and a list of tools, output a list of
            relevant sub-questions, such that the answers to all the
            sub-questions put together will answer the question. Respond
            in pure JSON without any markdown, like this:
            {{
                "sub_questions": [
                    "What is the population of San Francisco?",
                    "What is the budget of San Francisco?",
                    "What is the GDP of San Francisco?"
                ]
            }}
            Here is the user question: {await ctx.get('original_query')}

            And here is the list of tools: {await ctx.get('tools')}
            """
        )

        print(f"Sub-questions are {response}")

        response_obj = json.loads(str(response))
        sub_questions = response_obj["sub_questions"]

        await ctx.set("sub_question_count", len(sub_questions))

        for question in sub_questions:
            self.send_event(QueryEvent(question=question))

        return None

    @step
    async def sub_question(self, ctx: Context, ev: QueryEvent) -> AnswerEvent:
        print(f"Sub-question is {ev.question}")

        agent = ReActAgent.from_tools(
            await ctx.get("tools"), llm=await ctx.get("llm"), verbose=True
        )
        response = agent.chat(ev.question)

        return AnswerEvent(question=ev.question, answer=str(response))

    @step
    async def combine_answers(
        self, ctx: Context, ev: AnswerEvent
    ) -> StopEvent | None:
        ready = ctx.collect_events(
            ev, [AnswerEvent] * await ctx.get("sub_question_count")
        )
        if ready is None:
            return None

        answers = "\n\n".join(
            [
                f"Question: {event.question}: \n Answer: {event.answer}"
                for event in ready
            ]
        )

        prompt = f"""
            You are given an overall question that has been split into sub-questions,
            each of which has been answered. Combine the answers to all the sub-questions
            into a single answer to the original question.

            Original question: {await ctx.get('original_query')}

            Sub-questions and answers:
            {answers}
        """

        print(f"Final prompt is {prompt}")

        response = (await ctx.get("llm")).complete(prompt)

        print("Final response is", response)

        return StopEvent(result=str(response))

运行 Agent 增强型子问题查询引擎

我们跳过了创建查询引擎工具的部分，但可以在完整 notebook 中参考。简而言之，每个工具都是基于单个（但很长，300 多页）旧金山预算文档的独立查询引擎。

engine = SubQuestionQueryEngine(timeout=120, verbose=True)
result = await engine.run(
    llm=Settings.llm,
    tools=query_engine_tools,
    query="How has the total amount of San Francisco's budget changed from 2016 to 2023?",
)

print(result)

在结果中，我们可以看到子问题被生成，并且 ReAct 模式应用于所有子问题。为了提高可读性，完整输出被截断，但运行 notebook 可以查看完整答案！

Sub-question is What is the budget of San Francisco in 2016?
> Running step 543e99b5-0b95-40a1-969c-f2ccecbcf405. Step input: What is the budget of San Francisco in 2016?
Thought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: budget_2016
Action Input: {'input': 'What is the budget of San Francisco in 2016?'}
Observation: According to the provided information, the budget of San Francisco in 2016-17 is $51,569,787.
> Running step fc16dc8e-de61-4221-9cd9-0e2831a20067. Step input: None
Thought: I can answer without using any more tools. I'll use the user's language to answer
Answer: The budget of San Francisco in 2016 is $51,569,787.
Step sub_question produced event AnswerEvent
Running step sub_question
Sub-question is What is the budget of San Francisco in 2023?
> Running step 6eca7b14-7748-4daf-a186-8338995605ef. Step input: What is the budget of San Francisco in 2023?
Observation: Error: Could not parse output. Please follow the thought-action-input format. Try again.
> Running step d8fed16d-6654-4b15-bc4d-3749e0d600de. Step input: None
Thought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: budget_2023
Action Input: {'input': 'What is the budget of San Francisco in 2023?'}
Observation: The budget of San Francisco in 2023 is $14.6 billion.
> Running step 90f90257-aca3-4458-a7f6-f47ea3e10b48. Step input: None
Thought: I can answer without using any more tools. I'll use the user's language to answer
Answer: The budget of San Francisco in 2023 is $14.6 billion.
...

您可以在下方看到最终响应。

The budget of San Francisco in 2016 was $51,569,787, and the budget in 2023 is $14.6 billion. Therefore, the total amount of San Francisco's budget has increased significantly from 2016 to 2023, with a change of approximately $14.6 billion - $51,569,787 = $14,548,213,213.