Jerry Liu • 2023-05-23

较“笨”的 LLM Agent 需要更多限制和更好的工具

摘要

在本文中，我们比较了不同复杂程度的 LLM 驱动 Agent 在实际数据任务（金融分析）中的表现。我们将具有更复杂、无约束交互行为（ReAct）的 Agent 与包含更简单、更受限交互（路由）的 Agent 的性能进行比较。我们特别分析了 Agent 层与工具层可以分别增加多少复杂性。

我们发现，语言模型的选择非常重要。由“较笨”模型（开玩笑地说，我们将任何非 GPT-4 模型都称为“笨”）驱动的 ReAct Agent 在处理数据时难以返回相关结果。我们发现，限制 Agent 的交互行为，并让它们能够访问更多可以更明确地执行复杂操作的工具，有助于提高这些不太复杂的 LLM 的查询性能。相比之下，更复杂的模型（GPT-4）可以更可靠地利用 ReAct 循环来执行各种复杂的数据查询。

这篇博文非常详细；我们在下面提供了大量实验和结果。最棒的是，您可以使用我们的示例 notebook 亲自运行所有这些内容！

Agent 概述

在过去几个月里，构建 LLM 驱动的 Agent 变得越来越流行。LangChain 等框架使得根据一组通用抽象来创建这些 Agent 变得容易得多。

从高层次来看，“Agent”本质上是一个自动化决策引擎，可用于与外部环境进行交互。核心 Agent 循环大致如下所示

Agent 可以访问一组“工具”，这些工具是它可以执行的通用函数。通过一些附加元数据，Agent 可以了解每个工具，并可以调用每个工具（作为函数调用或结构化 API）。
用户向 Agent 输入自然语言。
接收到输入后，Agent 以某种方式与工具集交互并返回响应。

执行Agent-工具交互的方式多种多样。

最流行的可能是ReAct：Agent 推理下一步行动，构建行动命令，执行行动。它在一个迭代循环中重复这些步骤，直到任务完成。
也有其他交互模式。最近有一篇关于计划与解决提示的论文，它事先生成一个计划（将复杂任务分解为更简单的任务）。在 ReAct 之前，也有关于Self-Ask 和思维链提示的相关技术。

“复杂”与“简单”的 Agent 交互技术

我们将 ReAct 等技术归类为更复杂且无约束：这是因为它们执行迭代推理，并将输入分解为更小的步骤。复杂的 Agent 交互循环允许更大的行为自由，并增加对所用 LLM 的负担。复杂交互框架的优点是它们更通用，可以处理针对简单工具的更广泛类型的查询。缺点是如果 LLM 不够出色，这些框架容易出错；无约束的行为可能导致意外结果。

在另一端，您可以想象一种简单且受限的 Agent 交互机制，其中 Agent 对底层工具进行一步选择，并返回工具的响应。Agent 本质上只是充当从查询到工具的路由器。没有将问题分解为更小步骤的环节，也没有迭代的思维链循环。优点是模型出错的可能性较小。缺点是这种交互技术允许的自由度较小，对行为施加了更多限制。

研究用于数据查询的 Agent 交互技术

LlamaIndex 对 Agent 如何协助增强数据任务很感兴趣。更具体地说，我们关注 Agent 如何帮助对各种数据源执行复杂的用户查询。这不仅包括对单个文档提问，还包括能够跨多个文档合成洞见并将其返回给用户。

LlamaIndex 查询引擎可用作 Agent 结构中的工具来查询您的数据（我们提供与 LangChain 的无缝集成）。这些工具的复杂性各不相同。例如，一个简单工具可以是我们的向量存储查询引擎，它可以从向量存储中执行 Top-k 嵌入检索。一个更高级的工具可以是基于我们的图数据结构的查询引擎，它可以设置为明确提供对任何文档子集的比较/对比功能。工具本身可以在底层包含“Agent 式”的决策能力。LlamaIndex 提供了围绕路由、查询分解和多部分查询规划的各种模块。

在这篇博文中，我们有兴趣比较以下设计 Agent 和工具的方法，以了解哪种方法能够以稳健的方式为不同的用户查询提供良好的答案

对一组简单工具使用更复杂且无约束的 Agent 交互 (ReAct)
使用更复杂工具进行更简单且受限的 Agent 交互（简单路由）

本质上，我们关注的是有多少复杂性可以推到 Agent 交互层，而不是留在工具层。我们探讨以下具体示例：假设用户查询是比较/对比两个不同的文档（一个相对复杂的查询）。如果工具集都只是不同文档上的向量索引，Agent 交互循环能否可靠地对向量索引执行该查询？另一方面，如果我们把复杂性推到工具层，那么我们可以明确地拥有一个可以在您的文档上执行“比较/对比”的工具。这样，Agent 的负担就只是调用这个工具，而不是以更复杂的方式与一组其他工具交互。

高层次发现

高层次的发现是，不太复杂的 Agent 需要更多限制。更具体地说，我们发现在 ReAct 循环中使用由 GPT-3 驱动的 Agent 在处理复杂查询时没有提供良好的结果；它无法找到针对提供的工具集执行适当的交互模式来呈现结果。相反，通过增加 Agent 行为的限制并在工具本身中提供更多的复杂性，我们能够让 GPT-3 Agent 产生更好的结果。

更智能的 Agent 需要更少限制。我们确实发现，在使用一组针对数据的简单工具时，带有 ReAct 的 GPT-4 Agent 能够提供比 GPT-3 Agent 更好的查询结果。这表明更强大的 Agent 可能不需要那么多工具来“明确”执行任务，因为大部分逻辑可以在 Agent 交互循环中处理。

设置

我们的数据包含 Uber 在 2022 年的三份 10-Q 文件（季度财务报告）：3 月、6 月和 9 月。我们希望对此数据执行不同的查询；这些查询主要围绕比较这些文档中不同的信息片段。

march_2022 = SimpleDirectoryReader(input_files=["../data/10q/uber_10q_march_2022.pdf"]).load_data()
june_2022 = SimpleDirectoryReader(input_files=["../data/10q/uber_10q_june_2022.pdf"]).load_data()
sept_2022 = SimpleDirectoryReader(input_files=["../data/10q/uber_10q_sept_2022.pdf"]).load_data()

我们使用 LlamaIndex 为每个文档定义一个向量索引，该索引仅将文档块 + 嵌入存储在向量存储中。然后，我们可以使用简单的QueryEngine 查询每个向量索引。我们为每个QueryEngine 对象创建一个工具。

# define indices
march_index = GPTVectorStoreIndex.from_documents(march_2022)
june_index = GPTVectorStoreIndex.from_documents(june_2022)
sept_index = GPTVectorStoreIndex.from_documents(sept_2022)

# define query engine
march_engine = march_index.as_query_engine(similarity_top_k=3)
june_engine = june_index.as_query_engine(similarity_top_k=3)
sept_engine = sept_index.as_query_engine(similarity_top_k=3)

我们还在这三个文档上定义了一个ComposableGraph。这个可组合图大致遵循此处描述的指南。这个图被明确设置为对这三个文档执行比较/对比查询。

graph = ComposableGraph.from_indices(
    GPTListIndex,
    children_indices=[march_index, june_index, sept_index],
    index_summaries=[
        "Provides information about Uber quarterly financials ending March 2022",
        "Provides information about Uber quarterly financials ending June 2022",
        "Provides information about Uber quarterly financials ending September 2022"
    ]
)

可以使用ComposableGraphQueryEngine 查询该图

# define decompose_transform
decompose_transform = DecomposeQueryTransform(verbose=True)

# define custom query engines
custom_query_engines = {}
for index in [march_index, june_index, sept_index]:
    query_engine = index.as_query_engine(service_context=service_context)
    query_engine = TransformQueryEngine(
        query_engine,
        query_transform=decompose_transform,
        transform_extra_info={'index_summary': index.index_struct.summary},
    )
    custom_query_engines[index.index_id] = query_engine

custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
    service_context=service_context,
    streaming=True,
)

# define graph
g_engine = graph.as_query_engine(
    custom_query_engines=custom_query_engines
)

我们尝试以下 Agent 设置

GPT-3 ReAct Agent：一个零样本 GPT-3 ReAct Agent，配备三个工具：每个工具对应于一份 10-Q 文件的向量索引。
GPT-4 ReAct Agent：与上述相同，但使用 GPT-4 代替。
简单路由 Agent：一个简单的路由“Agent”，配备四个工具：上述三个工具 + 明确设置为执行比较/对比查询的ComposableGraphQueryEngine。

初始化这些 Agent 的代码片段如下所示。对于简单路由 Agent，我们使用 LlamaIndex 内置的RouterQueryEngine，尽管您也应该能够在 LangChain 中通过零样本 Agent（调整设置后）或路由链来实现类似的结果。

GPT-3/GPT-4 ReAct Agent 设置

# initializing zero-shot ReAct agent

uber_config_sept = IndexToolConfig(
    query_engine=sept_engine, 
    name=f"Uber 10Q September 2022",
    description=f"Provides information about Uber quarterly financials ending September 2022",
    tool_kwargs={"return_direct": False}
)
uber_config_june = IndexToolConfig(
    query_engine=june_engine, 
    name=f"Uber 10Q June 2022",
    description=f"Provides information about Uber quarterly financials ending June 2022",
    tool_kwargs={"return_direct": False}
)
uber_config_march = IndexToolConfig(
    query_engine=march_engine, 
    name=f"Uber 10Q March 2022",
    description=f"Provides information about Uber quarterly financials ending March 2022",
    tool_kwargs={"return_direct": False}
)

toolkit = LlamaToolkit(
    index_configs=[uber_config_sept, uber_config_june, uber_config_march],
)

# this is a light wrapper around `initialize_agent` in langchain (which defaults to zero-shot)
agent_chain = create_llama_agent(
    toolkit,
    llm, # can be GPT-3 or GPT-4 
    verbose=True
)

简单路由 Agent 设置


query_tool_sept = QueryEngineTool.from_defaults(
    query_engine=sept_engine,
    description=f"Provides information about Uber quarterly financials ending September 2022",
)
query_tool_june = QueryEngineTool.from_defaults(
    query_engine=june_engine,
    description=f"Provides information about Uber quarterly financials ending June 2022",
)
query_tool_march = QueryEngineTool.from_defaults(
    query_engine=march_engine,
    description=f"Provides information about Uber quarterly financials ending March 2022",
)
query_tool_graph = QueryEngineTool.from_defaults(
    query_engine=g_engine,
    description=f"Provides comparisons between Uber financials across quarters in 2022. Can be used to answer "
                 "any questions that require analysis across multiple quarters.",
)

# our "router" query engine is effectively a simple agent that can only perform routing
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        query_tool_sept,
        query_tool_june,
        query_tool_march,
        query_tool_graph
    ]
)

现在我们已经描述了设置，让我们来看看下面的结果吧！

发现与实验

从高层次来看，我们发现在 ReAct Agent 中使用 GPT-3 在这些查询上产生了次优结果。它们倾向于表现出以下特征

所选工具集不可预测：即使问题在语义上相似，选择的工具集也可能不同，导致响应存在差异性。
所选工具集覆盖不全：通常我们期望一个给定的问题能够利用所有三份 10-Q 报表，但往往只选取了其中的一部分。
错误的思维链处理：有时 Agent 在整个 CoT 过程中使用了与问题无关的工具。

相比之下，我们发现 GPT-4 ReAct Agent 提供的答案更相关、更可预测，并且在中间结果中出现的错误更少。

最后，我们发现使用一个更简单的、仅路由的 GPT-3 Agent，并使其能够访问明确的“比较/对比”工具，可以使 Agent 表现更好。

提醒一下，完整结果都在 notebook 中：https://colab.research.google.com/drive/1uP38k4nr8OPmXbY4dLoKKQW0F29WtNuY?usp=sharing

GPT-3 ReAct Agent 结果

查询 1

agent_chain.run(input="Analyze Uber revenue growth over the last few quarters")

响应

我们看到只选择了 9 月份的 10-Q 文件来回答这个问题。尽管 9 月份的 10-Q 确实包含一些关于营收增长与 2021 年同期比较的信息，但这并没有明确回答关于过去几个季度营收增长的问题。

查询 2

agent_chain.run(input="Analyze changes in risk factors for Uber")

响应

选择了 9 月份和 6 月份的 10-Q 文件，但未选择 3 月份。此外，答案很模糊，没有提供关于 Uber 具体风险因素的详细信息（并且还提到风险因素“在过去三个季度发生了变化”，尽管它只使用了两个工具）。

查询 3

在这个查询中，我们更明确地展示了提示中的微小变化如何通过不同的工具诱导不同的思维链路径，并最终产生不同的答案。

# Prompt variation 1 
agent_chain.run(input="Analyze Uber revenue growth and risk factors over time")

响应

# Prompt variation 2
agent_chain.run(input="Analyze Uber revenue growth and risk factors over quarters")

这两个查询的主要区别在于“随着时间”和“按季度”。正如我们所见，两种变体不仅选择的工具不同，输入也不同——第一个是“财务状况”，第二个是“营收增长和风险因素”。

由于第一个变体中的工具输入与问题无关，因此答案同样模糊：“可以通过比较财务状况来分析 Uber 的营收增长和风险因素…”

查询 4

在这里，我们不再问比较/对比的问题，而是只问一个关于特定陈述的问题。

agent_chain.run(input="How much cash did Uber have in sept 2022?")

我们看到 Agent 犯了两个错误：1) 它无法向每个工具提供行动输入，以及 2) 它最终查看了与问题无关的 6 月和 3 月文件。

GPT-4 ReAct Agent 结果

GPT-4 ReAct Agent 的表现比 GPT-3 Agent 好得多。它们全面遍历可用的工具集，并提供更详细的观察提取和响应合成。

我们不会详细介绍所有这些示例，但您可以在示例 notebook 中找到它们！

查询 1

agent_chain_gpt4.run(input="Analyze Uber revenue growth over the last few quarters")

响应

与 GPT-3 Agent 不同，这里的 GPT-4 Agent 至少会遍历所有文件并合成结果。

查询 2

agent_chain_gpt4.run(input="Analyze changes in risk factors for Uber")

响应

在这里，GPT-4 Agent 仍然只查看了 9 月和 6 月（跳过了 3 月），但响应详细得多，并引用了报告中的具体事实。

GPT-3 路由 Agent 结果

提醒：路由 Agent 不执行任何思维链 (CoT)，并且额外访问我们的 ComposableGraph 查询引擎，该引擎可以明确执行比较/对比查询。让我们看看这个 Agent 的表现如何。

查询 1

response = query_engine.query("Analyze Uber revenue growth over the last few quarters")

响应（中间步骤）

Selecting query engine 3: Provides comparisons between Uber financials across quarters in 2022. Can be used to answer any questions that require analysis across multiple quarters..
> Current query: Analyze Uber revenue growth over the last few quarters
> New query:  What was Uber's revenue growth from the last quarter ending March 2022 compared to the previous quarter?
> Current query: Analyze Uber revenue growth over the last few quarters
> New query:  What was Uber's revenue growth from the last quarter ending June 2022 compared to the previous quarter?
> Current query: Analyze Uber revenue growth over the last few quarters
> New query:  What was Uber's revenue growth from the last few quarters ending September 2022?

最终响应

Based on the limited context information provided, we can only analyze Uber's revenue growth for two specific quarters:

1. For the quarter ending March 2022, Uber's revenue growth increased compared to the first quarter of 2021. However, the exact percentage of growth is not provided.

2. For the quarter ending September 2022, Uber's revenue growth was 72%.

Unfortunately, we do not have enough data to analyze the revenue growth for the quarters ending June 2022 or to compare it to the previous quarter. To provide a more comprehensive analysis, additional data on Uber's revenue growth for other quarters would be needed.

公平地说，这个响应没有涵盖 6 月份。但这是工具本身的实现问题，而不是 Agent 交互步骤的问题。Agent 的唯一工作是选择正确的工具来解决当前任务，它正确地完成了这项工作；它决定选择对应于我们的ComposableGraphQueryEngine抽象（可以执行比较/对比查询）的工具。

查询 2

response = query_engine.query("Analyze changes in risk factors for Uber")

响应（中间步骤）

INFO:llama_index.query_engine.router_query_engine:Selecting query engine 3: Provides comparisons between Uber financials across quarters in 2022. Can be used to answer any questions that require analysis across multiple quarters..
Selecting query engine 3: Provides comparisons between Uber financials across quarters in 2022. Can be used to answer any questions that require analysis across multiple quarters..
&gt; Current query: Analyze changes in risk factors for Uber
&gt; New query:  What are the risk factors for Uber in the quarter ending March 2022?
&gt; Current query: Analyze changes in risk factors for Uber
&gt; New query:  What are the risk factors for Uber in the quarter ending June 2022?
&gt; Current query: Analyze changes in risk factors for Uber
&gt; New query:  What are the risk factors for Uber in the quarter ending September 2022?

最终响应

From the context information provided, we can analyze the changes in risk factors for Uber across the three quarters as follows:

1. COVID-19 pandemic: The ongoing pandemic remains a constant risk factor across all quarters, with its impact on driver and courier supply constraints, consumer demand, and government restrictions. However, the specific concerns related to the pandemic may evolve, such as the emergence of new virus variants or changes in vaccination rates.

2. Driver and courier supply constraints: This risk factor is present in the quarter ending March 2022 but is not explicitly mentioned in the subsequent quarters. However, it may still be an underlying concern, especially if the pandemic continues to impact driver availability.

3. Changes in consumer demand: This risk factor is mentioned in the quarter ending March 2022, with an imbalance between Mobility and Delivery services. It is not explicitly mentioned in the subsequent quarters, but fluctuations in revenue due to changes in Gross Bookings and Mobility Trip volumes in the quarter ending June 2022 could be related to changes in consumer demand.

4. Government restrictions and regulations: This risk factor is present in the quarter ending March 2022, with the impact of restrictions on Mobility offerings. In the quarter ending June 2022, the focus shifts to the potential impact of regulatory changes and legal claims, such as driver classification in the UK. In the quarter ending September 2022, the risk factor of driver classification as employees, workers, or quasi-employees becomes more prominent.

5. Uncertainty around future developments: This risk factor is mentioned in the quarter ending March 2022 but is not explicitly mentioned in the subsequent quarters. However, it may still be an underlying concern, especially with the ongoing impact of COVID-19 and potential permanent changes to end-users' behavior.

6. Accounting pronouncements: This risk factor is introduced in the quarter ending June 2022, with the adoption and impact of recently issued accounting pronouncements potentially affecting Uber's consolidated financial statements.

7. Integration and performance of acquired businesses: This risk factor is introduced in the quarter ending June 2022, with the acquisition of Transplace contributing to the increase in Freight revenue.

8. Competition: This risk factor is introduced in the quarter ending September 2022, highlighting the highly competitive nature of the mobility, delivery, and logistics industries, and the need to lower fares, offer incentives, and promotions to remain competitive.

9. Significant losses and increasing operating expenses: This risk factor is introduced in the quarter ending September 2022, emphasizing the financial challenges Uber faces in achieving profitability.

Agent 的工作再次变得简单：它选择优化用于执行“比较/对比”查询的图查询引擎。结果比从 GPT-3 ReAct Agent 获得的结果详细得多（并且与 GPT-4 ReAct Agent 的结果相当）。它包含所有风险因素的详细要点分析，并提供所有要点随时间的比较。

查询 3

response = query_engine.query("Analyze Uber revenue growth and risk factors over the last few quarters")

响应（中间步骤）

> Current query: Analyze Uber revenue growth and risk factors over quarters
> New query:  What is Uber's revenue growth and risk factors for the quarter ending March 2022?
> Current query: Analyze Uber revenue growth and risk factors over quarters
> New query:  What is Uber's revenue growth and risk factors for the quarter ending March 2022?
> Current query: Analyze Uber revenue growth and risk factors over quarters
> New query:  What is Uber's revenue growth and risk factors for the quarter ending June 2022?
> Current query: Analyze Uber revenue growth and risk factors over quarters
> New query:  What is Uber's revenue growth and risk factors for the quarter ending June 2022?
> Current query: Analyze Uber revenue growth and risk factors over quarters
> New query:  What were Uber's total revenues for the quarter ending September 2022?
> Current query: Analyze Uber revenue growth and risk factors over quarters
> New query:  What were Uber's total revenues for the quarter ending September 2022?

最终响应

Uber's revenue growth has been impressive over the past few quarters, with the quarter ending March 2022 showing a 136% year-over-year growth and the quarter ending June 2022 showing a 105% year-over-year growth. The revenue for the quarter ending June 2022 was $8,343 million. 

The risk factors for Uber's business have been largely related to the ongoing impacts of the COVID-19 pandemic, including reduced global demand for Mobility rides, supply constraints, and potential permanent changes to end-user behavior. These risk factors have been present in both the quarter ending March 2022 and the quarter ending June 2022, though the specific risks have varied slightly.

总结思考

基于 ReAct 的 Agent 提供了一个强大、通用的推理循环，并有潜力解决您的数据上的复杂任务。但它们往往只能与更强大的语言模型（如 GPT-4）可靠地协同工作。不太复杂的模型（例如 GPT-3）将做出更不可预测和错误的决策，导致您的数据源上的查询性能欠佳。

使用“较笨”模型实现的 Agent 需要更多交互限制，才能做出更可靠、更少错误的决策。我们发现，如果我们明确限制 Agent 接口并将复杂性推到工具层，我们仍然可以创建在您的数据上提供良好性能的 Agent。

当然，这只是初步分析，并且存在一些警告/局限性

您可能能够通过“提示调整”默认的 ReAct 循环来获得更一致的结果，我们没有尝试这样做。
我们仅在三份财务文档集上进行了测试。如果要在数千份文档上进行测试，还需要做大量工作。
我们只比较了 GPT-3 和 GPT-4，还有更多模型需要比较/基准测试，例如 ChatGPT、任何开源模型、Anthropic Claude 等。
除了 ReAct 之外，我们没有测试其他 Agent 交互模式：“计划与解决”Agent（尽管我们在 LlamaIndex 中有类似的表述）、AutoGPT 类似的任务管理等。

无论您是否有类似发现或不同意我们的分析，请告诉我们！我们非常乐意在我们的Discord上促成这场讨论。