宣布 LlamaCloud 全面可用(以及我们的 1900 万美元 A 轮融资)!
LlamaIndex

Jerry Liu 2023-05-12

在 SEC 10-K 文件上测试 Anthropic Claude 的 10 万 token 窗口

Anthropic 昨日发布的10 万 token 上下文窗口扩展席卷了人工智能社区。10 万 token 的限制大约等于 7.5 万词(约为 GPT4–32k 上下文窗口的 3 倍,GPT-3.5/ChatGPT 的 25 倍);这意味着您现在可以在单次推理调用中放入 300 页文本。

Anthropic 博客中强调的核心用例之一是分析 SEC 10-K 文件;该模型能够摄取整个报告,并对不同问题产生答案。

巧合的是,几个月前我们发布了一篇教程,展示了 LlamaIndex + Unstructured + GPT3 如何帮助您对 UBER SEC 10-k 文件执行不同的查询。LlamaIndex 提供了一套全面的工具包,可帮助您在上下文窗口有限的任何 LLM 之上管理外部数据,我们展示了我们可以执行各种查询,从单个文档上的问题到跨文档部分的比较。

Anthropic 的 10 万模型在 UBER SEC 10-k 文件上的表现如何?此外,没有 LlamaIndex 更高级数据结构的帮助下,它的表现又如何?在这篇博客中,我们使用最简单的数据结构:列表索引,展示了 Anthropic 模型在不同查询上的性能。

高层发现

Anthropic 10 万模型的优势

  • 对数据的整体理解(经过一些提示调整后,有点像):Anthropic 的模型确实展示了令人印象深刻的能力,能够综合整个上下文窗口中的见解来回答当前问题(假设我们设置了 response_mode="tree_summarize",详见下文)。但它也可能会遗漏细节;详见下文!
  • 延迟:这一点让我们感到惊讶。Anthropic 的模型能够在约 60-90 秒内处理完整个 UBER 10-k 文件,这看起来时间不短,但比重复调用 GPT-3 API 要快得多(累加起来可能需要几分钟)。

Anthropic 10 万模型的不足

  • 成本:这一点很明显。我们运行的每个查询都处理了数十万 token。根据Claude-v1 每百万 token 11 美元的定价,这相当于每次查询 1 美元,费用很快就会累积起来。
  • 对更复杂提示的推理:Anthropic 模型在理解我们用于“创建和细化”响应合成的细化提示方面出人意料地表现不足,返回了错误/不相关的结果。我们最终转而使用“树状摘要”。结果见下文。

方法概述

我们想测试 Anthropic 10 万模型在 2019-2022 年 UBER 10-k 文件上的能力。我们也希望在尽可能少使用检索/合成结构的情况下进行测试。这意味着不使用 embedding,也不使用花哨的检索机制。

理想情况下,我们可以直接将整个 10-k 文件(甚至所有四个 10-k 文件)插入到提示中。然而,我们发现单个 UBER 10-k 文件实际上包含约 16 万 token,这超出了 10 万的上下文窗口。这意味着我们仍然必须对每个文件进行分块!

我们最终使用了我们的列表索引数据结构 — 我们将每个文本分割成巨大的约 10 万 token 的块,并使用我们的响应合成策略在多个块之间合成答案。

我们对每个文件以及多个文件运行了一些查询,这与我们最初的博客文章类似。结果报告如下。

教程设置

我们的数据摄取与 LlamaIndex + Unstructured 博客文章相同。我们使用 Unstructured 的 HTML 解析器将 HTML DOM 解析成格式良好的文本。然后为每个 SEC 文件创建一个 Document 对象。

您可以在 LlamaHub 上访问 Unstructured 数据加载器。

from llama_index import download_loader
from pathlib import Path

UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

loader = UnstructuredReader()
doc_set = {}
all_docs = []
years = [2022, 2021, 2020, 2019]
for year in years:
    year_doc = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False)[0]
    # insert year metadata into each year
    year_doc.extra_info = {"year": year}
    doc_set[year] = year_doc
    all_docs.append(year_doc)

接下来,我们想设置 Anthropic LLM。我们默认使用 claude-v1。我们还想在 PromptHelper 对象中手动定义新的 10 万 token 输入大小;这将有助于我们在响应合成过程中了解如何将上下文“压缩”到输入提示空间中。

我们将 max_input_size 设置为 10 万,将输出长度设置为 2048。我们还将上下文块大小设置为一个较高的值(9.5 万,为提示的其余部分留出一些缓冲空间)。只有当 token 数量超过此限制时,上下文才会被分块。

from llama_index import PromptHelper, LLMPredictor, ServiceContext
from langchain.llms import Anthropic

# define prompt helper
# set maximum input size
max_input_size = 100000
# set number of output tokens
num_output = 2048
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

llm_predictor = LLMPredictor(llm=Anthropic(model="claude-v1.3-100k", temperature=0, max_tokens_to_sample=num_output))
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor, prompt_helper=prompt_helper,
    chunk_size_limit=95000
)

分析单个文档

我们首先分析对单个文档的查询。我们在 2019 年 UBER 10-K 上构建一个列表索引

list_index = GPTListIndex.from_documents([doc_set[2019]], service_context=service_context)
print(len(list_index.index_struct.nodes))

如前所述,该 10-K 文件超过了 10 万 token 的限制,因此列表索引中有两个节点。

然后我们提出一个查询:“2019 年最大的一些风险因素是什么?”

回顾一下,LlamaIndex 中有两种跨多个节点(总上下文超出上下文窗口)进行响应合成的方法:“创建和细化”策略和“树状摘要”策略。

  • 创建和细化:按顺序遍历每个检索到的 Node。对第一个节点使用问答提示,对后续节点使用细化提示。每个节点进行一次单独的 LLM 调用。
  • 树状摘要:给定一组 Node 对象和查询,使用问答提示递归构建树,并将根节点作为响应返回。适用于摘要目的。

Claude-v1 10 万模型在“创建和细化”策略(这是默认策略)下表现不佳。

# NOTE: the default create/refine approach does not give good answers
query = "What were some of the biggest risk factors in 2019?"
query_engine = list_index.as_query_engine(service_context=service_context)
response = query_engine.query(query)

以下是检索到的答案

Original answer:

Some of the biggest risk factors for Uber in 2019 included:

• Regulatory challenges and uncertainty. Uber faced regulatory challenges and uncertainty in many markets, including restrictions on its products and services, caps on pricing, and licensing requirements. For example, California's AB5 law and other similar laws increased the risk of Drivers being classified as employees. Uber also faced regulatory scrutiny and bans in London, Barcelona, and other markets.

• Competition. The markets in which Uber operates are highly competitive, and Uber faced significant competition from well-established and low-cost alternatives in 2019. Competitors also aggressively competed for Drivers and consumers by offering significant incentives and discounts. 

• Safety and security. There were risks related to the safety and security of Uber's platform, including risks from vehicle or scooter accidents, assaults, and other incidents. Uber released a safety report in 2019 detailing reports of sexual assaults on its platform which brought additional scrutiny.

• Financial performance and profitability. Uber continued to experience significant losses in 2019 as it invested in growth. There was no guarantee that Uber would achieve or sustain profitability in the future.  

• Reliance on third parties. Uber relied on third parties for key elements of its business like cloud computing services, payment processing, and background checks. If these third parties failed to provide services or increased costs, it could adversely impact Uber's business.

• Macroeconomic conditions. Uber's business was sensitive to economic conditions and changes that impacted discretionary consumer spending. A decline in the economy could reduce demand for Uber's products and services.  

• Reliance on Drivers and Restaurants. Uber's success depended on attracting and retaining enough Drivers and Restaurants to its platform. If Uber was unable to attract or retain enough Drivers and Restaurants, it could negatively impact its operations.

• Intellectual property. If Uber was unable to protect its intellectual property, or if third parties claimed Uber was infringing on their intellectual property, it could harm Uber's business. Uber relied on a combination of copyright, trademark, patent, and trade secret laws to establish its intellectual property rights.

Refined answer:

Some of the biggest risk factors for Uber in 2019 included:

• Regulatory challenges and uncertainty. Uber faced significant regulatory challenges in 2019, including AB5 in California which increased the risk of Drivers being classified as employees and regulatory scrutiny and temporary bans in London and other markets. These regulatory issues created uncertainty and posed risks to Uber's business model and financial results.

• Competition. The ridesharing and meal delivery markets are highly competitive, and competitors offered significant incentives and discounts to take market share from Uber in 2019. This competition could negatively impact Uber's growth and profitability.

• Safety and security. There were risks related to the safety and security of Uber's platform, including risks from vehicle or scooter accidents, assaults, and other incidents. Uber's safety report detailing sexual assault reports brought additional scrutiny and risks. Any failure to ensure safety could significantly damage Uber's reputation and business.

• Financial performance and profitability. Uber continued to experience substantial losses in 2019 due to investments in growth, and there was no guarantee of future profitability. Uber's path to profitability was uncertain, and failure to become profitable could threaten its business model and access to capital.  

• Reliance on third parties. Uber relied on third parties for services like cloud computing, payment processing, and background checks. If these third parties failed to provide services or increased costs, it could negatively impact Uber's offerings, growth, and profitability.  

• Macroeconomic conditions. Uber's business was sensitive to the economy and consumer discretionary spending. An economic downturn could reduce demand for Uber's offerings and threaten its growth and financial performance.

• Reliance on Drivers and Restaurants. Uber's success depended on attracting and retaining enough Drivers and Restaurants to support growth and meet consumer demand. Failure to attract or retain Drivers and Restaurants could significantly impact Uber's business.  

• Intellectual property. Failure to protect intellectual property or infringement of others' intellectual property could damage Uber's competitive position and business. Uber relied on laws and policies to protect intellectual property, but monitoring infringement and unauthorized use was difficult.

如图所示,答案中包含了提示本身;细化提示的目的是只生成细化后的答案,而不是同时生成原始答案和细化后的答案。

不知何故,使用树状摘要策略效果好得多。

# NOTE: tree_summarize gives better answers
query = "What were some of the biggest risk factors in 2019?"
query_engine = list_index.as_query_engine(service_context=service_context, response_mode="tree_summarize")
response = query_engine.query(query)

返回的响应

• Regulatory challenges and uncertainty: Uber faced significant regulatory challenges and uncertainty in 2019, including AB5 in California which codified a new test for determining whether workers should be classified as employees or independent contractors. Uber also faced regulatory scrutiny and bans in other markets like London, UK. These regulatory issues created uncertainty and risk around Uber's business model and operations.

• Safety and security: Uber received negative publicity around safety incidents on its platform which could damage its brand and reputation. Uber released a safety report in 2019 on sexual assaults and other incidents which led to additional scrutiny. Safety and security risks remain an ongoing issue for Uber's business.

• Competition: The markets in which Uber competes are intensely competitive, and Uber faces competition from new and existing companies in the various segments it operates in like ridesharing, food delivery, and logistics. Increased competition could make it difficult for Uber to attract and retain platform users.

• Financial losses: Uber continued to experience significant losses from operations in 2019 as it invested in growth. Uber may not achieve or sustain profitability in the future which could negatively impact its business and stock price.

• Reliance on third-parties: Uber relies on third-party data centers, cloud computing services, and software which exposes it to risks if these third-parties experience any disruptions or interference. Any disruption of third-party services could negatively impact Uber's business.

• Macroeconomic conditions: Uber's business is sensitive to macroeconomic conditions, and any downturn could reduce discretionary consumer spending which would harm Uber's business. Issues like a recession, increased unemployment, or reduced consumer confidence could all potentially impact Uber's business.

• Reliance on Drivers and Restaurants: Uber's business depends on attracting and retaining Drivers and Restaurants to its platform. If Uber is unable to attract or maintain a critical mass of Drivers and Restaurants, its platform will become less appealing to platform users and its financial results would be adversely impacted.

答案好多了。它详细介绍了中美贸易战、经济增长放缓、英国脱欧等等(请记住 2019 年是疫情前)。

Token 用量和延迟

文档包含约 17 万 token。不知为何,这个数字没有反映在 Anthropic 使用日志中(“Prompt Tokens”部分似乎上限为 10240)。但提示长度(以字符为单位)以及模型延迟被记录下来了。

根据定价,约 17 万 token 相当于 1.5-2 美元。

对一个 Uber SEC-10K 进行一次查询大约需要 150 ,包括所有 LLM 调用。这实际上比重复调用 ChatGPT/davinci 稍微快一些。每次 ChatGPT/davinci 调用(将 4K token 窗口最大化)根据经验需要 6-10 秒才能完成 → 125-250 秒(或更长)。

分析多个文档

在我们之前的博客文章中,一个热门例子展示了如何使用 LlamaIndex 图结构比较/对比不同的文档。

我们也在这里测试是否可以做到这一点,方法是将多个 SEC 报告输入到 Claude-v1 10 万模型中。

注意:考虑到单个 UBER SEC-10K 文件甚至无法完全放入上下文窗口,我们当然还需要实现响应合成策略来处理摄取多个 10K 文件的情况。

我们在所有 4 个 10K 文件上构建一个列表索引:2019 年、2020 年、2021 年和 2022 年。

list_index = GPTListIndex.from_documents(all_docs, service_context=service_context)
print(len(list_index.index_struct.nodes))

然后我们使用树状摘要响应模式提出问题。

query = "How are the risk factors changing across years? Compare/contrast the risk factors across the SEC filings."
query_engine = list_index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query(query)

完整答案如下

The risk factors disclosed in Uber's SEC filings have evolved over time based on Uber's business and industry changes. Some of the key differences in risk factors across the filings are:

2017 10-K:
- Focused heavily on risks related to negative publicity, competition, dependence on independent contractors, and regulatory challenges as Uber was still facing backlash from various PR crises and regulatory pushback. 
- Also highlighted risks from intellectual property litigation given various IP disputes at the time.

2018 10-K:
- Added more risks related to autonomous vehicles as Uber ramped up its self-driving car efforts. Specifically called out risks from accidents, technical challenges, and competition in the AV space.
- Removed some risks related to negative publicity and PR crises as those issues had subsided. But added risks related to corporate culture and workplace environment given the Fowler scandal.

2019 10-K: 
- Further expanded AV risks to include risks from partnerships and third party relationships as Uber relied more on partnerships with other companies like Toyota and Volvo. 
- Added risks related to environmental regulations and climate change as those issues received more mainstream attention. 
- Removed risks related to initial scaling challenges as Uber had matured as a company. But added risks related to growth into new offerings and markets as Uber expanded into freight shipping, scooters, etc.

In summary, Uber's risk factors have evolved to reflect the key issues and challenges facing the company at any point in time. The risks highlight both external factors like regulations as well as internal factors related to Uber's business, technology, and operations. The changes over time show how an innovative company's risks can shift quite quickly.

Some of the key trends in how the risk factors are changing include:

1. Risks related to negative publicity and PR crises decreased over time as those issues subsided, while risks related to other issues like AV technology, workplace culture, and new initiatives increased. This shows how Uber's risks evolved as the company matured.

2. Risks tend to increase in areas where Uber is actively investing or expanding into. For example, risks related to AVs, partnerships, and new mobility offerings increased as Uber ramped up efforts in those areas. This highlights how risk profiles change with a company's strategic priorities.

3. External risks like regulations and climate change were added as those issues gained more mainstream attention and importance. This shows how companies have to adapt their risk factors to account for changes in the overall business environment. 

4. Certain foundational risks around competition, growth, and reliance on independent contractors persisted over time. But the specifics and details provided for those risks evolved based on Uber's current challenges and priorities. So while the themes remained, the risks were actively updated.

In summary, Uber's risk factors changed over time to provide more details on the company's priorities, challenges, partnerships, and external environment at any given point. But certain core risks inherent to Uber's business model have also remained consistently highlighted, demonstrating how those foundational risks are long-term in nature. The changes in risks over time provide a glimpse into how an innovative company's risk profile is constantly evolving.

这个响应只包含对 2019 年 10-K 的风险因素分析(其中又包含了 2017 年和 2018 年的风险因素)。它不包含 2020 年及以后的年份。这部分原因可能是由于我们的树状摘要响应合成策略。尽管如此,这表明试图天真地将文档“塞进”巨大的 10 万 token 块,并使用简单的响应合成策略,仍然无法产生最优的答案。

Token 用量和延迟

正如预期,将所有四个文档输入 Anthropic 需要更多链式 LLM 调用,这会消耗更多 token 并花费更长时间(大约 9-10 分钟)。

结论

总的来说,新的 10 万 token 上下文窗口令人振奋,为开发者提供了将数据馈送给 LLM 以执行不同任务/查询的新模式。它提供了连贯的分析,且边际 token 成本远低于 GPT-4。

话虽如此,在每次推理调用中试图最大化利用这个上下文窗口,确实在延迟和成本方面带来了一些权衡。

我们期待在 Claude 之上进行更多实验/比较/思考文章!请告诉我们您的反馈。

资源

您可以在此处查看我们的完整 Colab 笔记本