Wangda Tan • 2023-12-06

LlamaIndex + Waii：结合数据库中的结构化数据与PDF，增强数据分析能力

引言

在许多企业中，数据主要存储在数据库中。通常，在尝试生成可操作的见解时，很难将数据库数据与其他形式的数据（如 PDF）结合起来。

我们设想开发一种代理，使任何人都能利用所有这些数据源的数据进行明智决策。想象一下，一个擅长通过合并来自 JIRA 和数据库等不同来源的数据来创建文档的代理，并进一步结合最新的互联网信息。

在 waii.ai，我们致力于提供企业级文本转 SQL API，实现最完整准确的自然语言到 SQL 转换。Waii 使公司能够将文本转 SQL 直接集成到其产品中，并为其内部数据/业务团队启用无代码分析。Waii 开箱即用，支持自托管/本地部署。

LlamaIndex 推出了一个卓越的 RAG 框架，促进了各种客户数据源（如 PDF、Notion 和内部知识库）与大型语言模型 (LLM) 的连接。这一进展简化了数据增强型聊天机器人和分析代理的创建。

这为开发一个能够访问多个数据源（包括您首选的数据库）的企业代理提供了绝佳机会。我们将在博客的其余部分进一步探讨这一点。

为何需要一个新的文本转 SQL LlamaIndex 插件？

为了使 Llama Index 代理能够利用文本转 SQL API，插件至关重要。LlamaIndex 已经有一个内置的文本转 SQL 插件，但我们为何决定创建一个新的 LlamaHub 插件？

LlamaIndex 中现有的文本转 SQL 插件适用于处理具有简单 SQL 查询的简单数据库（少于 10 个表、100 列）。然而，管理中到大型数据库（可能包含数百个表和数千列）带来了复杂的挑战。由于 LLM 的上下文窗口受限，即使是具有大上下文窗口（如 GPT-4-turbo 的 128K 令牌）的模型，在内容过载时，其任务检索也可能出现不准确和退化的问题。这一问题在LlamaIndex 的一项研究中进行了讨论。

相比之下，Waii 注重提高查询生成的效率。我们开发了一个内置编译器，用于处理 LLM 的编译错误，以支持多种方言。我们基于数据库元数据、约束和查询历史创建的内部知识图谱有助于表/模式选择。用户还可以对模式/表/列应用语义规则，或与他们的数据目录服务集成，除了语法正确性外，还能确保生成查询的语义正确性。

要使用我们的服务，用户只需将其数据库连接到 Waii 并复制 Waii API 密钥即可创建 LlamaIndex 代理。

LlamaIndex + Waii 代理

我们很高兴展示 Waii 与 LlamaIndex 的集成，以创建一个能够执行各种文本转 SQL 任务并基于 PDF 验证数据的代理。

我们将分析客户在圣诞节期间购买量最大的类别，并将其与德勤的假日零售调查报告进行比较。

LlamaIndex + Waii 架构

在深入研究代码示例之前，让我们先看看架构

LlamaIndex 代理在客户端运行，并伴随一些工具：每个工具提供函数规范，并允许根据上下文和用户对 chat("…") 的输入来选择函数。例如，如果问题表明需要从“互联网”检索信息，则会选择 Google 搜索工具。内部它使用 LLM，根据给定的上下文返回选定的函数及其参数。

当选择 Waii 工具时，无论是用于描述数据集、生成查询还是运行查询，它都会向 Waii 服务发送 API 请求。

Waii 服务可以部署为托管 SaaS 或在您的本地环境中运行的 Docker 容器。Waii 服务的组件包括

查询生成器：协调整个查询生成工作流程，并为此与 LLM 通信。
知识图谱 / 元数据管理：连接到数据库，提取元数据和查询历史作为知识图谱，协助查询生成器选择正确的表和模式。
语义规则：这些规则协助查询生成器生成语义正确的查询。
Waii 编译器：在 LLM 生成查询后，Waii 编译器会修补查询中识别出的问题。如果编译问题无法修复，它会使用清晰的错误消息重新生成查询。

使用 Waii + PDF Loader 创建 LlamaIndex 代理

首先，我们创建两个 LlamaHub 工具——Waii 和 PDF Loader。LlamaHub 工具包含用于识别可用函数及其参数的规范，代理将根据可用函数和上下文选择并执行要使用的函数。

让我们从创建包含 Waii 工具的代理开始

from llama_hub.tools.google_search import GoogleSearchToolSpec
from llama_hub.tools.waii import WaiiToolSpec
from llama_index.agent import OpenAIAgent
from llama_index.llms import OpenAI

waii_tool = WaiiToolSpec(
    api_key='waii_api_key',
    # Connection key of WAII connected database, see 
    # https://github.com/waii-ai/waii-sdk-py#get-connections
    database_key='database_to_use',
    verbose=True
)

然后创建一个 PDF 工具

from pathlib import Path
from llama_index import download_loader
from llama_index import VectorStoreIndex

PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path('DI_2023-Deloitte-holiday-retail-survey.pdf'))
index = VectorStoreIndex.from_documents(documents)
engine = index.as_query_engine(similarity_top_k=5)
deloitte_retail_survey_tool = QueryEngineTool(
        query_engine=engine,
        metadata=ToolMetadata(
            name="deloitte_retail_survey",
            description=(
                "Provides retail survey report for holiday sales based on Deloitte's data"
                "Use a detailed plain text question as input to the tool, and output using plain text based on pdf data"
            ),
        ),
    )

最后，创建一个结合了 Waii 和 PDF 工具的代理

agent = OpenAIAgent.from_tools(
  [deloitte_retail_survey_tool] + waii_tool.to_tool_list(), 
  llm=OpenAI(model='gpt-4-1106-preview', temperature=0), 
  verbose=True)

[deloitte_retail_survey_tool] + waii_tool.to_tool_list() 表示使用 Waii 和 PDF Search 提供的所有功能（例如，从数据库获取答案、生成查询、执行查询、描述数据集等）。

理解您的数据集

数据分析的第一步是更好地理解您的数据集。

您可以开始向您的代理提问

agent.chat("Describe my database")

其输出是

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: describe_dataset with args: 
  {"ask":"Can you describe the whole database?"}
...

========================

STARTING TURN 2
---------------

The database I have access to consists of several schemas, each with its 
own set of tables and domains:

1. **CRUNCHBASE_2016**: This schema includes tables related to acquisitions, 
   companies, investments, and funding rounds in the Crunchbase database for
   the year 2016. It's useful for analyzing acquisition trends, company 
   information, investment activity, and funding trends.

2. **RETAIL_DATA**: This schema contains tables related to retail operations, 
   such as call centers, customers, addresses, demographics, inventory, 
   items, promotions, stores, returns, sales, and warehouses. It can be used 
   to analyze call center performance, customer demographics, inventory 
   management, sales performance, and other retail operations.

3. ...

如您所见，代理理解了请求，调用了 Waii 提供的 describe_dataset 函数，并生成了数据集的摘要。

我对 RETAIL_DATA 模式感兴趣，所以让我进一步询问该模式 -

agent.chat("What can i do with the retail_data schema")

然后我得到了这个

The RETAIL_DATA schema in the TWEAKIT_PLAYGROUND database is designed 
to support a wide range of analyses related to retail operations. Here 
are some of the capabilities and types of analyses you can perform with 
this schema:

1. **Call Center Analysis**: Evaluate the performance of call centers, 
   understand call volumes, and assess customer service efficiency.

2. **Customer Demographics**: Analyze customer profiles, including 
   demographics, purchasing behaviors, and preferences. This can help 
   in targeted marketing and customer segmentation.

Specific questions that can be addressed using the RETAIL_DATA schema include:
- What is the total number of call centers?
- How many customers have a preferred customer flag?
- What is the average price of items?

让我进行更多数据分析。

生成并运行 SQL 查询

让我们生成一个 SQL 查询（询问圣诞节期间销量前 10 的商品类别）

agent.chat("Top 10 item category sold during christmas time across all years")

现在它调用 Waii 工具中的 get_answer 函数

=== Calling Function ===
Calling function: get_answer with args: 
  {"ask":"What are the top 10 item categories sold during 
          Christmas time across all years?"}

并得到了一个类似如下的 SQL

WITH christmas_sales AS (
    SELECT
        ss_item_sk,
        SUM(ss_quantity) AS total_quantity
    FROM tweakit_playground.retail_data.store_sales AS ss
    INNER JOIN tweakit_playground.retail_data.date_dim AS dd
        ON ss_sold_date_sk = d_date_sk
    WHERE
        d_holiday = 'Y' AND d_date ILIKE '%-12-25'
    GROUP BY
        ss_item_sk
),
ranked_categories AS (
    SELECT
        i_category,
        SUM(total_quantity) AS category_quantity
    FROM christmas_sales AS cs
    INNER JOIN tweakit_playground.retail_data.item AS i
        ON ss_item_sk = i_item_sk
    GROUP BY
        i_category
    ORDER BY
        category_quantity DESC
    LIMIT 10
)
SELECT
    i_category AS category,
    category_quantity
FROM ranked_categories

让我让它运行查询

agent.chat("Run it")

我得到了一个 Dataframe 及其摘要

The top 10 item categories sold during Christmas time across all years, 
based on the total quantity of items, are:

1. Women: 1,487,891
2. Sports: 1,486,644
3. Children: 1,483,531
4. Electronics: 1,478,463
5. Music: 1,476,045
6. Books: 1,472,911
7. Home: 1,471,348
8. Jewelry: 1,459,025
9. Shoes: 1,456,296
10. Men: 1,451,285

结合 PDF 报告使用

让我们尝试将结果与德勤的假日零售报告（PDF 格式）进行交叉核对 — 德勤的顶级礼品购买类别与我们从数据库中看到的一样吗？

PDF 中有一个页面是我们感兴趣的 — 让我们看看是否可以检索到这个页面

agent.chat("""Give me top holiday gift categories based on Deloitte's 
holiday retail survey, and how it is different from the top 
10 item category sold during christmas time across all years 
from my database""")

输出如下所示，表明它使用 deloitte_retail_survey 工具从 PDF 中获取见解。

STARTING TURN 1
---------------
=== Calling Function ===
Calling function: deloitte_retail_survey with args: {"input": "What are the top holiday gift categories based on Deloitte's holiday retail survey?"}
...

它给出了以下摘要

Based on Deloitte's holiday retail survey, the top holiday gift 
categories are:

1. Clothing & Accessories
2. Gift Cards & Other
3. Food & Beverage
...

From your database, the top 10 item categories sold during Christmas 
time across all years are:

1. Women
2. Sports
3. Children
...

Comparing the two lists, we can see some differences and similarities:

- "Clothing & Accessories" from Deloitte's survey could correspond to 
  "Women," "Men," and possibly "Children" from your database.
- "Electronics & Accessories" is a common category in both lists.
- "Gift Cards & Other" and "Food & Beverage" from Deloitte's survey do 
  not have a direct match in the top categories from your database.
...

Bingo！现在我们可以将数据库中的结果与 PDF 进行比较了。我很高兴看到代理能够关联这两个列表，并告诉我我的商店没有“礼品卡 & 其他”和“食品 & 饮料”类别！

您可以在 Colab notebook 的链接中找到代码

总结

将 Waii 的文本转 SQL API 与 LlamaIndex 的 RAG 框架集成，标志着企业数据分析的重大进步。这种强大的组合使公司能够轻松合并和分析来自各种来源的数据，包括数据库、PDF 和互联网。我们展示了代理生成 SQL 查询、理解复杂数据集以及将发现与外部报告关联的能力。这项创新不仅简化了数据分析，还为数字时代的明智决策开辟了新途径。

要了解更多关于 Waii 的信息，请在此处联系我们：https://www.waii.ai/#request-demo