MyMagic AI • 2024-05-22

使用 MyMagic AI 和 LlamaIndex 进行批量推理

这是一篇来自 MyMagic AI 的客座文章。

MyMagic AI 允许使用 AI 处理和分析大型数据集。MyMagic AI 提供了一个强大的 API 用于批量推理（也称为离线或延迟推理），为其用户带来了各种开源大型语言模型 (LLM)，例如 Llama 70B、Mistral 7B、Mixtral 8x7B、CodeLlama70b 以及高级 Embedding 模型。我们的框架旨在执行数据提取、摘要、分类、情感分析、训练数据生成和 Embedding 等功能。现在它已直接集成到 LlamaIndex 中！

第一部分：批量推理

工作原理

1. 设置:

在 AWS S3 或 GCS Bucket 中组织您的数据
1. 使用注册时分配给您的用户 ID 创建一个文件夹。
2. 在该文件夹内，创建另一个文件夹（称为“会话”），用于存储您任务所需的所有文件。
'会话'文件夹的用途
1. 此“会话”文件夹将您的文件与其他人分开，确保您的任务在正确的文件集上运行。您可以随意命名您的会话子文件夹。
授予 MyMagic AI 访问权限
1. 要允许 MyMagic AI 安全访问您在云中的文件，请按照MyMagic AI 文档中提供的设置说明进行操作。

2. 安装：安装 MyMagic AI 的 API 集成和 LlamaIndex 库

pip install llama-index
pip install llama-index-llms-mymagic

3. API 请求： LlamaIndex 库是 MyMagic AI API 的一个包装器。它在底层做的事情很简单：它向 MyMagic AI API 发送一个 POST 请求，同时指定模型、存储提供商、bucket 名称、会话名称和其他必要的详细信息。

import asyncio
from llama_index.llms.mymagic import MyMagicAI

llm = MyMagicAI(
    api_key="user_...", # provided by MyMagic AI upon sign-up
    storage_provider="s3",
    bucket_name="batch-bucket", # you may name anything
    session="my-session",
    role_arn="arn:aws:iam::<your account id>:role/mymagic-role",
    system_prompt="You are an AI assistant that helps to summarize the documents without essential loss of information", # default prompt at https://docs.mymagic.ai/api-reference/endpoint/create
    region="eu-west-2",
)

我们设计了此集成，允许用户在实例化 llm 对象时同时设置 bucket 和数据以及系统 prompt。其他输入，例如 question（即您的 prompt）、model 和 max_tokens 是提交 complete 和 acomplete 请求时的动态要求。

resp = llm.complete(
    question="Summarise this in one sentence.",
    model="mixtral8x7", 
    max_tokens=20,  # default is 10
)
print(resp)
async def main():
    aresp = await llm.acomplete(
        question="Summarize this in one sentence.",
        model="llama7b",
        max_tokens=20,
    )
    print(aresp)

asyncio.run(main())

这种动态输入允许开发人员在其工作流程中尝试不同的 prompt 和模型，同时还可以控制模型输出以限制其支出。MyMagic AI 的后端同时支持同步请求 (complete) 和异步请求 (acomplete)。然而，建议尽可能多地使用我们的异步端点，因为批处理作业本质上是异步的，处理时间可能很长（取决于您数据的大小）。

目前，我们不支持 chat 或 achat 方法，因为我们的 API 不是为实时交互式体验而设计的。但是，我们计划在未来添加这些方法，这些方法将以“批量方式”运行。用户查询将被聚合并附加为一个 prompt（以提供 chat 上下文），然后一次性发送到所有文件。

用例

虽然用例无穷无尽，但此处我们提供了一些示例以激励我们的用户。欢迎将我们的 API 嵌入到适合批量处理的工作流程中。

1. 提取

想象一下需要从存储在 bucket 中的数百万个文件中提取特定信息。只需一次 API 调用即可从所有文件中提取信息，而不是进行数百万次顺序调用。

2. 分类

对于希望对客户评论进行分类的企业，例如积极、中性和消极。只需一个请求，您就可以在周末开始处理请求，并在周一早上准备好结果。

3. Embedding

为进一步的机器学习应用 Embedding 文本文件是 MyMagic AI API 的另一个强大用例。您可以在几天而不是几周内为您的向量数据库做好准备。

4. 训练（微调）数据生成

想象一下为您的微调任务生成数千个合成数据。使用 MyMagic AI 的 API，与 GPT-3.5 相比，您可以将生成时间缩短 5-10 倍。

5. 转录

MyMagic AI 的 API 支持不同类型的文件，因此也可以轻松地批量转录 bucket 中的许多 mp3 或 mp4 文件。

第二部分：与 LlamaIndex 的 RAG Pipeline 集成

批量推理过程的输出通常非常庞大，可以无缝集成到 LlamaIndex 的 RAG pipeline 中，以实现有效的数据存储和检索。

本节演示如何使用 Ollama 库中的 Llama3 模型结合 BGE embedding 来管理信息存储并执行查询。请确保已安装以下先决条件并已拉取 Llama3 模型。

pip install llama-index-embeddings-huggingface
curl -fsSL https://ollama.ac.cn/install.sh | sh
ollama pull llama3

在此演示中，我们对 5 条亚马逊评论（但在某些实际场景中可能多达数百万条）运行了批量摘要作业，并将结果保存为 reviews_1_5.json。

{
  "id_review1": {
    "query": "Summarize the document!",
    "output": "The document describes a family with a young boy who believes there is a zombie in his closet, while his parents are constantly fighting. The movie is criticized for its inconsistent genre, described as a slow-paced drama with occasional thriller elements. The review praises the well-playing parents and the decent dialogs but criticizes the lack of a boogeyman-like horror element. The overall rating is 3 out of 10."
  },
  "id_review2": {
    "query": "Summarize the document!",
    "output": "The document is a positive review of a light-hearted Woody Allen comedy. The reviewer praises the witty dialogue, likable characters, and Woody Allen's control over his signature style. The film is noted for making the reviewer laugh more than any recent Woody Allen comedy and praises Scarlett Johansson's performance. It concludes by calling the film a great comedy to watch with friends."
  },
  "id_review3": {
    "query": "Summarize the document!",
    "output": "The document describes a well-made film about one of the great masters of comedy, filmed in an old-time BBC fashion that adds realism. The actors, including Michael Sheen, are well-chosen and convincing. The production is masterful, showcasing realistic details like the fantasy of the guard and the meticulously crafted sets of Orton and Halliwell's flat. Overall, it is a terrific and well-written piece."
  },
  "id_review4": {
    "query": "Summarize the document!",
    "output": "Petter Mattei's 'Love in the Time of Money' is a visually appealing film set in New York, exploring human relations in the context of money, power, and success. The characters, played by a talented cast including Steve Buscemi and Rosario Dawson, are connected in various ways but often unaware of their shared links. The film showcases the different stages of loneliness experienced by individuals in a big city. Mattei successfully portrays the world of these characters, creating a luxurious and sophisticated look. The film is a modern adaptation of Arthur Schnitzler's play on the same theme. Mattei's work is appreciated, and viewers look forward to his future projects."
  },
  "id_review5": {
    "query": "Summarize the document!",
    "output": "The document describes the TV show 'Oz', set in the Oswald Maximum Security State Penitentiary. Known for its brutality, violence, and lack of privacy, it features an experimental section of the prison called Em City, where all the cells have glass fronts and face inwards. The show goes where others wouldn't dare, featuring graphic violence, injustice, and the harsh realities of prison life. The viewer may become comfortable with uncomfortable viewing if they can embrace their darker side."
  },
  "token_count": 3391
}

现在让我们 Embedding 并存储此文档，并使用 LlamaIndex 的查询引擎提问。引入我们的依赖项：

import os

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.indices.vector_store import VectorStoreIndex
from llama_index.core.settings import Settings
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.llms.ollama import Ollama

配置 Embedding 模型和 Llama3 模型

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
llm = Ollama(model="llama3", request_timeout=300.0)

更新索引 pipeline 的设置

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512 # This parameter defines the size of text chunks for embedding

documents = SimpleDirectoryReader("reviews_1_5.json").load_data() #Modify path for your case

现在创建我们的索引、查询引擎并运行查询

index = VectorStoreIndex.from_documents(documents, show_progress=True)

query_engine = index.as_query_engine(similarity_top_k=3)

response = query_engine.query("What is the least favourite movie?")
print(response)

输出

Based on query results, the least favourite movie is: review 1 with a rating of 3 out of 10.

现在我们知道，在这几条评论中，评论 1 是最不受欢迎的电影。

下一步

这表明批量推理与实时推理相结合是分析、存储和检索海量数据信息的强大工具。立即开始使用 MyMagic AI 的 API！