Harshad Suryawanshi • 2023-11-27

多模态 RAG：使用 ResNet、Cohere 和 Llamaindex 构建受 Pokémon Go 启发的应用“AInimal Go！”

在当前 GPT-4 Vision (GPT-4V) 用例随处可见的背景下，我想探索一种替代方法：将深度学习视觉模型与大型语言模型 (LLM) 配对。我的最新项目“AInimal Go！”正是一种尝试，旨在展示像 ResNet18 这样的专业视觉模型如何能够与 LLM 无缝集成，使用 LlamaIndex 作为编排层，并以维基百科文章作为知识库。

项目概述

“AInimal Go！”是一个交互式应用，允许用户捕捉或上传动物图像。上传图像后，ResNet18 模型会迅速对动物进行分类。随后，由 LlamaIndex 巧妙编排的 Cohere LLM API 接手。它会扮演识别出的动物角色，使用户能够就该动物与其进行独特的对话。对话内容由近 200 篇维基百科文章组成的知识库提供信息和丰富，为用户查询提供准确相关的回应。

为什么不使用 GPT-4V？

在 GPT-4 Vision 用例激增的同时，我想探索一种高效但功能强大的替代方案。选择合适的工具来完成任务非常重要——对每种多模态任务都使用 GPT-4V 可能会大材小用，就像用大锤砸坚果一样。我的方法是利用 ResNet18 的灵活性和精度来进行动物识别。这种方法不仅能削减成本，还凸显了专业模型在多模态领域的适应性。

工具与技术

用于动物检测的 ResNet： 一个速度极快的实现，利用 ImageNet 分类方案识别图像中的动物。
Cohere LLM： 根据识别出的动物生成引人入胜、信息丰富的对话。
LlamaIndex： 无缝编排工作流程，管理从预先索引的关于动物的维基百科文章中检索信息。
Streamlit 用于 UI

深入了解 `app.py`

“AInimal Go！”的核心在于 app.py 脚本，ResNet、Cohere LLM 和 LlamaIndex 在其中无缝地结合在一起。现在，让我们深入了解代码的关键方面

1. 图像捕捉/上传

在“AInimal Go！”中，流程始于用户上传或使用其设备相机捕捉图像。这是至关重要的一步，它为后续与识别出的动物的互动奠定了基础。

以下代码片段展示了如何使用 Streamlit 创建用于图像上传和捕捉的 UI。它提供了两种选择：用于选择图像文件的文件上传器和用于实时捕捉的相机输入。无论通过哪种方法提供图像，都会将其转换为字节流（BytesIO）进行处理。这种简化确保了无缝的用户体验，无论图像是从图库上传还是现场捕捉。

# Image upload section.
    image_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"], key="uploaded_image", on_change=on_image_upload)
    
    col1, col2, col3 = st.columns([1, 2, 1])
    with col2:  # Camera input will be in the middle column
        camera_image = st.camera_input("Take a picture", on_change=on_image_upload)
        
    
    # Determine the source of the image (upload or camera)
    if image_file is not None:
        image_data = BytesIO(image_file.getvalue())
    elif camera_image is not None:
        image_data = BytesIO(camera_image.getvalue())
    else:
        image_data = None
    
    if image_data:
        # Display the uploaded image at a standard width.
        st.session_state['assistant_avatar'] = image_data
        st.image(image_data, caption='Uploaded Image.', width=200)

2. 初始化 ResNet 进行图像分类

用户上传或捕捉图像后，下一个关键步骤是识别图像中的动物。这就是 ResNet18——一个强大的深度学习图像分类模型——发挥作用的地方。

函数 load_model_and_labels 执行两个关键任务

加载动物标签： 它首先加载 ImageNet 标签中特定于动物的子集。这些标签存储在一个字典中，将类别 ID 映射到其对应的动物名称。这种映射对于解释 ResNet 模型的输出至关重要。
初始化 ResNet18： 然后，该函数初始化特征提取器和 ResNet18 模型。特征提取器将图像预处理为 ResNet18 所需的格式，而模型本身负责实际的分类任务。

def load_model_and_labels():
    # Load animal labels as a dictionary
    animal_labels_dict = {}
    with open('imagenet_animal_labels_subset.txt', 'r') as file:
        for line in file:
            parts = line.strip().split(':')
            class_id = int(parts[0].strip())
            label_name = parts[1].strip().strip("'")
            animal_labels_dict[class_id] = label_name

    # Initialize feature extractor and model
    feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-18")
    model = ResNetForImageClassification.from_pretrained("microsoft/resnet-18")

    return feature_extractor, model, animal_labels_dict

feature_extractor, model, animal_labels_dict = load_model_and_labels()

通过以这种方式集成 ResNet18，“AInimal Go！”利用其速度和准确性完成了识别用户图像中动物的关键任务。这为随后进行的引人入胜且信息丰富的对话奠定了基础。

3. 使用 ResNet18 进行动物检测

初始化 ResNet18 后，下一步是使用它来检测上传图像中的动物。函数 get_image_caption 处理此任务。

图像预处理： 首先打开上传的图像，然后使用先前初始化的特征提取器进行预处理。此预处理将图像调整为 ResNet18 所需的格式。
动物检测： 预处理后的图像随后馈入 ResNet18，由其预测图像的类别。Logits（模型的原始输出）经过处理，找到概率最高的类别，该类别对应于预测的动物。
检索动物名称： 使用先前创建的标签字典将预测的类别 ID 映射到相应的动物名称。然后将此名称显示给用户。

def get_image_caption(image_data):
    image = Image.open(image_data)
    inputs = feature_extractor(images=image, return_tensors="pt")

    with torch.no_grad():
        logits = model(**inputs).logits

    predicted_label_id = logits.argmax(-1).item()
    predicted_label_name = model.config.id2label[predicted_label_id]
    st.write(predicted_label_name)
    # Return the predicted animal name
    return predicted_label_name, predicted_label_id

4. 验证图像中是否存在动物

为了确保“AInimal Go！”中的对话相关且引人入胜，验证上传的图像确实描绘了动物至关重要。此验证由 is_animal 函数处理。

def is_animal(predicted_label_id):
    # Check if the predicted label ID is within the animal classes range
    return 0 &lt;= predicted_label_id &lt;= 398

该函数检查 ResNet18 的预测标签 ID 是否落在动物类别范围（ImageNet 分类中的 0 到 398）内。这个简单而有效的检查对于保持应用专注于动物互动至关重要。

在脚本的后面，此函数用于验证检测到的对象

if not (is_animal(label_id)):
    st.error("Please upload image of an animal!")
    st.stop()

如果上传的图像未描绘动物，应用会提示用户上传适当的图像，确保对话保持在正轨上。

5. 初始化 LLM

init_llm 函数初始化 Cohere LLM 以及存储和服务所需的上下文（指定 llm 和 embed_model）。它还加载了约 200 篇动物主题的预索引维基百科文章。该函数设置了 LLM 运行的环境，为生成响应做准备。

def init_llm(api_key):
    llm = Cohere(model="command", api_key=st.secrets['COHERE_API_TOKEN'])

    service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")
    storage_context = StorageContext.from_defaults(persist_dir="storage")
    index = load_index_from_storage(storage_context, index_id="index", service_context=service_context)
    
    return llm, service_context, storage_context, index

此函数对于设置 LLM 至关重要，确保所有必要组件就位以实现聊天功能。

6. 创建聊天引擎

create_chat_engine 函数接收动物描述并利用它创建查询引擎。此引擎负责处理用户查询并根据识别出的动物生成响应。

def create_chat_engine(img_desc, api_key):
    doc = Document(text=img_desc)
    
    query_engine = CitationQueryEngine.from_args(
        index,
        similarity_top_k=3,
        citation_chunk_size=512,
        verbose=True
    )
    
    return query_engine

system_prompt=f"""
              You are a chatbot, able to have normal interactions. Do not make up information.
              You always answer in great detail and are polite. Your job is to roleplay as an {img_desc}. 
              Remember to make {img_desc} sounds while talking but dont overdo it.
              """
                    
response = chat_engine.query(f"{system_prompt}. {user_input}")

通过创建特定于识别出的动物的查询引擎，此函数确保应用中的对话相关、信息丰富且引人入胜。我使用了 CitationQueryEngine，以便将来能够展示来源，从而使对话不仅引人入胜，而且通过可信的参考资料信息丰富。

7. 将所有功能整合

将所有技术组件就位后，“AInimal Go！”将所有内容整合到一个用户友好的聊天界面中。在这里，用户可以直接与 AI 互动，提问并接收关于识别出的动物的回复。这个由 Streamlit 精心管理的最终互动循环，完美地展示了视觉和语言模型在应用中的无缝集成。

总结

“AInimal Go！”代表了视觉模型、语言模型和维基百科的激动人心的融合，LlamaIndex 作为编排器，将 ResNet 用于动物识别，Cohere 的 LLM 用于引人入胜的对话，实现了无缝集成。这款应用是迈向更具创新性的视觉-语言应用的垫脚石。可能性无限，您的见解可以塑造其未来。我鼓励您探索演示，尝试代码，并与我一起推动 AI 在多模态互动领域所能达到的边界。