Harshad Suryawanshi • 2023-11-08

使用 PaLM、KOSMOS-2 和 LlamaIndex 构建我自己的 ChatGPT 视觉能力

在不断发展的 AI 领域，OpenAI 具备视觉能力的 ChatGPT 开启了新的篇章。对于开发者和创造者来说，这是一个令人兴奋的时代，因为我们正在探索视觉理解与对话式 AI 的融合。受此创新的启发，我着手构建了自己的多模态原型，这不仅仅是一个复制品，更是面向更先进和定制化视觉-语言应用的启动平台。

我们掌握的工具非同寻常。KOSMOS-2 是一个真正的强大引擎，能够从简单的像素中描绘出生动的叙事，让图像字幕生成看起来像魔法一样。然后是 Google PaLM API，它带来了真正理解并做出相关回应的对话深度。当然，还有 LlamaIndex —— 这个操作的核心大脑，它以如此精妙的方式编排这些元素，使得交互流畅自然，就像老朋友之间的对话一样。

功能概览

我好奇心和编程的成果是一个 Streamlit 应用——一个原型，它既是对 ChatGPT 视觉能力的致敬，也是一个替代方案。它具备以下功能：

实时图像交互：上传您的图片，立即开始关于它们的对话。
使用 KOSMOS-2 自动生成字幕：微软的 AI 模型为对话提供了描述性基础。
使用 PaLM 实现对话深度：谷歌的语言模型确保对话像图片本身一样丰富和细致。
用户友好界面：Streamlit 提供了直观简洁的用户界面，让任何人都能轻松导航和交互。

技术栈深度解析

该项目是各种技术的交响乐，每种技术都扮演着至关重要的角色

微软的 AI 模型 KOSMOS-2 通过 Replicate 为图像赋予叙事能力，使其焕发生机。
Google PaLM API 增加了语言智能层，使得关于图像的对话富有洞察力且引人入胜。
LlamaIndex 扮演着指挥家的角色，协调各模型协同工作。

揭示 `app.py`：应用的核心

app.py 脚本是应用的核心，我们将 KOSMOS-2 和 PaLM 与 Llamaindex 结合在一起，创造出无缝的多模态体验。让我们从头到尾详细介绍一下。

1. 初始设置

我们首先导入必要的库并设置我们的 Streamlit 页面。在这里，我们为图像处理和对话管理奠定基础。

import streamlit as st
import extra_streamlit_components as stx
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from io import BytesIO
import replicate
from llama_index.llms.palm import PaLM
from llama_index import ServiceContext, VectorStoreIndex, Document
from llama_index.memory import ChatMemoryBuffer
import os
import datetime

st.set_page_config(layout="wide")
st.write("My version of ChatGPT vision. You can upload an image and start chatting with the LLM about the image")

2. 用户界面

接下来，我们构建侧边栏和主区域，确保用户知道谁创建了这个应用，并可以访问其他项目，从而增强可信度和参与度。

# Sidebar
st.sidebar.markdown('## Created By')
st.sidebar.markdown("[Harshad Suryawanshi](https://www.linkedin.com/in/harshadsuryawanshi/)")
st.sidebar.markdown('## Other Projects')
# ...sidebar content continues

3. 图像上传和处理

上传图像后，应用不仅显示图像，还会调用 get_image_caption 函数生成相关的字幕。这个函数使用 @st.cache 装饰器进行缓存，通过 Replicate 使用 KOSMOS-2 模型为上传的图像提供简要描述。然后，该描述将作为与用户进行初始对话的基础。

@st.cache
def get_image_caption(image_data):
    input_data = {
        "image": image_data,
        "description_type": "Brief"
    }
    output = replicate.run(
        "lucataco/kosmos-2:3e7b211c29c092f4bcc8853922cc986baa52efe255876b80cac2c2fbb4aff805",
        input=input_data
    )
    # Split the output string on the newline character and take the first item
    text_description = output.split('\n\n')[0]
    return text_description

4. 使用 PaLM 和 Llamaindex 进行对话流程

获得图像字幕后，调用 create_chat_engine 函数来设置聊天引擎。此函数至关重要，因为它为对话建立了上下文，并初始化 PaLM API 进行交互。

@st.cache_resource
def create_chat_engine(img_desc, api_key):
    llm = PaLM(api_key=api_key)
    service_context = ServiceContext.from_defaults(llm=llm)
    doc = Document(text=img_desc)
    index = VectorStoreIndex.from_documents([doc], service_context=service_context)
    chatmemory = ChatMemoryBuffer.from_defaults(token_limit=1500)
    
    chat_engine = index.as_chat_engine(
        chat_mode="context",
        system_prompt=(
            f"You are a chatbot, able to have normal interactions, as well as talk. "
            "You always answer in great detail and are polite. Your responses always descriptive. "
            "Your job is to talk about an image the user has uploaded. Image description: {img_desc}."
        ),
        verbose=True,
        memory=chatmemory
    )
    return chat_engine

create_chat_engine 函数构建了我们应用对话能力的基础设施。它首先使用提供的 API 密钥实例化一个 PaLM 对象，设置服务上下文，并创建一个包含图像描述的文档。然后对该文档进行索引，以便为 Llamaindex 的上下文聊天引擎做好准备。最后，通过一个提示来配置聊天引擎，该提示指导 AI 如何参与对话，引用图像描述并定义聊天机器人的行为。

5. 用户交互和消息处理

该应用的演示版本通过将会话中的消息数量限制为 20 条，确保了引人入胜且受控的用户体验。如果达到此限制，它会友好地通知用户并禁用进一步的输入，以有效管理资源。

if message_count &gt;= 20:
    st.error("Notice: The maximum message limit for this demo version has been reached.")
    # Disabling the uploader and input by not displaying them
    image_uploader_placeholder = st.empty()  # Placeholder for the uploader
    chat_input_placeholder = st.empty()      # Placeholder for the chat input

然而，当消息数量未达到限制时，应用会提供清晰的聊天选项并处理图像上传过程。上传后，它立即处理图像以获取字幕，设置聊天引擎，并更新用户界面以反映成功上传。

else:
    # Add a clear chat button
    if st.button("Clear Chat"):
        clear_chat()

    # Image upload section
    image_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"], key="uploaded_image", on_change=on_image_upload)
    # ...code for image upload and display

对于用户的每一次输入，消息都会添加到聊天记录中，并向聊天引擎查询响应。该应用确保每条消息——无论是来自用户还是助手——都显示在聊天界面中，保持流畅的对话流程。

# ...code for handling user input and displaying chat history

# Call the chat engine to get the response if an image has been uploaded
if image_file and user_input:
    try:
        with st.spinner('Waiting for the chat engine to respond...'):
            # Get the response from your chat engine
            response = chat_engine.chat(user_input)
        # ...code for appending and displaying the assistant's response
    except Exception as e:
        st.error(f'An error occurred.')
        # ...exception handling code

总结

此应用是基础，是更复杂视觉-语言应用的跳板。潜力无限，您的见解可以塑造它的未来。我邀请您深入了解演示，修改代码，并与我一起突破 AI 能力的界限。

GitHub 仓库链接

在 LinkedIn 上与我联系

LinkedIn 帖子:

LlamaIndex 新闻通讯 2024–02–27
2024-02-27
弥合危机咨询差距：推出 Counselor Copilot
2024-02-24
介绍 LlamaCloud 和 LlamaParse
2024-02-20
LlamaIndex 新闻通讯 2024–02–20：介绍 LlamaCloud
2024-02-20

使用 PaLM、KOSMOS-2 和 LlamaIndex 构建我自己的 ChatGPT 视觉能力

功能概览

技术栈深度解析

揭示 app.py：应用的核心

总结

相关文章

揭示 `app.py`：应用的核心