Vishwas Gowda • 2023-10-17

我如何使用LlamaIndex构建Streamlit LLM黑客松获奖应用 FinSight。

在本文中，我们将深入探讨LLM应用开发的世界，并仔细了解我构建Streamlit LLM黑客松获奖应用FinSight — 触手可及的财务洞察的历程。本文涵盖了从构思到执行的整个过程，以及代码片段和快照。

引言

LLM在金融领域的用例

LLM在金融领域一个引人入胜的用例是将其应用于公司年度报告（10-K表格）。这些报告是公开信息，几乎所有投资组合经理、财务分析师和股东都定期使用它们来做出明智的决定。

然而，阅读、理解和评估这些报告，特别是针对多家公司时，可能会非常繁琐和耗时。因此，在年度报告上使用LLM提取洞察和总结将解决很多问题，并节省宝贵的时间。

当Streamlit LLM黑客松宣布时，我认为这是探索这个想法的最佳时机。这就是FinSight诞生的过程。

FinSight如何工作？

FinSight有两个主要功能，称为“年度报告分析器”和“财务指标评审”，但在本博文中，我们将重点介绍前者。

年度报告分析器是一个基于RAG（检索增强生成）的功能，这意味着LLM将基于知识库中的信息（在这种情况下是公司的年度报告）生成洞察。以下是其幕后工作原理

虽然这是架构的基本表示，我们将深入探讨每个组件的重要性及其工作原理。

设置

如果您想参考应用代码：Repo

我们将使用LlamaIndex来构建知识库并使用LLM（gpt-4最适合）对其进行查询。LlamaIndex是一个简单灵活的数据框架，用于将自定义数据源连接到大型语言模型。

对于前端，Streamlit是构建和共享Web应用最方便的工具。

克隆仓库

git clone https://github.com/vishwasg217/finsight.git
cd finsight

2. 设置虚拟环境

# For macOS and Linux:
python3 -m venv venv

# For Windows:
python -m venv venv

3. 激活虚拟环境

# For macOS and Linux:
source venv/bin/activate
# For Windows:
.\venv\Scripts\activate

4. 安装所需依赖

pip install -r requirements.txt

5. 设置环境变量

# create directory
mkdir .streamlit
# create toml file
touch .streamlit/secrets.toml

您可以在这里获取您的API密钥：AlphaVantage, OpenAI,

# Add the following API keys
av_api_key = "ALPHA_VANTAGE API KEY"
openai_api_key = "OPEN AI API KEY"

文档加载、索引和存储

尽管LlamaIndex有一套自己的数据连接器来读取PDF，但我们仍然需要编写一个小的函数`process_pdf()`来加载PDF，因为我们是通过Streamlit进行的。

from pypdf import PdfReader
from llama_index.schema import Document

def process_pdf(pdf):
    file = PdfReader(pdf)
    text = ""
    for page in file.pages:
        text += str(page.extract_text())
        
    doc = Document(text=text)
    return [doc]

下一步是将此文档摄取、索引并存储在向量数据库中。在这种情况下，我们将使用FAISS DB，因为我们需要一个内存中的向量数据库。FAISS使用起来也非常方便。因此，我们编写了一个名为`get_vector_index()`的函数来完成这项工作。

如果您有兴趣查看其他向量数据库选项，可以阅读此处。

from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import FaissVectorStore


def get_vector_index(documents):
    llm = OpenAI(OPENAI_API_KEY)
    faiss_index = faiss.IndexFlatL2(d)
    vector_store = FaissVectorStore(faiss_index=faiss_index)

    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    service_context = ServiceContext.from_defaults(llm=llm) 

    index = VectorStoreIndex.from_documents(documents, 
        service_context=service_context,
        storage_context=storage_context
    )
   
    return index

`ServiceContext()`和`StorageContext()`用于设置向量存储的配置。使用`VectorStoreIndex.from_documents()`，我们将文档摄取、索引并存储为FAISS DB中的向量嵌入。

# Calling the functions through streamlit frontend

import streamlit as st

if "index" not in st.session_state:
  st.session_state.index = None

if "process_doc" not in st.session_state:
        st.session_state.process_doc = False

if st.sidebar.button("Process Document"):
        with st.spinner("Processing Document..."):
            documents = process_pdf(pdfs)
            st.session_state.index = get_vector_index(documents)
            st.session_state.process_doc = True

  st.toast("Document Processsed!")

查询工具和引擎

既然我们的知识库已准备就绪，是时候构建一个查询它的机制了。

index = get_vector_index(documents)
engine = index.as_query_engine()
query = "How has Microsoft performed in this fiscal year?"
response = engine(query)

理想情况下，上面的代码应该足以从向量数据库中的信息中查询和合成响应。然而，响应可能不够全面和详细，尤其是对于此类开放性问题。我们需要开发一个更好的机制，允许我们将查询分解成更详细的问题，并从向量数据库的多个部分检索上下文。

def get_query_engine(engine):
    query_engine_tools = [
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name="Annual Report",
                description=f"Provides information about the company from its annual report.",
            ),
        ),
    ]
    s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)
    return s_engine

index = get_vector_index(documents)
engine = index.as_query_engine()
s_engine = get_query_engine(engine)

让我们分解上面的函数。`QueryEngineTool`模块封装了`engine`，并帮助向引擎提供上下文和元数据。当您有多个引擎并希望向LLM提供关于针对给定查询使用哪个引擎的上下文时，这尤其有用。

看起来会像这样


# example for multiple query engine tools
query_engine_tools = [
    QueryEngineTool(
        query_engine=sept_engine,
        metadata=ToolMetadata(
            name="sept_22",
            description="Provides information about Uber quarterly financials ending September 2022",
        ),
    ),
    QueryEngineTool(
        query_engine=june_engine,
        metadata=ToolMetadata(
            name="june_22",
            description="Provides information about Uber quarterly financials ending June 2022",
        ),
    )
]

您可以在此处阅读更多关于LlamaIndex中可用的工具。

然而，目前我们只使用一个`QueryEnginerTool`。

`SubQuestionQueryEngine`模块将一个复杂查询分解成许多子问题及其目标查询引擎进行执行。执行所有子问题后，收集所有响应并将其发送到响应合成器以生成最终响应。使用此模块至关重要，因为从年度报告生成洞察需要复杂的查询，这些查询需要从向量数据库内的多个节点检索信息。

提示工程

提示工程对于整个过程至关重要，主要有两个原因

通过编写精确和相关的查询，为代理提供清晰的信息，说明它需要从向量数据库中检索什么
然后通过提供生成的输出的结构和描述来控制从检索到的上下文中生成的输出质量。

这两点都可以通过使用`langchain`中的`PromptTemplate`和`PydanticOutputParser`模块来处理。

使用`PydanticOutputParser`，我们为要生成的洞察的不同部分编写描述。在与金融专家进行了一些交流后，我决定为这4个部分生成洞察：财政年度亮点、战略展望与未来方向、风险管理、创新与研发。现在让我们为这些部分编写`pydantic`类

from pydantic import BaseModel, Field

class FiscalYearHighlights(BaseModel):
    performance_highlights: str = Field(..., description="Key performance metrics and financial stats over the fiscal year.")
    major_events: str = Field(..., description="Highlight of significant events, acquisitions, or strategic shifts that occurred during the year.")
    challenges_encountered: str = Field(..., description="Challenges the company faced during the year and, if and how they managed or overcame them.")

class StrategyOutlookFutureDirection(BaseModel):
    strategic_initiatives: str = Field(..., description="The company's primary objectives and growth strategies for the upcoming years.")
    market_outlook: str = Field(..., description="Insights into the broader market, competitive landscape, and industry trends the company anticipates.")

class RiskManagement(BaseModel):
    risk_factors: str = Field(..., description="Primary risks the company acknowledges.")
    risk_mitigation: str = Field(..., description="Strategies for managing these risks.")

class InnovationRnD(BaseModel):
    r_and_d_activities: str = Field(..., description="Overview of the company's focus on research and development, major achievements, or breakthroughs.")
    innovation_focus: str = Field(..., description="Mention of new technologies, patents, or areas of research the company is diving into.")

注意：这些部分及其描述适用于一般用例。可以根据您的特定需求进行更改。

这些pydantic类将为提示提供每个部分的格式和描述。因此，让我们编写一个函数，允许我们将任何pydantic类插入到提示中

from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser


prompt_template = """
You are given the task of generating insights for {section} from the annual report of the company. 

Given below is the output format, which has the subsections.
Must use bullet points.
Always use $ symbol for money values, and round it off to millions or billions accordingly

Incase you don't have enough info you can just write: No information available
---
{output_format}
---
"""

def report_insights(engine, section, pydantic_model):
    parser = PydanticOutputParser(pydantic_object=pydantic_model)

    prompt_template = PromptTemplate(
        template=prompt_template,
        input_variables=["section"],
        partial_variables={"output_format": parser.get_format_instructions()}
    )

    formatted_input = prompt_template.format(section=section)
    response = engine.query(formatted_input)
    parsed_response = parser.parse(response.response)

    return parsed_response

`PromptTemplate`将所有值（如`section`和`output_format`）插入到提示模板中。`PydanticOutputParser`将pydantic类转换为LLM可读的格式。生成的响应将是字符串格式，因此我们使用`parser.parse()`函数来解析响应并获得结构化输出。

# calling the function in streamlit frontend

if st.session_state.process_doc:

    if st.button("Analyze Report"):

        engine = get_query_engine(st.session_state.index.as_query_engine(similarity_top_k=3))

        with st.status("**Analyzing Report...**"):

            st.write("Fiscal Year Highlights...")
            st.session_state.fiscal_year_highlights = report_insights(engine, "Fiscal Year Highlights", FiscalYearHighlights)

            st.write("Strategy Outlook and Future Direction...")
            st.session_state.strategy_outlook_future_direction = report_insights(engine, "Strategy Outlook and Future Direction", StrategyOutlookFutureDirection)

            st.write("Risk Management...")
            st.session_state.risk_management = report_insights(engine, "Risk Management", RiskManagement)
            
            st.write("Innovation and R&D...")
            st.session_state.innovation_and_rd = report_insights(engine, "Innovation and R&D", InnovationRnD)


# displaying the generated insights

  if st.session_state.fiscal_year_highlights:
        
        with tab1:
            st.write("## Fiscal Year Highlights")
            st.write("### Performance Highlights")
            st.write(st.session_state.fiscal_year_highlights.performance_highlights)
            st.write("### Major Events")
            st.write(st.session_state.fiscal_year_highlights.major_events)
            st.write("### Challenges Encountered")
            st.write(st.session_state.fiscal_year_highlights.challenges_encountered)
            st.write("### Milestone Achievements")
            st.write(str(st.session_state.fiscal_year_highlights.milestone_achievements))


    if st.session_state.strategy_outlook_future_direction:
        with tab2:
            st.write("## Strategy Outlook and Future Direction")
            st.write("### Strategic Initiatives")
            st.write(st.session_state.strategy_outlook_future_direction.strategic_initiatives)
            st.write("### Market Outlook")
            st.write(st.session_state.strategy_outlook_future_direction.market_outlook)
            st.write("### Product Roadmap")
            st.write(st.session_state.strategy_outlook_future_direction.product_roadmap)

    if st.session_state.risk_management:
        with tab3:
            st.write("## Risk Management")
            st.write("### Risk Factors")
            st.write(st.session_state.risk_management.risk_factors)
            st.write("### Risk Mitigation")
            st.write(st.session_state.risk_management.risk_mitigation)

    if st.session_state.innovation_and_rd:
        with tab4:
            st.write("## Innovation and R&D")
            st.write("### R&D Activities")
            st.write(st.session_state.innovation_and_rd.r_and_d_activities)
            st.write("### Innovation Focus")
            st.write(st.session_state.innovation_and_rd.innovation_focus)

您可以在此处找到完整的年度报告分析器代码。

即将推出的功能

选择并存储洞察：我一直在开发一个功能，允许用户选择所需的任何洞察并将其保存到用户帐户中
添加更多针对特定行业的洞察：目前，这些洞察适用于一般用途。然而，不同的行业对年度报告的使用方式不同，因此我自然需要根据用户的用例创建一套不同的洞察。
用于查询财务报表的`PandasQueryEngine`模块：使用此模块，LLM将能够从财务报表中提取更好的洞察，财务报表通常是结构化格式的。

结论

总之，FinSight的年度报告分析器通过利用LLM的力量，使财务分析更容易、更具洞察力。对于投资组合经理、财务分析师和股东来说，它是一个有价值的工具，可以节省时间并改进决策制定。虽然核心流程保持一致，但请注意，我们部署的应用代码可能会不断发展，以包含升级和增强功能，确保持续改进。

非常感谢LlamaIndex帮助我将FinSight变为现实。没有其他框架在构建基于RAG的工具方面如此先进。

如果您喜欢您所读的内容，请给我一个赞，也请支持FinSight。您可以在此处查看GitHub仓库。

您可以在LinkedIn和Twitter/X上与我联系