Jerry Liu • 2023-06-23

使用 LlamaIndex 和 TruLens 构建和评估 LLM 应用

作者： Anupam Datta, Shayak Sen, Jerry Liu, Simon Suo

原文链接：https://truera.com/build-and-evaluate-llm-apps-with-llamaindex-and-trulens/

LlamaIndex 是一个流行的开源框架，用于构建 LLM 应用。TruLens 是一个开源库，用于评估、跟踪和迭代 LLM 应用以提高其质量。LlamaIndex 和 TruLens 团队正积极合作，使 LLM 应用开发者能够快速构建、评估和迭代他们的应用。

在 TruLens 的最新版本中，我们引入了基于 LlamaIndex 的 LLM 应用追踪功能，只需几行代码即可评估和跟踪您的实验。这使得您可以自动评估应用堆栈的多个不同组件，包括：

应用输入和输出
LLM 调用
从索引中检索到的上下文块
延迟
成本和 Token 计数（即将推出！）

查看此notebook以开始使用，并对照阅读以获得分步视图。

我该如何实际使用它？

构建 LlamaIndex 应用

LlamaIndex 让您可以将数据连接到 LLM，并针对多种不同用例快速构建应用。

from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('llama_index/data').load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

构建应用后，您可以轻松查询数据

response = query_engine.query("What did the author do growing up?")
print(response)

并获得相应的回复。

Growing up, the author wrote short stories, programmed on an IBM 1401, and nagged his father to buy him a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, but switched to AI after becoming bored with it. He then took art classes at Harvard and applied to art schools, eventually attending RISD.

使用 TruLens 包装 LlamaIndex 应用

使用 TruLens，您可以使用 TruLlama 包装器包装 LlamaIndex 查询引擎。此包装器保留了 LlamaIndex 的所有行为，但会跟踪所有中间步骤，以便对其进行单独评估。

from trulens_eval import TruLlama
l = TruLlama(query_engine)

现在可以使用完全相同的方式查询被包装的应用

response = l.query("What did the author do growing up?")
print(response)

不同的是，现在查询的详细信息会被 TruLens 记录下来。

添加反馈函数

现在，为了评估模型的行为，我们可以向您的被包装应用添加反馈函数。请注意，作为开发者，您只需添加几行代码即可开始在应用中使用反馈函数。您还可以轻松添加根据您的应用需求量身定制的函数。

我们使用反馈函数的目标是以编程方式检查应用质量指标。

第一个反馈函数检查提示和回复之间的语言匹配。这是一个有用的检查，因为用户的一个自然期望是回复语言与提示语言相同。它是通过调用 HuggingFace API 来实现的，该 API 会以编程方式检查语言匹配。
下一个反馈函数通过使用一个被提示生成相关性得分的 OpenAI LLM 来检查答案与问题的相关性。
最后，第三个反馈函数检查从向量数据库中检索到的单个块与问题的相关性，同样以类似方式使用 OpenAI LLM。这非常有用，因为从向量数据库检索的步骤可能会产生与问题不相关的块，如果在生成最终回复之前过滤掉这些块，最终回复的质量会更好。

from trulens_eval import TruLlama, Tru, Query, Feedback, feedback

# Initialize Huggingface-based feedback function collection class:
hugs = feedback.Huggingface()
openai = feedback.OpenAI()
# Define a language match feedback function using HuggingFace.
f_lang_match = Feedback(hugs.language_match).on_input_output()
# By default this will check language match on the main app input and main app
# output.

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance).on_input_output()

# Question/statement relevance between question and each context chunk.
f_qs_relevance = Feedback(openai.qs_relevance).on_input().on(
    TruLlama.select_source_nodes().node.text
).aggregate(np.min)


feedbacks = [f_lang_match, f_qa_relevance, f_qs_relevance]

l = TruLlama(app=query_engine, feedbacks=feedbacks)