Ravi Theja • 2023-08-12

LlamaIndex：利用 Text2SQL 和 RAG 的力量分析产品评论

引言

亚马逊和沃尔玛等电商平台每天都有大量产品吸引着海量评论。这些评论是反映消费者对产品看法的关键接触点。但企业如何才能从海量数据库中筛选出有意义的评论洞察呢？

答案在于通过 LlamaIndex 将 SQL 与 RAG（检索增强生成）相结合。

让我们深入探讨！

产品评论样本数据集

为了本次演示的目的，我们使用 GPT-4 生成了一个包含三款产品（iPhone13、三星电视和一把人体工学椅）评论的样本数据集。先睹为快：

iPhone13：“惊人的电池续航和摄像头质量。迄今为止最好的 iPhone。”
三星电视：“令人印象深刻的画面清晰度和鲜艳的色彩。一流的电视。”
人体工学椅：“即使长时间坐着也感觉非常舒适。”

这是一个样本数据集。

rows = [
    # iPhone13 Reviews
    {"category": "Phone", "product_name": "Iphone13", "review": "The iPhone13 is a stellar leap forward. From its sleek design to the crystal-clear display, it screams luxury and functionality. Coupled with the enhanced battery life and an A15 chip, it's clear Apple has once again raised the bar in the smartphone industry."},
    {"category": "Phone", "product_name": "Iphone13", "review": "This model brings the brilliance of the ProMotion display, changing the dynamics of screen interaction. The rich colors, smooth transitions, and lag-free experience make daily tasks and gaming absolutely delightful."},
    {"category": "Phone", "product_name": "Iphone13", "review": "The 5G capabilities are the true game-changer. Streaming, downloading, or even regular browsing feels like a breeze. It's remarkable how seamless the integration feels, and it's obvious that Apple has invested a lot in refining the experience."},

    # SamsungTV Reviews
    {"category": "TV", "product_name": "SamsungTV", "review": "Samsung's display technology has always been at the forefront, but with this TV, they've outdone themselves. Every visual is crisp, the colors are vibrant, and the depth of the blacks is simply mesmerizing. The smart features only add to the luxurious viewing experience."},
    {"category": "TV", "product_name": "SamsungTV", "review": "This isn't just a TV; it's a centerpiece for the living room. The ultra-slim bezels and the sleek design make it a visual treat even when it's turned off. And when it's on, the 4K resolution delivers a cinematic experience right at home."},
    {"category": "TV", "product_name": "SamsungTV", "review": "The sound quality, often an oversight in many TVs, matches the visual prowess. It creates an enveloping atmosphere that's hard to get without an external sound system. Combined with its user-friendly interface, it's the TV I've always dreamt of."},

    # Ergonomic Chair Reviews
    {"category": "Furniture", "product_name": "Ergonomic Chair", "review": "Shifting to this ergonomic chair was a decision I wish I'd made earlier. Not only does it look sophisticated in its design, but the level of comfort is unparalleled. Long hours at the desk now feel less daunting, and my back is definitely grateful."},
    {"category": "Furniture", "product_name": "Ergonomic Chair", "review": "The meticulous craftsmanship of this chair is evident. Every component, from the armrests to the wheels, feels premium. The adjustability features mean I can tailor it to my needs, ensuring optimal posture and comfort throughout the day."},
    {"category": "Furniture", "product_name": "Ergonomic Chair", "review": "I was initially drawn to its aesthetic appeal, but the functional benefits have been profound. The breathable material ensures no discomfort even after prolonged use, and the robust build gives me confidence that it's a chair built to last."},
]

设置内存数据库

为了处理我们的数据，我们正在使用一个内存中的 SQLite 数据库。SQLAlchemy 提供了一种高效的方式来建模、创建和与该数据库交互。以下是我们的 product_reviews 表结构：

id (整数，主键)
category (字符串)
product_name (字符串)
review (字符串，非空)

定义好表结构后，我们使用样本数据集填充它。

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

# create product reviews SQL table
table_name = "product_reviews"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("id", Integer(), primary_key=True),
    Column("category", String(16), primary_key=True),
    Column("product_name", Integer),
    Column("review", String(16), nullable=False)
)
metadata_obj.create_all(engine)

sql_database = SQLDatabase(engine, include_tables=["product_reviews"])

for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.connect() as connection:
        cursor = connection.execute(stmt)
        connection.commit()

分析产品评论 — Text2SQL + RAG

从数据中提取洞察通常需要复杂的提问。

LlamaIndex 中的 SQL + RAG 通过将其分解为三个步骤来简化此过程：

问题分解

主要查询构建：用自然语言构建主要问题，从 SQL 表中提取初步数据。
次要查询构建：构建辅助问题，以细化或解释主要查询的结果。

2. 数据检索：使用 Text2SQL LlamaIndex 模块运行主要查询，以获得初步结果集。

3. 最终答案生成：使用 List Index 根据次要问题进一步细化结果，从而得出最终答案。

让我们一步步来做。

将用户查询分解为两个阶段

在使用关系型数据库时，通常将用户查询分解为更易于管理的部分会很有帮助。这使得从数据库中检索准确数据并随后处理或解释这些数据以满足用户需求变得更容易。我们设计了一种方法，通过给 gpt-3.5-turbo 模型一个示例，将其分解为两个不同的问题，以生成两个不同的问题。

让我们将此应用于查询“获取 iPhone13 评论的摘要”，我们的系统将生成：

数据库查询：“从表中检索与 iPhone13 相关的评论。”
解释查询：“总结检索到的评论。”

这种方法确保我们同时满足数据检索和数据解释的需求，从而为用户查询提供更准确和量身定制的响应。

def generate_questions(user_query: str) -&gt; List[str]:
  system_message = '''
  You are given with Postgres table with the following columns.

  city_name, population, country, reviews.

  Your task is to decompose the given question into the following two questions.

  1. Question in natural language that needs to be asked to retrieve results from the table.
  2. Question that needs to be asked on the top of the result from the first question to provide the final answer.

  Example:

  Input:
  How is the culture of countries whose population is more than 5000000

  Output:
  1. Get the reviews of countries whose population is more than 5000000
  2. Provide the culture of countries
  '''

  messages = [
      ChatMessage(role="system", content=system_message),
      ChatMessage(role="user", content=user_query),
  ]
  generated_questions = llm.chat(messages).message.content.split('\n')

  return generated_questions

user_query = "Get the summary of reviews of Iphone13"

text_to_sql_query, rag_query = generate_questions(user_query)

数据检索 — 执行主要查询

当我们把用户的提问分解成组成部分时，第一步是将“自然语言中的数据库查询”转换成可以在我们的数据库中运行的实际 SQL 查询。在本节中，我们将使用 LlamaIndex 的 NLSQLTableQueryEngine 来处理此 SQL 查询的转换和执行。

设置 NLSQLTableQueryEngine

NLSQLTableQueryEngine 是一个强大的工具，它可以接收自然语言查询并将其转换为 SQL 查询。我们通过提供必要的信息来启动它：

sql_database：这代表我们的 SQL 数据库连接详情。
tables：我们指定查询将针对哪个表或哪些表运行。在这种情况下，我们针对的是 product_reviews 表。
synthesize_response：当设置为 False 时，这确保我们接收到原始 SQL 响应，而不进行额外的合成。
service_context：这是一个可选参数，可用于提供特定于服务的设置或插件。

sql_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["product_reviews"],
    synthesize_response=False,
    service_context=service_context
)

执行自然语言查询

设置好引擎后，下一步是针对它执行我们的自然语言查询。引擎的 query() 方法用于此目的。

sql_response = sql_query_engine.query(text_to_sql_query)

处理 SQL 响应

SQL 查询的结果通常是行的列表（每行表示为评论列表）。为了使其更具可读性并可用于第三步的摘要评论处理，我们将此结果转换为单个字符串。

sql_response_list = ast.literal_eval(sql_response.response)
text = [' '.join(t) for t in sql_response_list]
text = ' '.join(text)

您可以在 sql_response.metadata["sql_query"] 中查看生成的 SQL 查询。

通过遵循此过程，我们能够将自然语言处理与 SQL 查询执行无缝集成。让我们进行此过程的最后一步，以获取评论摘要。

使用 ListIndex 细化和解释评论

从 SQL 查询获得主要结果集后，通常需要进一步细化或解释。这就是 LlamaIndex 中的 ListIndex 发挥关键作用的地方。它允许我们在获得的文本数据上执行次要问题，以获得细化后的答案。

listindex = ListIndex([Document(text=text)])
list_query_engine = listindex.as_query_engine()

response = list_query_engine.query(rag_query)

print(response.response)

现在让我们将所有内容封装在一个函数中，并尝试一些有趣的例子

"""Function to perform SQL+RAG"""

def sql_rag(user_query: str) -&gt; str:
  text_to_sql_query, rag_query = generate_questions(user_query)

  sql_response = sql_query_engine.query(text_to_sql_query)

  sql_response_list = ast.literal_eval(sql_response.response)

  text = [' '.join(t) for t in sql_response_list]
  text = ' '.join(text)

  listindex = ListIndex([Document(text=text)])
  list_query_engine = listindex.as_query_engine()

  summary = list_query_engine.query(rag_query)

  return summary.response

示例

sql_rag("How is the sentiment of SamsungTV product?")

三星电视产品的评论情绪总体上是积极的。用户对画面清晰度、鲜艳的色彩和令人惊叹的画质表示满意。他们赞赏智能功能、用户友好的界面和便捷的连接选项。时尚的设计和壁挂能力也受到好评。环境模式、游戏模式和 HDR 内容被提及为突出特点。用户认为带语音命令的遥控器很方便，并赞赏定期的软件更新。然而，一些用户提到音质可以更好，并建议使用外部音响系统。总的来说，评论表明三星电视被认为是高质量观看的可靠投资。

sql_rag("Are people happy with Ergonomic Chair?")

人们对人体工学椅的总体满意度很高。

您可以在 Google Colab 笔记本中尝试此方法和数据集 — 点击此处。

结论

在电子商务时代，用户评论决定产品的成败，快速分析和解释海量文本数据的能力至关重要。LlamaIndex 通过巧妙地整合 SQL 和 RAG，为企业提供了强大的工具，可以从此类数据集中提取可行的洞察。通过将结构化 SQL 查询与自然语言处理的抽象无缝结合，我们展示了一种简化方法，将模糊的用户查询转化为精确、信息丰富的答案。

通过这种方法，企业现在可以高效地筛选海量评论，提取用户情绪的本质，并做出明智的决策。无论是衡量产品总体情绪、理解特定功能反馈，还是跟踪评论随时间的变化，LlamaIndex 中的 Text2SQL+RAG 方法论都预示着数据分析的新时代。

Jamba-Instruct 在 LlamaIndex 上的 256k 上下文窗口
2024-07-31
LlamaIndex 新闻简报 2024-04-02
2024-04-02
LlamaIndex 新闻简报 2024-03-26
2024-03-26
使用 UpTrain 评估增强您的 LlamaIndex RAG 管道
2024-03-19