Evaluating Summarization with TruLensยถ
In this notebook, we will evaluate a summarization application based on DialogSum dataset using a broad set of available metrics from TruLens. These metrics break down into three categories.
- Ground truth agreement: For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth. We will use for different measures: BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.
- Groundedness: Estimate if the generated summary can be traced back to parts of the original transcript both with LLM and NLI methods.
- Comprehensivenss: Estimate if the generated summary contains all of the key points from the source text.
Dependenciesยถ
Let's first install the packages that this notebook depends on. Uncomment these linse to run.
# !pip install trulens trulens-providers-openai trulens-providers-huggingface bert_score evaluate absl-py rouge-score pandas tenacity
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["HUGGINGFACE_API_KEY"] = "hf_..."
Download and load dataยถ
Now we will download a portion of the DialogSum dataset from github.
import pandas as pd
!wget -O dialogsum.dev.jsonl https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.dev.jsonl
file_path_dev = "dialogsum.dev.jsonl"
dev_df = pd.read_json(path_or_buf=file_path_dev, lines=True)
Let's preview the data to make sure that the data was properly loaded
dev_df.head(10)
Create a simple summarization app and instrument itยถ
We will create a simple summarization app based on the OpenAI ChatGPT model and instrument it for use with TruLens
from trulens.apps.custom import TruCustomApp
from trulens.apps.custom import instrument
import openai
class DialogSummaryApp:
@instrument
def summarize(self, dialog):
client = openai.OpenAI()
summary = (
client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "system",
"content": """Summarize the given dialog into 1-2 sentences based on the following criteria:
1. Convey only the most salient information;
2. Be brief;
3. Preserve important named entities within the conversation;
4. Be written from an observer perspective;
5. Be written in formal language. """,
},
{"role": "user", "content": dialog},
],
)
.choices[0]
.message.content
)
return summary
Initialize Database and view dashboardยถ
from trulens.core import TruSession
from trulens.dashboard import run_dashboard
session = TruSession()
session.reset_database()
# If you have a database you can connect to, use a URL. For example:
# session = TruSession(database_url="postgresql://hostname/database?user=username&password=password")
run_dashboard(session, force=True)
Write feedback functionsยถ
We will now create the feedback functions that will evaluate the app. Remember that the criteria we were evaluating against were:
- Ground truth agreement: For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth. We will use for different measures: BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.
- Groundedness: For this measure, we will estimate if the generated summary can be traced back to parts of the original transcript.
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
We select the golden dataset based on dataset we downloaded
golden_set = (
dev_df[["dialogue", "summary"]]
.rename(columns={"dialogue": "query", "summary": "response"})
.to_dict("records")
)
from trulens.core import Select
from trulens.providers.huggingface import Huggingface
from trulens.providers.openai import OpenAI
provider = OpenAI(model_engine="gpt-4o")
hug_provider = Huggingface()
ground_truth_collection = GroundTruthAgreement(golden_set, provider=provider)
f_groundtruth = Feedback(
ground_truth_collection.agreement_measure, name="Similarity (LLM)"
).on_input_output()
f_bert_score = Feedback(ground_truth_collection.bert_score).on_input_output()
f_bleu = Feedback(ground_truth_collection.bleu).on_input_output()
f_rouge = Feedback(ground_truth_collection.rouge).on_input_output()
# Groundedness between each context chunk and the response.
f_groundedness_llm = (
Feedback(
provider.groundedness_measure_with_cot_reasons,
name="Groundedness - LLM Judge",
)
.on(Select.RecordInput)
.on(Select.RecordOutput)
)
f_groundedness_nli = (
Feedback(
hug_provider.groundedness_measure_with_nli,
name="Groundedness - NLI Judge",
)
.on(Select.RecordInput)
.on(Select.RecordOutput)
)
f_comprehensiveness = (
Feedback(
provider.comprehensiveness_with_cot_reasons, name="Comprehensiveness"
)
.on(Select.RecordInput)
.on(Select.RecordOutput)
)
provider.comprehensiveness_with_cot_reasons(
"the white house is white. obama is the president",
"the white house is white. obama is the president",
)
Create the app and wrap itยถ
Now we are ready to wrap our summarization app with TruLens as a TruCustomApp
. Now each time it will be called, TruLens will log inputs, outputs and any instrumented intermediate steps and evaluate them ith the feedback functions we created.
app = DialogSummaryApp()
print(app.summarize(dev_df.dialogue[498]))
tru_recorder = TruCustomApp(
app,
app_name="Summarize",
app_version="v1",
feedbacks=[
f_groundtruth,
f_groundedness_llm,
f_groundedness_nli,
f_comprehensiveness,
f_bert_score,
f_bleu,
f_rouge,
],
)
We can test a single run of the App as so. This should show up on the dashboard.
with tru_recorder:
app.summarize(dialog=dev_df.dialogue[498])
We'll make a lot of queries in a short amount of time, so we need tenacity to make sure that most of our requests eventually go through.
from tenacity import retry
from tenacity import stop_after_attempt
from tenacity import wait_random_exponential
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def run_with_backoff(doc):
return tru_recorder.with_record(app.summarize, dialog=doc)
for pair in golden_set:
llm_response = run_with_backoff(pair["query"])
print(llm_response)
And that's it! This might take a few minutes to run, at the end of it, you can explore the dashboard to see how well your app does.
from trulens.dashboard import run_dashboard
run_dashboard(session)