Evaluating Summarization with TruLens¶

In this notebook, we will evaluate a summarization application based on DialogSum dataset using a broad set of available metrics from TruLens. These metrics break down into three categories.

Ground truth agreement: For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth. We will use for different measures: BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.
Groundedness: Estimate if the generated summary can be traced back to parts of the original transcript both with LLM and NLI methods.
Comprehensivenss: Estimate if the generated summary contains all of the key points from the source text.

Dependencies¶

Let's first install the packages that this notebook depends on. Uncomment these linse to run.

In [ ]:

Copied!

# !pip install trulens trulens-providers-openai trulens-providers-huggingface bert_score evaluate absl-py rouge-score pandas tenacity
# !pip install trulens trulens-providers-openai trulens-providers-huggingface bert_score evaluate absl-py rouge-score pandas tenacity

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["HUGGINGFACE_API_KEY"] = "hf_..."
import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["HUGGINGFACE_API_KEY"] = "hf_..."

Download and load data¶

Now we will download a portion of the DialogSum dataset from github.

In [ ]:

Copied!

import pandas as pd
import pandas as pd

In [ ]:

Copied!

!wget -O dialogsum.dev.jsonl https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.dev.jsonl
!wget -O dialogsum.dev.jsonl https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.dev.jsonl

In [ ]:

Copied!

file_path_dev = "dialogsum.dev.jsonl"
dev_df = pd.read_json(path_or_buf=file_path_dev, lines=True)
file_path_dev = "dialogsum.dev.jsonl"
dev_df = pd.read_json(path_or_buf=file_path_dev, lines=True)

Let's preview the data to make sure that the data was properly loaded

In [ ]:

Copied!

dev_df.head(10)
dev_df.head(10)

Create a simple summarization app and instrument it¶

We will create a simple summarization app based on the OpenAI ChatGPT model and instrument it for use with TruLens

In [ ]:

Copied!

from trulens.apps.custom import TruCustomApp
from trulens.apps.custom import instrument
from trulens.apps.custom import TruCustomApp
from trulens.apps.custom import instrument

In [ ]:

Copied!





import openai


class DialogSummaryApp:
    @instrument
    def summarize(self, dialog):
        client = openai.OpenAI()
        summary = (
            client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[
                    {
                        "role": "system",
                        "content": """Summarize the given dialog into 1-2 sentences based on the following criteria: 
                     1. Convey only the most salient information; 
                     2. Be brief; 
                     3. Preserve important named entities within the conversation; 
                     4. Be written from an observer perspective; 
                     5. Be written in formal language. """,
                    },
                    {"role": "user", "content": dialog},
                ],
            )
            .choices[0]
            .message.content
        )
        return summary
import openai


class DialogSummaryApp:
    @instrument
    def summarize(self, dialog):
        client = openai.OpenAI()
        summary = (
            client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[
                    {
                        "role": "system",
                        "content": """Summarize the given dialog into 1-2 sentences based on the following criteria: 
                     1. Convey only the most salient information; 
                     2. Be brief; 
                     3. Preserve important named entities within the conversation; 
                     4. Be written from an observer perspective; 
                     5. Be written in formal language. """,
                    },
                    {"role": "user", "content": dialog},
                ],
            )
            .choices[0]
            .message.content
        )
        return summary

Initialize Database and view dashboard¶

In [ ]:

Copied!





from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()
# If you have a database you can connect to, use a URL. For example:
# session = TruSession(database_url="postgresql+psycopg://hostname/database?user=username&password=password")
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()
# If you have a database you can connect to, use a URL. For example:
# session = TruSession(database_url="postgresql+psycopg://hostname/database?user=username&password=password")

In [ ]:

Copied!

run_dashboard(session, force=True)
run_dashboard(session, force=True)

Write feedback functions¶

We will now create the feedback functions that will evaluate the app. Remember that the criteria we were evaluating against were:

Ground truth agreement: For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth. We will use for different measures: BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.
Groundedness: For this measure, we will estimate if the generated summary can be traced back to parts of the original transcript.

In [ ]:

Copied!

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement

We select the golden dataset based on dataset we downloaded

In [ ]:

Copied!





golden_set = (
    dev_df[["dialogue", "summary"]]
    .rename(columns={"dialogue": "query", "summary": "response"})
    .to_dict("records")
)
golden_set = (
    dev_df[["dialogue", "summary"]]
    .rename(columns={"dialogue": "query", "summary": "response"})
    .to_dict("records")
)

In [ ]:

Copied!





from trulens.core import Select
from trulens.providers.huggingface import Huggingface
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")
hug_provider = Huggingface()

ground_truth_collection = GroundTruthAgreement(golden_set, provider=provider)
f_groundtruth = Feedback(
    ground_truth_collection.agreement_measure, name="Similarity (LLM)"
).on_input_output()
f_bert_score = Feedback(ground_truth_collection.bert_score).on_input_output()
f_bleu = Feedback(ground_truth_collection.bleu).on_input_output()
f_rouge = Feedback(ground_truth_collection.rouge).on_input_output()
# Groundedness between each context chunk and the response.


f_groundedness_llm = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness - LLM Judge",
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)
f_groundedness_nli = (
    Feedback(
        hug_provider.groundedness_measure_with_nli,
        name="Groundedness - NLI Judge",
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)
f_comprehensiveness = (
    Feedback(
        provider.comprehensiveness_with_cot_reasons, name="Comprehensiveness"
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)
from trulens.core import Select
from trulens.providers.huggingface import Huggingface
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")
hug_provider = Huggingface()

ground_truth_collection = GroundTruthAgreement(golden_set, provider=provider)
f_groundtruth = Feedback(
    ground_truth_collection.agreement_measure, name="Similarity (LLM)"
).on_input_output()
f_bert_score = Feedback(ground_truth_collection.bert_score).on_input_output()
f_bleu = Feedback(ground_truth_collection.bleu).on_input_output()
f_rouge = Feedback(ground_truth_collection.rouge).on_input_output()
# Groundedness between each context chunk and the response.


f_groundedness_llm = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness - LLM Judge",
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)
f_groundedness_nli = (
    Feedback(
        hug_provider.groundedness_measure_with_nli,
        name="Groundedness - NLI Judge",
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)
f_comprehensiveness = (
    Feedback(
        provider.comprehensiveness_with_cot_reasons, name="Comprehensiveness"
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)

In [ ]:

Copied!





provider.comprehensiveness_with_cot_reasons(
    "the white house is white. obama is the president",
    "the white house is white. obama is the president",
)
provider.comprehensiveness_with_cot_reasons(
    "the white house is white. obama is the president",
    "the white house is white. obama is the president",
)

Create the app and wrap it¶

Now we are ready to wrap our summarization app with TruLens as a TruCustomApp. Now each time it will be called, TruLens will log inputs, outputs and any instrumented intermediate steps and evaluate them ith the feedback functions we created.

In [ ]:

Copied!

app = DialogSummaryApp()
print(app.summarize(dev_df.dialogue[498]))
app = DialogSummaryApp()
print(app.summarize(dev_df.dialogue[498]))

In [ ]:

Copied!





tru_recorder = TruCustomApp(
    app,
    app_name="Summarize",
    app_version="v1",
    feedbacks=[
        f_groundtruth,
        f_groundedness_llm,
        f_groundedness_nli,
        f_comprehensiveness,
        f_bert_score,
        f_bleu,
        f_rouge,
    ],
)
tru_recorder = TruCustomApp(
    app,
    app_name="Summarize",
    app_version="v1",
    feedbacks=[
        f_groundtruth,
        f_groundedness_llm,
        f_groundedness_nli,
        f_comprehensiveness,
        f_bert_score,
        f_bleu,
        f_rouge,
    ],
)

We can test a single run of the App as so. This should show up on the dashboard.

In [ ]:

Copied!

with tru_recorder:
    app.summarize(dialog=dev_df.dialogue[498])
with tru_recorder:
    app.summarize(dialog=dev_df.dialogue[498])

We'll make a lot of queries in a short amount of time, so we need tenacity to make sure that most of our requests eventually go through.

In [ ]:

Copied!

from tenacity import retry
from tenacity import stop_after_attempt
from tenacity import wait_random_exponential
from tenacity import retry
from tenacity import stop_after_attempt
from tenacity import wait_random_exponential

In [ ]:

Copied!

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def run_with_backoff(doc):
    return tru_recorder.with_record(app.summarize, dialog=doc)
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def run_with_backoff(doc):
    return tru_recorder.with_record(app.summarize, dialog=doc)

In [ ]:

Copied!

for pair in golden_set:
    llm_response = run_with_backoff(pair["query"])
    print(llm_response)
for pair in golden_set:
    llm_response = run_with_backoff(pair["query"])
    print(llm_response)

And that's it! This might take a few minutes to run, at the end of it, you can explore the dashboard to see how well your app does.

In [ ]:

Copied!

from trulens.dashboard import run_dashboard

run_dashboard(session)
from trulens.dashboard import run_dashboard

run_dashboard(session)