📓 Context Relevance Benchmarking: ranking is all you need.¶
The numerical scoring scheme adopted by TruLens feedback functions is intuitive for generating aggregated results from eval runs that are easy to interpret and visualize across different applications of interest. However, it begs the question how trustworthy these scores actually are, given they are at their core next-token-prediction-style generation from meticulously designed prompts. Consequently, these feedback functions face typical large language model (LLM) challenges in rigorous production environments, including prompt sensitivity and non-determinism, especially when incorporating Mixture-of-Experts and model-as-a-service solutions like those from OpenAI.
Another frequent inquiry from the community concerns the intrinsic semantic
significance, or lack thereof, of feedback scores—for example, how one would
interpret and instrument with a score of 0.9 when assessing context relevance in
a RAG application or whether a harmfulness score of 0.7 from GPT-3.5 equates to
the same from Llama-2-7b
.
For simpler meta-evaluation tasks, when human numerical scores are available in
the benchmark datasets, such as SummEval
, it's a lot more straightforward to
evaluate feedback functions as long as we can define reasonable correlation
between the task of the feedback function and the ones available in the
benchmarks. Check out our preliminary work on evaluating our own groundedness
feedback functions:
https://www.trulens.org/trulens/groundedness_smoke_tests/#groundedness-evaluations
and our previous blog, where the groundedness metric in the context of RAG can
be viewed as equivalent to the consistency metric defined in the SummEval
benchmark. In those cases, calculating MAE between our feedback scores and the
golden set's human scores can readily provide insights on how well the
groundedness LLM-based feedback functions are aligned with human preferences.
Yet, acquiring high-quality, numerically scored datasets is challenging and costly, a sentiment echoed across institutions and companies working on RLFH dataset annotation.
Observing that many information retrieval (IR) benchmarks use binary labels, we propose to frame the problem of evaluating LLM-based feedback functions (meta-evaluation) as evaluating a recommender system. In essence, we argue the relative importance or ranking based on the score assignments is all you need to achieve meta-evaluation against human golden sets. The intuition is that it is a sufficient proxy to trustworthiness if feedback functions demonstrate discriminative capabilities that reliably and consistently assign items, be it context chunks or generated responses, with weights and ordering closely mirroring human preferences.
In this following section, we illustrate how we conduct meta-evaluation
experiments on one of Trulens most widely used feedback functions: context relevance
and share how well they are aligned with human preferences in
practice.
# pip install -q scikit-learn litellm trulens
# Import groundedness feedback function
from benchmark_frameworks.eval_as_recommendation import compute_ece
from benchmark_frameworks.eval_as_recommendation import compute_ndcg
from benchmark_frameworks.eval_as_recommendation import precision_at_k
from benchmark_frameworks.eval_as_recommendation import recall_at_k
from benchmark_frameworks.eval_as_recommendation import score_passages
from test_cases import generate_ms_marco_context_relevance_benchmark
from trulens.core import TruSession
TruSession().reset_database()
benchmark_data = []
for i in range(1, 6):
dataset_path = f"./datasets/ms_marco/ms_marco_train_v2.1_{i}.json"
benchmark_data.extend(
list(generate_ms_marco_context_relevance_benchmark(dataset_path))
)
import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["ANTHROPIC_API_KEY"] = "..."
import numpy as np
import pandas as pd
df = pd.DataFrame(benchmark_data)
df = df.iloc[:500]
print(len(df.groupby("query_id").count()))
df.groupby("query_id").head()
Define feedback functions for contexnt relevance to be evaluated¶
from trulens.providers.litellm import LiteLLM
from trulens.providers.openai import OpenAI
# GPT 3.5
gpt3_turbo = OpenAI(model_engine="gpt-3.5-turbo")
def wrapped_relevance_turbo(input, output, temperature=0.0):
return gpt3_turbo.context_relevance(input, output, temperature)
gpt4 = OpenAI(model_engine="gpt-4-1106-preview")
def wrapped_relevance_gpt4(input, output, temperature=0.0):
return gpt4.context_relevance(input, output, temperature)
# # GPT 4 turbo latest
gpt4_latest = OpenAI(model_engine="gpt-4-0125-preview")
def wrapped_relevance_gpt4_latest(input, output, temperature=0.0):
return gpt4_latest.context_relevance(input, output, temperature)
# Anthropic
claude_2 = LiteLLM(model_engine="claude-2")
def wrapped_relevance_claude2(input, output, temperature=0.0):
return claude_2.context_relevance(input, output, temperature)
claude_2_1 = LiteLLM(model_engine="claude-2.1")
def wrapped_relevance_claude21(input, output, temperature=0.0):
return claude_2_1.context_relevance(input, output, temperature)
# Define a list of your feedback functions
feedback_functions = {
"GPT-3.5-Turbo": wrapped_relevance_turbo,
"GPT-4-Turbo": wrapped_relevance_gpt4,
"GPT-4-Turbo-latest": wrapped_relevance_gpt4_latest,
"Claude-2": wrapped_relevance_claude2,
"Claude-2.1": wrapped_relevance_claude21,
}
backoffs_by_functions = {
"GPT-3.5-Turbo": 0.5,
"GPT-4-Turbo": 0.5,
"GPT-4-Turbo-latest": 0.5,
"Claude-2": 1,
"Claude-2.1": 1,
}
# Running the benchmark
results = []
K = 5 # for precision@K and recall@K
# sampling of size n is performed for estimating log probs (conditional probs)
# generated by the LLMs
sample_size = 1
for name, func in feedback_functions.items():
try:
scores, groundtruths = score_passages(
df,
name,
func,
backoffs_by_functions[name]
if name in backoffs_by_functions
else 0.5,
n=1,
)
df_score_groundtruth_pairs = pd.DataFrame({
"scores": scores,
"groundtruth (human-preferences of relevancy)": groundtruths,
})
df_score_groundtruth_pairs.to_csv(
f"./results/{name}_score_groundtruth_pairs.csv"
)
ndcg_value = compute_ndcg(scores, groundtruths)
ece_value = compute_ece(scores, groundtruths)
precision_k = np.mean([
precision_at_k(sc, tr, 1) for sc, tr in zip(scores, groundtruths)
])
recall_k = np.mean([
recall_at_k(sc, tr, K) for sc, tr in zip(scores, groundtruths)
])
results.append((name, ndcg_value, ece_value, recall_k, precision_k))
print(f"Finished running feedback function name {name}")
print("Saving results...")
tmp_results_df = pd.DataFrame(
results,
columns=["Model", "nDCG", "ECE", f"Recall@{K}", "Precision@1"],
)
print(tmp_results_df)
tmp_results_df.to_csv("./results/tmp_context_relevance_benchmark.csv")
except Exception as e:
print(
f"Failed to run benchmark for feedback function name {name} due to {e}"
)
# Convert results to DataFrame for display
results_df = pd.DataFrame(
results, columns=["Model", "nDCG", "ECE", f"Recall@{K}", "Precision@1"]
)
results_df.to_csv(("./results/all_context_relevance_benchmark.csv"))
Visualization¶
import matplotlib.pyplot as plt
# Make sure results_df is defined and contains the necessary columns
# Also, ensure that K is defined
plt.figure(figsize=(12, 10))
# Graph for nDCG, Recall@K, and Precision@K
plt.subplot(2, 1, 1) # First subplot
ax1 = results_df.plot(
x="Model",
y=["nDCG", f"Recall@{K}", "Precision@1"],
kind="bar",
ax=plt.gca(),
)
plt.title("Feedback Function Performance (Higher is Better)")
plt.ylabel("Score")
plt.xticks(rotation=45)
plt.legend(loc="upper left")
# Graph for ECE
plt.subplot(2, 1, 2) # Second subplot
ax2 = results_df.plot(
x="Model", y=["ECE"], kind="bar", ax=plt.gca(), color="orange"
)
plt.title("Feedback Function Calibration (Lower is Better)")
plt.ylabel("ECE")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
results_df