Skip to content

trulens.feedback.groundtruth

trulens.feedback.groundtruth

Classes

GroundTruthAgreement

Bases: WithClassInfo, SerialModel

Measures Agreement against a Ground Truth.

Attributes
tru_class_info instance-attribute
tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

Functions
__rich_repr__
__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load staticmethod
load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate classmethod
model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

__init__
__init__(
    ground_truth: Union[
        List[Dict], Callable, DataFrame, FunctionOrMethod
    ],
    provider: Optional[LLMProvider] = None,
    bert_scorer: Optional[BERTScorer] = None,
    **kwargs
)

Measures Agreement against a Ground Truth.

Usage 1
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())
Usage 2
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
from trulens.core.session import TruSession

session = TruSession()
ground_truth_dataset = session.get_ground_truths_by_dataset("hotpotqa") # assuming a dataset "hotpotqa" has been created and persisted in the DB

ground_truth_collection = GroundTruthAgreement(ground_truth_dataset, provider=OpenAI())
Usage 3
from trulens.feedback import GroundTruthAgreement
from trulens.providers.cortex import Cortex
ground_truth_imp = llm_app
response = llm_app(prompt)

snowflake_connection_parameters = {
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_USER_PASSWORD"],
    "database": os.environ["SNOWFLAKE_DATABASE"],
    "schema": os.environ["SNOWFLAKE_SCHEMA"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
}
ground_truth_collection = GroundTruthAgreement(
    ground_truth_imp,
    provider=Cortex(
        snowflake.connector.connect(**snowflake_connection_parameters),
        model_engine="mistral-7b",
    ),
)
PARAMETER DESCRIPTION
ground_truth

A list of query/response pairs or a function, or a dataframe containing ground truth dataset, or callable that returns a ground truth string given a prompt string.

TYPE: Union[List[Dict], Callable, DataFrame, FunctionOrMethod]

provider

The provider to use for agreement measures.

TYPE: Optional[LLMProvider] DEFAULT: None

bert_scorer

Internal Usage for DB serialization.

TYPE: Optional[BERTScorer] DEFAULT: None

agreement_measure
agreement_measure(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses OpenAI's Chat GPT Model. A function that that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI

golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Feedback(ground_truth_collection.agreement_measure).on_input_output()
The on_input_output() selector can be changed. See Feedback Function Guide

PARAMETER DESCRIPTION
prompt

A text prompt to an agent.

TYPE: str

response

The agent's response to the prompt.

TYPE: str

RETURNS DESCRIPTION
float

A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement".

TYPE: Union[float, Tuple[float, Dict[str, str]]]

dict

with key 'ground_truth_response'

TYPE: Union[float, Tuple[float, Dict[str, str]]]

ndcg_at_k
ndcg_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute NDCG@k for a given query and retrieved context chunks.

PARAMETER DESCRIPTION
query

The input query string.

TYPE: str

retrieved_context_chunks

List of retrieved context chunks.

TYPE: List[str]

relevance_scores

Relevance scores for each retrieved chunk.

TYPE: Optional[List[float]] DEFAULT: None

k

Rank position up to which to compute NDCG. If None, compute for all retrieved chunks.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
float

Computed NDCG@k score.

TYPE: float

precision_at_k
precision_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute Precision@k for a given query and retrieved context chunks, considering tie handling.

PARAMETER DESCRIPTION
query

The input query string.

TYPE: str

retrieved_context_chunks

List of retrieved context chunks.

TYPE: List[str]

relevance_scores

Relevance scores for each retrieved chunk.

TYPE: Optional[List[float]] DEFAULT: None

k

Rank position up to which to compute Precision. If None, compute for all retrieved chunks.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
float

Computed Precision@k score.

TYPE: float

recall_at_k
recall_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute Recall@k for a given query and retrieved context chunks, considering tie handling.

PARAMETER DESCRIPTION
query

The input query string.

TYPE: str

retrieved_context_chunks

List of retrieved context chunks.

TYPE: List[str]

relevance_scores

Relevance scores for each retrieved chunk.

TYPE: Optional[List[float]] DEFAULT: None

k

Rank position up to which to compute Recall. If None, compute for all retrieved chunks.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
float

Computed Recall@k score.

TYPE: float

mrr
mrr(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
) -> float

Compute Mean Reciprocal Rank (MRR) for a given query and retrieved context chunks.

PARAMETER DESCRIPTION
query

The input query string.

TYPE: str

retrieved_context_chunks

List of retrieved context chunks.

TYPE: List[str]

RETURNS DESCRIPTION
float

Computed MRR score.

TYPE: float

ir_hit_rate
ir_hit_rate(
    query: str,
    retrieved_context_chunks: List[str],
    k: Optional[int] = None,
) -> float

Compute IR Hit Rate (Hit Rate@k) for a given query and retrieved context chunks.

PARAMETER DESCRIPTION
query

The input query string.

TYPE: str

retrieved_context_chunks

List of retrieved context chunks.

TYPE: List[str]

k

Rank position up to which to compute Hit Rate. If None, compute for all retrieved chunks.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
float

Computed Hit Rate@k score.

TYPE: float

absolute_error
absolute_error(
    prompt: str, response: str, score: float
) -> Tuple[float, Dict[str, float]]

Method to look up the numeric expected score from a golden set and take the difference.

Primarily used for evaluation of model generated feedback against human feedback

Example
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.bedrock import Bedrock

golden_set =
{"query": "How many stomachs does a cow have?", "expected_response": "Cows' diet relies primarily on grazing.", "expected_score": 0.4},
{"query": "Name some top dental floss brands", "expected_response": "I don't know", "expected_score": 0.8}
]

bedrock = Bedrock(
    model_id="amazon.titan-text-express-v1", region_name="us-east-1"
)
ground_truth_collection = GroundTruthAgreement(golden_set, provider=bedrock)

f_groundtruth = Feedback(ground_truth.absolute_error.on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()
bert_score
bert_score(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BERT Score. A function that that measures similarity to ground truth using bert embeddings.

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Feedback(ground_truth_collection.bert_score).on_input_output()
The on_input_output() selector can be changed. See Feedback Function Guide

PARAMETER DESCRIPTION
prompt

A text prompt to an agent.

TYPE: str

response

The agent's response to the prompt.

TYPE: str

RETURNS DESCRIPTION
float

A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement".

TYPE: Union[float, Tuple[float, Dict[str, str]]]

dict

with key 'ground_truth_response'

TYPE: Union[float, Tuple[float, Dict[str, str]]]

bleu
bleu(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Feedback(ground_truth_collection.bleu).on_input_output()
The on_input_output() selector can be changed. See Feedback Function Guide

PARAMETER DESCRIPTION
prompt

A text prompt to an agent.

TYPE: str

response

The agent's response to the prompt.

TYPE: str

RETURNS DESCRIPTION
float

A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement".

TYPE: Union[float, Tuple[float, Dict[str, str]]]

dict

with key 'ground_truth_response'

TYPE: Union[float, Tuple[float, Dict[str, str]]]

rouge
rouge(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

PARAMETER DESCRIPTION
prompt

A text prompt to an agent.

TYPE: str

response

The agent's response to the prompt.

TYPE: str

RETURNS DESCRIPTION
Union[float, Tuple[float, Dict[str, str]]]
  • float: A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement".
Union[float, Tuple[float, Dict[str, str]]]
  • dict: with key 'ground_truth_response'

GroundTruthAggregator

Bases: WithClassInfo, SerialModel

Attributes
tru_class_info instance-attribute
tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

model_config class-attribute
model_config: dict = dict(
    arbitrary_types_allowed=True, extra="allow"
)

Aggregate benchmarking metrics for ground-truth-based evaluation on feedback functions.

Functions
__rich_repr__
__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load staticmethod
load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate classmethod
model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

register_custom_agg_func
register_custom_agg_func(
    name: str,
    func: Callable[
        [List[float], GroundTruthAggregator], float
    ],
) -> None

Register a custom aggregation function.

auc
auc(scores: List[float]) -> float

Calculate the area under the ROC curve. Can be used for meta-evaluation.

PARAMETER DESCRIPTION
scores

scores returned by feedback function

TYPE: List[float]

RETURNS DESCRIPTION
float

Area under the ROC curve

TYPE: float

kendall_tau
kendall_tau(
    scores: Union[List[float], List[List]]
) -> float

Calculate Kendall's tau. Can be used for meta-evaluation. Kendall’s tau is a measure of the correspondence between two rankings. Values close to 1 indicate strong agreement, values close to -1 indicate strong disagreement. This is the tau-b version of Kendall’s tau which accounts for ties.

PARAMETER DESCRIPTION
scores

scores returned by feedback function

TYPE: List[float]

RETURNS DESCRIPTION
float

Kendall's tau

TYPE: float

spearman_correlation
spearman_correlation(
    scores: Union[List[float], List[List]]
) -> float

Calculate the Spearman correlation. Can be used for meta-evaluation. The Spearman correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables).

PARAMETER DESCRIPTION
scores

scores returned by feedback function

TYPE: List[float]

RETURNS DESCRIPTION
float

Spearman correlation

TYPE: float

pearson_correlation
pearson_correlation(
    scores: Union[List[float], List[List]]
) -> float

Calculate the Pearson correlation. Can be used for meta-evaluation. The Pearson correlation coefficient is a measure of the linear relationship between two variables.

PARAMETER DESCRIPTION
scores

scores returned by feedback function

TYPE: List[float]

RETURNS DESCRIPTION
float

Pearson correlation

TYPE: float

matthews_correlation
matthews_correlation(
    scores: Union[List[float], List[List]]
) -> float

Calculate the Matthews correlation coefficient. Can be used for meta-evaluation. The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications.

PARAMETER DESCRIPTION
scores

scores returned by feedback function

TYPE: List[float]

RETURNS DESCRIPTION
float

Matthews correlation coefficient

TYPE: float

cohens_kappa
cohens_kappa(
    scores: Union[List[float], List[List]], threshold=0.5
) -> float

Computes Cohen's Kappa score between true labels and predicted scores.

Parameters: - true_labels (list): A list of true labels. - scores (list): A list of predicted labels or scores.

Returns: - float: Cohen's Kappa score.

recall
recall(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates recall given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The recall score.

precision
precision(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates precision given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The precision score.

f1_score
f1_score(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates the F1 score given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The F1 score.

brier_score
brier_score(
    scores: Union[List[float], List[List]]
) -> float

assess both calibration and sharpness of the probability estimates Args: scores (List[float]): relevance scores returned by feedback function Returns: float: Brier score

ece
ece(score_confidence_pairs: List[Tuple[float]]) -> float

Calculate the expected calibration error. Can be used for meta-evaluation.

PARAMETER DESCRIPTION
score_confidence_pairs

list of tuples of relevance scores and confidences returned by feedback function

TYPE: List[Tuple[float]]

RETURNS DESCRIPTION
float

Expected calibration error

TYPE: float

mae
mae(scores: Union[List[float], List[List]]) -> float

Calculate the mean absolute error. Can be used for meta-evaluation.

PARAMETER DESCRIPTION
scores

scores returned by feedback function

TYPE: List[float]

RETURNS DESCRIPTION
float

Mean absolute error

TYPE: float