Skip to content

trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment

trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment

Classes

TruBenchmarkExperiment

Example

snowflake_connection_parameters = {
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_USER_PASSWORD"],
    "database": os.environ["SNOWFLAKE_DATABASE"],
    "schema": os.environ["SNOWFLAKE_SCHEMA"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
}
cortex = Cortex(
    snowflake.connector.connect(**snowflake_connection_parameters)
    model_engine="snowflake-arctic",
)

def context_relevance_ff_to_score(input, output, temperature=0):
    return cortex.context_relevance(question=input, context=output, temperature=temperature)

tru_labels = [1, 0, 0, ...] # ground truth labels collected from ground truth data collection
mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae

tru_benchmark_arctic = session.BenchmarkExperiment(
    app_name="MAE",
    feedback_fn=context_relevance_ff_to_score,
    agg_funcs=[mae_agg_func],
    benchmark_params=BenchmarkParams(temperature=0.5),
)
Functions
__init__
__init__(
    feedback_fn: Callable,
    agg_funcs: List[AggCallable],
    benchmark_params: BenchmarkParams,
)

Create a benchmark experiment class which defines custom feedback functions and aggregators to evaluate the feedback function on a ground truth dataset.

PARAMETER DESCRIPTION
feedback_fn

function that takes in a row of ground truth data and returns a score by typically a LLM-as-judge

TYPE: Callable

agg_funcs

list of aggregation functions to compute metrics on the feedback scores

TYPE: List[AggCallable]

benchmark_params

benchmark configuration parameters

TYPE: BenchmarkParams

run_score_generation_on_single_row
run_score_generation_on_single_row(
    feedback_fn: Callable, feedback_args: List[Any]
) -> Union[float, Tuple[float, float]]

Generate a score with the feedback_fn

PARAMETER DESCRIPTION
row

A single row from the dataset.

feedback_fn

The function used to generate feedback scores.

TYPE: Callable

RETURNS DESCRIPTION
Union[float, Tuple[float, float]]

Union[float, Tuple[float, float]]: Feedback score (with metadata) after running the benchmark on a single entry in ground truth data.

__call__
__call__(
    ground_truth: DataFrame,
) -> Union[
    List[float],
    List[Tuple[float]],
    Tuple[List[float], List[float]],
]

Collect the list of generated feedback scores as input to the benchmark aggregation functions Note the order of generated scores must be preserved to match the order of the true labels.

PARAMETER DESCRIPTION
ground_truth

ground truth dataset / collection to evaluate the feedback function on

TYPE: DataFrame

RETURNS DESCRIPTION
Union[List[float], List[Tuple[float]], Tuple[List[float], List[float]]]

List[float]: feedback scores after running the benchmark on all entries in ground truth data

Functions

create_benchmark_experiment_app

create_benchmark_experiment_app(
    app_name: str,
    app_version: str,
    benchmark_experiment: TruBenchmarkExperiment,
    **kwargs
) -> TruCustomApp

Create a Custom app for special use case: benchmarking feedback functions.

PARAMETER DESCRIPTION
app_name

user-defined name of the experiment run.

TYPE: str

app_version

user-defined version of the experiment run.

TYPE: str

feedback_fn

feedback function of interest to perform meta-evaluation

agg_funcs

list of aggregation functions to compute metrics for the benchmark.

benchmark_params

parameters for the benchmarking experiment.

RETURNS DESCRIPTION
TruCustomApp

Custom app wrapper for benchmarking feedback functions.