trulens.feedback¶

trulens.feedback ¶

Classes¶

GroundTruthAggregator ¶

Bases: WithClassInfo, SerialModel

Attributes¶

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

model_config `class-attribute` ¶

model_config: dict = dict(
    arbitrary_types_allowed=True, extra="allow"
)

Aggregate benchmarking metrics for ground-truth-based evaluation on feedback functions.

Functions¶

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

register_custom_agg_func ¶

register_custom_agg_func(
    name: str,
    func: Callable[
        [List[float], GroundTruthAggregator], float
    ],
) -> None

Register a custom aggregation function.

auc ¶

auc(scores: List[float]) -> float

Calculate the area under the ROC curve. Can be used for meta-evaluation.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Area under the ROC curve TYPE: `float`

kendall_tau ¶

kendall_tau(
    scores: Union[List[float], List[List]]
) -> float

Calculate Kendall's tau. Can be used for meta-evaluation. Kendall’s tau is a measure of the correspondence between two rankings. Values close to 1 indicate strong agreement, values close to -1 indicate strong disagreement. This is the tau-b version of Kendall’s tau which accounts for ties.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Kendall's tau TYPE: `float`

spearman_correlation ¶

spearman_correlation(
    scores: Union[List[float], List[List]]
) -> float

Calculate the Spearman correlation. Can be used for meta-evaluation. The Spearman correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables).

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Spearman correlation TYPE: `float`

pearson_correlation ¶

pearson_correlation(
    scores: Union[List[float], List[List]]
) -> float

Calculate the Pearson correlation. Can be used for meta-evaluation. The Pearson correlation coefficient is a measure of the linear relationship between two variables.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Pearson correlation TYPE: `float`

matthews_correlation ¶

matthews_correlation(
    scores: Union[List[float], List[List]]
) -> float

Calculate the Matthews correlation coefficient. Can be used for meta-evaluation. The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Matthews correlation coefficient TYPE: `float`

cohens_kappa ¶

cohens_kappa(
    scores: Union[List[float], List[List]], threshold=0.5
) -> float

Computes Cohen's Kappa score between true labels and predicted scores.

Parameters: - true_labels (list): A list of true labels. - scores (list): A list of predicted labels or scores.

Returns: - float: Cohen's Kappa score.

recall ¶

recall(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates recall given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The recall score.

precision ¶

precision(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates precision given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The precision score.

f1_score ¶

f1_score(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates the F1 score given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The F1 score.

brier_score ¶

brier_score(
    scores: Union[List[float], List[List]]
) -> float

assess both calibration and sharpness of the probability estimates Args: scores (List[float]): relevance scores returned by feedback function Returns: float: Brier score

ece ¶

ece(score_confidence_pairs: List[Tuple[float]]) -> float

Calculate the expected calibration error. Can be used for meta-evaluation.

PARAMETER	DESCRIPTION
`score_confidence_pairs`	list of tuples of relevance scores and confidences returned by feedback function TYPE: `List[Tuple[float]]`

RETURNS	DESCRIPTION
`float`	Expected calibration error TYPE: `float`

mae ¶

mae(scores: Union[List[float], List[List]]) -> float

Calculate the mean absolute error. Can be used for meta-evaluation.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Mean absolute error TYPE: `float`

GroundTruthAgreement ¶

Bases: WithClassInfo, SerialModel

Measures Agreement against a Ground Truth.

Attributes¶

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

Functions¶

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

init ¶

__init__(
    ground_truth: Union[
        List[Dict], Callable, DataFrame, FunctionOrMethod
    ],
    provider: Optional[LLMProvider] = None,
    bert_scorer: Optional[BERTScorer] = None,
    **kwargs
)

Measures Agreement against a Ground Truth.

Usage 1

from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

Usage 2

from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
from trulens.core.session import TruSession

session = TruSession()
ground_truth_dataset = session.get_ground_truths_by_dataset("hotpotqa") # assuming a dataset "hotpotqa" has been created and persisted in the DB

ground_truth_collection = GroundTruthAgreement(ground_truth_dataset, provider=OpenAI())

Usage 3

from snowflake.snowpark import Session
from trulens.feedback import GroundTruthAgreement
from trulens.providers.cortex import Cortex
ground_truth_imp = llm_app
response = llm_app(prompt)

snowflake_connection_parameters = {
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_USER_PASSWORD"],
    "database": os.environ["SNOWFLAKE_DATABASE"],
    "schema": os.environ["SNOWFLAKE_SCHEMA"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
}

snowpark_session = Session.builder.configs(snowflake_connection_parameters).create()

ground_truth_collection = GroundTruthAgreement(
    ground_truth_imp,
    provider=Cortex(
        snowpark_session=snowpark_session,
        model_engine="mistral-7b",
    ),
)

PARAMETER	DESCRIPTION
`ground_truth`	A list of query/response pairs or a function, or a dataframe containing ground truth dataset, or callable that returns a ground truth string given a prompt string. TYPE: `Union[List[Dict], Callable, DataFrame, FunctionOrMethod]`
`provider`	The provider to use for agreement measures. TYPE: `Optional[LLMProvider]` DEFAULT: `None`
`bert_scorer`	Internal Usage for DB serialization. TYPE: `Optional[BERTScorer]` DEFAULT: `None`

agreement_measure ¶

agreement_measure(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses OpenAI's Chat GPT Model. A function that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI

golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Feedback(ground_truth_collection.agreement_measure).on_input_output()

The on_input_output() selector can be changed. See Feedback Function Guide

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement". TYPE: `Union[float, Tuple[float, Dict[str, str]]]`
`dict`	with key 'ground_truth_response' TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

ndcg_at_k ¶

ndcg_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute NDCG@k for a given query and retrieved context chunks.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`relevance_scores`	Relevance scores for each retrieved chunk. TYPE: `Optional[List[float]]` DEFAULT: `None`
`k`	Rank position up to which to compute NDCG. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed NDCG@k score. TYPE: `float`

precision_at_k ¶

precision_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute Precision@k for a given query and retrieved context chunks, considering tie handling.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`relevance_scores`	Relevance scores for each retrieved chunk. TYPE: `Optional[List[float]]` DEFAULT: `None`
`k`	Rank position up to which to compute Precision. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed Precision@k score. TYPE: `float`

recall_at_k ¶

recall_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute Recall@k for a given query and retrieved context chunks, considering tie handling.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`relevance_scores`	Relevance scores for each retrieved chunk. TYPE: `Optional[List[float]]` DEFAULT: `None`
`k`	Rank position up to which to compute Recall. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed Recall@k score. TYPE: `float`

mrr ¶

mrr(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
) -> float

Compute Mean Reciprocal Rank (MRR) for a given query and retrieved context chunks.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`

RETURNS	DESCRIPTION
`float`	Computed MRR score. TYPE: `float`

ir_hit_rate ¶

ir_hit_rate(
    query: str,
    retrieved_context_chunks: List[str],
    k: Optional[int] = None,
) -> float

Compute IR Hit Rate (Hit Rate@k) for a given query and retrieved context chunks.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`k`	Rank position up to which to compute Hit Rate. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed Hit Rate@k score. TYPE: `float`

absolute_error ¶

absolute_error(
    prompt: str, response: str, score: float
) -> Tuple[float, Dict[str, float]]

Method to look up the numeric expected score from a golden set and take the difference.

Primarily used for evaluation of model generated feedback against human feedback

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.bedrock import Bedrock

golden_set =
{"query": "How many stomachs does a cow have?", "expected_response": "Cows' diet relies primarily on grazing.", "expected_score": 0.4},
{"query": "Name some top dental floss brands", "expected_response": "I don't know", "expected_score": 0.8}
]

bedrock = Bedrock(
    model_id="amazon.titan-text-express-v1", region_name="us-east-1"
)
ground_truth_collection = GroundTruthAgreement(golden_set, provider=bedrock)

f_groundtruth = Feedback(ground_truth.absolute_error.on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

bert_score ¶

bert_score(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BERT Score. A function that that measures similarity to ground truth using bert embeddings.

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Feedback(ground_truth_collection.bert_score).on_input_output()

The on_input_output() selector can be changed. See Feedback Function Guide

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement". TYPE: `Union[float, Tuple[float, Dict[str, str]]]`
`dict`	with key 'ground_truth_response' TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

bleu ¶

bleu(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

Example

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Feedback(ground_truth_collection.bleu).on_input_output()

The on_input_output() selector can be changed. See Feedback Function Guide

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement". TYPE: `Union[float, Tuple[float, Dict[str, str]]]`
`dict`	with key 'ground_truth_response' TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

rouge ¶

rouge(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`Union[float, Tuple[float, Dict[str, str]]]`	float: A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement".
`Union[float, Tuple[float, Dict[str, str]]]`	dict: with key 'ground_truth_response'

LLMProvider ¶

Bases: Provider

An LLM-based provider.

This is an abstract class and needs to be initialized as one of these:

OpenAI and subclass AzureOpenAI.
Bedrock.
LiteLLM. LiteLLM provides an interface to a wide range of models.
Langchain.

Attributes¶

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

endpoint `class-attribute` `instance-attribute` ¶

endpoint: Optional[Endpoint] = None

Endpoint supporting this provider.

Remote API invocations are handled by the endpoint.

Functions¶

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

_create_chat_completion ¶

_create_chat_completion(
    prompt: Optional[str] = None,
    messages: Optional[Sequence[Dict]] = None,
    **kwargs
) -> str

Chat Completion Model

RETURNS	DESCRIPTION
`str`	Completion model response. TYPE: `str`

generate_score ¶

generate_score(
    system_prompt: str,
    user_prompt: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 10,
    temperature: float = 0.0,
) -> float

Base method to generate a score normalized to 0 to 1, used for evaluation.

PARAMETER	DESCRIPTION
`system_prompt`	A pre-formatted system prompt. TYPE: `str`
`user_prompt`	An optional user prompt. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. TYPE: `int` DEFAULT: `10`
`temperature`	The temperature for the LLM response. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	The score on a 0-1 scale.

generate_score_and_reasons ¶

generate_score_and_reasons(
    system_prompt: str,
    user_prompt: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 10,
    temperature: float = 0.0,
) -> Tuple[float, Dict]

Base method to generate a score and reason, used for evaluation.

PARAMETER	DESCRIPTION
`system_prompt`	A pre-formatted system prompt. TYPE: `str`
`user_prompt`	An optional user prompt. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. TYPE: `int` DEFAULT: `10`
`temperature`	The temperature for the LLM response. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	The score on a 0-1 scale.
`Dict`	Reason metadata if returned by the LLM.

_determine_output_space ¶

_determine_output_space(
    min_score_val: int, max_score_val: int
) -> str

Determines the output space based on min_score_val and max_score_val.

PARAMETER	DESCRIPTION
`min_score_val`	Minimum value for the score range. TYPE: `int`
`max_score_val`	Maximum value for the score range. TYPE: `int`

RETURNS	DESCRIPTION
`str`	The corresponding output space. TYPE: `str`

context_relevance ¶

context_relevance(
    question: str,
    context: str,
    criteria: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the relevance of the context to the question.

Example

from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
    Feedback(provider.context_relevance)
    .on_input()
    .on(context)
    .aggregate(np.mean)
    )

PARAMETER	DESCRIPTION
`question`	A question being asked. TYPE: `str`
`context`	Context related to the question. TYPE: `str`
`criteria`	If provided, overrides the evaluation criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

Returns: float: A value between 0.0 (not relevant) and 1.0 (relevant).

context_relevance_with_cot_reasons ¶

context_relevance_with_cot_reasons(
    question: str,
    context: str,
    criteria: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.

Example

from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
    Feedback(provider.context_relevance_with_cot_reasons)
    .on_input()
    .on(context)
    .aggregate(np.mean)
    )

PARAMETER	DESCRIPTION
`question`	A question being asked. TYPE: `str`
`context`	Context related to the question. TYPE: `str`
`criteria`	If provided, overrides the evaluation criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". TYPE: `Tuple[float, Dict]`

relevance ¶

relevance(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the relevance of the response to a prompt.

Example

feedback = Feedback(provider.relevance).on_input_output()

Usage on RAG Contexts

feedback = Feedback(provider.relevance).on_input().on(
    TruLlama.select_source_nodes().node.text # See note below
).aggregate(np.mean)

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`criteria`	If provided, overrides the evaluation criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". TYPE: `float`

relevance_with_cot_reasons ¶

relevance_with_cot_reasons(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> Tuple[float, Dict]

Uses chat completion Model. A function that completes a template to check the relevance of the response to a prompt. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input()
    .on_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`criteria`	If provided, overrides the evaluation criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". TYPE: `Tuple[float, Dict]`

sentiment ¶

sentiment(
    text: str,
    criteria: str = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the sentiment of some text.

Example

feedback = Feedback(provider.sentiment).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate sentiment of. TYPE: `str`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "negative sentiment" and 1 being "positive sentiment".

sentiment_with_cot_reasons ¶

sentiment_with_cot_reasons(
    text: str,
    criteria: str = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the sentiment of some text. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.sentiment_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	Text to evaluate. TYPE: `str`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (negative sentiment) and 1.0 (positive sentiment). TYPE: `Tuple[float, Dict]`

model_agreement ¶

model_agreement(prompt: str, response: str) -> float

Uses chat completion model. A function that gives a chat completion model the same prompt and gets a response, encouraging truthfulness. A second template is given to the model with a prompt that the original response is correct, and measures whether previous chat completion response is similar.

Example

feedback = Feedback(provider.model_agreement).on_input_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not in agreement) and 1.0 (in agreement). TYPE: `float`

_langchain_evaluate ¶

_langchain_evaluate(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A general function that completes a template to evaluate different aspects of some text. Prompt credit to Langchain.

PARAMETER	DESCRIPTION
`text`	A prompt to an agent. TYPE: `str`
`criteria`	The specific criteria for evaluation. TYPE: `str` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`float`	A value between 0.0 and 1.0, representing the specified evaluation. TYPE: `float`

_langchain_evaluate_with_cot_reasons ¶

_langchain_evaluate_with_cot_reasons(
    text: str,
    criteria: str,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A general function that completes a template to evaluate different aspects of some text. Prompt credit to Langchain.

PARAMETER	DESCRIPTION
`text`	A prompt to an agent. TYPE: `str`
`criteria`	The specific criteria for evaluation. TYPE: `str`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 and 1.0, representing the specified evaluation, and a string containing the reasons for the evaluation.

conciseness ¶

conciseness(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.conciseness).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate the conciseness of. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not concise) and 1.0 (concise).

conciseness_with_cot_reasons ¶

conciseness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.conciseness).on_output()

Args: text: The text to evaluate the conciseness of.

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 (not concise) and 1.0 (concise) and a string containing the reasons for the evaluation.

correctness ¶

correctness(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.correctness).on_output()

PARAMETER	DESCRIPTION
`text`	A prompt to an agent. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not correct) and 1.0 (correct).

correctness_with_cot_reasons ¶

correctness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.correctness_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	Text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0 (not correct) and 1.0 (correct) and a string containing the reasons for the evaluation.

coherence ¶

coherence(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.coherence).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not coherent) and 1.0 (coherent). TYPE: `float`

coherence_with_cot_reasons ¶

coherence_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.coherence_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0 (not coherent) and 1.0 (coherent) and a string containing the reasons for the evaluation.

harmfulness ¶

harmfulness(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.harmfulness).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not harmful) and 1.0 (harmful)". TYPE: `float`

harmfulness_with_cot_reasons ¶

harmfulness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.harmfulness_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0 (not harmful) and 1.0 (harmful) and a string containing the reasons for the evaluation.

maliciousness ¶

maliciousness(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.maliciousness).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not malicious) and 1.0 (malicious). TYPE: `float`

maliciousness_with_cot_reasons ¶

maliciousness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.maliciousness_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0 (not malicious) and 1.0 (malicious) and a string containing the reasons for the evaluation.

helpfulness ¶

helpfulness(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.helpfulness).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not helpful) and 1.0 (helpful). TYPE: `float`

helpfulness_with_cot_reasons ¶

helpfulness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.helpfulness_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0 (not helpful) and 1.0 (helpful) and a string containing the reasons for the evaluation.

controversiality ¶

controversiality(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval.

Example

feedback = Feedback(provider.controversiality).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not controversial) and 1.0 (controversial). TYPE: `float`

controversiality_with_cot_reasons ¶

controversiality_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.controversiality_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0 (not controversial) and 1.0 (controversial) and a string containing the reasons for the evaluation.

misogyny ¶

misogyny(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.misogyny).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not misogynistic) and 1.0 (misogynistic). TYPE: `float`

misogyny_with_cot_reasons ¶

misogyny_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.misogyny_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 (not misogynistic) and 1.0 (misogynistic) and a string containing the reasons for the evaluation.

criminality ¶

criminality(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.criminality).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not criminal) and 1.0 (criminal). TYPE: `float`

criminality_with_cot_reasons ¶

criminality_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.criminality_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 (not criminal) and 1.0 (criminal) and a string containing the reasons for the evaluation.

insensitivity ¶

insensitivity(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.insensitivity).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not insensitive) and 1.0 (insensitive). TYPE: `float`

insensitivity_with_cot_reasons ¶

insensitivity_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.insensitivity_with_cot_reasons).on_output()

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 (not insensitive) and 1.0 (insensitive) and a string containing the reasons for the evaluation.

_get_answer_agreement ¶

_get_answer_agreement(
    prompt: str, response: str, check_response: str
) -> str

Uses chat completion model. A function that completes a template to check if two answers agree.

PARAMETER	DESCRIPTION
`text`	A prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`check_response(str)`	The response to check against.

RETURNS	DESCRIPTION
`str`	str

_generate_key_points ¶

_generate_key_points(
    source: str, temperature: float = 0.0
) -> str

Uses chat completion model. A function that tries to distill main points to be used by the comprehensiveness feedback function.

Args: source (str): Text corresponding to source material.

RETURNS	DESCRIPTION
`str`	(str) key points of the source text.

_assess_key_point_inclusion ¶

_assess_key_point_inclusion(
    key_points: str,
    summary: str,
    min_score_val: int = 0,
    max_score_val: int = 3,
    criteria: Optional[str] = None,
    temperature: float = 0.0,
) -> List

Splits key points by newlines and assesses if each one is included in the summary.

PARAMETER	DESCRIPTION
`key_points`	Key points separated by newlines. TYPE: `str`
`summary`	The summary text to check for inclusion of key points. TYPE: `str`

RETURNS	DESCRIPTION
`List`	List[str]: A list of strings indicating whether each key point is included in the summary.

comprehensiveness_with_cot_reasons ¶

comprehensiveness_with_cot_reasons(
    source: str,
    summary: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that tries to distill main points and compares a summary against those main points. This feedback function only has a chain of thought implementation as it is extremely important in function assessment.

Example

feedback = Feedback(provider.comprehensiveness_with_cot_reasons).on_input_output()

PARAMETER	DESCRIPTION
`source`	Text corresponding to source material. TYPE: `str`
`summary`	Text corresponding to a summary. TYPE: `str`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 (not comprehensive) and 1.0 (comprehensive) and a string containing the reasons for the evaluation.

summarization_with_cot_reasons ¶

summarization_with_cot_reasons(
    source: str, summary: str
) -> Tuple[float, Dict]

Summarization is deprecated in place of comprehensiveness. This function is no longer implemented.

stereotypes ¶

stereotypes(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    min_score_val: Optional[int] = 0,
    max_score_val: Optional[int] = 3,
    temperature: Optional[float] = 0.0,
) -> float

Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.

Example

feedback = Feedback(provider.stereotypes).on_input_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed).

stereotypes_with_cot_reasons ¶

stereotypes_with_cot_reasons(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.

Example

feedback = Feedback(provider.stereotypes_with_cot_reasons).on_input_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, str]: A tuple containing a value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed) and a string containing the reasons for the evaluation.

_remove_trivial_statements ¶

_remove_trivial_statements(
    statements: List[str],
) -> List[str]

Removes trivial statements from a list of statements.

PARAMETER	DESCRIPTION
`statements`	A list of statements. TYPE: `List[str]`

RETURNS	DESCRIPTION
`List[str]`	List[str]: A list of statements with trivial statements removed.

groundedness_measure_with_cot_reasons ¶

groundedness_measure_with_cot_reasons(
    source: str,
    statement: str,
    criteria: Optional[str] = None,
    examples: Optional[str] = None,
    groundedness_configs: Optional[
        GroundednessConfigs
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> Tuple[float, dict]

A measure to track if the source material supports each sentence in the statement using an LLM provider.

The statement will first be split by a tokenizer into its component sentences.

Then, trivial statements are eliminated so as to not dilute the evaluation.

The LLM will process each statement, using chain of thought methodology to emit the reasons.

Abstentions will be considered as grounded.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(context.collect()
    .on_output()
    )

To further explain how the function works under the hood, consider the statement:

"Hi. I'm here to help. The university of Washington is a public research university. UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"

The function will split the statement into its component sentences:

"Hi."
"I'm here to help."
"The university of Washington is a public research university."
"UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"

Next, trivial statements are removed, leaving only:

"The university of Washington is a public research university."
"UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"

The LLM will then process the statement, to assess the groundedness of the statement.

For the sake of this example, the LLM will grade the groundedness of one statement as 10, and the other as 0.

Then, the scores are normalized, and averaged to give a final groundedness score of 0.5.

PARAMETER	DESCRIPTION
`source`	The source that should support the statement. TYPE: `str`
`statement`	The statement to check groundedness. TYPE: `str`
`criteria`	The specific criteria for evaluation. Defaults to None. TYPE: `str` DEFAULT: `None`
`use_sent_tokenize`	Whether to split the statement into sentences using punkt sentence tokenizer. If `False`, use an LLM to split the statement. Defaults to False. Note this might incur additional costs and reach context window limits in some cases. TYPE: `bool`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, dict]`	Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation.

qs_relevance ¶

qs_relevance(*args, **kwargs)

Deprecated. Use relevance instead.

qs_relevance_with_cot_reasons ¶

qs_relevance_with_cot_reasons(*args, **kwargs)

Deprecated. Use relevance_with_cot_reasons instead.

groundedness_measure_with_cot_reasons_consider_answerability ¶

groundedness_measure_with_cot_reasons_consider_answerability(
    source: str,
    statement: str,
    question: str,
    criteria: Optional[str] = None,
    examples: Optional[List[str]] = None,
    groundedness_configs: Optional[
        GroundednessConfigs
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
) -> Tuple[float, dict]

A measure to track if the source material supports each sentence in the statement using an LLM provider.

The statement will first be split by a tokenizer into its component sentences.

Then, trivial statements are eliminated so as to not delete the evaluation.

The LLM will process each statement, using chain of thought methodology to emit the reasons.

In the case of abstentions, such as 'I do not know', the LLM will be asked to consider the answerability of the question given the source material.

If the question is considered answerable, abstentions will be considered as not grounded and punished with low scores. Otherwise, unanswerable abstentions will be considered grounded.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(context.collect()
    .on_output()
    .on_input()
    )

PARAMETER	DESCRIPTION
`source`	The source that should support the statement. TYPE: `str`
`statement`	The statement to check groundedness. TYPE: `str`
`question`	The question to check answerability. TYPE: `str`
`criteria`	The specific criteria for evaluation. Defaults to None. TYPE: `str` DEFAULT: `None`
`use_sent_tokenize`	Whether to split the statement into sentences using punkt sentence tokenizer. If `False`, use an LLM to split the statement. Defaults to False. Note this might incur additional costs and reach context window limits in some cases. TYPE: `bool`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, dict]`	Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation.

Embeddings ¶

Bases: WithClassInfo, SerialModel

Embedding related feedback function implementations.

Attributes¶

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

Functions¶

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

init ¶

__init__(embed_model: BaseEmbedding)

Instantiates embeddings for feedback functions.

Example

Below is just one example. Embedders from LlamaIndex are supported: https://docs.llamaindex.ai/en/latest/module_guides/models/embeddings/

from llama_index.embeddings.openai import OpenAIEmbedding
from trulens.feedback.embeddings import Embeddings

embed_model = OpenAIEmbedding()

f_embed = Embedding(embed_model=embed_model)

PARAMETER	DESCRIPTION
`embed_model`	Supports embedders from LlamaIndex: https://docs.llamaindex.ai/en/latest/module_guides/models/embeddings/ TYPE: `BaseEmbedding`

cosine_distance ¶

cosine_distance(
    query: str, document: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Runs cosine distance on the query and document embeddings

Example

Below is just one example. Embedders from LlamaIndex are supported: https://docs.llamaindex.ai/en/latest/module_guides/models/embeddings/

from llama_index.embeddings.openai import OpenAIEmbedding
from trulens.feedback.embeddings import Embeddings

embed_model = OpenAIEmbedding()

# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.cosine_distance)                .on_input_output()

PARAMETER	DESCRIPTION
`query`	A text prompt to a vector DB. TYPE: `str`
`document`	The document returned from the vector DB. TYPE: `str`

RETURNS	DESCRIPTION
`float`	the embedding vector distance TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

manhattan_distance ¶

manhattan_distance(
    query: str, document: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Runs L1 distance on the query and document embeddings

Example

Below is just one example. Embedders from LlamaIndex are supported: https://docs.llamaindex.ai/en/latest/module_guides/models/embeddings/

from llama_index.embeddings.openai import OpenAIEmbedding
from trulens.feedback.embeddings import Embeddings

embed_model = OpenAIEmbedding()

# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.manhattan_distance)                .on_input_output()

PARAMETER	DESCRIPTION
`query`	A text prompt to a vector DB. TYPE: `str`
`document`	The document returned from the vector DB. TYPE: `str`

RETURNS	DESCRIPTION
`float`	the embedding vector distance TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

euclidean_distance ¶

euclidean_distance(
    query: str, document: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Runs L2 distance on the query and document embeddings

Example

Below is just one example. Embedders from LlamaIndex are supported: https://docs.llamaindex.ai/en/latest/module_guides/models/embeddings/

from llama_index.embeddings.openai import OpenAIEmbedding
from trulens.feedback.embeddings import Embeddings

embed_model = OpenAIEmbedding()

# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.euclidean_distance)                .on_input_output()

PARAMETER	DESCRIPTION
`query`	A text prompt to a vector DB. TYPE: `str`
`document`	The document returned from the vector DB. TYPE: `str`

RETURNS	DESCRIPTION
`float`	the embedding vector distance TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

trulens.feedback¶