📓 Custom Feedback Functions¶

Feedback functions are an extensible framework for evaluating LLMs.

The primary motivations for customizing feedback functions are either to improve alignment of an existing feedback function, or to evaluate on a new axis not addressed by an out-of-the-box feedback function.

Improving feedback function alignment through customization¶

Feedback Functions can be customized through a number of parameter changes that influence score generation. For example, you can choose to run feedbacks with or without chain-of-thought reasoning, customize the output scale, or provide "few-shot" examples to guide alignment of a feedback function. All of these decisions affect the score generation and should be carefully tested and benchmarked.

Chain-of-thought Reasoning¶

Feedback functions can be run with chain-of-thought reasoning using their "cot" variant. Doing so provides both the benefit of a view into how the grading is performed, and improves alignment due to the auto-regressive nature of LLMs forcing the score to sequentially follow the reasons.

In [ ]:

Copied!





from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)

In [ ]:

Copied!





provider.relevance_with_cot_reasons(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
provider.relevance_with_cot_reasons(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)

Output space¶

The output space is another very important variable to consider. This allows you to trade-off between a score's accuracy and granularity. The larger the output space, the lower accuracy.

Output space can be modulated via the min_score_val and max_score_val keyword arguments.

The output space currently allows three selections:

0 or 1 (binary)
0 to 3 (default)
0 to 10

While the output you see is always on a scale from 0 to 1, changing the output space changes the score range prompting given to the LLM judge. The score produced by the judge is then scaled down appropriately.

For example, we can modulate the output space to 0-10.

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=10,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=10,
)

Or to binary scoring.

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=1,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=1,
)

Temperature¶

When using LLMs, temperature is another paramter to be mindful of. Feedback functions default to a temperature of 0, but it can be useful in some cases to use higher temperatures, or even ensemble with feedback functions using different temperatures.

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    temperature=0.9,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    temperature=0.9,
)

Groundedness configurations¶

Groundedness has its own specific configurations that can be set with the GroundednessConfigurations class.

In [ ]:

Copied!

from trulens.core.feedback import feedback

groundedness_configs = feedback.GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)
from trulens.core.feedback import feedback

groundedness_configs = feedback.GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)

In [ ]:

Copied!





provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
)
provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
)

In [ ]:

Copied!





provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
    groundedness_configs=groundedness_configs,
)
provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
    groundedness_configs=groundedness_configs,
)

Custom Criteria¶

To customize the LLM-judge prompting, you can override standard criteria with your own custom criteria.

This can be useful to tailor LLM-judge prompting to your domain and improve alignment with human evaluations.

In [ ]:

Copied!





custom_criteria = """
A relevant response should provide a clear and concise answer to the question.
"""

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    criteria=custom_criteria,
    min_score_val=0,
    max_score_val=1,
)
custom_criteria = """
A relevant response should provide a clear and concise answer to the question.
"""

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    criteria=custom_criteria,
    min_score_val=0,
    max_score_val=1,
)

In [ ]:

Copied!





custom_criteria = """
A positive sentiment should be expressed with an extremely encouraging and enthusiastic tone.
"""

provider.sentiment(
    "When you're ready to start your business, you'll be amazed at how much you can achieve!",
    criteria=custom_criteria,
)
custom_criteria = """
A positive sentiment should be expressed with an extremely encouraging and enthusiastic tone.
"""

provider.sentiment(
    "When you're ready to start your business, you'll be amazed at how much you can achieve!",
    criteria=custom_criteria,
)

Few-shot examples¶

You can also provide examples to customize feedback scoring to your domain.

This is currently available only for the RAG triad feedback functions (answer relevance, context relevance and groundedness).

In [ ]:

Copied!





from trulens.feedback.v2 import feedback

fewshot_relevance_examples_list = [
    (
        {
            "query": "What are the key considerations when starting a small business?",
            "response": "You should focus on building relationships with mentors and industry leaders. Networking can provide insights, open doors to opportunities, and help you avoid common pitfalls.",
        },
        3,
    ),
]
from trulens.feedback.v2 import feedback

fewshot_relevance_examples_list = [
    (
        {
            "query": "What are the key considerations when starting a small business?",
            "response": "You should focus on building relationships with mentors and industry leaders. Networking can provide insights, open doors to opportunities, and help you avoid common pitfalls.",
        },
        3,
    ),
]

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    examples=fewshot_relevance_examples_list,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    examples=fewshot_relevance_examples_list,
)

Usage Options for Customized Feedback Functions¶

Feedback customizations are available both directly (shown above) and through the Feedback class.

Below is an example using the customizations via a feedback function instantiation that will run with typical TruLens recording.

In [ ]:

Copied!





from trulens.core import Select
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    Feedback(
        provider.relevance_with_cot_reasons,
        name="Answer Relevance",
        examples=fewshot_relevance_examples_list,
        criteria=custom_criteria,
        min_score_val=0,
        max_score_val=1,
        temperature=0.9,
    )
    .on_input()
    .on_output()
)
from trulens.core import Select
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    Feedback(
        provider.relevance_with_cot_reasons,
        name="Answer Relevance",
        examples=fewshot_relevance_examples_list,
        criteria=custom_criteria,
        min_score_val=0,
        max_score_val=1,
        temperature=0.9,
    )
    .on_input()
    .on_output()
)

In [ ]:

Copied!





f_answer_relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
f_answer_relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)

In [ ]:

Copied!





from trulens.core.feedback import feedback

groundedness_configs = feedback.GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)

# Question/answer relevance between overall question and answer.
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness",
        examples=fewshot_relevance_examples_list,
        min_score_val=0,
        max_score_val=1,
        temperature=0.9,
        groundedness_configs=groundedness_configs,
    )
    .on_input()
    .on_output()
)
from trulens.core.feedback import feedback

groundedness_configs = feedback.GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)

# Question/answer relevance between overall question and answer.
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness",
        examples=fewshot_relevance_examples_list,
        min_score_val=0,
        max_score_val=1,
        temperature=0.9,
        groundedness_configs=groundedness_configs,
    )
    .on_input()
    .on_output()
)

Creating new custom feedback funcitons¶

You can add your own feedback functions to evaluate the qualities required by your application by simply creating a new provider class and feedback function in your notebook. If your contributions would be useful for others, we encourage you to contribute to TruLens!

Feedback functions are organized by model provider into Provider classes.

The process for adding new feedback functions is:

Create a new Provider class or locate an existing one that applies to your feedback function. If your feedback function does not rely on a model provider, you can create a standalone class. Add the new feedback function method to your selected class. Your new method can either take a single text (str) as a parameter or both prompt (str) and response (str). It should return a float between 0 (worst) and 1 (best).

In [ ]:

Copied!





from trulens.core import Feedback
from trulens.core import Provider


class StandAlone(Provider):
    def custom_feedback(self, my_text_field: str) -> float:
        """
        A dummy function of text inputs to float outputs.

        Parameters:
            my_text_field (str): Text to evaluate.

        Returns:
            float: square length of the text
        """
        return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))
from trulens.core import Feedback
from trulens.core import Provider


class StandAlone(Provider):
    def custom_feedback(self, my_text_field: str) -> float:
        """
        A dummy function of text inputs to float outputs.

        Parameters:
            my_text_field (str): Text to evaluate.

        Returns:
            float: square length of the text
        """
        return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))

Instantiate your provider and feedback functions. The feedback function is wrapped by the Feedback class which helps specify what will get sent to your function parameters (For example: Select.RecordInput or Select.RecordOutput)

In [ ]:

Copied!





standalone = StandAlone()
f_custom_function = Feedback(standalone.custom_feedback).on(
    my_text_field=Select.RecordOutput
)
standalone = StandAlone()
f_custom_function = Feedback(standalone.custom_feedback).on(
    my_text_field=Select.RecordOutput
)

Your feedback function is now ready to use just like the out of the box feedback functions. Below is an example of it being used.

In [ ]:

Copied!

f_custom_function("Hello, World!")
f_custom_function("Hello, World!")

Extending existing providers.¶

In addition to calling your own methods, you can also extend stock feedback providers (such as OpenAI, AzureOpenAI, Bedrock) to custom feedback implementations. This can be especially useful for tweaking stock feedback functions, or running custom feedback function prompts while letting TruLens handle the backend LLM provider.

This is done by subclassing the provider you wish to extend, and using the generate_score method that runs the provided prompt with your specified provider, and extracts a float score from 0-1. Your prompt should request the LLM respond on the scale from 0 to 10, then the generate_score method will normalize to 0-1.

See below for example usage:

In [ ]:

Copied!





from trulens.providers.openai import AzureOpenAI


class CustomAzureOpenAI(AzureOpenAI):
    def style_check_professional(self, response: str) -> float:
        """
        Custom feedback function to grade the professional style of the response, extending AzureOpenAI provider.

        Args:
            response (str): text to be graded for professional style.

        Returns:
            float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
        """
        professional_prompt = str.format(
            "Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}",
            response,
        )
        return self.generate_score(system_prompt=professional_prompt)
from trulens.providers.openai import AzureOpenAI


class CustomAzureOpenAI(AzureOpenAI):
    def style_check_professional(self, response: str) -> float:
        """
        Custom feedback function to grade the professional style of the response, extending AzureOpenAI provider.

        Args:
            response (str): text to be graded for professional style.

        Returns:
            float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
        """
        professional_prompt = str.format(
            "Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}",
            response,
        )
        return self.generate_score(system_prompt=professional_prompt)

Running "chain of thought evaluations" is another use case for extending providers. Doing so follows a similar process as above, where the base provider (such as AzureOpenAI) is subclassed.

For this case, the method generate_score_and_reasons can be used to extract both the score and chain of thought reasons from the LLM response.

To use this method, the prompt used should include the COT_REASONS_TEMPLATE available from the TruLens prompts library (trulens.feedback.prompts).

See below for example usage:

In [ ]:

Copied!





from typing import Dict, Tuple

from trulens.feedback import prompts


class CustomAzureOpenAIReasoning(AzureOpenAI):
    def context_relevance_with_cot_reasons_extreme(
        self, question: str, context: str
    ) -> Tuple[float, Dict]:
        """
        Tweaked version of context relevance, extending AzureOpenAI provider.
        A function that completes a template to check the relevance of the statement to the question.
        Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            question (str): A question being asked.
            context (str): A statement to the question.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # remove scoring guidelines around middle scores
        system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
            "- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n",
            "",
        )

        user_prompt = str.format(
            prompts.CONTEXT_RELEVANCE_USER, question=question, context=context
        )
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)
from typing import Dict, Tuple

from trulens.feedback import prompts


class CustomAzureOpenAIReasoning(AzureOpenAI):
    def context_relevance_with_cot_reasons_extreme(
        self, question: str, context: str
    ) -> Tuple[float, Dict]:
        """
        Tweaked version of context relevance, extending AzureOpenAI provider.
        A function that completes a template to check the relevance of the statement to the question.
        Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            question (str): A question being asked.
            context (str): A statement to the question.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # remove scoring guidelines around middle scores
        system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
            "- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n",
            "",
        )

        user_prompt = str.format(
            prompts.CONTEXT_RELEVANCE_USER, question=question, context=context
        )
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)