๐ Custom Feedback Functionsยถ
Feedback functions are an extensible framework for evaluating LLMs.
The primary motivations for customizing feedback functions are either to improve alignment of an existing feedback function, or to evaluate on a new axis not addressed by an out-of-the-box feedback function.
Improving feedback function alignment through customizationยถ
Feedback Functions can be customized through a number of parameter changes that influence score generation. For example, you can choose to run feedbacks with or without chain-of-thought reasoning, customize the output scale, or provide "few-shot" examples to guide alignment of a feedback function. All of these decisions affect the score generation and should be carefully tested and benchmarked.
Chain-of-thought Reasoningยถ
Feedback functions can be run with chain-of-thought reasoning using their "cot" variant. Doing so provides both the benefit of a view into how the grading is performed, and improves alignment due to the auto-regressive nature of LLMs forcing the score to sequentially follow the reasons.
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
provider = OpenAI(model_engine="gpt-4o")
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
provider.relevance_with_cot_reasons(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
Output spaceยถ
The output space is another very important variable to consider. This allows you to trade-off between a score's accuracy and granularity. The larger the output space, the lower accuracy.
Output space can be modulated via the min_score_val
and max_score_val
keyword arguments.
The output space currently allows three selections:
- 0 or 1 (binary)
- 0 to 3 (default)
- 0 to 10
While the output you see is always on a scale from 0 to 1, changing the output space changes the score range prompting given to the LLM judge. The score produced by the judge is then scaled down appropriately.
For example, we can modulate the output space to 0-10.
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
min_score_val=0,
max_score_val=10,
)
Or to binary scoring.
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
min_score_val=0,
max_score_val=1,
)
Temperatureยถ
When using LLMs, temperature is another paramter to be mindful of. Feedback functions default to a temperature of 0, but it can be useful in some cases to use higher temperatures, or even ensemble with feedback functions using different temperatures.
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
temperature=0.9,
)
Groundedness configurationsยถ
Groundedness has its own specific configurations that can be set with the GroundednessConfigurations
class.
from trulens.core.feedback import feedback
groundedness_configs = feedback.GroundednessConfigs(
use_sent_tokenize=False, filter_trivial_statements=False
)
provider.groundedness_measure_with_cot_reasons(
"The First AFLโNFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
"Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
)
provider.groundedness_measure_with_cot_reasons(
"The First AFLโNFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
"Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
groundedness_configs=groundedness_configs,
)
Custom Criteriaยถ
To customize the LLM-judge prompting, you can override standard criteria with your own custom criteria.
This can be useful to tailor LLM-judge prompting to your domain and improve alignment with human evaluations.
custom_criteria = """
A relevant response should provide a clear and concise answer to the question.
"""
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
criteria=custom_criteria,
min_score_val=0,
max_score_val=1,
)
custom_criteria = """
A positive sentiment should be expressed with an extremely encouraging and enthusiastic tone.
"""
provider.sentiment(
"When you're ready to start your business, you'll be amazed at how much you can achieve!",
criteria=custom_criteria,
)
Few-shot examplesยถ
You can also provide examples to customize feedback scoring to your domain.
This is currently available only for the RAG triad feedback functions (answer relevance, context relevance and groundedness).
from trulens.feedback.v2 import feedback
fewshot_relevance_examples_list = [
(
{
"query": "What are the key considerations when starting a small business?",
"response": "You should focus on building relationships with mentors and industry leaders. Networking can provide insights, open doors to opportunities, and help you avoid common pitfalls.",
},
3,
),
]
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
examples=fewshot_relevance_examples_list,
)
Usage Options for Customized Feedback Functionsยถ
Feedback customizations are available both directly (shown above) and through the Feedback
class.
Below is an example using the customizations via a feedback function instantiation that will run with typical TruLens recording.
from trulens.core import Select
from trulens.providers.openai import OpenAI
provider = OpenAI(model_engine="gpt-4o")
# Question/answer relevance between overall question and answer.
f_answer_relevance = (
Feedback(
provider.relevance_with_cot_reasons,
name="Answer Relevance",
examples=fewshot_relevance_examples_list,
criteria=custom_criteria,
min_score_val=0,
max_score_val=1,
temperature=0.9,
)
.on_input()
.on_output()
)
f_answer_relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
from trulens.core.feedback import feedback
groundedness_configs = feedback.GroundednessConfigs(
use_sent_tokenize=False, filter_trivial_statements=False
)
# Question/answer relevance between overall question and answer.
f_groundedness = (
Feedback(
provider.groundedness_measure_with_cot_reasons,
name="Groundedness",
examples=fewshot_relevance_examples_list,
min_score_val=0,
max_score_val=1,
temperature=0.9,
groundedness_configs=groundedness_configs,
)
.on_input()
.on_output()
)
Creating new custom feedback funcitonsยถ
You can add your own feedback functions to evaluate the qualities required by your application by simply creating a new provider class and feedback function in your notebook. If your contributions would be useful for others, we encourage you to contribute to TruLens!
Feedback functions are organized by model provider into Provider classes.
The process for adding new feedback functions is:
- Create a new Provider class or locate an existing one that applies to your feedback function. If your feedback function does not rely on a model provider, you can create a standalone class. Add the new feedback function method to your selected class. Your new method can either take a single text (str) as a parameter or both prompt (str) and response (str). It should return a float between 0 (worst) and 1 (best).
from trulens.core import Feedback
from trulens.core import Provider
class StandAlone(Provider):
def custom_feedback(self, my_text_field: str) -> float:
"""
A dummy function of text inputs to float outputs.
Parameters:
my_text_field (str): Text to evaluate.
Returns:
float: square length of the text
"""
return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))
- Instantiate your provider and feedback functions. The feedback function is wrapped by the
Feedback
class which helps specify what will get sent to your function parameters (For example: Select.RecordInput or Select.RecordOutput)
standalone = StandAlone()
f_custom_function = Feedback(standalone.custom_feedback).on(
my_text_field=Select.RecordOutput
)
- Your feedback function is now ready to use just like the out of the box feedback functions. Below is an example of it being used.
f_custom_function("Hello, World!")
Extending existing providers.ยถ
In addition to calling your own methods, you can also extend stock feedback providers (such as OpenAI
, AzureOpenAI
, Bedrock
) to custom feedback implementations. This can be especially useful for tweaking stock feedback functions, or running custom feedback function prompts while letting TruLens handle the backend LLM provider.
This is done by subclassing the provider you wish to extend, and using the generate_score
method that runs the provided prompt with your specified provider, and extracts a float score from 0-1. Your prompt should request the LLM respond on the scale from 0 to 10, then the generate_score
method will normalize to 0-1.
See below for example usage:
from trulens.providers.openai import AzureOpenAI
class CustomAzureOpenAI(AzureOpenAI):
def style_check_professional(self, response: str) -> float:
"""
Custom feedback function to grade the professional style of the response, extending AzureOpenAI provider.
Args:
response (str): text to be graded for professional style.
Returns:
float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
"""
professional_prompt = str.format(
"Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}",
response,
)
return self.generate_score(system_prompt=professional_prompt)
Running "chain of thought evaluations" is another use case for extending providers. Doing so follows a similar process as above, where the base provider (such as AzureOpenAI
) is subclassed.
For this case, the method generate_score_and_reasons
can be used to extract both the score and chain of thought reasons from the LLM response.
To use this method, the prompt used should include the COT_REASONS_TEMPLATE
available from the TruLens prompts library (trulens.feedback.prompts
).
See below for example usage:
from typing import Dict, Tuple
from trulens.feedback import prompts
class CustomAzureOpenAIReasoning(AzureOpenAI):
def context_relevance_with_cot_reasons_extreme(
self, question: str, context: str
) -> Tuple[float, Dict]:
"""
Tweaked version of context relevance, extending AzureOpenAI provider.
A function that completes a template to check the relevance of the statement to the question.
Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
Also uses chain of thought methodology and emits the reasons.
Args:
question (str): A question being asked.
context (str): A statement to the question.
Returns:
float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
"""
# remove scoring guidelines around middle scores
system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
"- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n",
"",
)
user_prompt = str.format(
prompts.CONTEXT_RELEVANCE_USER, question=question, context=context
)
user_prompt = user_prompt.replace(
"RELEVANCE:", prompts.COT_REASONS_TEMPLATE
)
return self.generate_score_and_reasons(system_prompt, user_prompt)