trulens.providers.openai.provider¶
trulens.providers.openai.provider
¶
Classes¶
OpenAI
¶
Bases: LLMProvider
Out of the box feedback functions calling OpenAI APIs.
Additionally, all feedback functions listed in the base LLMProvider class can be run with OpenAI.
Create an OpenAI Provider with out of the box feedback functions.
Example
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
PARAMETER | DESCRIPTION |
---|---|
model_engine |
The OpenAI completion model. Defaults to
|
**kwargs |
Additional arguments to pass to the OpenAIEndpoint which are then passed to OpenAIClient and finally to the OpenAI client.
TYPE:
|
Attributes¶
tru_class_info
instance-attribute
¶
tru_class_info: Class
Class information of this pydantic object for use in deserialization.
Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.
Functions¶
load
staticmethod
¶
load(obj, *args, **kwargs)
Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.
model_validate
classmethod
¶
model_validate(*args, **kwargs) -> Any
Deserialized a jsonized version of the app into the instance of the class it was serialized from.
Note
This process uses extra information stored in the jsonized object and handled by WithClassInfo.
generate_score
¶
generate_score(
system_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> float
Base method to generate a score normalized to 0 to 1, used for evaluation.
PARAMETER | DESCRIPTION |
---|---|
system_prompt |
A pre-formatted system prompt.
TYPE:
|
user_prompt |
An optional user prompt. |
min_score_val |
The minimum score value.
TYPE:
|
max_score_val |
The maximum score value.
TYPE:
|
temperature |
The temperature for the LLM response.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The score on a 0-1 scale. |
generate_confidence_score
¶
generate_confidence_score(
verb_confidence_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> Tuple[float, Dict[str, float]]
Base method to generate a score normalized to 0 to 1, used for evaluation.
PARAMETER | DESCRIPTION |
---|---|
verb_confidence_prompt |
A pre-formatted system prompt.
TYPE:
|
user_prompt |
An optional user prompt. |
min_score_val |
The minimum score value.
TYPE:
|
max_score_val |
The maximum score value.
TYPE:
|
temperature |
The temperature for the LLM response.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict[str, float]]
|
The feedback score on a 0-1 scale and the confidence score. |
generate_score_and_reasons
¶
generate_score_and_reasons(
system_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Base method to generate a score and reason, used for evaluation.
PARAMETER | DESCRIPTION |
---|---|
system_prompt |
A pre-formatted system prompt.
TYPE:
|
user_prompt |
An optional user prompt. Defaults to None. |
min_score_val |
The minimum score value.
TYPE:
|
max_score_val |
The maximum score value.
TYPE:
|
temperature |
The temperature for the LLM response.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The score on a 0-1 scale. |
Dict
|
Reason metadata if returned by the LLM. |
context_relevance
¶
context_relevance(
question: str,
context: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> float
Uses chat completion model. A function that completes a template to check the relevance of the context to the question.
Example
from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
Feedback(provider.context_relevance)
.on_input()
.on(context)
.aggregate(np.mean)
)
PARAMETER | DESCRIPTION |
---|---|
question |
A question being asked.
TYPE:
|
context |
Context related to the question.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
Returns: float: A value between 0.0 (not relevant) and 1.0 (relevant).
context_relevance_with_cot_reasons
¶
context_relevance_with_cot_reasons(
question: str,
context: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.
Example
from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
PARAMETER | DESCRIPTION |
---|---|
question |
A question being asked.
TYPE:
|
context |
Context related to the question.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". |
context_relevance_verb_confidence
¶
context_relevance_verb_confidence(
question: str,
context: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict[str, float]]
Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.
Example
from trulens.apps.llamaindex import TruLlama
context = TruLlama.select_context(llamaindex_rag_app)
feedback = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
PARAMETER | DESCRIPTION |
---|---|
question |
A question being asked.
TYPE:
|
context |
Context related to the question.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
Returns: float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". Dict[str, float]: A dictionary containing the confidence score.
relevance
¶
relevance(
prompt: str,
response: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> float
Uses chat completion model. A function that completes a template to check the relevance of the response to a prompt.
Example
feedback = Feedback(provider.relevance).on_input_output()
Usage on RAG Contexts
feedback = Feedback(provider.relevance).on_input().on(
TruLlama.select_source_nodes().node.text # See note below
).aggregate(np.mean)
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
TYPE:
|
relevance_with_cot_reasons
¶
relevance_with_cot_reasons(
prompt: str,
response: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion Model. A function that completes a template to check the relevance of the response to a prompt. Also uses chain of thought methodology and emits the reasons.
Example
feedback = (
Feedback(provider.relevance_with_cot_reasons)
.on_input()
.on_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". |
sentiment
¶
Uses chat completion model. A function that completes a template to check the sentiment of some text.
Example
feedback = Feedback(provider.sentiment).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate sentiment of.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "negative sentiment" and 1 being "positive sentiment". |
sentiment_with_cot_reasons
¶
sentiment_with_cot_reasons(
text: str,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion model. A function that completes a template to check the sentiment of some text. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.sentiment_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (negative sentiment) and 1.0 (positive sentiment). |
model_agreement
¶
Uses chat completion model. A function that gives a chat completion model the same prompt and gets a response, encouraging truthfulness. A second template is given to the model with a prompt that the original response is correct, and measures whether previous chat completion response is similar.
Example
feedback = Feedback(provider.model_agreement).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not in agreement) and 1.0 (in agreement).
TYPE:
|
conciseness
¶
Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.conciseness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate the conciseness of.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not concise) and 1.0 (concise). |
conciseness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.conciseness).on_output()
Args: text: The text to evaluate the conciseness of.
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not concise) and 1.0 (concise) and a string containing the reasons for the evaluation. |
correctness
¶
Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.correctness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
A prompt to an agent.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not correct) and 1.0 (correct). |
correctness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.correctness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not correct) and 1.0 (correct) and a string containing the reasons for the evaluation. |
coherence
¶
Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.coherence).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not coherent) and 1.0 (coherent).
TYPE:
|
coherence_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.coherence_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not coherent) and 1.0 (coherent) and a string containing the reasons for the evaluation. |
harmfulness
¶
Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.harmfulness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not harmful) and 1.0 (harmful)".
TYPE:
|
harmfulness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.harmfulness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not harmful) and 1.0 (harmful) and a string containing the reasons for the evaluation. |
maliciousness
¶
Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.maliciousness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not malicious) and 1.0 (malicious).
TYPE:
|
maliciousness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.maliciousness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not malicious) and 1.0 (malicious) and a string containing the reasons for the evaluation. |
helpfulness
¶
Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.helpfulness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not helpful) and 1.0 (helpful).
TYPE:
|
helpfulness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.helpfulness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not helpful) and 1.0 (helpful) and a string containing the reasons for the evaluation. |
controversiality
¶
Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval.
Example
feedback = Feedback(provider.controversiality).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not controversial) and 1.0 (controversial).
TYPE:
|
controversiality_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.controversiality_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not controversial) and 1.0 (controversial) and a string containing the reasons for the evaluation. |
misogyny
¶
Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.misogyny).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not misogynistic) and 1.0 (misogynistic).
TYPE:
|
misogyny_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.misogyny_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not misogynistic) and 1.0 (misogynistic) and a string containing the reasons for the evaluation. |
criminality
¶
Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.criminality).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not criminal) and 1.0 (criminal).
TYPE:
|
criminality_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.criminality_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not criminal) and 1.0 (criminal) and a string containing the reasons for the evaluation. |
insensitivity
¶
Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.insensitivity).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not insensitive) and 1.0 (insensitive).
TYPE:
|
insensitivity_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.insensitivity_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not insensitive) and 1.0 (insensitive) and a string containing the reasons for the evaluation. |
comprehensiveness_with_cot_reasons
¶
comprehensiveness_with_cot_reasons(
source: str,
summary: str,
min_score: int = 0,
max_score: int = 3,
) -> Tuple[float, Dict]
Uses chat completion model. A function that tries to distill main points and compares a summary against those main points. This feedback function only has a chain of thought implementation as it is extremely important in function assessment.
Example
feedback = Feedback(provider.comprehensiveness_with_cot_reasons).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
source |
Text corresponding to source material.
TYPE:
|
summary |
Text corresponding to a summary.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not comprehensive) and 1.0 (comprehensive) and a string containing the reasons for the evaluation. |
summarization_with_cot_reasons
¶
Summarization is deprecated in place of comprehensiveness. This function is no longer implemented.
stereotypes
¶
Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.
Example
feedback = Feedback(provider.stereotypes).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed). |
stereotypes_with_cot_reasons
¶
stereotypes_with_cot_reasons(
prompt: str,
response: str,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.
Example
feedback = Feedback(provider.stereotypes_with_cot_reasons).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed) and a string containing the reasons for the evaluation. |
groundedness_measure_with_cot_reasons
¶
groundedness_measure_with_cot_reasons(
source: str,
statement: str,
criteria: Optional[str] = None,
use_sent_tokenize: bool = True,
filter_trivial_statements: bool = True,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, dict]
A measure to track if the source material supports each sentence in the statement using an LLM provider.
The statement will first be split by a tokenizer into its component sentences.
Then, trivial statements are eliminated so as to not dilute the evaluation.
The LLM will process each statement, using chain of thought methodology to emit the reasons.
Abstentions will be considered as grounded.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
provider = OpenAI()
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(context.collect()
.on_output()
)
To further explain how the function works under the hood, consider the statement:
"Hi. I'm here to help. The university of Washington is a public research university. UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"
The function will split the statement into its component sentences:
- "Hi."
- "I'm here to help."
- "The university of Washington is a public research university."
- "UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"
Next, trivial statements are removed, leaving only:
- "The university of Washington is a public research university."
- "UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"
The LLM will then process the statement, to assess the groundedness of the statement.
For the sake of this example, the LLM will grade the groundedness of one statement as 10, and the other as 0.
Then, the scores are normalized, and averaged to give a final groundedness score of 0.5.
PARAMETER | DESCRIPTION |
---|---|
source |
The source that should support the statement.
TYPE:
|
statement |
The statement to check groundedness.
TYPE:
|
criteria |
The specific criteria for evaluation. Defaults to None.
TYPE:
|
use_sent_tokenize |
Whether to split the statement into sentences using punkt sentence tokenizer. If
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, dict]
|
Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation. |
qs_relevance_with_cot_reasons
¶
qs_relevance_with_cot_reasons(*args, **kwargs)
Deprecated. Use relevance_with_cot_reasons
instead.
groundedness_measure_with_cot_reasons_consider_answerability
¶
groundedness_measure_with_cot_reasons_consider_answerability(
source: str,
statement: str,
question: str,
criteria: Optional[str] = None,
use_sent_tokenize: bool = True,
filter_trivial_statements: bool = True,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, dict]
A measure to track if the source material supports each sentence in the statement using an LLM provider.
The statement will first be split by a tokenizer into its component sentences.
Then, trivial statements are eliminated so as to not delete the evaluation.
The LLM will process each statement, using chain of thought methodology to emit the reasons.
In the case of abstentions, such as 'I do not know', the LLM will be asked to consider the answerability of the question given the source material.
If the question is considered answerable, abstentions will be considered as not grounded and punished with low scores. Otherwise, unanswerable abstentions will be considered grounded.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
provider = OpenAI()
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(context.collect()
.on_output()
.on_input()
)
PARAMETER | DESCRIPTION |
---|---|
source |
The source that should support the statement.
TYPE:
|
statement |
The statement to check groundedness.
TYPE:
|
question |
The question to check answerability.
TYPE:
|
criteria |
The specific criteria for evaluation. Defaults to None.
TYPE:
|
use_sent_tokenize |
Whether to split the statement into sentences using punkt sentence tokenizer. If
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, dict]
|
Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation. |
moderation_hate
¶
A function that checks if text is hate speech.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_hate, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not hate) and 1.0 (hate).
TYPE:
|
moderation_hatethreatening
¶
A function that checks if text is threatening speech.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_hatethreatening, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not threatening) and 1.0 (threatening).
TYPE:
|
moderation_selfharm
¶
A function that checks if text is about self harm.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_selfharm, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not self harm) and 1.0 (self harm).
TYPE:
|
moderation_sexual
¶
A function that checks if text is sexual speech.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_sexual, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not sexual) and 1.0 (sexual).
TYPE:
|
moderation_sexualminors
¶
A function that checks if text is about sexual minors.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_sexualminors, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not sexual minors) and 1.0 (sexual minors).
TYPE:
|
moderation_violence
¶
A function that checks if text is about violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_violence, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not violence) and 1.0 (violence).
TYPE:
|
moderation_violencegraphic
¶
A function that checks if text is about graphic violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_violencegraphic, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not graphic violence) and 1.0 (graphic violence).
TYPE:
|
moderation_harassment
¶
A function that checks if text is about graphic violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_harassment, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not harassment) and 1.0 (harassment).
TYPE:
|
moderation_harassment_threatening
¶
A function that checks if text is about graphic violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_harassment_threatening, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not harassment/threatening) and 1.0 (harassment/threatening).
TYPE:
|
AzureOpenAI
¶
Bases: OpenAI
Warning
Azure OpenAI does not support the OpenAI moderation endpoint.
Out of the box feedback functions calling AzureOpenAI APIs. Has the same functionality as OpenAI out of the box feedback functions, excluding the moderation endpoint which is not supported by Azure. Please export the following env variables. These can be retrieved from https://oai.azure.com/ .
- AZURE_OPENAI_ENDPOINT
- AZURE_OPENAI_API_KEY
- OPENAI_API_VERSION
Deployment name below is also found on the oai azure page.
Example
from trulens.providers.openai import AzureOpenAI
openai_provider = AzureOpenAI(deployment_name="...")
openai_provider.relevance(
prompt="Where is Germany?",
response="Poland is in Europe."
) # low relevance
PARAMETER | DESCRIPTION |
---|---|
deployment_name |
The name of the deployment.
TYPE:
|
Attributes¶
tru_class_info
instance-attribute
¶
tru_class_info: Class
Class information of this pydantic object for use in deserialization.
Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.
Functions¶
load
staticmethod
¶
load(obj, *args, **kwargs)
Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.
model_validate
classmethod
¶
model_validate(*args, **kwargs) -> Any
Deserialized a jsonized version of the app into the instance of the class it was serialized from.
Note
This process uses extra information stored in the jsonized object and handled by WithClassInfo.
generate_score
¶
generate_score(
system_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> float
Base method to generate a score normalized to 0 to 1, used for evaluation.
PARAMETER | DESCRIPTION |
---|---|
system_prompt |
A pre-formatted system prompt.
TYPE:
|
user_prompt |
An optional user prompt. |
min_score_val |
The minimum score value.
TYPE:
|
max_score_val |
The maximum score value.
TYPE:
|
temperature |
The temperature for the LLM response.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The score on a 0-1 scale. |
generate_confidence_score
¶
generate_confidence_score(
verb_confidence_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> Tuple[float, Dict[str, float]]
Base method to generate a score normalized to 0 to 1, used for evaluation.
PARAMETER | DESCRIPTION |
---|---|
verb_confidence_prompt |
A pre-formatted system prompt.
TYPE:
|
user_prompt |
An optional user prompt. |
min_score_val |
The minimum score value.
TYPE:
|
max_score_val |
The maximum score value.
TYPE:
|
temperature |
The temperature for the LLM response.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict[str, float]]
|
The feedback score on a 0-1 scale and the confidence score. |
generate_score_and_reasons
¶
generate_score_and_reasons(
system_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Base method to generate a score and reason, used for evaluation.
PARAMETER | DESCRIPTION |
---|---|
system_prompt |
A pre-formatted system prompt.
TYPE:
|
user_prompt |
An optional user prompt. Defaults to None. |
min_score_val |
The minimum score value.
TYPE:
|
max_score_val |
The maximum score value.
TYPE:
|
temperature |
The temperature for the LLM response.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The score on a 0-1 scale. |
Dict
|
Reason metadata if returned by the LLM. |
context_relevance
¶
context_relevance(
question: str,
context: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> float
Uses chat completion model. A function that completes a template to check the relevance of the context to the question.
Example
from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
Feedback(provider.context_relevance)
.on_input()
.on(context)
.aggregate(np.mean)
)
PARAMETER | DESCRIPTION |
---|---|
question |
A question being asked.
TYPE:
|
context |
Context related to the question.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
Returns: float: A value between 0.0 (not relevant) and 1.0 (relevant).
context_relevance_with_cot_reasons
¶
context_relevance_with_cot_reasons(
question: str,
context: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.
Example
from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
PARAMETER | DESCRIPTION |
---|---|
question |
A question being asked.
TYPE:
|
context |
Context related to the question.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". |
context_relevance_verb_confidence
¶
context_relevance_verb_confidence(
question: str,
context: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict[str, float]]
Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.
Example
from trulens.apps.llamaindex import TruLlama
context = TruLlama.select_context(llamaindex_rag_app)
feedback = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
PARAMETER | DESCRIPTION |
---|---|
question |
A question being asked.
TYPE:
|
context |
Context related to the question.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
Returns: float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". Dict[str, float]: A dictionary containing the confidence score.
relevance
¶
relevance(
prompt: str,
response: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> float
Uses chat completion model. A function that completes a template to check the relevance of the response to a prompt.
Example
feedback = Feedback(provider.relevance).on_input_output()
Usage on RAG Contexts
feedback = Feedback(provider.relevance).on_input().on(
TruLlama.select_source_nodes().node.text # See note below
).aggregate(np.mean)
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
TYPE:
|
relevance_with_cot_reasons
¶
relevance_with_cot_reasons(
prompt: str,
response: str,
criteria: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion Model. A function that completes a template to check the relevance of the response to a prompt. Also uses chain of thought methodology and emits the reasons.
Example
feedback = (
Feedback(provider.relevance_with_cot_reasons)
.on_input()
.on_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
criteria |
If provided, overrides the evaluation criteria for evaluation. Defaults to None. |
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". |
sentiment
¶
Uses chat completion model. A function that completes a template to check the sentiment of some text.
Example
feedback = Feedback(provider.sentiment).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate sentiment of.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0 and 1. 0 being "negative sentiment" and 1 being "positive sentiment". |
sentiment_with_cot_reasons
¶
sentiment_with_cot_reasons(
text: str,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion model. A function that completes a template to check the sentiment of some text. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.sentiment_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (negative sentiment) and 1.0 (positive sentiment). |
model_agreement
¶
Uses chat completion model. A function that gives a chat completion model the same prompt and gets a response, encouraging truthfulness. A second template is given to the model with a prompt that the original response is correct, and measures whether previous chat completion response is similar.
Example
feedback = Feedback(provider.model_agreement).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not in agreement) and 1.0 (in agreement).
TYPE:
|
conciseness
¶
Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.conciseness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate the conciseness of.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not concise) and 1.0 (concise). |
conciseness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.conciseness).on_output()
Args: text: The text to evaluate the conciseness of.
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not concise) and 1.0 (concise) and a string containing the reasons for the evaluation. |
correctness
¶
Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.correctness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
A prompt to an agent.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not correct) and 1.0 (correct). |
correctness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.correctness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not correct) and 1.0 (correct) and a string containing the reasons for the evaluation. |
coherence
¶
Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.coherence).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not coherent) and 1.0 (coherent).
TYPE:
|
coherence_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.coherence_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not coherent) and 1.0 (coherent) and a string containing the reasons for the evaluation. |
harmfulness
¶
Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.harmfulness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not harmful) and 1.0 (harmful)".
TYPE:
|
harmfulness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.harmfulness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not harmful) and 1.0 (harmful) and a string containing the reasons for the evaluation. |
maliciousness
¶
Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.maliciousness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not malicious) and 1.0 (malicious).
TYPE:
|
maliciousness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.maliciousness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not malicious) and 1.0 (malicious) and a string containing the reasons for the evaluation. |
helpfulness
¶
Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.helpfulness).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not helpful) and 1.0 (helpful).
TYPE:
|
helpfulness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.helpfulness_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not helpful) and 1.0 (helpful) and a string containing the reasons for the evaluation. |
controversiality
¶
Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval.
Example
feedback = Feedback(provider.controversiality).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not controversial) and 1.0 (controversial).
TYPE:
|
controversiality_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.controversiality_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0 (not controversial) and 1.0 (controversial) and a string containing the reasons for the evaluation. |
misogyny
¶
Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.misogyny).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not misogynistic) and 1.0 (misogynistic).
TYPE:
|
misogyny_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.misogyny_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not misogynistic) and 1.0 (misogynistic) and a string containing the reasons for the evaluation. |
criminality
¶
Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.criminality).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not criminal) and 1.0 (criminal).
TYPE:
|
criminality_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.criminality_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not criminal) and 1.0 (criminal) and a string containing the reasons for the evaluation. |
insensitivity
¶
Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.insensitivity).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not insensitive) and 1.0 (insensitive).
TYPE:
|
insensitivity_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.insensitivity_with_cot_reasons).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
The text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not insensitive) and 1.0 (insensitive) and a string containing the reasons for the evaluation. |
comprehensiveness_with_cot_reasons
¶
comprehensiveness_with_cot_reasons(
source: str,
summary: str,
min_score: int = 0,
max_score: int = 3,
) -> Tuple[float, Dict]
Uses chat completion model. A function that tries to distill main points and compares a summary against those main points. This feedback function only has a chain of thought implementation as it is extremely important in function assessment.
Example
feedback = Feedback(provider.comprehensiveness_with_cot_reasons).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
source |
Text corresponding to source material.
TYPE:
|
summary |
Text corresponding to a summary.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (not comprehensive) and 1.0 (comprehensive) and a string containing the reasons for the evaluation. |
summarization_with_cot_reasons
¶
Summarization is deprecated in place of comprehensiveness. This function is no longer implemented.
stereotypes
¶
Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.
Example
feedback = Feedback(provider.stereotypes).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed). |
stereotypes_with_cot_reasons
¶
stereotypes_with_cot_reasons(
prompt: str,
response: str,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, Dict]
Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.
Example
feedback = Feedback(provider.stereotypes_with_cot_reasons).on_input_output()
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, Dict]
|
Tuple[float, str]: A tuple containing a value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed) and a string containing the reasons for the evaluation. |
groundedness_measure_with_cot_reasons
¶
groundedness_measure_with_cot_reasons(
source: str,
statement: str,
criteria: Optional[str] = None,
use_sent_tokenize: bool = True,
filter_trivial_statements: bool = True,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, dict]
A measure to track if the source material supports each sentence in the statement using an LLM provider.
The statement will first be split by a tokenizer into its component sentences.
Then, trivial statements are eliminated so as to not dilute the evaluation.
The LLM will process each statement, using chain of thought methodology to emit the reasons.
Abstentions will be considered as grounded.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
provider = OpenAI()
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(context.collect()
.on_output()
)
To further explain how the function works under the hood, consider the statement:
"Hi. I'm here to help. The university of Washington is a public research university. UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"
The function will split the statement into its component sentences:
- "Hi."
- "I'm here to help."
- "The university of Washington is a public research university."
- "UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"
Next, trivial statements are removed, leaving only:
- "The university of Washington is a public research university."
- "UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"
The LLM will then process the statement, to assess the groundedness of the statement.
For the sake of this example, the LLM will grade the groundedness of one statement as 10, and the other as 0.
Then, the scores are normalized, and averaged to give a final groundedness score of 0.5.
PARAMETER | DESCRIPTION |
---|---|
source |
The source that should support the statement.
TYPE:
|
statement |
The statement to check groundedness.
TYPE:
|
criteria |
The specific criteria for evaluation. Defaults to None.
TYPE:
|
use_sent_tokenize |
Whether to split the statement into sentences using punkt sentence tokenizer. If
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, dict]
|
Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation. |
qs_relevance_with_cot_reasons
¶
qs_relevance_with_cot_reasons(*args, **kwargs)
Deprecated. Use relevance_with_cot_reasons
instead.
groundedness_measure_with_cot_reasons_consider_answerability
¶
groundedness_measure_with_cot_reasons_consider_answerability(
source: str,
statement: str,
question: str,
criteria: Optional[str] = None,
use_sent_tokenize: bool = True,
filter_trivial_statements: bool = True,
min_score_val: int = 0,
max_score_val: int = 3,
temperature: float = 0.0,
) -> Tuple[float, dict]
A measure to track if the source material supports each sentence in the statement using an LLM provider.
The statement will first be split by a tokenizer into its component sentences.
Then, trivial statements are eliminated so as to not delete the evaluation.
The LLM will process each statement, using chain of thought methodology to emit the reasons.
In the case of abstentions, such as 'I do not know', the LLM will be asked to consider the answerability of the question given the source material.
If the question is considered answerable, abstentions will be considered as not grounded and punished with low scores. Otherwise, unanswerable abstentions will be considered grounded.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
provider = OpenAI()
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(context.collect()
.on_output()
.on_input()
)
PARAMETER | DESCRIPTION |
---|---|
source |
The source that should support the statement.
TYPE:
|
statement |
The statement to check groundedness.
TYPE:
|
question |
The question to check answerability.
TYPE:
|
criteria |
The specific criteria for evaluation. Defaults to None.
TYPE:
|
use_sent_tokenize |
Whether to split the statement into sentences using punkt sentence tokenizer. If
TYPE:
|
min_score_val |
The minimum score value used by the LLM before normalization. Defaults to 0.
TYPE:
|
max_score_val |
The maximum score value used by the LLM before normalization. Defaults to 3.
TYPE:
|
temperature |
The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, dict]
|
Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation. |
moderation_hate
¶
A function that checks if text is hate speech.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_hate, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not hate) and 1.0 (hate).
TYPE:
|
moderation_hatethreatening
¶
A function that checks if text is threatening speech.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_hatethreatening, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not threatening) and 1.0 (threatening).
TYPE:
|
moderation_selfharm
¶
A function that checks if text is about self harm.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_selfharm, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not self harm) and 1.0 (self harm).
TYPE:
|
moderation_sexual
¶
A function that checks if text is sexual speech.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_sexual, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not sexual) and 1.0 (sexual).
TYPE:
|
moderation_sexualminors
¶
A function that checks if text is about sexual minors.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_sexualminors, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not sexual minors) and 1.0 (sexual minors).
TYPE:
|
moderation_violence
¶
A function that checks if text is about violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_violence, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not violence) and 1.0 (violence).
TYPE:
|
moderation_violencegraphic
¶
A function that checks if text is about graphic violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_violencegraphic, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not graphic violence) and 1.0 (graphic violence).
TYPE:
|
moderation_harassment
¶
A function that checks if text is about graphic violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_harassment, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not harassment) and 1.0 (harassment).
TYPE:
|
moderation_harassment_threatening
¶
A function that checks if text is about graphic violence.
Example
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_harassment_threatening, higher_is_better=False
).on_output()
PARAMETER | DESCRIPTION |
---|---|
text |
Text to evaluate.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
A value between 0.0 (not harassment/threatening) and 1.0 (harassment/threatening).
TYPE:
|