Llama index agents
LlamaIndex Agents + Ground Truth & Custom Evaluationsยถ
In this example, we build an agent-based app with Llama Index to answer questions with the help of Yelp. We'll evaluate it using a few different feedback functions (some custom, some out-of-the-box)
The first set of feedback functions complete what the non-hallucination triad. However because we're dealing with agents here, we've added a fourth leg (query translation) to cover the additional interaction between the query planner and the agent. This combination provides a foundation for eliminating hallucination in LLM applications.
- Query Translation - The first step. Here we compare the similarity of the original user query to the query sent to the agent. This ensures that we're providing the agent with the correct question.
- Context or QS Relevance - Next, we compare the relevance of the context provided by the agent back to the original query. This ensures that we're providing context for the right question.
- Groundedness - Third, we ensure that the final answer is supported by the context. This ensures that the LLM is not extending beyond the information provided by the agent.
- Question Answer Relevance - Last, we want to make sure that the final answer provided is relevant to the user query. This last step confirms that the answer is not only supported but also useful to the end user.
In this example, we'll add two additional feedback functions.
- Ratings usage - evaluate if the summarized context uses ratings as justification. Note: this may not be relevant for all queries.
- Ground truth eval - we want to make sure our app responds correctly. We will create a ground truth set for this evaluation.
Last, we'll compare the evaluation of this app against a standalone LLM. May the best bot win?
Install TruLens and Llama-Indexยถ
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.33 llama-index-tools-yelp==0.1.2 openai
# If running from github repo, uncomment the below to setup paths.
# from pathlib import Path
# import sys
# trulens_path = Path().cwd().parent.parent.parent.parent.resolve()
# sys.path.append(str(trulens_path))
# Setup OpenAI Agent
import os
from llama_index.agent.openai import OpenAIAgent
import openai
# Set your API keys. If you already have them in your var env., you can skip these steps.
os.environ["OPENAI_API_KEY"] = "sk..."
openai.api_key = os.environ["OPENAI_API_KEY"]
os.environ["YELP_API_KEY"] = "..."
os.environ["YELP_CLIENT_ID"] = "..."
# If you already have keys in var env., use these to check instead:
# from trulens.core.utils.keys import check_keys
# check_keys("OPENAI_API_KEY", "YELP_API_KEY", "YELP_CLIENT_ID")
Set up our Llama-Index Appยถ
For this app, we will use a tool from Llama-Index to connect to Yelp and allow the Agent to search for business and fetch reviews.
# Import and initialize our tool spec
from llama_index.core.tools.tool_spec.load_and_search.base import (
LoadAndSearchToolSpec,
)
from llama_index.tools.yelp.base import YelpToolSpec
# Add Yelp API key and client ID
tool_spec = YelpToolSpec(
api_key=os.environ.get("YELP_API_KEY"),
client_id=os.environ.get("YELP_CLIENT_ID"),
)
gordon_ramsay_prompt = "You answer questions about restaurants in the style of Gordon Ramsay, often insulting the asker."
# Create the Agent with our tools
tools = tool_spec.to_tool_list()
agent = OpenAIAgent.from_tools(
[
*LoadAndSearchToolSpec.from_defaults(tools[0]).to_tool_list(),
*LoadAndSearchToolSpec.from_defaults(tools[1]).to_tool_list(),
],
verbose=True,
system_prompt=gordon_ramsay_prompt,
)
Create a standalone GPT3.5 for comparisonยถ
client = openai.OpenAI()
chat_completion = client.chat.completions.create
from trulens.apps.custom import TruCustomApp
from trulens.core import instrument
class LLMStandaloneApp:
@instrument
def __call__(self, prompt):
return (
chat_completion(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": gordon_ramsay_prompt},
{"role": "user", "content": prompt},
],
)
.choices[0]
.message.content
)
llm_standalone = LLMStandaloneApp()
Evaluation and Tracking with TruLensยถ
# imports required for tracking and evaluation
from trulens.core import Feedback
from trulens.core import Select
from trulens.core import TruSession
from trulens.feedback import GroundTruthAgreement
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI
session = TruSession()
# session.reset_database() # if needed
Evaluation setupยถ
To set up our evaluation, we'll first create two new custom feedback functions: query_translation_score and ratings_usage. These are straight-forward prompts of the OpenAI API.
class Custom_OpenAI(OpenAI):
def query_translation_score(self, question1: str, question2: str) -> float:
prompt = f"Your job is to rate how similar two questions are on a scale of 1 to 10. Respond with the number only. QUESTION 1: {question1}; QUESTION 2: {question2}"
return self.generate_score_and_reason(system_prompt=prompt)
def ratings_usage(self, last_context: str) -> float:
prompt = f"Your job is to respond with a '1' if the following statement mentions ratings or reviews, and a '0' if not. STATEMENT: {last_context}"
return self.generate_score_and_reason(system_prompt=prompt)
Now that we have all of our feedback functions available, we can instantiate them. For many of our evals, we want to check on intermediate parts of our app such as the query passed to the yelp app, or the summarization of the Yelp content. We'll do so here using Select.
# unstable: perhaps reduce temperature?
custom_provider = Custom_OpenAI()
# Input to tool based on trimmed user input.
f_query_translation = (
Feedback(custom_provider.query_translation_score, name="Query Translation")
.on_input()
.on(Select.Record.app.query[0].args.str_or_query_bundle)
)
f_ratings_usage = Feedback(
custom_provider.ratings_usage, name="Ratings Usage"
).on(Select.Record.app.query[0].rets.response)
# Result of this prompt: Given the context information and not prior knowledge, answer the query.
# Query: address of Gumbo Social
# Answer: "
provider = OpenAI()
# Context relevance between question and last context chunk (i.e. summary)
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(Select.Record.app.query[0].rets.response)
)
# Groundedness
f_groundedness = (
Feedback(
provider.groundedness_measure_with_cot_reasons, name="Groundedness"
)
.on(Select.Record.app.query[0].rets.response)
.on_output()
)
# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(
provider.relevance, name="Answer Relevance"
).on_input_output()
Ground Truth Evalยถ
It's also useful in many cases to do ground truth eval with small golden sets. We'll do so here.
golden_set = [
{
"query": "Hello there mister AI. What's the vibe like at oprhan andy's in SF?",
"response": "welcoming and friendly",
},
{"query": "Is park tavern in San Fran open yet?", "response": "Yes"},
{
"query": "I'm in san francisco for the morning, does Juniper serve pastries?",
"response": "Yes",
},
{
"query": "What's the address of Gumbo Social in San Francisco?",
"response": "5176 3rd St, San Francisco, CA 94124",
},
{
"query": "What are the reviews like of Gola in SF?",
"response": "Excellent, 4.6/5",
},
{
"query": "Where's the best pizza in New York City",
"response": "Joe's Pizza",
},
{
"query": "What's the best diner in Toronto?",
"response": "The George Street Diner",
},
]
f_groundtruth = Feedback(
GroundTruthAgreement(golden_set, provider=provider).agreement_measure, name="Ground Truth Eval"
).on_input_output()
Run the dashboardยถ
By running the dashboard before we start to make app calls, we can see them come in 1 by 1.
from trulens.dashboard import run_dashboard
run_dashboard(
session,
# if running from github
# _dev=trulens_path,
# force=True
)
Instrument Yelp Appยถ
We can instrument our yelp app with TruLlama and utilize the full suite of evals we set up.
tru_agent = TruLlama(
agent,
app_name="YelpAgent",
tags="agent prototype",
feedbacks=[
f_qa_relevance,
f_groundtruth,
f_context_relevance,
f_groundedness,
f_query_translation,
f_ratings_usage,
],
)
tru_agent.print_instrumented()
Instrument Standalone LLM app.ยถ
Since we don't have insight into the OpenAI innerworkings, we cannot run many of the evals on intermediate steps.
We can still do QA relevance on input and output, and check for similarity of the answers compared to the ground truth.
tru_llm_standalone = TruCustomApp(
llm_standalone,
app_name="OpenAIChatCompletion",
tags="comparison",
feedbacks=[f_qa_relevance, f_groundtruth],
)
tru_llm_standalone.print_instrumented()
Start using our apps!ยถ
prompt_set = [
"What's the vibe like at oprhan andy's in SF?",
"What are the reviews like of Gola in SF?",
"Where's the best pizza in New York City",
"What's the address of Gumbo Social in San Francisco?",
"I'm in san francisco for the morning, does Juniper serve pastries?",
"What's the best diner in Toronto?",
]
for prompt in prompt_set:
print(prompt)
with tru_llm_standalone as recording:
llm_standalone(prompt)
record_standalone = recording.get()
with tru_agent as recording:
agent.query(prompt)
record_agent = recording.get()