Advanced Evaluation Methods¶
In this notebook, we will level up our evaluation using chain of thought reasoning. Chain of thought reasoning through interemediate steps improves LLM's ability to perform complex reasoning - and this includes evaluations. Even better, this reasoning is useful for us as humans to identify and understand new failure modes such as irrelevant retrieval or hallucination.
Second, in this example we will leverage deferred evaluations. Deferred evaluations can be especially useful for cases such as sub-question queries where the structure of our serialized record can vary. By creating different options for context evaluation, we can use deferred evaluations to try both and use the one that matches the structure of the serialized record. Deferred evaluations can be run later, especially in off-peak times for your app.
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.11 sentence-transformers transformers pypdf gdown
Query Engine Construction¶
import os
import openai
from trulens.core import Feedback
from trulens.core import FeedbackMode
from trulens.core import Select
from trulens.core import TruSession
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI as fOpenAI
session = TruSession()
session.reset_database()
os.environ["OPENAI_API_KEY"] = "..."
openai.api_key = os.environ["OPENAI_API_KEY"]
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
# sentence-window index
!gdown "https://drive.google.com/uc?id=16pH4NETEs43dwJUvYnJ9Z-bsR9_krkrP"
!tar -xzf sentence_index.tar.gz
# Merge into a single large document rather than one document per-page
from llama_index import Document
document = Document(text="\n\n".join([doc.text for doc in documents]))
from llama_index.core import ServiceContext
from llama_index.llms import OpenAI
from llama_index.node_parser import SentenceWindowNodeParser
# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
sentence_context = ServiceContext.from_defaults(
llm=llm,
embed_model="local:BAAI/bge-small-en-v1.5",
node_parser=node_parser,
)
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core import load_index_from_storage
if not os.path.exists("./sentence_index"):
sentence_index = VectorStoreIndex.from_documents(
[document], service_context=sentence_context
)
sentence_index.storage_context.persist(persist_dir="./sentence_index")
else:
sentence_index = load_index_from_storage(
StorageContext.from_defaults(persist_dir="./sentence_index"),
service_context=sentence_context,
)
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.indices.postprocessor import SentenceTransformerRerank
sentence_window_engine = sentence_index.as_query_engine(
similarity_top_k=6,
# the target key defaults to `window` to match the node_parser's default
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window"),
SentenceTransformerRerank(top_n=2, model="BAAI/bge-reranker-base"),
],
)
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.tools import QueryEngineTool
from llama_index.tools import ToolMetadata
sentence_sub_engine = SubQuestionQueryEngine.from_defaults(
[
QueryEngineTool(
query_engine=sentence_window_engine,
metadata=ToolMetadata(
name="climate_report", description="Climate Report on Oceans."
),
)
],
service_context=sentence_context,
verbose=False,
)
import nest_asyncio
nest_asyncio.apply()
import numpy as np
# Initialize OpenAI provider
provider = fOpenAI()
# Helpfulness
f_helpfulness = Feedback(provider.helpfulness).on_output()
# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(provider.relevance_with_cot_reasons).on_input_output()
# Question/statement relevance between question and each context chunk with context reasoning.
# The context is located in a different place for the sub questions so we need to define that feedback separately
f_context_relevance_subquestions = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(Select.Record.calls[0].rets.source_nodes[:].node.text)
.aggregate(np.mean)
)
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(Select.Record.calls[0].args.prompt_args.context_str)
.aggregate(np.mean)
)
# Initialize groundedness
# Groundedness with chain of thought reasoning
# Similar to context relevance, we'll follow a strategy of defining it twice for the subquestions and overall question.
f_groundedness_subquestions = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(Select.Record.calls[0].rets.source_nodes[:].node.text.collect())
.on_output()
)
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(Select.Record.calls[0].args.prompt_args.context_str)
.on_output()
)
# We'll use the recorder in deferred mode so we can log all of the subquestions before starting eval.
# This approach will give us smoother handling for the evals + more consistent logging at high volume.
# In addition, for our two different qs relevance definitions, deferred mode can just take the one that evaluates.
tru_recorder = TruLlama(
sentence_sub_engine,
app_name="App",
feedbacks=[
f_qa_relevance,
f_context_relevance,
f_context_relevance_subquestions,
f_groundedness,
f_groundedness_subquestions,
f_helpfulness,
],
feedback_mode=FeedbackMode.DEFERRED,
)
questions = [
"Based on the provided text, discuss the impact of human activities on the natural carbon dynamics of estuaries, shelf seas, and other intertidal and shallow-water habitats. Provide examples from the text to support your answer.",
"Analyze the combined effects of exploitation and multi-decadal climate fluctuations on global fisheries yields. How do these factors make it difficult to assess the impacts of global climate change on fisheries yields? Use specific examples from the text to support your analysis.",
"Based on the study by Gutiérrez-Rodríguez, A.G., et al., 2018, what potential benefits do seaweeds have in the field of medicine, specifically in relation to cancer treatment?",
"According to the research conducted by Haasnoot, M., et al., 2020, how does the uncertainty in Antarctic mass-loss impact the coastal adaptation strategy of the Netherlands?",
"Based on the context, explain how the decline in warm water coral reefs is projected to impact the services they provide to society, particularly in terms of coastal protection.",
"Tell me something about the intricacies of tying a tie.",
]
for question in questions:
with tru_recorder as recording:
sentence_sub_engine.query(question)
from trulens.dashboard import run_dashboard
run_dashboard(session)
Before we start the evaluator, note that we've logged all of the records including the sub-questions. However we haven't completed any evals yet.
Start the evaluator to generate the feedback results.
session.start_evaluator()