Evaluating Multi-Modal RAG¶

In this notebook guide, we’ll demonstrate how to evaluate a LlamaIndex Multi-Modal RAG system with TruLens.

In [ ]:

Copied!

# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.11 ftfy regex tqdm git+https://github.com/openai/CLIP.git torch torchvision matplotlib scikit-image qdrant_client
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.11 ftfy regex tqdm git+https://github.com/openai/CLIP.git torch torchvision matplotlib scikit-image qdrant_client

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

Use Case: Spelling In ASL¶

In this demonstration, we will build a RAG application for teaching how to sign the alphabet of the American Sign Language (ASL).

In [ ]:

Copied!

QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."
QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."

Images¶

The images were taken from ASL-Alphabet Kaggle dataset. Note, that they were modified to simply include a label of the associated letter on the hand gesture image. These altered images are what we use as context to the user queries, and they can be downloaded from our google drive (see below cell, which you can uncomment to download the dataset directly from this notebook).

Text Context¶

For text context, we use descriptions of each of the hand gestures sourced from https://www.deafblind.com/asl.html. We have conveniently stored these in a json file called asl_text_descriptions.json which is included in the zip download from our google drive.

In [ ]:

Copied!





download_notebook_data = True
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q
!unzip asl_data.zip
download_notebook_data = True
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q
!unzip asl_data.zip

In [ ]:

Copied!





import json

from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader

# context images
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# context text
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "To sign {letter} in ASL: {desc}."
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]
import json

from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader

# context images
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# context text
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "To sign {letter} in ASL: {desc}."
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]

With our documents in hand, we can create our MultiModalVectorStoreIndex. To do so, we parse our Documents into nodes and then simply pass these nodes to the MultiModalVectorStoreIndex constructor.

In [ ]:

Copied!





from llama_index.core.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
from llama_index.core.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)

In [ ]:

Copied!





#######################################################################
## Set load_previously_generated_text_descriptions to True if you    ##
## would rather use previously generated gpt-4v text descriptions    ##
## that are included in the .zip download                            ##
#######################################################################

load_previously_generated_text_descriptions = False
#######################################################################
## Set load_previously_generated_text_descriptions to True if you    ##
## would rather use previously generated gpt-4v text descriptions    ##
## that are included in the .zip download                            ##
#######################################################################

load_previously_generated_text_descriptions = False

In [ ]:

Copied!





from llama_index.core.schema import ImageDocument
from llama_index.legacy.multi_modal_llms.openai import OpenAIMultiModal
import tqdm

if not load_previously_generated_text_descriptions:
    # define our lmm
    openai_mm_llm = OpenAIMultiModal(
        model="gpt-4-vision-preview", max_new_tokens=300
    )

    # make a new copy since we want to store text in its attribute
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # get text desc and save to text attr
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="Describe the images as an alternative text",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # save so don't have to incur expensive gpt-4v calls again
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # load up previously saved image descriptions and documents
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# parse into nodes
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)
from llama_index.core.schema import ImageDocument
from llama_index.legacy.multi_modal_llms.openai import OpenAIMultiModal
import tqdm

if not load_previously_generated_text_descriptions:
    # define our lmm
    openai_mm_llm = OpenAIMultiModal(
        model="gpt-4-vision-preview", max_new_tokens=300
    )

    # make a new copy since we want to store text in its attribute
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # get text desc and save to text attr
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="Describe the images as an alternative text",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # save so don't have to incur expensive gpt-4v calls again
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # load up previously saved image descriptions and documents
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# parse into nodes
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

A keen reader will notice that we stored the text descriptions within the text field of an ImageDocument. As we did before, to create a MultiModalVectorStoreIndex, we'll need to parse the ImageDocuments as ImageNodes, and thereafter pass the nodes to the constructor.

Note that when ImageNodes that have populated text fields are used to build a MultiModalVectorStoreIndex, we can choose to use this text to build embeddings on that will be used for retrieval. To so, we just specify the class attribute is_image_to_text to True.

In [ ]:

Copied!





image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)

As in the text-only case, we need to "attach" a generator to our index (that can be used as a retriever) to finally assemble our RAG systems. In the multi-modal case however, our generators are Multi-Modal LLMs (or also often referred to as Large Multi-Modal Models or LMM for short). In this notebook, to draw even more comparisons on varied RAG systems, we will use GPT-4V. We can "attach" a generator and get an queryable interface for RAG by invoking the as_query_engine method of our indexes.

In [ ]:

Copied!





from llama_index.core.prompts import PromptTemplate
from llama_index.multi_modal_llms.openai import OpenAIMultiModal

# define our QA prompt template
qa_tmpl_str = (
    "Images of hand gestures for ASL are provided.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "If the images provided cannot help in answering the query\n"
    "then respond that you are unable to answer the query. Otherwise,\n"
    "using only the context provided, and not prior knowledge,\n"
    "provide an answer to the query."
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# define our lmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview",
    max_new_tokens=300,
)

# define our RAG query engines
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
}
from llama_index.core.prompts import PromptTemplate
from llama_index.multi_modal_llms.openai import OpenAIMultiModal

# define our QA prompt template
qa_tmpl_str = (
    "Images of hand gestures for ASL are provided.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "If the images provided cannot help in answering the query\n"
    "then respond that you are unable to answer the query. Otherwise,\n"
    "using only the context provided, and not prior knowledge,\n"
    "provide an answer to the query."
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# define our lmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview",
    max_new_tokens=300,
)

# define our RAG query engines
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
}

Let's take a test drive of one these systems. To pretty display the response, we make use of notebook utility function display_query_and_multimodal_response.

In [ ]:

Copied!

letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)
letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)

In [ ]:

Copied!

from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)
from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)

Just like with text-based RAG systems, we can leverage the RAG Triad with TruLens to assess the quality of the RAG.

In [ ]:

Copied!

from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()

run_dashboard(session)
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()

run_dashboard(session)

Define the RAG Triad for evaluations¶

First we need to define the feedback functions to use: answer relevance, context relevance and groundedness.

In [ ]:

Copied!





import numpy as np

# Initialize provider class
from openai import OpenAI
from trulens.core import Feedback
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI as fOpenAI

openai_client = OpenAI()
provider = fOpenAI(client=openai_client)

# Define a groundedness feedback function
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(TruLlama.select_source_nodes().node.text.collect())
    .on_output()
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)

feedbacks = [f_groundedness, f_qa_relevance, f_context_relevance]
import numpy as np

# Initialize provider class
from openai import OpenAI
from trulens.core import Feedback
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI as fOpenAI

openai_client = OpenAI()
provider = fOpenAI(client=openai_client)

# Define a groundedness feedback function
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(TruLlama.select_source_nodes().node.text.collect())
    .on_output()
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)

feedbacks = [f_groundedness, f_qa_relevance, f_context_relevance]

Set up TruLlama to log and evaluate rag engines¶

In [ ]:

Copied!





tru_text_desc_gpt4v = TruLlama(
    rag_engines["mm_text_desc_gpt4v"],
    app_name="text-desc-gpt4v",
    feedbacks=feedbacks,
)

tru_mm_clip_gpt4v = TruLlama(
    rag_engines["mm_clip_gpt4v"], app_name="mm_clip_gpt4v", feedbacks=feedbacks
)
tru_text_desc_gpt4v = TruLlama(
    rag_engines["mm_text_desc_gpt4v"],
    app_name="text-desc-gpt4v",
    feedbacks=feedbacks,
)

tru_mm_clip_gpt4v = TruLlama(
    rag_engines["mm_clip_gpt4v"], app_name="mm_clip_gpt4v", feedbacks=feedbacks
)

Evaluate the performance of the RAG on each letter¶

In [ ]:

Copied!





letters = [
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W",
    "X",
    "Y",
    "Z",
]
letters = [
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W",
    "X",
    "Y",
    "Z",
]

In [ ]:

Copied!





with tru_text_desc_gpt4v as recording:
    for letter in letters:
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        response = rag_engines["mm_text_desc_gpt4v"].query(query)

with tru_mm_clip_gpt4v as recording:
    for letter in letters:
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        response = rag_engines["mm_clip_gpt4v"].query(query)
with tru_text_desc_gpt4v as recording:
    for letter in letters:
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        response = rag_engines["mm_text_desc_gpt4v"].query(query)

with tru_mm_clip_gpt4v as recording:
    for letter in letters:
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        response = rag_engines["mm_clip_gpt4v"].query(query)

See results¶

In [ ]:

Copied!

session.get_leaderboard(app_ids=["text-desc-gpt4v", "mm_clip_gpt4v"])
session.get_leaderboard(app_ids=["text-desc-gpt4v", "mm_clip_gpt4v"])

In [ ]:

Copied!

from trulens.dashboard import run_dashboard

run_dashboard(session)
from trulens.dashboard import run_dashboard

run_dashboard(session)

Evaluating Multi-Modal RAG¶

Use Case: Spelling In ASL¶

Images¶

Text Context¶

Build Our Multi-Modal RAG Systems¶

Test drive our Multi-Modal RAG¶

Evaluate Multi-Modal RAGs with TruLens¶

Define the RAG Triad for evaluations¶

Set up TruLlama to log and evaluate rag engines¶

Evaluate the performance of the RAG on each letter¶

See results¶