Deploy, Fine-tune Foundation Models with AWS Sagemaker, Iterate and Monitor with TruEra¶
SageMaker JumpStart provides a variety of pretrained open source and proprietary models such as Llama-2, Anthropic’s Claude and Cohere Command that can be quickly deployed in the Sagemaker environment. In many cases however, these foundation models are not sufficient on their own for production use cases, needing to be adapted to a particular style or new tasks. One way to surface this need is by evaluating the model against a curated ground truth dataset. Once the need to adapt the foundation model is clear, one could leverage a set of techniques to carry that out. A popular approach is to fine-tune the model on a dataset that is tailored to the use case.
One challenge with this approach is that curated ground truth datasets are expensive to create. In this blog post, we address this challenge by augmenting this workflow with a framework for extensible, automated evaluations. We start off with a baseline foundation model from SageMaker JumpStart and evaluate it with TruLens, an open source library for evaluating & tracking LLM apps. Once we identify the need for adaptation, we can leverage fine-tuning in Sagemaker Jumpstart and confirm improvement with TruLens.
TruLens evaluations make use of an abstraction of feedback functions. These functions can be implemented in several ways, including BERT-style models, appropriately prompted Large Language Models, and more. TruLens’ integration with AWS Bedrock allows you to easily run evaluations using LLMs available from AWS Bedrock. The reliability of Bedrock’s infrastructure is particularly valuable for use in performing evaluations across development and production.
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy pre-trained Llama 2 model as well as fine-tune it for your dataset in domain adaptation or instruction tuning format. We will also use TruLens to identify performance issues with the base model and validate improvement of the fine-tuned model.
# !pip install trulens trulens-providers-bedrock sagemaker datasets boto3
Deploy Pre-trained Model¶
First we will deploy the Llama-2 model as a SageMaker endpoint. To train/deploy 13B and 70B models, please change model_id to "meta-textgenerated_text-llama-2-7b" and "meta-textgenerated_text-llama-2-70b" respectively.
model_id, model_version = "meta-textgeneration-llama-2-7b", "*"
from sagemaker.jumpstart.model import JumpStartModel
pretrained_model = JumpStartModel(model_id=model_id)
pretrained_predictor = pretrained_model.deploy(accept_eula=True)
Invoke the endpoint¶
Next, we invoke the endpoint with some sample queries. Later, in this notebook, we will fine-tune this model with a custom dataset and carry out inference using the fine-tuned model. We will also show comparison between results obtained via the pre-trained and the fine-tuned models.
def print_response(payload, response):
print(payload["inputs"])
print(f"> {response[0]['generated_text']}")
print("\n==================================\n")
payload = {
"inputs": "I believe the meaning of life is",
"parameters": {
"max_new_tokens": 64,
"top_p": 0.9,
"temperature": 0.6,
"return_full_text": False,
},
}
try:
response = pretrained_predictor.predict(
payload, custom_attributes="accept_eula=true"
)
print_response(payload, response)
except Exception as e:
print(e)
To learn about additional use cases of pre-trained model, please checkout the notebook Text completion: Run Llama 2 models in SageMaker JumpStart.
Dataset preparation for fine-tuning¶
You can fine-tune on the dataset with domain adaptation format or instruction tuning format. Please find more details in the section Dataset instruction. In this demo, we will use a subset of Dolly dataset in an instruction tuning format. Dolly dataset contains roughly 15,000 instruction following records for various categories such as question answering, summarization, information extraction etc. It is available under Apache 2.0 license. We will select the summarization examples for fine-tuning.
Training data is formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The training folder can also contain a template.json file describing the input and output formats.
To train your model on a collection of unstructured dataset (text files), please see the section Example fine-tuning with Domain-Adaptation dataset format in the Appendix.
from datasets import load_dataset
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# To train for question answering/information extraction, you can replace the assertion in next line to example["category"] == "closed_qa"/"information_extraction".
summarization_dataset = dolly_dataset.filter(
lambda example: example["category"] == "summarization"
)
summarization_dataset = summarization_dataset.remove_columns("category")
# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.1)
# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")
train_and_test_dataset["train"][0]
Next, we create a prompt template for using the data in an instruction / input format for the training job (since we are instruction fine-tuning the model in this example), and also for inferencing the deployed endpoint.
import json
template = {
"prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n",
"completion": " {response}",
}
with open("template.json", "w") as f:
json.dump(template, f)
import sagemaker
from sagemaker.s3 import S3Uploader
output_bucket = sagemaker.Session().default_bucket()
local_data_file = "train.jsonl"
train_data_location = f"s3://{output_bucket}/dolly_dataset"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"Training data: {train_data_location}")
Train the model¶
Next, we fine-tune the LLaMA v2 7B model on the summarization dataset from Dolly. Finetuning scripts are based on scripts provided by this repo. To learn more about the fine-tuning scripts, please checkout section 5. Few notes about the fine-tuning method. For a list of supported hyper-parameters and their default values, please see section 3. Supported Hyper-parameters for fine-tuning.
from sagemaker.jumpstart.estimator import JumpStartEstimator
estimator = JumpStartEstimator(
model_id=model_id,
environment={"accept_eula": "true"},
disable_output_compression=True, # For Llama-2-70b, add instance_type = "ml.g5.48xlarge"
)
# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
estimator.set_hyperparameters(
instruction_tuned="True", epoch="5", max_input_length="1024"
)
estimator.fit({"training": train_data_location})
Studio Kernel Dying issue: If your studio kernel dies and you lose reference to the estimator object, please see section 6. Studio Kernel Dead/Creating JumpStart Model from the training Job on how to deploy endpoint using the training job name and the model id.
Deploy the fine-tuned model¶
Next, we deploy fine-tuned model. We will compare the performance of fine-tuned and pre-trained model.
finetuned_predictor = attached_estimator
finetuned_predictor = attached_estimator.deploy()
Evaluate the pre-trained and fine-tuned model¶
Next, we use TruLens evaluate the performance of the fine-tuned model and compare it with the pre-trained model.
from IPython.display import HTML
from IPython.display import display
import pandas as pd
test_dataset = train_and_test_dataset["test"]
(
inputs,
ground_truth_responses,
responses_before_finetuning,
responses_after_finetuning,
) = (
[],
[],
[],
[],
)
def predict_and_print(datapoint):
# For instruction fine-tuning, we insert a special key between input and output
input_output_demarkation_key = "\n\n### Response:\n"
payload = {
"inputs": template["prompt"].format(
instruction=datapoint["instruction"], context=datapoint["context"]
)
+ input_output_demarkation_key,
"parameters": {"max_new_tokens": 100},
}
inputs.append(payload["inputs"])
ground_truth_responses.append(datapoint["response"])
# Please change the following line to "accept_eula=True"
pretrained_response = pretrained_predictor.predict(
payload, custom_attributes="accept_eula=true"
)
responses_before_finetuning.append(pretrained_response[0]["generated_text"])
# Please change the following line to "accept_eula=True"
finetuned_response = finetuned_predictor.predict(
payload, custom_attributes="accept_eula=true"
)
responses_after_finetuning.append(finetuned_response[0]["generated_text"])
try:
for i, datapoint in enumerate(test_dataset.select(range(5))):
predict_and_print(datapoint)
df = pd.DataFrame(
{
"Inputs": inputs,
"Ground Truth": ground_truth_responses,
"Response from non-finetuned model": responses_before_finetuning,
"Response from fine-tuned model": responses_after_finetuning,
}
)
display(HTML(df.to_html()))
except Exception as e:
print(e)
Set up as text to text LLM apps¶
def base_llm(instruction, context):
# For instruction fine-tuning, we insert a special key between input and output
input_output_demarkation_key = "\n\n### Response:\n"
payload = {
"inputs": template["prompt"].format(
instruction=instruction, context=context
)
+ input_output_demarkation_key,
"parameters": {"max_new_tokens": 200},
}
return pretrained_predictor.predict(
payload, custom_attributes="accept_eula=true"
)[0]["generated_text"]
def finetuned_llm(instruction, context):
# For instruction fine-tuning, we insert a special key between input and output
input_output_demarkation_key = "\n\n### Response:\n"
payload = {
"inputs": template["prompt"].format(
instruction=instruction, context=context
)
+ input_output_demarkation_key,
"parameters": {"max_new_tokens": 200},
}
return finetuned_predictor.predict(
payload, custom_attributes="accept_eula=true"
)[0]["generated_text"]
base_llm(test_dataset["instruction"][0], test_dataset["context"][0])
finetuned_llm(test_dataset["instruction"][0], test_dataset["context"][0])
Use TruLens for automated evaluation and tracking
from trulens.core import Feedback
from trulens.core import Select
from trulens.core import TruSession
from trulens.apps.basic import TruBasicApp
from trulens.feedback import GroundTruthAgreement
# Rename columns
test_dataset = pd.DataFrame(test_dataset)
test_dataset.rename(columns={"instruction": "query"}, inplace=True)
# Convert DataFrame to a list of dictionaries
golden_set = test_dataset[["query", "response"]].to_dict(orient="records")
# Instantiate Bedrock
from trulens.providers.bedrock import Bedrock
# Initialize Bedrock as feedback function provider
bedrock = Bedrock(
model_id="amazon.titan-text-express-v1", region_name="us-east-1"
)
# Create a Feedback object for ground truth similarity
ground_truth = GroundTruthAgreement(golden_set, provider=bedrock)
# Call the agreement measure on the instruction and output
f_groundtruth = (
Feedback(ground_truth.agreement_measure, name="Ground Truth Agreement")
.on(Select.Record.calls[0].args.args[0])
.on_output()
)
# Answer Relevance
f_answer_relevance = (
Feedback(bedrock.relevance_with_cot_reasons, name="Answer Relevance")
.on(Select.Record.calls[0].args.args[0])
.on_output()
)
# Context Relevance
f_context_relevance = (
Feedback(
bedrock.context_relevance_with_cot_reasons, name="Context Relevance"
)
.on(Select.Record.calls[0].args.args[0])
.on(Select.Record.calls[0].args.args[1])
)
# Groundedness
f_groundedness = (
Feedback(bedrock.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(Select.Record.calls[0].args.args[1])
.on_output()
)
base_recorder = TruBasicApp(
base_llm,
app_name="LLM",
app_version="base",
feedbacks=[
f_groundtruth,
f_answer_relevance,
f_context_relevance,
f_groundedness,
],
)
finetuned_recorder = TruBasicApp(
finetuned_llm,
app_name="LLM",
app_version="finetuned",
feedbacks=[
f_groundtruth,
f_answer_relevance,
f_context_relevance,
f_groundedness,
],
)
for i in range(len(test_dataset)):
with base_recorder as recording:
base_recorder.app(test_dataset["query"][i], test_dataset["context"][i])
with finetuned_recorder as recording:
finetuned_recorder.app(
test_dataset["query"][i], test_dataset["context"][i]
)
# Ignore minor errors in the stack trace
TruSession().get_records_and_feedback()
records, feedback = TruSession().get_leaderboard()
TruSession().get_leaderboard()
TruSession().run_dashboard()
Clean up resources¶
# Delete resources
pretrained_predictor.delete_model()
pretrained_predictor.delete_endpoint()
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()