SageMaker V3 Model Optimization Example#
This notebook demonstrates how to use SageMaker V3 ModelBuilder to optimize a JumpStart model for improved inference performance.
Prerequisites#
Note: Ensure you have sagemaker and ipywidgets installed in your environment. The ipywidgets package is required to monitor endpoint deployment progress in Jupyter notebooks.
# Import required libraries
import json
import uuid
import time
import boto3
from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.core.resources import EndpointConfig
from sagemaker.core.helper.session_helper import Session
Step 1: Configure Model and Session#
We’ll optimize a Llama 3 model from JumpStart using AWQ quantization.
# Configuration
MODEL_ID = "meta-textgeneration-llama-3-8b-instruct"
MODEL_NAME_PREFIX = "jumpstart-optimize-example"
ENDPOINT_NAME_PREFIX = "jumpstart-optimize-example-endpoint"
AWS_ACCOUNT_ID = Session.account_id()
AWS_REGION = Session.boto_region_name
# Generate unique identifiers
unique_id = str(uuid.uuid4())[:8]
model_name = f"{MODEL_NAME_PREFIX}-{unique_id}"
endpoint_name = f"{ENDPOINT_NAME_PREFIX}-{unique_id}"
job_name = f"js-optimize-{int(time.time())}"
print(f"Model name: {model_name}")
print(f"Endpoint name: {endpoint_name}")
print(f"Optimization job name: {job_name}")
Step 2: Create Schema Builder#
Define the input/output schema for the text generation model.
# Create schema builder for text generation
sample_input = {"inputs": "What are falcons?", "parameters": {"max_new_tokens": 32}}
sample_output = [{"generated_text": "Falcons are small to medium-sized birds of prey."}]
schema_builder = SchemaBuilder(sample_input, sample_output)
print("Schema builder created successfully!")
Step 3: Initialize SageMaker Session#
Create a SageMaker session with the specified AWS region.
# Create SageMaker session
boto_session = boto3.Session(region_name=AWS_REGION)
sagemaker_session = Session(boto_session=boto_session)
print(f"SageMaker session created for region: {AWS_REGION}")
Step 4: Create ModelBuilder#
Initialize the ModelBuilder with the JumpStart model ID and schema.
# Initialize ModelBuilder
model_builder = ModelBuilder(
model=MODEL_ID,
schema_builder=schema_builder,
sagemaker_session=sagemaker_session,
)
print("ModelBuilder created successfully!")
Step 5: Optimize the Model#
Optimize the model using AWQ quantization for improved inference performance. This step may take up to 30 minutes to complete!
# Optimize the model with AWQ quantization
print("Optimizing JumpStart model...")
optimized_model = model_builder.optimize(
instance_type="ml.g5.2xlarge",
image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124",
output_path=f"s3://sagemaker-us-east-2-593793038179/optimize-output/jumpstart-{unique_id}/",
quantization_config={"OverrideEnvironment": {"OPTION_QUANTIZE": "awq"}},
accept_eula=True,
job_name=job_name,
model_name=model_name,
)
print(f"Model Successfully Optimized: {optimized_model.model_name}")
Step 6: Deploy the Optimized Model#
Deploy the optimized model to a SageMaker endpoint for real-time inference.
# Deploy the optimized model to an endpoint
print("Deploying optimized model to endpoint...")
core_endpoint = model_builder.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1
)
print(f"Endpoint Successfully Created: {core_endpoint.endpoint_name}")
Step 7: Test the Optimized Endpoint#
Send a test request to verify the optimized model is working correctly.
# Test optimized model invocation
test_data = {
"inputs": "What are the benefits of machine learning?",
"parameters": {"max_new_tokens": 50}
}
result = core_endpoint.invoke(
body=json.dumps(test_data),
content_type="application/json"
)
response_body = result.body.read().decode('utf-8')
prediction = json.loads(response_body)
print(f"Result of invoking optimized endpoint: {prediction}")
Step 8: Clean Up Resources#
Clean up the created resources to avoid ongoing charges.
# Clean up resources
core_endpoint_config = EndpointConfig.get(endpoint_config_name=core_endpoint.endpoint_name)
# Delete in the correct order
optimized_model.delete()
core_endpoint.delete()
core_endpoint_config.delete()
print("Optimized model and endpoint successfully deleted!")
Summary#
This notebook demonstrated:
Creating a ModelBuilder with a JumpStart model
Optimizing the model using AWQ quantization
Deploying the optimized model to a SageMaker endpoint
Making inference requests to the optimized endpoint
Cleaning up resources
The V3 ModelBuilder’s optimize() method makes it easy to improve model performance with quantization and other optimization techniques!