SageMaker V3 Model Optimization Example

SageMaker V3 Model Optimization Example#

This notebook demonstrates how to use SageMaker V3 ModelBuilder to optimize a JumpStart model for improved inference performance.

Prerequisites#

Note: Ensure you have sagemaker and ipywidgets installed in your environment. The ipywidgets package is required to monitor endpoint deployment progress in Jupyter notebooks.

# Import required libraries
import json
import uuid
import time
import boto3

from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.core.resources import EndpointConfig
from sagemaker.core.helper.session_helper import Session

Step 1: Configure Model and Session#

We’ll optimize a Llama 3 model from JumpStart using AWQ quantization.

# Configuration
MODEL_ID = "meta-textgeneration-llama-3-8b-instruct"
MODEL_NAME_PREFIX = "jumpstart-optimize-example"
ENDPOINT_NAME_PREFIX = "jumpstart-optimize-example-endpoint"
AWS_ACCOUNT_ID = Session.account_id()
AWS_REGION = Session.boto_region_name

# Generate unique identifiers
unique_id = str(uuid.uuid4())[:8]
model_name = f"{MODEL_NAME_PREFIX}-{unique_id}"
endpoint_name = f"{ENDPOINT_NAME_PREFIX}-{unique_id}"
job_name = f"js-optimize-{int(time.time())}"

print(f"Model name: {model_name}")
print(f"Endpoint name: {endpoint_name}")
print(f"Optimization job name: {job_name}")

Step 2: Create Schema Builder#

Define the input/output schema for the text generation model.

# Create schema builder for text generation
sample_input = {"inputs": "What are falcons?", "parameters": {"max_new_tokens": 32}}
sample_output = [{"generated_text": "Falcons are small to medium-sized birds of prey."}]

schema_builder = SchemaBuilder(sample_input, sample_output)
print("Schema builder created successfully!")

Step 3: Initialize SageMaker Session#

Create a SageMaker session with the specified AWS region.

# Create SageMaker session
boto_session = boto3.Session(region_name=AWS_REGION)
sagemaker_session = Session(boto_session=boto_session)
print(f"SageMaker session created for region: {AWS_REGION}")

Step 4: Create ModelBuilder#

Initialize the ModelBuilder with the JumpStart model ID and schema.

# Initialize ModelBuilder
model_builder = ModelBuilder(
    model=MODEL_ID,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
)
print("ModelBuilder created successfully!")

Step 5: Optimize the Model#

Optimize the model using AWQ quantization for improved inference performance. This step may take up to 30 minutes to complete!

# Optimize the model with AWQ quantization
print("Optimizing JumpStart model...")
optimized_model = model_builder.optimize(
    instance_type="ml.g5.2xlarge",
    image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124",
    output_path=f"s3://sagemaker-us-east-2-593793038179/optimize-output/jumpstart-{unique_id}/",
    quantization_config={"OverrideEnvironment": {"OPTION_QUANTIZE": "awq"}},
    accept_eula=True,
    job_name=job_name,
    model_name=model_name,
)
print(f"Model Successfully Optimized: {optimized_model.model_name}")

Step 6: Deploy the Optimized Model#

Deploy the optimized model to a SageMaker endpoint for real-time inference.

# Deploy the optimized model to an endpoint
print("Deploying optimized model to endpoint...")
core_endpoint = model_builder.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1
)
print(f"Endpoint Successfully Created: {core_endpoint.endpoint_name}")

Step 7: Test the Optimized Endpoint#

Send a test request to verify the optimized model is working correctly.

# Test optimized model invocation
test_data = {
    "inputs": "What are the benefits of machine learning?",
    "parameters": {"max_new_tokens": 50}
}

result = core_endpoint.invoke(
    body=json.dumps(test_data),
    content_type="application/json"
)

response_body = result.body.read().decode('utf-8')
prediction = json.loads(response_body)
print(f"Result of invoking optimized endpoint: {prediction}")

Step 8: Clean Up Resources#

Clean up the created resources to avoid ongoing charges.

# Clean up resources
core_endpoint_config = EndpointConfig.get(endpoint_config_name=core_endpoint.endpoint_name)

# Delete in the correct order
optimized_model.delete()
core_endpoint.delete()
core_endpoint_config.delete()

print("Optimized model and endpoint successfully deleted!")

Summary#

This notebook demonstrated:

Creating a ModelBuilder with a JumpStart model
Optimizing the model using AWQ quantization
Deploying the optimized model to a SageMaker endpoint
Making inference requests to the optimized endpoint
Cleaning up resources

The V3 ModelBuilder’s optimize() method makes it easy to improve model performance with quantization and other optimization techniques!