SageMaker V3 Model Optimization Example#

This notebook demonstrates how to use SageMaker V3 ModelBuilder to optimize a JumpStart model for improved inference performance.

Prerequisites#

Note: Ensure you have sagemaker and ipywidgets installed in your environment. The ipywidgets package is required to monitor endpoint deployment progress in Jupyter notebooks.

# Import required libraries
import json
import uuid
import time
import boto3

from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.core.resources import EndpointConfig
from sagemaker.core.helper.session_helper import Session

Step 1: Configure Model and Session#

We’ll optimize a Llama 3 model from JumpStart using AWQ quantization.

# Configuration
MODEL_ID = "meta-textgeneration-llama-3-8b-instruct"
MODEL_NAME_PREFIX = "jumpstart-optimize-example"
ENDPOINT_NAME_PREFIX = "jumpstart-optimize-example-endpoint"
AWS_ACCOUNT_ID = Session.account_id()
AWS_REGION = Session.boto_region_name

# Generate unique identifiers
unique_id = str(uuid.uuid4())[:8]
model_name = f"{MODEL_NAME_PREFIX}-{unique_id}"
endpoint_name = f"{ENDPOINT_NAME_PREFIX}-{unique_id}"
job_name = f"js-optimize-{int(time.time())}"

print(f"Model name: {model_name}")
print(f"Endpoint name: {endpoint_name}")
print(f"Optimization job name: {job_name}")

Step 2: Create Schema Builder#

Define the input/output schema for the text generation model.

# Create schema builder for text generation
sample_input = {"inputs": "What are falcons?", "parameters": {"max_new_tokens": 32}}
sample_output = [{"generated_text": "Falcons are small to medium-sized birds of prey."}]

schema_builder = SchemaBuilder(sample_input, sample_output)
print("Schema builder created successfully!")

Step 3: Initialize SageMaker Session#

Create a SageMaker session with the specified AWS region.

# Create SageMaker session
boto_session = boto3.Session(region_name=AWS_REGION)
sagemaker_session = Session(boto_session=boto_session)
print(f"SageMaker session created for region: {AWS_REGION}")

Step 4: Create ModelBuilder#

Initialize the ModelBuilder with the JumpStart model ID and schema.

# Initialize ModelBuilder
model_builder = ModelBuilder(
    model=MODEL_ID,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
)
print("ModelBuilder created successfully!")

Step 5: Optimize the Model#

Optimize the model using AWQ quantization for improved inference performance. This step may take up to 30 minutes to complete!

# Optimize the model with AWQ quantization
print("Optimizing JumpStart model...")
optimized_model = model_builder.optimize(
    instance_type="ml.g5.2xlarge",
    image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124",
    output_path=f"s3://sagemaker-us-east-2-593793038179/optimize-output/jumpstart-{unique_id}/",
    quantization_config={"OverrideEnvironment": {"OPTION_QUANTIZE": "awq"}},
    accept_eula=True,
    job_name=job_name,
    model_name=model_name,
)
print(f"Model Successfully Optimized: {optimized_model.model_name}")

Step 6: Deploy the Optimized Model#

Deploy the optimized model to a SageMaker endpoint for real-time inference.

# Deploy the optimized model to an endpoint
print("Deploying optimized model to endpoint...")
core_endpoint = model_builder.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1
)
print(f"Endpoint Successfully Created: {core_endpoint.endpoint_name}")

Step 7: Test the Optimized Endpoint#

Send a test request to verify the optimized model is working correctly.

# Test optimized model invocation
test_data = {
    "inputs": "What are the benefits of machine learning?",
    "parameters": {"max_new_tokens": 50}
}

result = core_endpoint.invoke(
    body=json.dumps(test_data),
    content_type="application/json"
)

response_body = result.body.read().decode('utf-8')
prediction = json.loads(response_body)
print(f"Result of invoking optimized endpoint: {prediction}")

Step 8: Clean Up Resources#

Clean up the created resources to avoid ongoing charges.

# Clean up resources
core_endpoint_config = EndpointConfig.get(endpoint_config_name=core_endpoint.endpoint_name)

# Delete in the correct order
optimized_model.delete()
core_endpoint.delete()
core_endpoint_config.delete()

print("Optimized model and endpoint successfully deleted!")

Summary#

This notebook demonstrated:

  1. Creating a ModelBuilder with a JumpStart model

  2. Optimizing the model using AWQ quantization

  3. Deploying the optimized model to a SageMaker endpoint

  4. Making inference requests to the optimized endpoint

  5. Cleaning up resources

The V3 ModelBuilder’s optimize() method makes it easy to improve model performance with quantization and other optimization techniques!