sagemaker.serve.inference_recommendation_mixin

sagemaker.serve.inference_recommendation_mixin#

Inference Recommender mixin for SageMaker model optimization.

This module provides the _InferenceRecommenderMixin class that enables SageMaker models to use Inference Recommender for right-sizing and optimization recommendations.

Key Features: - Automatic instance type and configuration recommendations - Load testing with custom traffic patterns - Performance optimization based on latency and throughput requirements - Support for both Default and Advanced recommendation jobs

Example

Basic usage with a ModelBuilder:

model_builder = ModelBuilder(model="my-model")
model = model_builder.build()

# Get right-sizing recommendations
model.right_size(
    sample_payload_url="s3://my-bucket/sample-payload.json",
    supported_content_types=["application/json"],
    supported_instance_types=["ml.m5.large", "ml.m5.xlarge"]
)

# Deploy with recommendations
predictor = model.deploy()

Classes

ModelLatencyThreshold(percentile, ...)

Latency threshold configuration for Advanced Inference Recommendations.

Phase(duration_in_seconds, ...)

Traffic pattern phase configuration for Advanced Inference Recommendations.

class sagemaker.serve.inference_recommendation_mixin.ModelLatencyThreshold(percentile: str, value_in_milliseconds: int)[source]#

Bases: object

Latency threshold configuration for Advanced Inference Recommendations.

Defines acceptable response latency limits for model inference. Used to filter recommendations based on performance requirements.

Parameters:
  • percentile – Latency percentile to measure (e.g., “P95”, “P99”)

  • value_in_milliseconds – Maximum acceptable latency in milliseconds

Example

Set P95 latency threshold:

threshold = ModelLatencyThreshold(
    percentile="P95",
    value_in_milliseconds=100  # 100ms max P95 latency
)
class sagemaker.serve.inference_recommendation_mixin.Phase(duration_in_seconds: int, initial_number_of_users: int, spawn_rate: int)[source]#

Bases: object

Traffic pattern phase configuration for Advanced Inference Recommendations.

Defines a phase of load testing with specific duration, user count, and spawn rate. Multiple phases can be combined to create complex traffic patterns.

Parameters:
  • duration_in_seconds – How long this phase should run

  • initial_number_of_users – Number of concurrent users at start of phase

  • spawn_rate – Rate at which new users are added (users per second)

Example

Create a ramp-up phase:

phase = Phase(
    duration_in_seconds=300,  # 5 minutes
    initial_number_of_users=1,
    spawn_rate=2  # Add 2 users per second
)