sagemaker.serve.inference_recommendation_mixin#
Inference Recommender mixin for SageMaker model optimization.
This module provides the _InferenceRecommenderMixin class that enables SageMaker models to use Inference Recommender for right-sizing and optimization recommendations.
Key Features: - Automatic instance type and configuration recommendations - Load testing with custom traffic patterns - Performance optimization based on latency and throughput requirements - Support for both Default and Advanced recommendation jobs
Example
Basic usage with a ModelBuilder:
model_builder = ModelBuilder(model="my-model")
model = model_builder.build()
# Get right-sizing recommendations
model.right_size(
sample_payload_url="s3://my-bucket/sample-payload.json",
supported_content_types=["application/json"],
supported_instance_types=["ml.m5.large", "ml.m5.xlarge"]
)
# Deploy with recommendations
predictor = model.deploy()
Classes
|
Latency threshold configuration for Advanced Inference Recommendations. |
|
Traffic pattern phase configuration for Advanced Inference Recommendations. |
- class sagemaker.serve.inference_recommendation_mixin.ModelLatencyThreshold(percentile: str, value_in_milliseconds: int)[source]#
Bases:
objectLatency threshold configuration for Advanced Inference Recommendations.
Defines acceptable response latency limits for model inference. Used to filter recommendations based on performance requirements.
- Parameters:
percentile – Latency percentile to measure (e.g., “P95”, “P99”)
value_in_milliseconds – Maximum acceptable latency in milliseconds
Example
Set P95 latency threshold:
threshold = ModelLatencyThreshold( percentile="P95", value_in_milliseconds=100 # 100ms max P95 latency )
- class sagemaker.serve.inference_recommendation_mixin.Phase(duration_in_seconds: int, initial_number_of_users: int, spawn_rate: int)[source]#
Bases:
objectTraffic pattern phase configuration for Advanced Inference Recommendations.
Defines a phase of load testing with specific duration, user count, and spawn rate. Multiple phases can be combined to create complex traffic patterns.
- Parameters:
duration_in_seconds – How long this phase should run
initial_number_of_users – Number of concurrent users at start of phase
spawn_rate – Rate at which new users are added (users per second)
Example
Create a ramp-up phase:
phase = Phase( duration_in_seconds=300, # 5 minutes initial_number_of_users=1, spawn_rate=2 # Add 2 users per second )