sagemaker.train.evaluate.execution#
SageMaker Evaluation Execution Module.
This module provides classes for managing evaluation executions.
Classes
|
Benchmark evaluation execution subclass with type-specific show_results(). |
|
Manages SageMaker pipeline-based evaluation execution lifecycle. |
|
LLM As Judge evaluation execution subclass with type-specific show_results(). |
|
Combined pipeline execution status with step details and failure reason. |
|
Pipeline step details for tracking execution progress. |
- class sagemaker.train.evaluate.execution.BenchmarkEvaluationExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#
Bases:
EvaluationPipelineExecutionBenchmark evaluation execution subclass with type-specific show_results().
Provides benchmark-specific result display functionality for comparing custom model performance against a base model.
- arn: str | None#
- last_modified_time: datetime | None#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str#
- s3_output_path: str | None#
- show_results() None[source]#
Display benchmark evaluation results comparing custom vs base model.
Shows aggregate metrics with detailed S3 artifact locations.
- Raises:
ValueError – If execution hasn’t succeeded.
Example
execution = evaluator.evaluate() execution.wait() execution.show_results()
- status: PipelineExecutionStatus#
- steps: List[Dict[str, Any]]#
- class sagemaker.train.evaluate.execution.EvaluationPipelineExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#
Bases:
BaseModelManages SageMaker pipeline-based evaluation execution lifecycle.
This class wraps SageMaker Pipeline execution to provide a simplified interface for running, monitoring, and managing evaluation jobs. Users typically don’t instantiate this class directly, but receive instances from evaluator classes.
Example
from sagemaker.train.evaluate import BenchmarkEvaluator from sagemaker.train.evaluate.execution import EvaluationPipelineExecution # Start evaluation through evaluator evaluator = BenchmarkEvaluator(...) execution = evaluator.evaluate() # Monitor execution print(f"Status: {execution.status.overall_status}") print(f"Steps: {len(execution.status.step_details)}") # Wait for completion execution.wait() # Display results execution.show_results() # Retrieve past executions all_executions = list(EvaluationPipelineExecution.get_all()) specific_execution = EvaluationPipelineExecution.get(arn="arn:...")
- Parameters:
arn (Optional[str]) – ARN of the pipeline execution.
name (str) – Name of the evaluation execution.
status (PipelineExecutionStatus) – Combined status with step details and failure reason.
last_modified_time (Optional[datetime]) – Last modification timestamp.
eval_type (Optional[EvalType]) – Type of evaluation (BENCHMARK, CUSTOM_SCORER, LLM_AS_JUDGE).
s3_output_path (Optional[str]) – S3 location where evaluation results are stored.
steps (List[Dict[str, Any]]) – Raw step information from SageMaker.
- arn: str | None#
- classmethod get(arn: str, session: Session | None = None, region: str | None = None) EvaluationPipelineExecution[source]#
Get a sagemaker pipeline execution instance by ARN.
- Parameters:
arn (str) – ARN of the pipeline execution.
session (Optional[Session]) – Boto3 session. Will be inferred if not provided.
region (Optional[str]) – AWS region. Will be inferred if not provided.
- Returns:
Retrieved pipeline execution instance.
- Return type:
- Raises:
ClientError – If AWS service call fails.
Example
# Get execution by ARN arn = "arn:aws:sagemaker:us-west-2:123456789012:pipeline/eval-pipeline/execution/abc123" execution = EvaluationPipelineExecution.get(arn=arn) print(execution.status.overall_status)
- classmethod get_all(eval_type: EvalType | None = None, session: Session | None = None, region: str | None = None)[source]#
Get all pipeline executions, optionally filtered by evaluation type.
Searches for existing pipelines using prefix and tag validation, then retrieves executions from those pipelines.
- Parameters:
- Yields:
EvaluationPipelineExecution – Pipeline execution instances.
Example
# Get all evaluation executions as iterator iter = EvaluationPipelineExecution.get_all() all_executions = list(iter) # Get only benchmark evaluations iter = EvaluationPipelineExecution.get_all(eval_type=EvalType.BENCHMARK) benchmark_executions = list(iter)
- last_modified_time: datetime | None#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str#
- s3_output_path: str | None#
- classmethod start(eval_type: EvalType, name: str, pipeline_definition: str, role_arn: str, s3_output_path: str | None = None, session: Session | None = None, region: str | None = None, tags: List[Dict[str, str | PipelineVariable]] | None = []) EvaluationPipelineExecution[source]#
Create sagemaker pipeline execution. Optionally creates pipeline.
- Parameters:
eval_type (EvalType) – Type of evaluation (BENCHMARK, CUSTOM_SCORER, LLM_AS_JUDGE).
name (str) – Name for the evaluation execution.
pipeline_definition (str) – Complete rendered pipeline definition as JSON string.
role_arn (str) – IAM role ARN for pipeline execution.
s3_output_path (Optional[str]) – S3 location where evaluation results are stored.
session (Optional[Session]) – Boto3 session for API calls.
region (Optional[str]) – AWS region for the pipeline.
tags (Optional[List[TagsDict]]) – List of tags to include in pipeline
- Returns:
Started pipeline execution instance.
- Return type:
- Raises:
ValueError – If pipeline_definition is not valid JSON.
ClientError – If AWS service call fails.
- status: PipelineExecutionStatus#
- steps: List[Dict[str, Any]]#
- wait(target_status: Literal['Executing', 'Stopping', 'Stopped', 'Failed', 'Succeeded'] = 'Succeeded', poll: int = 5, timeout: int | None = None) None[source]#
Wait for a pipeline execution to reach certain status.
This method provides a hybrid implementation that works in both Jupyter notebooks and terminal environments, with appropriate visual feedback for each.
- Parameters:
target_status – The status to wait for
poll – The number of seconds to wait between each poll
timeout – The maximum number of seconds to wait before timing out
- class sagemaker.train.evaluate.execution.LLMAJEvaluationExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#
Bases:
EvaluationPipelineExecutionLLM As Judge evaluation execution subclass with type-specific show_results().
Provides LLM-as-Judge-specific result display functionality with pagination and detailed judge explanations.
- arn: str | None#
- last_modified_time: datetime | None#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str#
- s3_output_path: str | None#
- show_results(limit: int = 5, offset: int = 0, show_explanations: bool = False) None[source]#
Display LLM As Judge evaluation results with pagination.
Shows per-evaluation results with prompt, response, and scores.
- Parameters:
limit (int) – Number of evaluation prompts to display. Set to None for all. Defaults to 5.
offset (int) – Starting index for pagination. Defaults to 0.
show_explanations (bool) – Whether to show judge explanations. Defaults to False.
- Raises:
ValueError – If execution hasn’t succeeded.
Example
execution = evaluator.evaluate() execution.wait() # Show first 5 evaluations execution.show_results() # Show next 5 execution.show_results(limit=5, offset=5) # Show all with explanations execution.show_results(limit=None, show_explanations=True)
- status: PipelineExecutionStatus#
- steps: List[Dict[str, Any]]#
- class sagemaker.train.evaluate.execution.PipelineExecutionStatus(*, overall_status: str, step_details: ~typing.List[~sagemaker.train.evaluate.execution.StepDetail] = <factory>, failure_reason: str | None = None)[source]#
Bases:
BaseModelCombined pipeline execution status with step details and failure reason.
Aggregates the overall execution status along with detailed information about individual pipeline steps and any failure reasons.
- Parameters:
overall_status (str) – Overall execution status (Starting, Executing, Completed, Failed, etc.).
step_details (List[StepDetail]) – List of individual pipeline step details.
failure_reason (Optional[str]) – Detailed reason if the execution failed.
- failure_reason: str | None#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- overall_status: str#
- step_details: List[StepDetail]#
- class sagemaker.train.evaluate.execution.StepDetail(*, name: str, status: str, start_time: str | None = None, end_time: str | None = None, display_name: str | None = None, failure_reason: str | None = None, job_arn: str | None = None)[source]#
Bases:
BaseModelPipeline step details for tracking execution progress.
Represents the status and timing information for a single step in a SageMaker pipeline execution.
- Parameters:
name (str) – Name of the pipeline step.
status (str) – Status of the step (Completed, Executing, Waiting, Failed).
start_time (Optional[str]) – ISO format timestamp when step started.
end_time (Optional[str]) – ISO format timestamp when step ended.
display_name (Optional[str]) – Human-readable display name for the step.
failure_reason (Optional[str]) – Detailed reason if the step failed.
- display_name: str | None#
- end_time: str | None#
- failure_reason: str | None#
- job_arn: str | None#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str#
- start_time: str | None#
- status: str#