sagemaker.train.evaluate.execution

Contents

sagemaker.train.evaluate.execution#

SageMaker Evaluation Execution Module.

This module provides classes for managing evaluation executions.

Classes

BenchmarkEvaluationExecution(*, arn, name, ...)

Benchmark evaluation execution subclass with type-specific show_results().

EvaluationPipelineExecution(*, arn, name, ...)

Manages SageMaker pipeline-based evaluation execution lifecycle.

LLMAJEvaluationExecution(*, arn, name, ...)

LLM As Judge evaluation execution subclass with type-specific show_results().

PipelineExecutionStatus(*, overall_status, ...)

Combined pipeline execution status with step details and failure reason.

StepDetail(*, name, status[, start_time, ...])

Pipeline step details for tracking execution progress.

class sagemaker.train.evaluate.execution.BenchmarkEvaluationExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#

Bases: EvaluationPipelineExecution

Benchmark evaluation execution subclass with type-specific show_results().

Provides benchmark-specific result display functionality for comparing custom model performance against a base model.

arn: str | None#
eval_type: EvalType | None#
last_modified_time: datetime | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str#
s3_output_path: str | None#
show_results() None[source]#

Display benchmark evaluation results comparing custom vs base model.

Shows aggregate metrics with detailed S3 artifact locations.

Raises:

ValueError – If execution hasn’t succeeded.

Example

execution = evaluator.evaluate()
execution.wait()
execution.show_results()
status: PipelineExecutionStatus#
steps: List[Dict[str, Any]]#
class sagemaker.train.evaluate.execution.EvaluationPipelineExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#

Bases: BaseModel

Manages SageMaker pipeline-based evaluation execution lifecycle.

This class wraps SageMaker Pipeline execution to provide a simplified interface for running, monitoring, and managing evaluation jobs. Users typically don’t instantiate this class directly, but receive instances from evaluator classes.

Example

from sagemaker.train.evaluate import BenchmarkEvaluator
from sagemaker.train.evaluate.execution import EvaluationPipelineExecution

# Start evaluation through evaluator
evaluator = BenchmarkEvaluator(...)
execution = evaluator.evaluate()

# Monitor execution
print(f"Status: {execution.status.overall_status}")
print(f"Steps: {len(execution.status.step_details)}")

# Wait for completion
execution.wait()

# Display results
execution.show_results()

# Retrieve past executions
all_executions = list(EvaluationPipelineExecution.get_all())
specific_execution = EvaluationPipelineExecution.get(arn="arn:...")
Parameters:
  • arn (Optional[str]) – ARN of the pipeline execution.

  • name (str) – Name of the evaluation execution.

  • status (PipelineExecutionStatus) – Combined status with step details and failure reason.

  • last_modified_time (Optional[datetime]) – Last modification timestamp.

  • eval_type (Optional[EvalType]) – Type of evaluation (BENCHMARK, CUSTOM_SCORER, LLM_AS_JUDGE).

  • s3_output_path (Optional[str]) – S3 location where evaluation results are stored.

  • steps (List[Dict[str, Any]]) – Raw step information from SageMaker.

class Config[source]#

Bases: object

arbitrary_types_allowed = True#
arn: str | None#
eval_type: EvalType | None#
classmethod get(arn: str, session: Session | None = None, region: str | None = None) EvaluationPipelineExecution[source]#

Get a sagemaker pipeline execution instance by ARN.

Parameters:
  • arn (str) – ARN of the pipeline execution.

  • session (Optional[Session]) – Boto3 session. Will be inferred if not provided.

  • region (Optional[str]) – AWS region. Will be inferred if not provided.

Returns:

Retrieved pipeline execution instance.

Return type:

EvaluationPipelineExecution

Raises:

ClientError – If AWS service call fails.

Example

# Get execution by ARN
arn = "arn:aws:sagemaker:us-west-2:123456789012:pipeline/eval-pipeline/execution/abc123"
execution = EvaluationPipelineExecution.get(arn=arn)
print(execution.status.overall_status)
classmethod get_all(eval_type: EvalType | None = None, session: Session | None = None, region: str | None = None)[source]#

Get all pipeline executions, optionally filtered by evaluation type.

Searches for existing pipelines using prefix and tag validation, then retrieves executions from those pipelines.

Parameters:
  • eval_type (Optional[EvalType]) – Evaluation type to filter by (e.g., EvalType.BENCHMARK). If None, returns executions from all evaluation pipelines.

  • session (Optional[Session]) – Boto3 session. Will be inferred if not provided.

  • region (Optional[str]) – AWS region. Will be inferred if not provided.

Yields:

EvaluationPipelineExecution – Pipeline execution instances.

Example

# Get all evaluation executions as iterator
iter = EvaluationPipelineExecution.get_all()
all_executions = list(iter)

# Get only benchmark evaluations
iter = EvaluationPipelineExecution.get_all(eval_type=EvalType.BENCHMARK)
benchmark_executions = list(iter)
last_modified_time: datetime | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str#
refresh() None[source]#

Describe a pipeline execution and update job status

s3_output_path: str | None#
classmethod start(eval_type: EvalType, name: str, pipeline_definition: str, role_arn: str, s3_output_path: str | None = None, session: Session | None = None, region: str | None = None, tags: List[Dict[str, str | PipelineVariable]] | None = []) EvaluationPipelineExecution[source]#

Create sagemaker pipeline execution. Optionally creates pipeline.

Parameters:
  • eval_type (EvalType) – Type of evaluation (BENCHMARK, CUSTOM_SCORER, LLM_AS_JUDGE).

  • name (str) – Name for the evaluation execution.

  • pipeline_definition (str) – Complete rendered pipeline definition as JSON string.

  • role_arn (str) – IAM role ARN for pipeline execution.

  • s3_output_path (Optional[str]) – S3 location where evaluation results are stored.

  • session (Optional[Session]) – Boto3 session for API calls.

  • region (Optional[str]) – AWS region for the pipeline.

  • tags (Optional[List[TagsDict]]) – List of tags to include in pipeline

Returns:

Started pipeline execution instance.

Return type:

EvaluationPipelineExecution

Raises:
  • ValueError – If pipeline_definition is not valid JSON.

  • ClientError – If AWS service call fails.

status: PipelineExecutionStatus#
steps: List[Dict[str, Any]]#
stop() None[source]#

Stop a pipeline execution

wait(target_status: Literal['Executing', 'Stopping', 'Stopped', 'Failed', 'Succeeded'] = 'Succeeded', poll: int = 5, timeout: int | None = None) None[source]#

Wait for a pipeline execution to reach certain status.

This method provides a hybrid implementation that works in both Jupyter notebooks and terminal environments, with appropriate visual feedback for each.

Parameters:
  • target_status – The status to wait for

  • poll – The number of seconds to wait between each poll

  • timeout – The maximum number of seconds to wait before timing out

class sagemaker.train.evaluate.execution.LLMAJEvaluationExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#

Bases: EvaluationPipelineExecution

LLM As Judge evaluation execution subclass with type-specific show_results().

Provides LLM-as-Judge-specific result display functionality with pagination and detailed judge explanations.

arn: str | None#
eval_type: EvalType | None#
last_modified_time: datetime | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str#
s3_output_path: str | None#
show_results(limit: int = 5, offset: int = 0, show_explanations: bool = False) None[source]#

Display LLM As Judge evaluation results with pagination.

Shows per-evaluation results with prompt, response, and scores.

Parameters:
  • limit (int) – Number of evaluation prompts to display. Set to None for all. Defaults to 5.

  • offset (int) – Starting index for pagination. Defaults to 0.

  • show_explanations (bool) – Whether to show judge explanations. Defaults to False.

Raises:

ValueError – If execution hasn’t succeeded.

Example

execution = evaluator.evaluate()
execution.wait()

# Show first 5 evaluations
execution.show_results()

# Show next 5
execution.show_results(limit=5, offset=5)

# Show all with explanations
execution.show_results(limit=None, show_explanations=True)
status: PipelineExecutionStatus#
steps: List[Dict[str, Any]]#
class sagemaker.train.evaluate.execution.PipelineExecutionStatus(*, overall_status: str, step_details: ~typing.List[~sagemaker.train.evaluate.execution.StepDetail] = <factory>, failure_reason: str | None = None)[source]#

Bases: BaseModel

Combined pipeline execution status with step details and failure reason.

Aggregates the overall execution status along with detailed information about individual pipeline steps and any failure reasons.

Parameters:
  • overall_status (str) – Overall execution status (Starting, Executing, Completed, Failed, etc.).

  • step_details (List[StepDetail]) – List of individual pipeline step details.

  • failure_reason (Optional[str]) – Detailed reason if the execution failed.

failure_reason: str | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

overall_status: str#
step_details: List[StepDetail]#
class sagemaker.train.evaluate.execution.StepDetail(*, name: str, status: str, start_time: str | None = None, end_time: str | None = None, display_name: str | None = None, failure_reason: str | None = None, job_arn: str | None = None)[source]#

Bases: BaseModel

Pipeline step details for tracking execution progress.

Represents the status and timing information for a single step in a SageMaker pipeline execution.

Parameters:
  • name (str) – Name of the pipeline step.

  • status (str) – Status of the step (Completed, Executing, Waiting, Failed).

  • start_time (Optional[str]) – ISO format timestamp when step started.

  • end_time (Optional[str]) – ISO format timestamp when step ended.

  • display_name (Optional[str]) – Human-readable display name for the step.

  • failure_reason (Optional[str]) – Detailed reason if the step failed.

display_name: str | None#
end_time: str | None#
failure_reason: str | None#
job_arn: str | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str#
start_time: str | None#
status: str#