sagemaker.train.evaluate

Contents

sagemaker.train.evaluate#

SageMaker Model Evaluation Module.

This module provides comprehensive evaluation capabilities for SageMaker models:

Classes:
  • BaseEvaluator: Abstract base class for all evaluators

  • BenchMarkEvaluator: Standard benchmark evaluations

  • CustomScorerEvaluator: Custom scorer and preset metrics evaluations

  • LLMAsJudgeEvaluator: LLM-as-judge evaluations

  • EvaluationPipelineExecution: Pipeline-based evaluation execution implementation

  • PipelineExecutionStatus: Combined status with step details and failure reason

  • StepDetail: Individual pipeline step information

class sagemaker.train.evaluate.BaseEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None)[source]#

Bases: BaseModel

Base class for SageMaker model evaluators.

Provides common functionality for all evaluators including model resolution, MLflow integration, and AWS resource configuration. Subclasses must implement the evaluate() method.

region#

AWS region for evaluation jobs. If not provided, will use SAGEMAKER_REGION env var or default region.

Type:

Optional[str]

role#

IAM execution role ARN for SageMaker pipeline and training jobs. If not provided, will be derived from the session’s caller identity. Use this when running outside SageMaker-managed environments (e.g., local notebooks, CI/CD) where the caller identity is not a SageMaker-assumable role.

Type:

Optional[str]

sagemaker_session#

SageMaker session object. If not provided, a default session will be created automatically.

Type:

Optional[Any]

model#

Model for evaluation. Can be: - JumpStart model ID (str): e.g., ‘llama3-2-1b-instruct’ - ModelPackage object: A fine-tuned model package - ModelPackage ARN (str): e.g., ‘arn:aws:sagemaker:region:account:model-package/name/version’ - BaseTrainer object: A completed training job (i.e., it must have _latest_training_job with output_model_package_arn populated)

Type:

Union[str, Any]

base_eval_name#

Optional base name for evaluation jobs. This name is used as the PipelineExecutionDisplayName when creating the SageMaker pipeline execution. The actual display name will be “{base_eval_name}-{timestamp}”. This parameter can be used to cross-reference the pipeline execution ARN with a human-readable display name in the SageMaker console. If not provided, a unique name will be generated automatically in the format “eval-{model_name}-{uuid}”.

Type:

Optional[str]

s3_output_path#

S3 location for evaluation outputs. Required.

Type:

str

mlflow_resource_arn#

MLflow resource ARN for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Supported formats: - MLflow tracking server: arn:aws:sagemaker:region:account:mlflow-tracking-server/name - MLflow app: arn:aws:sagemaker:region:account:mlflow-app/app-id

Type:

Optional[str]

mlflow_experiment_name#

Optional MLflow experiment name for tracking evaluation runs.

Type:

Optional[str]

mlflow_run_name#

Optional MLflow run name for tracking individual evaluation executions.

Type:

Optional[str]

networking#

VPC configuration for evaluation jobs. Accepts a sagemaker_core.shapes.VpcConfig object with security_group_ids and subnets attributes. When provided, evaluation jobs will run within the specified VPC for enhanced security and access to private resources.

Type:

Optional[VpcConfig]

kms_key_id#

AWS KMS key ID for encrypting output data. When provided, evaluation job outputs will be encrypted using this KMS key for enhanced data security.

Type:

Optional[str]

model_package_group#

Model package group. Accepts: 1. ARN string (e.g., ‘arn:aws:sagemaker:region:account:model-package-group/name’) 2. ModelPackageGroup object (ARN will be extracted from model_package_group_arn attribute) 3. Model package group name string (will fetch the object and extract ARN) Required when model is a JumpStart model ID. Optional when model is a ModelPackage ARN/object (will be inferred automatically).

Type:

Optional[Union[str, ModelPackageGroup]]

class Config[source]#

Bases: object

arbitrary_types_allowed = True#
base_eval_name: str | None#
evaluate() Any[source]#

Create and start an evaluation execution.

This method must be implemented by subclasses to define the specific evaluation logic for different evaluation types (benchmark, custom scorer, LLM-as-judge, etc.).

Returns:

The created evaluation execution object.

Return type:

EvaluationPipelineExecution

Raises:

NotImplementedError – This is an abstract method that must be implemented by subclasses.

Example

>>> # In a subclass implementation
>>> class CustomEvaluator(BaseEvaluator):
...     def evaluate(self):
...         # Create pipeline definition
...         pipeline_definition = self._build_pipeline()
...         # Start execution
...         return EvaluationPipelineExecution.start(...)
kms_key_id: str | None#
mlflow_experiment_name: str | None#
mlflow_resource_arn: str | None#
mlflow_run_name: str | None#
model: str | BaseTrainer | ModelPackage#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_package_group: str | ModelPackageGroup | None#
networking: VpcConfig | None#
region: str | None#
role: str | None#
s3_output_path: str#
sagemaker_session: Any | None#
class sagemaker.train.evaluate.BenchMarkEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, benchmark: _Benchmark, subtasks: str | List[str] | None = None, evaluate_base_model: bool = False)[source]#

Bases: BaseEvaluator

Benchmark evaluator for standard model evaluation tasks.

This evaluator accepts a benchmark enum and automatically deduces the appropriate metrics, strategy, and subtask availability based on the benchmark configuration. Supports various standard benchmarks like MMLU, BBH, MATH, MMMU, and others.

benchmark#

Benchmark type from the Benchmark enum obtained via get_benchmarks(). Required. Use get_benchmarks() to access available benchmark types.

Type:

_Benchmark

subtasks#

Benchmark subtask(s) to evaluate. Defaults to ‘ALL’ for benchmarks that support subtasks. Can be a single subtask string, a list of subtasks, or ‘ALL’ to run all subtasks. For benchmarks without subtask support, must be None.

Type:

Optional[Union[str, list[str]]]

mlflow_resource_arn#

ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Format: arn:aws:sagemaker:region:account:mlflow-tracking-server/name

Type:

Optional[str]

evaluate_base_model#

Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).

Type:

bool

region#

AWS region. Inherited from BaseEvaluator.

Type:

Optional[str]

sagemaker_session#

SageMaker session object. Inherited from BaseEvaluator.

Type:

Optional[Any]

model#

Model for evaluation. Inherited from BaseEvaluator.

Type:

Union[str, Any]

base_eval_name#

Base name for evaluation jobs. Inherited from BaseEvaluator.

Type:

Optional[str]

s3_output_path#

S3 location for evaluation outputs. Inherited from BaseEvaluator.

Type:

str

mlflow_experiment_name#

MLflow experiment name. Inherited from BaseEvaluator.

Type:

Optional[str]

mlflow_run_name#

MLflow run name. Inherited from BaseEvaluator.

Type:

Optional[str]

networking#

VPC configuration. Inherited from BaseEvaluator.

Type:

Optional[VpcConfig]

kms_key_id#

KMS key ID for encryption. Inherited from BaseEvaluator.

Type:

Optional[str]

model_package_group#

Model package group. Inherited from BaseEvaluator.

Type:

Optional[Union[str, ModelPackageGroup]]

Example

# Get available benchmarks
Benchmark = get_benchmarks()

# Create evaluator with benchmark and subtasks
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    subtasks=["abstract_algebra", "anatomy", "astronomy"],
    model="llama3-2-1b-instruct",
    s3_output_path="s3://bucket/outputs/",
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server"
)

# Run evaluation with configured subtasks
execution = evaluator.evaluate()
execution.wait()

# Or override subtasks at evaluation time
execution = evaluator.evaluate(subtask="abstract_algebra")
benchmark: _Benchmark#
evaluate(subtask: str | List[str] | None = None) EvaluationPipelineExecution[source]#

Create and start a benchmark evaluation job.

Parameters:

subtask (Optional[Union[str, list[str]]]) – Optional subtask(s) to evaluate. If not provided, uses the subtasks from constructor. Can be a single subtask string, a list of subtasks, or ‘ALL’ to run all subtasks.

Returns:

The created benchmark evaluation execution.

Return type:

EvaluationPipelineExecution

Example

Benchmark = get_benchmarks()
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    subtasks="ALL",
    model="llama3-2-1b-instruct",
    s3_output_path="s3://bucket/outputs/"
)

# Evaluate single subtask
execution = evaluator.evaluate(subtask="abstract_algebra")

# Evaluate multiple subtasks
execution = evaluator.evaluate(subtask=["abstract_algebra", "anatomy"])

# Evaluate all subtasks (uses constructor default)
execution = evaluator.evaluate()
evaluate_base_model: bool#
classmethod get_all(session: Any | None = None, region: str | None = None) Iterator[EvaluationPipelineExecution][source]#

Get all benchmark evaluation executions.

Uses EvaluationPipelineExecution.get_all() to retrieve all benchmark evaluation executions as an iterator.

Parameters:
  • session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.

  • region (Optional[str]) – Optional AWS region. If not provided, will be inferred.

Yields:

EvaluationPipelineExecution – Benchmark evaluation execution instances.

Example

# Get all benchmark evaluations as iterator
eval_iter = BenchMarkEvaluator.get_all()
all_executions = list(eval_iter)

# Or iterate directly
for execution in BenchMarkEvaluator.get_all():
    print(f"{execution.name}: {execution.status.overall_status}")

# With specific session/region
eval_iter = BenchMarkEvaluator.get_all(session=my_session, region='us-west-2')
all_executions = list(eval_iter)
property hyperparameters#

Get evaluation hyperparameters as a FineTuningOptions object.

This property provides access to evaluation hyperparameters with validation, type checking, and user-friendly information display. Hyperparameters are lazily loaded from the JumpStart Hub when first accessed.

Returns:

Dynamic object with evaluation hyperparameters

Return type:

FineTuningOptions

Raises:

ValueError – If base model name is not available or if hyperparameters cannot be loaded

Example

evaluator = BenchMarkEvaluator(...)

# Access current values
print(evaluator.hyperparameters.temperature)

# Modify values (with validation)
evaluator.hyperparameters.temperature = 0.5

# Get as dictionary
params = evaluator.hyperparameters.to_dict()

# Display parameter information
evaluator.hyperparameters.get_info()
evaluator.hyperparameters.get_info('temperature')
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

subtasks: str | List[str] | None#
class sagemaker.train.evaluate.CustomScorerEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, evaluator: str | Any, dataset: Any, evaluate_base_model: bool = False)[source]#

Bases: BaseEvaluator

Custom scorer evaluation job for preset or custom evaluator metrics.

This evaluator supports both preset metrics (via built-in metrics enum) and custom evaluator implementations for specialized evaluation needs.

evaluator#

Built-in metric enum value, Evaluator object, or Evaluator ARN string. Required. Use get_builtin_metrics() for available preset metrics.

Type:

Union[str, Any]

dataset#

Dataset for evaluation. Required. Accepts S3 URI, Dataset ARN, or DataSet object.

Type:

Any

mlflow_resource_arn#

ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Inherited from BaseEvaluator.

Type:

Optional[str]

evaluate_base_model#

Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).

Type:

bool

region#

AWS region. Inherited from BaseEvaluator.

Type:

Optional[str]

sagemaker_session#

SageMaker session object. Inherited from BaseEvaluator.

Type:

Optional[Any]

model#

Model for evaluation. Inherited from BaseEvaluator.

Type:

Union[str, Any]

base_eval_name#

Base name for evaluation jobs. Inherited from BaseEvaluator.

Type:

Optional[str]

s3_output_path#

S3 location for evaluation outputs. Inherited from BaseEvaluator.

Type:

str

mlflow_experiment_name#

MLflow experiment name. Inherited from BaseEvaluator.

Type:

Optional[str]

mlflow_run_name#

MLflow run name. Inherited from BaseEvaluator.

Type:

Optional[str]

networking#

VPC configuration. Inherited from BaseEvaluator.

Type:

Optional[VpcConfig]

kms_key_id#

KMS key ID for encryption. Inherited from BaseEvaluator.

Type:

Optional[str]

model_package_group#

Model package group. Inherited from BaseEvaluator.

Type:

Optional[Union[str, ModelPackageGroup]]

Example

from sagemaker.train.evaluate.custom_scorer_evaluator import (
    CustomScorerEvaluator,
    get_builtin_metrics
)
from sagemaker.ai_registry.evaluator import Evaluator

# Using preset metric
BuiltInMetric = get_builtin_metrics()
evaluator = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset=my_dataset,
    base_model="my-model",
    s3_output_path="s3://bucket/output",
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server"
)

# Using custom evaluator
my_evaluator = Evaluator.create(
    name="my-custom-evaluator",
    function_source="/path/to/evaluator.py",
    sub_type="AWS/Evaluator"
)
evaluator = CustomScorerEvaluator(
    evaluator=my_evaluator,
    dataset=my_dataset,
    base_model="my-model",
    s3_output_path="s3://bucket/output",
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server"
)

# Using evaluator ARN string
evaluator = CustomScorerEvaluator(
    evaluator="arn:aws:sagemaker:us-west-2:123456789012:hub-content/AIRegistry/Evaluator/my-evaluator/1",
    dataset=my_dataset,
    base_model="my-model",
    s3_output_path="s3://bucket/output",
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server"
)

job = evaluator.evaluate()
dataset: Any#
evaluate() EvaluationPipelineExecution[source]#

Create and start a custom scorer evaluation job.

Returns:

The created custom scorer evaluation execution

Return type:

EvaluationPipelineExecution

Example

evaluator = CustomScorerEvaluator(
    evaluator=BuiltInMetric.CODE_EXECUTIONS,
    dataset=my_dataset,
    base_model="my-model",
    s3_output_path="s3://bucket/output",
    mlflow_resource_arn="arn:..."
)
execution = evaluator.evaluate()
execution.wait()
evaluate_base_model: bool#
evaluator: str | Any#
classmethod get_all(session: Any | None = None, region: str | None = None)[source]#

Get all custom scorer evaluation executions.

Uses EvaluationPipelineExecution.get_all() to retrieve all custom scorer evaluation executions as an iterator.

Parameters:
  • session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.

  • region (Optional[str]) – Optional AWS region. If not provided, will be inferred.

Yields:

EvaluationPipelineExecution – Custom scorer evaluation execution instances

Example

# Get all custom scorer evaluations as iterator
evaluations = CustomScorerEvaluator.get_all()
all_executions = list(evaluations)

# Or iterate directly
for execution in CustomScorerEvaluator.get_all():
    print(f"{execution.name}: {execution.status.overall_status}")

# With specific session/region
evaluations = CustomScorerEvaluator.get_all(session=my_session, region='us-west-2')
all_executions = list(evaluations)
property hyperparameters#

Get evaluation hyperparameters as a FineTuningOptions object.

This property provides access to evaluation hyperparameters with validation, type checking, and user-friendly information display. Hyperparameters are lazily loaded from the JumpStart Hub when first accessed.

Returns:

Dynamic object with evaluation hyperparameters

Return type:

FineTuningOptions

Raises:

ValueError – If base model name is not available or if hyperparameters cannot be loaded

Example

evaluator = CustomScorerEvaluator(...)

# Access current values
print(evaluator.hyperparameters.temperature)

# Modify values (with validation)
evaluator.hyperparameters.temperature = 0.5

# Get as dictionary
params = evaluator.hyperparameters.to_dict()

# Display parameter information
evaluator.hyperparameters.get_info()
evaluator.hyperparameters.get_info('temperature')
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

class sagemaker.train.evaluate.EvaluationPipelineExecution(*, arn: str | None = None, name: str, status: ~sagemaker.train.evaluate.execution.PipelineExecutionStatus = <factory>, last_modified_time: ~datetime.datetime | None = None, eval_type: ~sagemaker.train.evaluate.constants.EvalType | None = None, s3_output_path: str | None = None, steps: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>)[source]#

Bases: BaseModel

Manages SageMaker pipeline-based evaluation execution lifecycle.

This class wraps SageMaker Pipeline execution to provide a simplified interface for running, monitoring, and managing evaluation jobs. Users typically don’t instantiate this class directly, but receive instances from evaluator classes.

Example

from sagemaker.train.evaluate import BenchmarkEvaluator
from sagemaker.train.evaluate.execution import EvaluationPipelineExecution

# Start evaluation through evaluator
evaluator = BenchmarkEvaluator(...)
execution = evaluator.evaluate()

# Monitor execution
print(f"Status: {execution.status.overall_status}")
print(f"Steps: {len(execution.status.step_details)}")

# Wait for completion
execution.wait()

# Display results
execution.show_results()

# Retrieve past executions
all_executions = list(EvaluationPipelineExecution.get_all())
specific_execution = EvaluationPipelineExecution.get(arn="arn:...")
Parameters:
  • arn (Optional[str]) – ARN of the pipeline execution.

  • name (str) – Name of the evaluation execution.

  • status (PipelineExecutionStatus) – Combined status with step details and failure reason.

  • last_modified_time (Optional[datetime]) – Last modification timestamp.

  • eval_type (Optional[EvalType]) – Type of evaluation (BENCHMARK, CUSTOM_SCORER, LLM_AS_JUDGE).

  • s3_output_path (Optional[str]) – S3 location where evaluation results are stored.

  • steps (List[Dict[str, Any]]) – Raw step information from SageMaker.

class Config[source]#

Bases: object

arbitrary_types_allowed = True#
arn: str | None#
eval_type: EvalType | None#
classmethod get(arn: str, session: Session | None = None, region: str | None = None) EvaluationPipelineExecution[source]#

Get a sagemaker pipeline execution instance by ARN.

Parameters:
  • arn (str) – ARN of the pipeline execution.

  • session (Optional[Session]) – Boto3 session. Will be inferred if not provided.

  • region (Optional[str]) – AWS region. Will be inferred if not provided.

Returns:

Retrieved pipeline execution instance.

Return type:

EvaluationPipelineExecution

Raises:

ClientError – If AWS service call fails.

Example

# Get execution by ARN
arn = "arn:aws:sagemaker:us-west-2:123456789012:pipeline/eval-pipeline/execution/abc123"
execution = EvaluationPipelineExecution.get(arn=arn)
print(execution.status.overall_status)
classmethod get_all(eval_type: EvalType | None = None, session: Session | None = None, region: str | None = None)[source]#

Get all pipeline executions, optionally filtered by evaluation type.

Searches for existing pipelines using prefix and tag validation, then retrieves executions from those pipelines.

Parameters:
  • eval_type (Optional[EvalType]) – Evaluation type to filter by (e.g., EvalType.BENCHMARK). If None, returns executions from all evaluation pipelines.

  • session (Optional[Session]) – Boto3 session. Will be inferred if not provided.

  • region (Optional[str]) – AWS region. Will be inferred if not provided.

Yields:

EvaluationPipelineExecution – Pipeline execution instances.

Example

# Get all evaluation executions as iterator
iter = EvaluationPipelineExecution.get_all()
all_executions = list(iter)

# Get only benchmark evaluations
iter = EvaluationPipelineExecution.get_all(eval_type=EvalType.BENCHMARK)
benchmark_executions = list(iter)
last_modified_time: datetime | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str#
refresh() None[source]#

Describe a pipeline execution and update job status

s3_output_path: str | None#
classmethod start(eval_type: EvalType, name: str, pipeline_definition: str, role_arn: str, s3_output_path: str | None = None, session: Session | None = None, region: str | None = None, tags: List[Dict[str, str | PipelineVariable]] | None = []) EvaluationPipelineExecution[source]#

Create sagemaker pipeline execution. Optionally creates pipeline.

Parameters:
  • eval_type (EvalType) – Type of evaluation (BENCHMARK, CUSTOM_SCORER, LLM_AS_JUDGE).

  • name (str) – Name for the evaluation execution.

  • pipeline_definition (str) – Complete rendered pipeline definition as JSON string.

  • role_arn (str) – IAM role ARN for pipeline execution.

  • s3_output_path (Optional[str]) – S3 location where evaluation results are stored.

  • session (Optional[Session]) – Boto3 session for API calls.

  • region (Optional[str]) – AWS region for the pipeline.

  • tags (Optional[List[TagsDict]]) – List of tags to include in pipeline

Returns:

Started pipeline execution instance.

Return type:

EvaluationPipelineExecution

Raises:
  • ValueError – If pipeline_definition is not valid JSON.

  • ClientError – If AWS service call fails.

status: PipelineExecutionStatus#
steps: List[Dict[str, Any]]#
stop() None[source]#

Stop a pipeline execution

wait(target_status: Literal['Executing', 'Stopping', 'Stopped', 'Failed', 'Succeeded'] = 'Succeeded', poll: int = 5, timeout: int | None = None) None[source]#

Wait for a pipeline execution to reach certain status.

This method provides a hybrid implementation that works in both Jupyter notebooks and terminal environments, with appropriate visual feedback for each.

Parameters:
  • target_status – The status to wait for

  • poll – The number of seconds to wait between each poll

  • timeout – The maximum number of seconds to wait before timing out

class sagemaker.train.evaluate.LLMAsJudgeEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, evaluator_model: str, dataset: str | Any, builtin_metrics: List[str] | None = None, custom_metrics: str | None = None, evaluate_base_model: bool = False)[source]#

Bases: BaseEvaluator

LLM-as-judge evaluation job.

This evaluator uses foundation models to evaluate LLM responses based on various quality and responsible AI metrics.

This feature is powered by Amazon Bedrock Evaluations. Your use of this feature is subject to pricing of Amazon Bedrock Evaluations, the Service Terms applicable to Amazon Bedrock, and the terms that apply to your usage of third-party models. Amazon Bedrock Evaluations may securely transmit data across AWS Regions within your geography for processing. For more information, access Amazon Bedrock Evaluations documentation.

Documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html

evaluator_model#

AWS Bedrock foundation model identifier to use as the judge. Required. For supported models, see: https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html#evaluation-judge-supported

Type:

str

dataset#

Evaluation dataset. Required. Accepts: - S3 URI (str): e.g., ‘s3://bucket/path/dataset.jsonl’ - Dataset ARN (str): e.g., ‘arn:aws:sagemaker:…:hub-content/AIRegistry/DataSet/…’ - DataSet object: sagemaker.ai_registry.dataset.DataSet instance (ARN inferred automatically)

Type:

Union[str, Any]

builtin_metrics#

List of built-in evaluation metric names to compute. The ‘Builtin.’ prefix from Bedrock documentation is optional and will be automatically removed if present. Examples: [‘Correctness’, ‘Faithfulness’] or [‘Builtin.Correctness’, ‘Builtin.Faithfulness’]. Optional.

Type:

Optional[List[str]]

custom_metrics#

JSON string containing array of custom metric definitions. Optional. For format details, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

Type:

Optional[str]

mlflow_resource_arn#

ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Inherited from BaseEvaluator.

Type:

Optional[str]

evaluate_base_model#

Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).

Type:

bool

region#

AWS region. Inherited from BaseEvaluator.

Type:

Optional[str]

sagemaker_session#

SageMaker session object. Inherited from BaseEvaluator.

Type:

Optional[Any]

model#

Model for evaluation. Inherited from BaseEvaluator.

Type:

Union[str, Any]

base_eval_name#

Base name for evaluation jobs. Inherited from BaseEvaluator.

Type:

Optional[str]

s3_output_path#

S3 location for evaluation outputs. Inherited from BaseEvaluator.

Type:

str

mlflow_experiment_name#

MLflow experiment name. Inherited from BaseEvaluator.

Type:

Optional[str]

mlflow_run_name#

MLflow run name. Inherited from BaseEvaluator.

Type:

Optional[str]

networking#

VPC configuration. Inherited from BaseEvaluator.

Type:

Optional[VpcConfig]

kms_key_id#

KMS key ID for encryption. Inherited from BaseEvaluator.

Type:

Optional[str]

model_package_group#

Model package group. Inherited from BaseEvaluator.

Type:

Optional[Union[str, ModelPackageGroup]]

Example

from sagemaker.train.evaluate import LLMAsJudgeEvaluator

# Example with built-in metrics (prefix optional)
# Both formats work - with or without 'Builtin.' prefix
evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
    evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    dataset="s3://my-bucket/my-dataset.jsonl",
    builtin_metrics=["Correctness", "Helpfulness"],  # Prefix optional
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server",
    s3_output_path="s3://my-bucket/output"
)
execution = evaluator.evaluate()

# Example with custom metrics
custom_metrics = [
    {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": "Assess if the response has positive sentiment. Prompt: {{prompt}}\nResponse: {{prediction}}",
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1.0}},
                {"definition": "Poor", "value": {"floatValue": 0.0}}
            ]
        }
    }
]

evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
    evaluator_model="anthropic.claude-3-haiku-20240307-v1:0",
    dataset="s3://my-bucket/dataset.jsonl",
    custom_metrics=custom_metrics,
    s3_output_path="s3://my-bucket/output"
)
execution = evaluator.evaluate()

# Example evaluating only custom model (skip base model)
evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
    evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    dataset="s3://my-bucket/my-dataset.jsonl",
    builtin_metrics=["Correctness"],  # Prefix optional
    evaluate_base_model=False,
    s3_output_path="s3://my-bucket/output"
)
execution = evaluator.evaluate()
builtin_metrics: List[str] | None#
custom_metrics: str | None#
dataset: str | Any#
evaluate()[source]#

Create and start an LLM-as-judge evaluation job.

This method initiates a 2-phase evaluation job:

  1. Phase 1: Generate inference responses from base and custom models

  2. Phase 2: Use judge model to evaluate responses with built-in and custom metrics

Returns:

The created LLM-as-judge evaluation execution

Return type:

EvaluationPipelineExecution

Raises:

ValueError – If invalid model, dataset, or metric configurations are provided

Example

evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0",
dataset="s3://my-bucket/my-dataset.jsonl",
builtin_metrics=["Correctness", "Helpfulness"],  # Prefix optional
s3_output_path="s3://my-bucket/output"
)

evaluator_model=”anthropic.claude-3-5-sonnet-20240620-v1:0”, dataset=”s3://my-bucket/my-dataset.jsonl”, builtin_metrics=[“Correctness”, “Helpfulness”], s3_output_path=”s3://my-bucket/output”

) execution = evaluator.evaluate() execution.wait()

evaluate_base_model: bool#
evaluator_model: str#
classmethod get_all(session: Any | None = None, region: str | None = None)[source]#

Get all LLM-as-judge evaluation executions.

Uses EvaluationPipelineExecution.get_all() to retrieve all LLM-as-judge evaluation executions as an iterator.

Parameters:
  • session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.

  • region (Optional[str]) – Optional AWS region. If not provided, will be inferred.

Yields:

EvaluationPipelineExecution – LLM-as-judge evaluation execution instances

Example

# Get all LLM-as-judge evaluations as iterator
evaluations = LLMAsJudgeEvaluator.get_all()
all_executions = list(evaluations)

# Or iterate directly
for execution in LLMAsJudgeEvaluator.get_all():
    print(f"{execution.name}: {execution.status.overall_status}")

# With specific session/region
evaluations = LLMAsJudgeEvaluator.get_all(session=my_session, region='us-west-2')
all_executions = list(evaluations)
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class sagemaker.train.evaluate.PipelineExecutionStatus(*, overall_status: str, step_details: ~typing.List[~sagemaker.train.evaluate.execution.StepDetail] = <factory>, failure_reason: str | None = None)[source]#

Bases: BaseModel

Combined pipeline execution status with step details and failure reason.

Aggregates the overall execution status along with detailed information about individual pipeline steps and any failure reasons.

Parameters:
  • overall_status (str) – Overall execution status (Starting, Executing, Completed, Failed, etc.).

  • step_details (List[StepDetail]) – List of individual pipeline step details.

  • failure_reason (Optional[str]) – Detailed reason if the execution failed.

failure_reason: str | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

overall_status: str#
step_details: List[StepDetail]#
class sagemaker.train.evaluate.StepDetail(*, name: str, status: str, start_time: str | None = None, end_time: str | None = None, display_name: str | None = None, failure_reason: str | None = None, job_arn: str | None = None)[source]#

Bases: BaseModel

Pipeline step details for tracking execution progress.

Represents the status and timing information for a single step in a SageMaker pipeline execution.

Parameters:
  • name (str) – Name of the pipeline step.

  • status (str) – Status of the step (Completed, Executing, Waiting, Failed).

  • start_time (Optional[str]) – ISO format timestamp when step started.

  • end_time (Optional[str]) – ISO format timestamp when step ended.

  • display_name (Optional[str]) – Human-readable display name for the step.

  • failure_reason (Optional[str]) – Detailed reason if the step failed.

display_name: str | None#
end_time: str | None#
failure_reason: str | None#
job_arn: str | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str#
start_time: str | None#
status: str#
sagemaker.train.evaluate.get_benchmark_properties(benchmark: _Benchmark) Dict[str, Any][source]#

Get properties for a specific benchmark.

This utility method returns the properties associated with a given benchmark as a dictionary, including information about modality, metrics, strategy, and available subtasks.

Parameters:

benchmark (_Benchmark) – The benchmark to get properties for (from get_benchmarks()).

Returns:

Dictionary containing benchmark properties with keys:

  • modality (str): The modality type (e.g., “Text”, “Multi-Modal”)

  • description (str): Description of the benchmark

  • metrics (list[str]): List of supported metrics

  • strategy (str): The evaluation strategy used

  • subtask_available (bool): Whether subtasks are supported

  • subtasks (Optional[list[str]]): List of available subtasks, if applicable

Return type:

Dict[str, Any]

Raises:

ValueError – If the provided benchmark is not found in the configuration.

Example

Benchmark = get_benchmarks()
props = get_benchmark_properties(Benchmark.MMLU)
print(props['description'])
# 'Multi-task Language Understanding – Tests knowledge across 57 subjects.'
print(props['subtasks'][:3])
# ['abstract_algebra', 'anatomy', 'astronomy']

Note

In the future, this will be extended to dynamically fetch benchmark properties from a backend API call instead of using the internal static configuration.

sagemaker.train.evaluate.get_benchmarks() Type[_Benchmark][source]#

Get the Benchmark enum for selecting available benchmarks.

This utility method provides access to the internal Benchmark enum, allowing users to reference available benchmarks without directly accessing internal implementation details.

Returns:

The Benchmark enum class containing all available benchmarks.

Return type:

Type[_Benchmark]

Example

Benchmark = get_benchmarks()
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    sagemaker_session=session,
    s3_output_path="s3://bucket/output"
)

Note

In the future, this will be extended to dynamically generate the enum from a backend API call to fetch the latest available benchmarks.

sagemaker.train.evaluate.get_builtin_metrics() Type[_BuiltInMetric][source]#

Get the built-in metrics enum for custom scorer evaluation.

This utility function provides access to preset metrics for custom scorer evaluation.

Returns:

The built-in metric enum class

Return type:

Type[_BuiltInMetric]

Example

from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()
evaluator = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset=my_dataset,
    base_model="my-model",
    s3_output_path="s3://bucket/output",
    mlflow_resource_arn="arn:..."
)

Modules

base_evaluator

Base evaluator module for SageMaker Model Evaluation.

benchmark_evaluator

Benchmark evaluator module for SageMaker Model Evaluation.

constants

Constants for SageMaker Evaluation Module.

custom_scorer_evaluator

Custom Scorer Evaluator for SageMaker Model Evaluation Module.

execution

SageMaker Evaluation Execution Module.

llm_as_judge_evaluator

LLM-as-Judge Evaluator for SageMaker Model Evaluation Module.

pipeline_templates

Pipeline templates for SageMaker Model Evaluation.