sagemaker.train.evaluate.benchmark_evaluator#

Benchmark evaluator module for SageMaker Model Evaluation.

This module provides benchmark evaluation capabilities for SageMaker models, supporting various standard benchmarks like MMLU, BBH, MATH, and others. It handles benchmark configuration, validation, and execution of evaluation pipelines.

Functions

get_benchmark_properties(benchmark)

Get properties for a specific benchmark.

get_benchmarks()

Get the Benchmark enum for selecting available benchmarks.

Classes

BenchMarkEvaluator(*[, region, role, ...])

Benchmark evaluator for standard model evaluation tasks.

class sagemaker.train.evaluate.benchmark_evaluator.BenchMarkEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, benchmark: _Benchmark, subtasks: str | List[str] | None = None, evaluate_base_model: bool = False)[source]#

Bases: BaseEvaluator

Benchmark evaluator for standard model evaluation tasks.

This evaluator accepts a benchmark enum and automatically deduces the appropriate metrics, strategy, and subtask availability based on the benchmark configuration. Supports various standard benchmarks like MMLU, BBH, MATH, MMMU, and others.

benchmark#

Benchmark type from the Benchmark enum obtained via get_benchmarks(). Required. Use get_benchmarks() to access available benchmark types.

Type:

_Benchmark

subtasks#

Benchmark subtask(s) to evaluate. Defaults to ‘ALL’ for benchmarks that support subtasks. Can be a single subtask string, a list of subtasks, or ‘ALL’ to run all subtasks. For benchmarks without subtask support, must be None.

Type:

Optional[Union[str, list[str]]]

mlflow_resource_arn#

ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Format: arn:aws:sagemaker:region:account:mlflow-tracking-server/name

Type:

Optional[str]

evaluate_base_model#

Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).

Type:

bool

region#

AWS region. Inherited from BaseEvaluator.

Type:

Optional[str]

sagemaker_session#

SageMaker session object. Inherited from BaseEvaluator.

Type:

Optional[Any]

model#

Model for evaluation. Inherited from BaseEvaluator.

Type:

Union[str, Any]

base_eval_name#

Base name for evaluation jobs. Inherited from BaseEvaluator.

Type:

Optional[str]

s3_output_path#

S3 location for evaluation outputs. Inherited from BaseEvaluator.

Type:

str

mlflow_experiment_name#

MLflow experiment name. Inherited from BaseEvaluator.

Type:

Optional[str]

mlflow_run_name#

MLflow run name. Inherited from BaseEvaluator.

Type:

Optional[str]

networking#

VPC configuration. Inherited from BaseEvaluator.

Type:

Optional[VpcConfig]

kms_key_id#

KMS key ID for encryption. Inherited from BaseEvaluator.

Type:

Optional[str]

model_package_group#

Model package group. Inherited from BaseEvaluator.

Type:

Optional[Union[str, ModelPackageGroup]]

Example

# Get available benchmarks
Benchmark = get_benchmarks()

# Create evaluator with benchmark and subtasks
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    subtasks=["abstract_algebra", "anatomy", "astronomy"],
    model="llama3-2-1b-instruct",
    s3_output_path="s3://bucket/outputs/",
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server"
)

# Run evaluation with configured subtasks
execution = evaluator.evaluate()
execution.wait()

# Or override subtasks at evaluation time
execution = evaluator.evaluate(subtask="abstract_algebra")
base_eval_name: str | None#
benchmark: _Benchmark#
evaluate(subtask: str | List[str] | None = None) EvaluationPipelineExecution[source]#

Create and start a benchmark evaluation job.

Parameters:

subtask (Optional[Union[str, list[str]]]) – Optional subtask(s) to evaluate. If not provided, uses the subtasks from constructor. Can be a single subtask string, a list of subtasks, or ‘ALL’ to run all subtasks.

Returns:

The created benchmark evaluation execution.

Return type:

EvaluationPipelineExecution

Example

Benchmark = get_benchmarks()
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    subtasks="ALL",
    model="llama3-2-1b-instruct",
    s3_output_path="s3://bucket/outputs/"
)

# Evaluate single subtask
execution = evaluator.evaluate(subtask="abstract_algebra")

# Evaluate multiple subtasks
execution = evaluator.evaluate(subtask=["abstract_algebra", "anatomy"])

# Evaluate all subtasks (uses constructor default)
execution = evaluator.evaluate()
evaluate_base_model: bool#
classmethod get_all(session: Any | None = None, region: str | None = None) Iterator[EvaluationPipelineExecution][source]#

Get all benchmark evaluation executions.

Uses EvaluationPipelineExecution.get_all() to retrieve all benchmark evaluation executions as an iterator.

Parameters:
  • session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.

  • region (Optional[str]) – Optional AWS region. If not provided, will be inferred.

Yields:

EvaluationPipelineExecution – Benchmark evaluation execution instances.

Example

# Get all benchmark evaluations as iterator
eval_iter = BenchMarkEvaluator.get_all()
all_executions = list(eval_iter)

# Or iterate directly
for execution in BenchMarkEvaluator.get_all():
    print(f"{execution.name}: {execution.status.overall_status}")

# With specific session/region
eval_iter = BenchMarkEvaluator.get_all(session=my_session, region='us-west-2')
all_executions = list(eval_iter)
property hyperparameters#

Get evaluation hyperparameters as a FineTuningOptions object.

This property provides access to evaluation hyperparameters with validation, type checking, and user-friendly information display. Hyperparameters are lazily loaded from the JumpStart Hub when first accessed.

Returns:

Dynamic object with evaluation hyperparameters

Return type:

FineTuningOptions

Raises:

ValueError – If base model name is not available or if hyperparameters cannot be loaded

Example

evaluator = BenchMarkEvaluator(...)

# Access current values
print(evaluator.hyperparameters.temperature)

# Modify values (with validation)
evaluator.hyperparameters.temperature = 0.5

# Get as dictionary
params = evaluator.hyperparameters.to_dict()

# Display parameter information
evaluator.hyperparameters.get_info()
evaluator.hyperparameters.get_info('temperature')
kms_key_id: str | None#
mlflow_experiment_name: str | None#
mlflow_resource_arn: str | None#
mlflow_run_name: str | None#
model: str | BaseTrainer | ModelPackage#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_package_group: str | ModelPackageGroup | None#
model_post_init(context: Any, /) None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

networking: VpcConfig | None#
region: str | None#
role: str | None#
s3_output_path: str#
sagemaker_session: Any | None#
subtasks: str | List[str] | None#
sagemaker.train.evaluate.benchmark_evaluator.get_benchmark_properties(benchmark: _Benchmark) Dict[str, Any][source]#

Get properties for a specific benchmark.

This utility method returns the properties associated with a given benchmark as a dictionary, including information about modality, metrics, strategy, and available subtasks.

Parameters:

benchmark (_Benchmark) – The benchmark to get properties for (from get_benchmarks()).

Returns:

Dictionary containing benchmark properties with keys:

  • modality (str): The modality type (e.g., “Text”, “Multi-Modal”)

  • description (str): Description of the benchmark

  • metrics (list[str]): List of supported metrics

  • strategy (str): The evaluation strategy used

  • subtask_available (bool): Whether subtasks are supported

  • subtasks (Optional[list[str]]): List of available subtasks, if applicable

Return type:

Dict[str, Any]

Raises:

ValueError – If the provided benchmark is not found in the configuration.

Example

Benchmark = get_benchmarks()
props = get_benchmark_properties(Benchmark.MMLU)
print(props['description'])
# 'Multi-task Language Understanding – Tests knowledge across 57 subjects.'
print(props['subtasks'][:3])
# ['abstract_algebra', 'anatomy', 'astronomy']

Note

In the future, this will be extended to dynamically fetch benchmark properties from a backend API call instead of using the internal static configuration.

sagemaker.train.evaluate.benchmark_evaluator.get_benchmarks() Type[_Benchmark][source]#

Get the Benchmark enum for selecting available benchmarks.

This utility method provides access to the internal Benchmark enum, allowing users to reference available benchmarks without directly accessing internal implementation details.

Returns:

The Benchmark enum class containing all available benchmarks.

Return type:

Type[_Benchmark]

Example

Benchmark = get_benchmarks()
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    sagemaker_session=session,
    s3_output_path="s3://bucket/output"
)

Note

In the future, this will be extended to dynamically generate the enum from a backend API call to fetch the latest available benchmarks.