sagemaker.train.evaluate.benchmark_evaluator#
Benchmark evaluator module for SageMaker Model Evaluation.
This module provides benchmark evaluation capabilities for SageMaker models, supporting various standard benchmarks like MMLU, BBH, MATH, and others. It handles benchmark configuration, validation, and execution of evaluation pipelines.
Functions
|
Get properties for a specific benchmark. |
Get the Benchmark enum for selecting available benchmarks. |
Classes
|
Benchmark evaluator for standard model evaluation tasks. |
- class sagemaker.train.evaluate.benchmark_evaluator.BenchMarkEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, benchmark: _Benchmark, subtasks: str | List[str] | None = None, evaluate_base_model: bool = False)[source]#
Bases:
BaseEvaluatorBenchmark evaluator for standard model evaluation tasks.
This evaluator accepts a benchmark enum and automatically deduces the appropriate metrics, strategy, and subtask availability based on the benchmark configuration. Supports various standard benchmarks like MMLU, BBH, MATH, MMMU, and others.
- benchmark#
Benchmark type from the Benchmark enum obtained via
get_benchmarks(). Required. Use get_benchmarks() to access available benchmark types.- Type:
_Benchmark
- subtasks#
Benchmark subtask(s) to evaluate. Defaults to ‘ALL’ for benchmarks that support subtasks. Can be a single subtask string, a list of subtasks, or ‘ALL’ to run all subtasks. For benchmarks without subtask support, must be None.
- Type:
Optional[Union[str, list[str]]]
- mlflow_resource_arn#
ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Format: arn:aws:sagemaker:region:account:mlflow-tracking-server/name
- Type:
Optional[str]
- evaluate_base_model#
Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).
- Type:
bool
- region#
AWS region. Inherited from BaseEvaluator.
- Type:
Optional[str]
- sagemaker_session#
SageMaker session object. Inherited from BaseEvaluator.
- Type:
Optional[Any]
- model#
Model for evaluation. Inherited from BaseEvaluator.
- Type:
Union[str, Any]
- base_eval_name#
Base name for evaluation jobs. Inherited from BaseEvaluator.
- Type:
Optional[str]
- s3_output_path#
S3 location for evaluation outputs. Inherited from BaseEvaluator.
- Type:
str
- mlflow_experiment_name#
MLflow experiment name. Inherited from BaseEvaluator.
- Type:
Optional[str]
- mlflow_run_name#
MLflow run name. Inherited from BaseEvaluator.
- Type:
Optional[str]
- kms_key_id#
KMS key ID for encryption. Inherited from BaseEvaluator.
- Type:
Optional[str]
- model_package_group#
Model package group. Inherited from BaseEvaluator.
- Type:
Optional[Union[str, ModelPackageGroup]]
Example
# Get available benchmarks Benchmark = get_benchmarks() # Create evaluator with benchmark and subtasks evaluator = BenchMarkEvaluator( benchmark=Benchmark.MMLU, subtasks=["abstract_algebra", "anatomy", "astronomy"], model="llama3-2-1b-instruct", s3_output_path="s3://bucket/outputs/", mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server" ) # Run evaluation with configured subtasks execution = evaluator.evaluate() execution.wait() # Or override subtasks at evaluation time execution = evaluator.evaluate(subtask="abstract_algebra")
- base_eval_name: str | None#
- benchmark: _Benchmark#
- evaluate(subtask: str | List[str] | None = None) EvaluationPipelineExecution[source]#
Create and start a benchmark evaluation job.
- Parameters:
subtask (Optional[Union[str, list[str]]]) – Optional subtask(s) to evaluate. If not provided, uses the subtasks from constructor. Can be a single subtask string, a list of subtasks, or ‘ALL’ to run all subtasks.
- Returns:
The created benchmark evaluation execution.
- Return type:
Example
Benchmark = get_benchmarks() evaluator = BenchMarkEvaluator( benchmark=Benchmark.MMLU, subtasks="ALL", model="llama3-2-1b-instruct", s3_output_path="s3://bucket/outputs/" ) # Evaluate single subtask execution = evaluator.evaluate(subtask="abstract_algebra") # Evaluate multiple subtasks execution = evaluator.evaluate(subtask=["abstract_algebra", "anatomy"]) # Evaluate all subtasks (uses constructor default) execution = evaluator.evaluate()
- evaluate_base_model: bool#
- classmethod get_all(session: Any | None = None, region: str | None = None) Iterator[EvaluationPipelineExecution][source]#
Get all benchmark evaluation executions.
Uses
EvaluationPipelineExecution.get_all()to retrieve all benchmark evaluation executions as an iterator.- Parameters:
session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.
region (Optional[str]) – Optional AWS region. If not provided, will be inferred.
- Yields:
EvaluationPipelineExecution – Benchmark evaluation execution instances.
Example
# Get all benchmark evaluations as iterator eval_iter = BenchMarkEvaluator.get_all() all_executions = list(eval_iter) # Or iterate directly for execution in BenchMarkEvaluator.get_all(): print(f"{execution.name}: {execution.status.overall_status}") # With specific session/region eval_iter = BenchMarkEvaluator.get_all(session=my_session, region='us-west-2') all_executions = list(eval_iter)
- property hyperparameters#
Get evaluation hyperparameters as a FineTuningOptions object.
This property provides access to evaluation hyperparameters with validation, type checking, and user-friendly information display. Hyperparameters are lazily loaded from the JumpStart Hub when first accessed.
- Returns:
Dynamic object with evaluation hyperparameters
- Return type:
- Raises:
ValueError – If base model name is not available or if hyperparameters cannot be loaded
Example
evaluator = BenchMarkEvaluator(...) # Access current values print(evaluator.hyperparameters.temperature) # Modify values (with validation) evaluator.hyperparameters.temperature = 0.5 # Get as dictionary params = evaluator.hyperparameters.to_dict() # Display parameter information evaluator.hyperparameters.get_info() evaluator.hyperparameters.get_info('temperature')
- kms_key_id: str | None#
- mlflow_experiment_name: str | None#
- mlflow_resource_arn: str | None#
- mlflow_run_name: str | None#
- model: str | BaseTrainer | ModelPackage#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_package_group: str | ModelPackageGroup | None#
- model_post_init(context: Any, /) None#
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- region: str | None#
- role: str | None#
- s3_output_path: str#
- sagemaker_session: Any | None#
- subtasks: str | List[str] | None#
- sagemaker.train.evaluate.benchmark_evaluator.get_benchmark_properties(benchmark: _Benchmark) Dict[str, Any][source]#
Get properties for a specific benchmark.
This utility method returns the properties associated with a given benchmark as a dictionary, including information about modality, metrics, strategy, and available subtasks.
- Parameters:
benchmark (_Benchmark) – The benchmark to get properties for (from
get_benchmarks()).- Returns:
Dictionary containing benchmark properties with keys:
modality(str): The modality type (e.g., “Text”, “Multi-Modal”)description(str): Description of the benchmarkmetrics(list[str]): List of supported metricsstrategy(str): The evaluation strategy usedsubtask_available(bool): Whether subtasks are supportedsubtasks(Optional[list[str]]): List of available subtasks, if applicable
- Return type:
Dict[str, Any]
- Raises:
ValueError – If the provided benchmark is not found in the configuration.
Example
Benchmark = get_benchmarks() props = get_benchmark_properties(Benchmark.MMLU) print(props['description']) # 'Multi-task Language Understanding – Tests knowledge across 57 subjects.' print(props['subtasks'][:3]) # ['abstract_algebra', 'anatomy', 'astronomy']
Note
In the future, this will be extended to dynamically fetch benchmark properties from a backend API call instead of using the internal static configuration.
- sagemaker.train.evaluate.benchmark_evaluator.get_benchmarks() Type[_Benchmark][source]#
Get the Benchmark enum for selecting available benchmarks.
This utility method provides access to the internal Benchmark enum, allowing users to reference available benchmarks without directly accessing internal implementation details.
- Returns:
The Benchmark enum class containing all available benchmarks.
- Return type:
Type[_Benchmark]
Example
Benchmark = get_benchmarks() evaluator = BenchMarkEvaluator( benchmark=Benchmark.MMLU, sagemaker_session=session, s3_output_path="s3://bucket/output" )
Note
In the future, this will be extended to dynamically generate the enum from a backend API call to fetch the latest available benchmarks.