SageMaker Benchmark Evaluation - Basic Usage#
This notebook demonstrates the basic user-facing flow for creating and managing benchmark evaluation jobs using the BenchmarkEvaluator with Jinja2 template-based pipeline generation.
Step 1: Discover Available Benchmarks#
Discover the benchmark properties and available options: Nova Model Evaluation
# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2
from sagemaker.train.evaluate import get_benchmarks, get_benchmark_properties
from rich.pretty import pprint
# Configure logging to show INFO messages
import logging
logging.basicConfig(
level=logging.INFO,
format='%(levelname)s - %(name)s - %(message)s'
)
# Get available benchmarks
Benchmark = get_benchmarks()
pprint(list(Benchmark))
# Print properties for a specific benchmark
pprint(get_benchmark_properties(benchmark=Benchmark.MMLU))
Step 2: Create BenchmarkEvaluator#
Create a BenchmarkEvaluator instance with the desired benchmark. The evaluator will use Jinja2 templates to render a complete pipeline definition.
Required Parameters:
benchmark: Benchmark type from the Benchmark enumbase_model: Model ARN from SageMaker hub contentoutput_s3_location: S3 location for evaluation outputsmlflow_resource_arn: MLflow tracking server ARN for experiment tracking
Optional Template Fields: These fields are used for template rendering. If not provided, defaults will be used:
model_package_group: Model package group ARNsource_model_package: Source model package ARNdataset: S3 URI of evaluation datasetmodel_artifact: ARN of model artifact for lineage tracking (auto-inferred from source_model_package)
from sagemaker.train.evaluate import BenchMarkEvaluator
# Create evaluator with MMLU benchmark
# These values match our successfully tested configuration
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
#subtask = "abstract_algebra" # or "all"
model="arn:aws:sagemaker:us-east-1:729646638167:model-package/sdk-test-finetuned-models/2",
s3_output_path="s3://sagemaker-us-east-1-729646638167/model-customization/eval/",
model_package_group="arn:aws:sagemaker:us-east-1:729646638167:model-package-group/sdk-test-finetuned-models", # Optional inferred from model if model package
base_eval_name="mmlu-eval-demo1",
# Note: sagemaker_session is optional and will be auto-created if not provided
# Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
)
pprint(evaluator)
# # [Optional] BASE MODEL EVAL
# from sagemaker.train.evaluate import BenchMarkEvaluator
# # Create evaluator with MMLU benchmark
# # These values match our successfully tested configuration
# evaluator = BenchMarkEvaluator(
# benchmark=Benchmark.MMLU,
# model="meta-textgeneration-llama-3-2-1b-instruct",
# s3_output_path="s3://mufi-test-serverless-smtj/eval/",
# mlflow_resource_arn="arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment",
# # model_package_group="arn:aws:sagemaker:us-west-2:<>:model-package-group/example-name-aovqo", # Optional inferred from model if model package
# base_eval_name="gen-qa-eval-demo",
# # Note: sagemaker_session is optional and will be auto-created if not provided
# # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
# )
# pprint(evaluator)
# # [Optional] Nova testing IAD Prod
# from sagemaker.train.evaluate import BenchMarkEvaluator
# # Create evaluator with MMLU benchmark
# # These values match our successfully tested configuration
# evaluator = BenchMarkEvaluator(
# benchmark=Benchmark.MMLU,
# # model="arn:aws:sagemaker:us-east-1:<>:model-package/bgrv-nova-micro-sft-lora/1",
# model="arn:aws:sagemaker:us-east-1:<>:model-package/test-nova-finetuned-models/3",
# s3_output_path="s3://mufi-test-serverless-iad/eval/",
# mlflow_resource_arn="arn:aws:sagemaker:us-east-1:<>:mlflow-tracking-server/mlflow-prod-server",
# model_package_group="arn:aws:sagemaker:us-east-1:<>:model-package-group/test-nova-finetuned-models", # Optional inferred from model if model package
# base_eval_name="gen-qa-eval-demo",
# region="us-east-1",
# # Note: sagemaker_session is optional and will be auto-created if not provided
# # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
# )
# pprint(evaluator)
Optionally update the hyperparameters#
pprint(evaluator.hyperparameters.to_dict())
# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"
# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()
Step 3: Run Evaluation#
Start a benchmark evaluation job. The system will:
Build template context with all required parameters
Render the pipeline definition from
DETERMINISTIC_TEMPLATEusing Jinja2Create or update the pipeline with the rendered definition
Start the pipeline execution with empty parameters (all values pre-substituted)
What happens during execution:
CreateEvaluationAction: Sets up lineage tracking
EvaluateBaseModel & EvaluateCustomModel: Run in parallel as serverless training jobs
AssociateLineage: Links evaluation results to lineage tracking
# Run evaluation with configured parameters
execution = evaluator.evaluate()
pprint(execution)
print(f"\nPipeline Execution ARN: {execution.arn}")
print(f"Initial Status: {execution.status.overall_status}")
Alternative: Override Subtasks at Runtime#
For benchmarks with subtask support, you can override subtasks when calling evaluate():
# Override subtasks at evaluation time
# execution = mmlu_evaluator.evaluate(subtask="abstract_algebra") # Single subtask
# execution = mmlu_evaluator.evaluate(subtask=["abstract_algebra", "anatomy"]) # Multiple subtasks
Step 4: Monitor Execution#
Check the job status and refresh as needed:
# Refresh status
execution.refresh()
# Display job status with step details
pprint(execution.status)
# Display individual step statuses
if execution.status.step_details:
print("\nStep Details:")
for step in execution.status.step_details:
print(f" {step.name}: {step.status}")
Step 5: Wait for Completion#
Wait for the pipeline to complete. This provides rich progress updates in Jupyter notebooks:
# Wait for job completion with progress updates
# This will show a rich progress display in Jupyter
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
print(f"\nFinal Status: {execution.status.overall_status}")
Step 6: View Results#
Display the evaluation results in a formatted table:
Output Structure:
Evaluation results are stored in S3:
s3://your-bucket/output/
└── job_name/
└── output/
└── output.tar.gz
Extract output.tar.gz to reveal:
run_name/
├── eval_results/
│ ├── results_[timestamp].json
│ ├── inference_output.jsonl
│ └── details/
│ └── model/
│ └── <execution-date-time>/
│ └── details_<task_name>_#_<datetime>.parquet
└── tensorboard_results/
└── eval/
└── events.out.tfevents.[timestamp]
pprint(execution.s3_output_path)
# Display results in a formatted table
execution.show_results()
Step 7: Retrieve an Existing Job#
You can retrieve and inspect any existing evaluation job:
from sagemaker.train.evaluate import EvaluationPipelineExecution
from rich.pretty import pprint
# Get an existing job by ARN
# Replace with your actual pipeline execution ARN
existing_arn = "arn:aws:sagemaker:us-west-2:<>:pipeline/SagemakerEvaluation-BenchmarkEvaluation-c344c91d-6f62-4907-85cc-7e6b29171c42/execution/inlsexrd7jes"
# base model only example
# existing_arn = "arn:aws:sagemaker:us-west-2:<>:pipeline/SagemakerEvaluation-benchmark/execution/gdp9f4dbv2vi"
existing_execution = EvaluationPipelineExecution.get(
arn=existing_arn,
region="us-west-2"
)
pprint(existing_execution)
print(f"\nStatus: {existing_execution.status.overall_status}")
existing_execution.show_results()
# Run evaluation with configured parameters
execution = evaluator.evaluate()
pprint(execution)
print(f"\nPipeline Execution ARN: {execution.arn}")
print(f"Initial Status: {execution.status.overall_status}")
Step 8: List All Benchmark Evaluations#
Retrieve all benchmark evaluation executions:
# Get all benchmark evaluations (returns iterator)
all_executions_iter = BenchMarkEvaluator.get_all(region="us-west-2")
all_executions = list(all_executions_iter)
print(f"Found {len(all_executions)} evaluation(s)\n")
for exec in all_executions[:5]: # Show first 5
print(f" {exec.name}: {exec.status.overall_status}")
Step 9: Stop a Running Job (Optional)#
You can stop a running evaluation if needed:
# Uncomment to stop the job
# existing_execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")
Understanding the Pipeline Structure#
The rendered pipeline definition includes:
4 Steps:
CreateEvaluationAction (Lineage): Sets up tracking
EvaluateBaseModel (Training): Evaluates base model
EvaluateCustomModel (Training): Evaluates custom model
AssociateLineage (Lineage): Links results
Key Features:
Template-based: Uses Jinja2 for flexible pipeline generation
Parallel execution: Base and custom models evaluated simultaneously
Serverless: No need to manage compute resources
MLflow integration: Automatic experiment tracking
Lineage tracking: Full traceability of evaluation artifacts
Typical Execution Time:
Total: ~10-12 minutes
Downloading phase: ~5-7 minutes (model)
Training phase: ~3-5 minutes (running evaluation)
Lineage steps: ~2-4 seconds each