SageMaker Benchmark Evaluation - Basic Usage#

This notebook demonstrates the basic user-facing flow for creating and managing benchmark evaluation jobs using the BenchmarkEvaluator with Jinja2 template-based pipeline generation.

Step 1: Discover Available Benchmarks#

Discover the benchmark properties and available options: Nova Model Evaluation

# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2
from sagemaker.train.evaluate import get_benchmarks, get_benchmark_properties
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

# Get available benchmarks
Benchmark = get_benchmarks()
pprint(list(Benchmark))

# Print properties for a specific benchmark
pprint(get_benchmark_properties(benchmark=Benchmark.MMLU))

Step 2: Create BenchmarkEvaluator#

Create a BenchmarkEvaluator instance with the desired benchmark. The evaluator will use Jinja2 templates to render a complete pipeline definition.

Required Parameters:

  • benchmark: Benchmark type from the Benchmark enum

  • base_model: Model ARN from SageMaker hub content

  • output_s3_location: S3 location for evaluation outputs

  • mlflow_resource_arn: MLflow tracking server ARN for experiment tracking

Optional Template Fields: These fields are used for template rendering. If not provided, defaults will be used:

  • model_package_group: Model package group ARN

  • source_model_package: Source model package ARN

  • dataset: S3 URI of evaluation dataset

  • model_artifact: ARN of model artifact for lineage tracking (auto-inferred from source_model_package)

from sagemaker.train.evaluate import BenchMarkEvaluator

# Create evaluator with MMLU benchmark
# These values match our successfully tested configuration
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    #subtask = "abstract_algebra" # or "all"
    model="arn:aws:sagemaker:us-east-1:729646638167:model-package/sdk-test-finetuned-models/2",
    s3_output_path="s3://sagemaker-us-east-1-729646638167/model-customization/eval/",
    model_package_group="arn:aws:sagemaker:us-east-1:729646638167:model-package-group/sdk-test-finetuned-models", # Optional inferred from model if model package
    base_eval_name="mmlu-eval-demo1",
    # Note: sagemaker_session is optional and will be auto-created if not provided
    # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
)

pprint(evaluator)
# # [Optional] BASE MODEL EVAL

# from sagemaker.train.evaluate import BenchMarkEvaluator

# # Create evaluator with MMLU benchmark
# # These values match our successfully tested configuration
# evaluator = BenchMarkEvaluator(
#     benchmark=Benchmark.MMLU,
#     model="meta-textgeneration-llama-3-2-1b-instruct",
#     s3_output_path="s3://mufi-test-serverless-smtj/eval/",
#     mlflow_resource_arn="arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment",
#     # model_package_group="arn:aws:sagemaker:us-west-2:<>:model-package-group/example-name-aovqo", # Optional inferred from model if model package
#     base_eval_name="gen-qa-eval-demo",
#     # Note: sagemaker_session is optional and will be auto-created if not provided
#     # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
# )

# pprint(evaluator)
# # [Optional] Nova testing IAD Prod

# from sagemaker.train.evaluate import BenchMarkEvaluator

# # Create evaluator with MMLU benchmark
# # These values match our successfully tested configuration
# evaluator = BenchMarkEvaluator(
#     benchmark=Benchmark.MMLU,
#     # model="arn:aws:sagemaker:us-east-1:<>:model-package/bgrv-nova-micro-sft-lora/1",
#     model="arn:aws:sagemaker:us-east-1:<>:model-package/test-nova-finetuned-models/3",
#     s3_output_path="s3://mufi-test-serverless-iad/eval/",
#     mlflow_resource_arn="arn:aws:sagemaker:us-east-1:<>:mlflow-tracking-server/mlflow-prod-server",
#     model_package_group="arn:aws:sagemaker:us-east-1:<>:model-package-group/test-nova-finetuned-models", # Optional inferred from model if model package
#     base_eval_name="gen-qa-eval-demo",
#     region="us-east-1",
#     # Note: sagemaker_session is optional and will be auto-created if not provided
#     # Note: region is optional and will be auto deduced using environment variables - SAGEMAKER_REGION, AWS_REGION
# )

# pprint(evaluator)

Optionally update the hyperparameters#

pprint(evaluator.hyperparameters.to_dict())

# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"

# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()

Step 3: Run Evaluation#

Start a benchmark evaluation job. The system will:

  1. Build template context with all required parameters

  2. Render the pipeline definition from DETERMINISTIC_TEMPLATE using Jinja2

  3. Create or update the pipeline with the rendered definition

  4. Start the pipeline execution with empty parameters (all values pre-substituted)

What happens during execution:

  • CreateEvaluationAction: Sets up lineage tracking

  • EvaluateBaseModel & EvaluateCustomModel: Run in parallel as serverless training jobs

  • AssociateLineage: Links evaluation results to lineage tracking

# Run evaluation with configured parameters
execution = evaluator.evaluate()
pprint(execution)

print(f"\nPipeline Execution ARN: {execution.arn}")
print(f"Initial Status: {execution.status.overall_status}")

Alternative: Override Subtasks at Runtime#

For benchmarks with subtask support, you can override subtasks when calling evaluate():

# Override subtasks at evaluation time
# execution = mmlu_evaluator.evaluate(subtask="abstract_algebra")  # Single subtask
# execution = mmlu_evaluator.evaluate(subtask=["abstract_algebra", "anatomy"])  # Multiple subtasks

Step 4: Monitor Execution#

Check the job status and refresh as needed:

# Refresh status
execution.refresh()

# Display job status with step details
pprint(execution.status)

# Display individual step statuses
if execution.status.step_details:
    print("\nStep Details:")
    for step in execution.status.step_details:
        print(f"  {step.name}: {step.status}")

Step 5: Wait for Completion#

Wait for the pipeline to complete. This provides rich progress updates in Jupyter notebooks:

# Wait for job completion with progress updates
# This will show a rich progress display in Jupyter
execution.wait(target_status="Succeeded", poll=5, timeout=3600)

print(f"\nFinal Status: {execution.status.overall_status}")

Step 6: View Results#

Display the evaluation results in a formatted table:

Output Structure:

Evaluation results are stored in S3:

s3://your-bucket/output/
└── job_name/
    └── output/
        └── output.tar.gz

Extract output.tar.gz to reveal:

run_name/
├── eval_results/
│   ├── results_[timestamp].json
│   ├── inference_output.jsonl
│   └── details/
│       └── model/
│           └── <execution-date-time>/
│               └── details_<task_name>_#_<datetime>.parquet
└── tensorboard_results/
    └── eval/
        └── events.out.tfevents.[timestamp]
pprint(execution.s3_output_path)
# Display results in a formatted table
execution.show_results()

Step 7: Retrieve an Existing Job#

You can retrieve and inspect any existing evaluation job:

from sagemaker.train.evaluate import EvaluationPipelineExecution
from rich.pretty import pprint


# Get an existing job by ARN
# Replace with your actual pipeline execution ARN
existing_arn = "arn:aws:sagemaker:us-west-2:<>:pipeline/SagemakerEvaluation-BenchmarkEvaluation-c344c91d-6f62-4907-85cc-7e6b29171c42/execution/inlsexrd7jes"

# base model only example
# existing_arn = "arn:aws:sagemaker:us-west-2:<>:pipeline/SagemakerEvaluation-benchmark/execution/gdp9f4dbv2vi"
existing_execution = EvaluationPipelineExecution.get(
    arn=existing_arn,
    region="us-west-2"
)

pprint(existing_execution)
print(f"\nStatus: {existing_execution.status.overall_status}")

existing_execution.show_results()
# Run evaluation with configured parameters
execution = evaluator.evaluate()
pprint(execution)

print(f"\nPipeline Execution ARN: {execution.arn}")
print(f"Initial Status: {execution.status.overall_status}")

Step 8: List All Benchmark Evaluations#

Retrieve all benchmark evaluation executions:

# Get all benchmark evaluations (returns iterator)
all_executions_iter = BenchMarkEvaluator.get_all(region="us-west-2")
all_executions = list(all_executions_iter)

print(f"Found {len(all_executions)} evaluation(s)\n")
for exec in all_executions[:5]:  # Show first 5
    print(f"  {exec.name}: {exec.status.overall_status}")

Step 9: Stop a Running Job (Optional)#

You can stop a running evaluation if needed:

# Uncomment to stop the job
# existing_execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

Understanding the Pipeline Structure#

The rendered pipeline definition includes:

4 Steps:

  1. CreateEvaluationAction (Lineage): Sets up tracking

  2. EvaluateBaseModel (Training): Evaluates base model

  3. EvaluateCustomModel (Training): Evaluates custom model

  4. AssociateLineage (Lineage): Links results

Key Features:

  • Template-based: Uses Jinja2 for flexible pipeline generation

  • Parallel execution: Base and custom models evaluated simultaneously

  • Serverless: No need to manage compute resources

  • MLflow integration: Automatic experiment tracking

  • Lineage tracking: Full traceability of evaluation artifacts

Typical Execution Time:

  • Total: ~10-12 minutes

  • Downloading phase: ~5-7 minutes (model)

  • Training phase: ~3-5 minutes (running evaluation)

  • Lineage steps: ~2-4 seconds each