SageMaker Custom Scorer Evaluation - Demo#
This notebook demonstrates how to use the CustomScorerEvaluator to evaluate models with custom evaluator functions.
Setup#
Import necessary modules.
# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2
from sagemaker.train.evaluate import CustomScorerEvaluator
from rich.pretty import pprint
# Configure logging to show INFO messages
import logging
logging.basicConfig(
level=logging.INFO,
format='%(levelname)s - %(name)s - %(message)s'
)
Configure Evaluation Parameters#
Set up the parameters for your custom scorer evaluation.
# Evaluator ARN (custom evaluator from AI Registry)
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/00-goga-qa-evaluation/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/nikmehta-reward-function/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/eval-lambda-test/0.0.1"
evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/F3LMYANDKWPZCROJVCKMJ7TOML6QMZBZRRQOVTUL45VUK7PJ4SXA/JsonDoc/eval-lambda-test/0.0.1"
# Dataset - can be S3 URI or AIRegistry DataSet ARN
dataset = "s3://sagemaker-us-west-2-<>/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl"
# Base model - can be:
# 1. Model package ARN: "arn:aws:sagemaker:region:account:model-package/name/version"
# 2. JumpStart model ID: "llama-3-2-1b-instruct" [Evaluation with Base Model Only is yet to be implemented/tested - Not Working currently]
base_model = "arn:aws:sagemaker:us-west-2:<>:model-package/test-finetuned-models-gamma/28"
# S3 location for outputs
s3_output_path = "s3://mufi-test-serverless-smtj/eval/"
# Optional: MLflow tracking server ARN
mlflow_resource_arn = "arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment"
print("Configuration:")
print(f" Evaluator: {evaluator_arn}")
print(f" Dataset: {dataset}")
print(f" Base Model: {base_model}")
print(f" Output Location: {s3_output_path}")
Create CustomScorerEvaluator Instance#
Instantiate the evaluator with your configuration. The evaluator can accept:
Custom Evaluator ARN (string): Points to your custom evaluator in AI Registry
Built-in Metric (string or enum): Use preset metrics like “code_executions”, “math_answers”, etc.
Evaluator Object: A sagemaker.ai_registry.evaluator.Evaluator instance
# Create evaluator with custom evaluator ARN
evaluator = CustomScorerEvaluator(
evaluator=evaluator_arn, # Custom evaluator ARN
dataset=dataset,
model=base_model,
s3_output_path=s3_output_path,
mlflow_resource_arn=mlflow_resource_arn,
# model_package_group="arn:aws:sagemaker:us-west-2:<>:model-package-group/Demo-test-deb-2",
evaluate_base_model=False # Set to True to also evaluate the base model
)
print("\n✓ CustomScorerEvaluator created successfully")
pprint(evaluator)
Optionally update the hyperparameters#
pprint(evaluator.hyperparameters.to_dict())
# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"
# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()
Alternative: Using Built-in Metrics#
Instead of a custom evaluator ARN, you can use built-in metrics:
# Example with built-in metrics (commented out)
# from sagemaker.train.evaluate import get_builtin_metrics
#
# BuiltInMetric = get_builtin_metrics()
#
# evaluator_builtin = CustomScorerEvaluator(
# evaluator=BuiltInMetric.PRIME_MATH, # Or use string: "prime_math"
# dataset=dataset,
# base_model=base_model,
# s3_output_path=s3_output_path
# )
Start Evaluation#
Call evaluate() to start the evaluation job. This will:
Create or update the evaluation pipeline
Start a pipeline execution
Return an
EvaluationPipelineExecutionobject for monitoring
# Start evaluation
execution = evaluator.evaluate()
print("\n✓ Evaluation execution started successfully!")
print(f" Execution Name: {execution.name}")
print(f" Pipeline Execution ARN: {execution.arn}")
print(f" Status: {execution.status.overall_status}")
Monitor Job Progress#
Use refresh() to update the job status, or wait() to block until completion.
# Check current status
execution.refresh()
print(f"Current Status: {execution.status.overall_status}")
pprint(execution.status)
Wait for Completion#
Block execution until the job completes. This provides a rich visual experience in Jupyter notebooks.
# Wait for job to complete (with rich visual feedback)
execution.wait(poll=30, timeout=3600)
print(f"\nFinal Status: {execution.status.overall_status}")
# show results
execution.show_results()
Retrieve Existing Job#
You can retrieve a previously started evaluation job using its ARN.
from sagemaker.train.evaluate import EvaluationPipelineExecution
# Get existing job by ARN
existing_arn = execution.arn # Or use a specific ARN
existing_exec = EvaluationPipelineExecution.get(arn=existing_arn)
print(f"Retrieved job: {existing_exec.name}")
print(f"Status: {existing_exec.status.overall_status}")
List All Custom Scorer Evaluations#
Retrieve all custom scorer evaluation executions.
# Get all custom scorer evaluations
all_executions = list(CustomScorerEvaluator.get_all())
print(f"Found {len(all_executions)} custom scorer evaluation(s):\n")
for execution in all_executions:
print(f" - {execution.name} - {execution.arn}: {execution.status.overall_status}")
Stop a Running Job (Optional)#
You can stop a running evaluation if needed.
# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")
Summary#
This notebook demonstrated:
✅ Creating a CustomScorerEvaluator with a custom evaluator ARN
✅ Starting an evaluation job
✅ Monitoring job progress with refresh() and wait()
✅ Retrieving existing jobs
✅ Listing all custom scorer evaluations
Key Points:#
The
evaluatorparameter accepts:Custom evaluator ARN (for AI Registry evaluators)
Built-in metric names (“code_executions”, “math_answers”, “exact_match”)
Evaluator objects from sagemaker.ai_registry.evaluator.Evaluator
Set
evaluate_base_model=Falseto only evaluate the custom modelUse
execution.wait()for automatic monitoring with rich visual feedbackUse
execution.refresh()for manual status updatesThe SageMaker session is automatically inferred from your environment