SageMaker Custom Scorer Evaluation - Demo

SageMaker Custom Scorer Evaluation - Demo#

This notebook demonstrates how to use the CustomScorerEvaluator to evaluate models with custom evaluator functions.

Setup#

Import necessary modules.

# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2

from sagemaker.train.evaluate import CustomScorerEvaluator
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

Configure Evaluation Parameters#

Set up the parameters for your custom scorer evaluation.

# Evaluator ARN (custom evaluator from AI Registry)
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/00-goga-qa-evaluation/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/nikmehta-reward-function/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/eval-lambda-test/0.0.1"
evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/F3LMYANDKWPZCROJVCKMJ7TOML6QMZBZRRQOVTUL45VUK7PJ4SXA/JsonDoc/eval-lambda-test/0.0.1"

# Dataset - can be S3 URI or AIRegistry DataSet ARN
dataset = "s3://sagemaker-us-west-2-<>/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl"

# Base model - can be:
# 1. Model package ARN: "arn:aws:sagemaker:region:account:model-package/name/version"
# 2. JumpStart model ID: "llama-3-2-1b-instruct" [Evaluation with Base Model Only is yet to be implemented/tested - Not Working currently]
base_model = "arn:aws:sagemaker:us-west-2:<>:model-package/test-finetuned-models-gamma/28"

# S3 location for outputs
s3_output_path = "s3://mufi-test-serverless-smtj/eval/"

# Optional: MLflow tracking server ARN
mlflow_resource_arn = "arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment"

print("Configuration:")
print(f"  Evaluator: {evaluator_arn}")
print(f"  Dataset: {dataset}")
print(f"  Base Model: {base_model}")
print(f"  Output Location: {s3_output_path}")

Create CustomScorerEvaluator Instance#

Instantiate the evaluator with your configuration. The evaluator can accept:

Custom Evaluator ARN (string): Points to your custom evaluator in AI Registry
Built-in Metric (string or enum): Use preset metrics like “code_executions”, “math_answers”, etc.
Evaluator Object: A sagemaker.ai_registry.evaluator.Evaluator instance

# Create evaluator with custom evaluator ARN
evaluator = CustomScorerEvaluator(
    evaluator=evaluator_arn,  # Custom evaluator ARN
    dataset=dataset,
    model=base_model,
    s3_output_path=s3_output_path,
    mlflow_resource_arn=mlflow_resource_arn,
    # model_package_group="arn:aws:sagemaker:us-west-2:<>:model-package-group/Demo-test-deb-2", 
    evaluate_base_model=False  # Set to True to also evaluate the base model
)

print("\n✓ CustomScorerEvaluator created successfully")
pprint(evaluator)

Optionally update the hyperparameters#

pprint(evaluator.hyperparameters.to_dict())

# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"

# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()

Alternative: Using Built-in Metrics#

Instead of a custom evaluator ARN, you can use built-in metrics:

# Example with built-in metrics (commented out)
# from sagemaker.train.evaluate import get_builtin_metrics
# 
# BuiltInMetric = get_builtin_metrics()
# 
# evaluator_builtin = CustomScorerEvaluator(
#     evaluator=BuiltInMetric.PRIME_MATH,  # Or use string: "prime_math"
#     dataset=dataset,
#     base_model=base_model,
#     s3_output_path=s3_output_path
# )

Start Evaluation#

Call evaluate() to start the evaluation job. This will:

Create or update the evaluation pipeline
Start a pipeline execution
Return an EvaluationPipelineExecution object for monitoring

# Start evaluation
execution = evaluator.evaluate()

print("\n✓ Evaluation execution started successfully!")
print(f"  Execution Name: {execution.name}")
print(f"  Pipeline Execution ARN: {execution.arn}")
print(f"  Status: {execution.status.overall_status}")

Monitor Job Progress#

Use refresh() to update the job status, or wait() to block until completion.

# Check current status
execution.refresh()
print(f"Current Status: {execution.status.overall_status}")

pprint(execution.status)

Wait for Completion#

Block execution until the job completes. This provides a rich visual experience in Jupyter notebooks.

# Wait for job to complete (with rich visual feedback)
execution.wait(poll=30, timeout=3600)

print(f"\nFinal Status: {execution.status.overall_status}")

# show results
execution.show_results()

Retrieve Existing Job#

You can retrieve a previously started evaluation job using its ARN.

from sagemaker.train.evaluate import EvaluationPipelineExecution

# Get existing job by ARN
existing_arn = execution.arn  # Or use a specific ARN

existing_exec = EvaluationPipelineExecution.get(arn=existing_arn)

print(f"Retrieved job: {existing_exec.name}")
print(f"Status: {existing_exec.status.overall_status}")

List All Custom Scorer Evaluations#

Retrieve all custom scorer evaluation executions.

# Get all custom scorer evaluations
all_executions = list(CustomScorerEvaluator.get_all())

print(f"Found {len(all_executions)} custom scorer evaluation(s):\n")
for execution in all_executions:
    print(f"  - {execution.name} - {execution.arn}: {execution.status.overall_status}")

Stop a Running Job (Optional)#

You can stop a running evaluation if needed.

# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

Summary#

This notebook demonstrated:

✅ Creating a CustomScorerEvaluator with a custom evaluator ARN
✅ Starting an evaluation job
✅ Monitoring job progress with refresh() and wait()
✅ Retrieving existing jobs
✅ Listing all custom scorer evaluations

Key Points:#

The evaluator parameter accepts:
- Custom evaluator ARN (for AI Registry evaluators)
- Built-in metric names (“code_executions”, “math_answers”, “exact_match”)
- Evaluator objects from sagemaker.ai_registry.evaluator.Evaluator
Set evaluate_base_model=False to only evaluate the custom model
Use execution.wait() for automatic monitoring with rich visual feedback
Use execution.refresh() for manual status updates
The SageMaker session is automatically inferred from your environment