SageMaker Custom Scorer Evaluation - Demo#

This notebook demonstrates how to use the CustomScorerEvaluator to evaluate models with custom evaluator functions.

Setup#

Import necessary modules.

# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2
from sagemaker.train.evaluate import CustomScorerEvaluator
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

Configure Evaluation Parameters#

Set up the parameters for your custom scorer evaluation.

# Evaluator ARN (custom evaluator from AI Registry)
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/00-goga-qa-evaluation/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/nikmehta-reward-function/1.0.0"
# evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/JsonDoc/eval-lambda-test/0.0.1"
evaluator_arn = "arn:aws:sagemaker:us-west-2:<>:hub-content/F3LMYANDKWPZCROJVCKMJ7TOML6QMZBZRRQOVTUL45VUK7PJ4SXA/JsonDoc/eval-lambda-test/0.0.1"

# Dataset - can be S3 URI or AIRegistry DataSet ARN
dataset = "s3://sagemaker-us-west-2-<>/studio-users/d20251107t195443/datasets/2025-11-07T19-55-37-609Z/zc_test.jsonl"

# Base model - can be:
# 1. Model package ARN: "arn:aws:sagemaker:region:account:model-package/name/version"
# 2. JumpStart model ID: "llama-3-2-1b-instruct" [Evaluation with Base Model Only is yet to be implemented/tested - Not Working currently]
base_model = "arn:aws:sagemaker:us-west-2:<>:model-package/test-finetuned-models-gamma/28"

# S3 location for outputs
s3_output_path = "s3://mufi-test-serverless-smtj/eval/"

# Optional: MLflow tracking server ARN
mlflow_resource_arn = "arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment"

print("Configuration:")
print(f"  Evaluator: {evaluator_arn}")
print(f"  Dataset: {dataset}")
print(f"  Base Model: {base_model}")
print(f"  Output Location: {s3_output_path}")

Create CustomScorerEvaluator Instance#

Instantiate the evaluator with your configuration. The evaluator can accept:

  • Custom Evaluator ARN (string): Points to your custom evaluator in AI Registry

  • Built-in Metric (string or enum): Use preset metrics like “code_executions”, “math_answers”, etc.

  • Evaluator Object: A sagemaker.ai_registry.evaluator.Evaluator instance

# Create evaluator with custom evaluator ARN
evaluator = CustomScorerEvaluator(
    evaluator=evaluator_arn,  # Custom evaluator ARN
    dataset=dataset,
    model=base_model,
    s3_output_path=s3_output_path,
    mlflow_resource_arn=mlflow_resource_arn,
    # model_package_group="arn:aws:sagemaker:us-west-2:<>:model-package-group/Demo-test-deb-2", 
    evaluate_base_model=False  # Set to True to also evaluate the base model
)

print("\n✓ CustomScorerEvaluator created successfully")
pprint(evaluator)

Optionally update the hyperparameters#

pprint(evaluator.hyperparameters.to_dict())

# optionally update hyperparameters
# evaluator.hyperparameters.temperature = "0.1"

# optionally get more info on types, limits, defaults.
# evaluator.hyperparameters.get_info()

Alternative: Using Built-in Metrics#

Instead of a custom evaluator ARN, you can use built-in metrics:

# Example with built-in metrics (commented out)
# from sagemaker.train.evaluate import get_builtin_metrics
# 
# BuiltInMetric = get_builtin_metrics()
# 
# evaluator_builtin = CustomScorerEvaluator(
#     evaluator=BuiltInMetric.PRIME_MATH,  # Or use string: "prime_math"
#     dataset=dataset,
#     base_model=base_model,
#     s3_output_path=s3_output_path
# )

Start Evaluation#

Call evaluate() to start the evaluation job. This will:

  1. Create or update the evaluation pipeline

  2. Start a pipeline execution

  3. Return an EvaluationPipelineExecution object for monitoring

# Start evaluation
execution = evaluator.evaluate()

print("\n✓ Evaluation execution started successfully!")
print(f"  Execution Name: {execution.name}")
print(f"  Pipeline Execution ARN: {execution.arn}")
print(f"  Status: {execution.status.overall_status}")

Monitor Job Progress#

Use refresh() to update the job status, or wait() to block until completion.

# Check current status
execution.refresh()
print(f"Current Status: {execution.status.overall_status}")

pprint(execution.status)

Wait for Completion#

Block execution until the job completes. This provides a rich visual experience in Jupyter notebooks.

# Wait for job to complete (with rich visual feedback)
execution.wait(poll=30, timeout=3600)

print(f"\nFinal Status: {execution.status.overall_status}")
# show results
execution.show_results()

Retrieve Existing Job#

You can retrieve a previously started evaluation job using its ARN.

from sagemaker.train.evaluate import EvaluationPipelineExecution

# Get existing job by ARN
existing_arn = execution.arn  # Or use a specific ARN

existing_exec = EvaluationPipelineExecution.get(arn=existing_arn)

print(f"Retrieved job: {existing_exec.name}")
print(f"Status: {existing_exec.status.overall_status}")

List All Custom Scorer Evaluations#

Retrieve all custom scorer evaluation executions.

# Get all custom scorer evaluations
all_executions = list(CustomScorerEvaluator.get_all())

print(f"Found {len(all_executions)} custom scorer evaluation(s):\n")
for execution in all_executions:
    print(f"  - {execution.name} - {execution.arn}: {execution.status.overall_status}")

Stop a Running Job (Optional)#

You can stop a running evaluation if needed.

# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

Summary#

This notebook demonstrated:

  1. ✅ Creating a CustomScorerEvaluator with a custom evaluator ARN

  2. ✅ Starting an evaluation job

  3. ✅ Monitoring job progress with refresh() and wait()

  4. ✅ Retrieving existing jobs

  5. ✅ Listing all custom scorer evaluations

Key Points:#

  • The evaluator parameter accepts:

    • Custom evaluator ARN (for AI Registry evaluators)

    • Built-in metric names (“code_executions”, “math_answers”, “exact_match”)

    • Evaluator objects from sagemaker.ai_registry.evaluator.Evaluator

  • Set evaluate_base_model=False to only evaluate the custom model

  • Use execution.wait() for automatic monitoring with rich visual feedback

  • Use execution.refresh() for manual status updates

  • The SageMaker session is automatically inferred from your environment