SageMaker LLM-as-Judge Evaluation - Basic Usage#

This notebook demonstrates the basic user-facing flow for creating and managing LLM-as-Judge evaluation jobs using the LLMAsJudgeEvaluator.

# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2

Configuration#

# Configuration
REGION = 'us-west-2'
S3_BUCKET = 's3://mufi-test-serverless-smtj/eval/'
# DATASET = 'arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/DataSet/gen-qa-test-content/1.0.1'  # Dataset ARN or S3 URI
DATASET = "s3://my-sagemaker-sherpa-dataset/dataset/gen-qa-formatted-dataset/gen_qa.jsonl"
MLFLOW_ARN = 'arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment'

Step 1: Import Required Libraries#

Import the LLMAsJudgeEvaluator class.

import json
from sagemaker.train.evaluate import LLMAsJudgeEvaluator
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

Step 2: Create LLMAsJudgeEvaluator#

Create an LLMAsJudgeEvaluator instance with the desired evaluator model, dataset, and metrics.

Key Parameters:#

  • model: Model Package (or Base Model) to be evaluated (required)

  • evaluator_model: Bedrock model ID to use as judge (required)

  • dataset: S3 URI or Dataset ARN (required)

  • builtin_metrics: List of built-in metrics (optional, no ‘Builtin.’ prefix needed)

  • custom_metrics: JSON string of custom metrics (optional)

  • evaluate_base_model: Whether to evaluate base model in addition to custom model (optional, default=True)

  • mlflow_resource_arn: MLflow tracking server ARN (optional)

  • model_package_group: Model package group ARN (optional)

  • s3_output_path: S3 output location (required)

A. Using custom metrics (as JSON string)#

Custom metrics must be provided as a properly escaped JSON string. You can either:

  1. Create a Python dict and use json.dumps() to convert it

  2. Provide a pre-escaped JSON string directly

# Method 1: Create dict and convert to JSON string
custom_metric_dict = {
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}

# Convert to JSON string
custom_metrics_json = json.dumps([custom_metric_dict])  # Note: wrap in list
# Create evaluator with custom metrics
evaluator = LLMAsJudgeEvaluator(
    # base_model='arn:aws:sagemaker:us-west-2:<>:model-package/Demo-test-deb-2/1',  # Required
    model="arn:aws:sagemaker:us-west-2:<>:model-package/test-finetuned-models-gamma/28",
    evaluator_model="anthropic.claude-3-5-haiku-20241022-v1:0",  # Required
    dataset=DATASET,  # Required: S3 URI or Dataset ARN
    builtin_metrics=["Completeness", "Faithfulness"],  # Optional: Can combine with custom metrics
    custom_metrics=custom_metrics_json,  # Optional: JSON string of custom metrics
    mlflow_resource_arn=MLFLOW_ARN,  # Optional
    # model_package_group=MODEL_PACKAGE_GROUP,  # Optional if BASE_MODEL is a Model Package ARN/Object
    s3_output_path=S3_BUCKET,  # Required
    evaluate_base_model=False
)

pprint(evaluator)

[Optional] Example with multiple custom metrics#

# # Create multiple custom metrics
# custom_metrics_list = [
#     {
#         "customMetricDefinition": {
#             "name": "GoodMetric",
#             "instructions": (
#                 "Assess if the response has positive sentiment. "
#                 "Prompt: {{prompt}}\nResponse: {{prediction}}"
#             ),
#             "ratingScale": [
#                 {"definition": "Good", "value": {"floatValue": 1}},
#                 {"definition": "Poor", "value": {"floatValue": 0}}
#             ]
#         }
#     },
#     {
#         "customMetricDefinition": {
#             "name": "BadMetric",
#             "instructions": (
#                 "Assess if the response has negative sentiment. "
#                 "Prompt: {{prompt}}\nResponse: {{prediction}}"
#             ),
#             "ratingScale": [
#                 {"definition": "Bad", "value": {"floatValue": 1}},
#                 {"definition": "Good", "value": {"floatValue": 0}}
#             ]
#         }
#     }
# ]

# # Convert list to JSON string
# custom_metrics_json = json.dumps(custom_metrics_list)

# # Create evaluator
# evaluator = LLMAsJudgeEvaluator(
#     base_model=BASE_MODEL,
#     evaluator_model="anthropic.claude-3-5-haiku-20241022-v1:0",
#     dataset=DATASET,
#     custom_metrics=custom_metrics_json,  # Multiple custom metrics
#     s3_output_path=S3_BUCKET,
# )

# print(f"✅ Created evaluator with {len(json.loads(custom_metrics_json))} custom metrics")
# pprint(evaluator)

[Optional] Skipping base model evaluation (evaluate custom model only)#

By default, LLM-as-Judge evaluates both the base model and custom model. You can skip base model evaluation to save time and cost by setting evaluate_base_model=False.

# # Define custom metrics (same as test script)
# custom_metrics = "[{\"customMetricDefinition\":{\"name\":\"GoodMetric\",\"instructions\":\"You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\\n\\nConsider the following:\\n- Does the response have a positive, encouraging tone?\\n- Is the response helpful and constructive?\\n- Does it avoid negative language or criticism?\\n\\nRate on this scale:\\n- Good: Response has positive sentiment\\n- Poor: Response lacks positive sentiment\\n\\nHere is the actual task:\\nPrompt: {{prompt}}\\nResponse: {{prediction}}\",\"ratingScale\":[{\"definition\":\"Good\",\"value\":{\"floatValue\":1}},{\"definition\":\"Poor\",\"value\":{\"floatValue\":0}}]}},{\"customMetricDefinition\":{\"name\":\"BadMetric\",\"instructions\":\"You are an expert evaluator. Your task is to assess if the sentiment of the response is negative. Rate the response based on whether it conveys negative sentiment, unhelpfulness, or destructive tone.\\n\\nConsider the following:\\n- Does the response have a negative, discouraging tone?\\n- Is the response unhelpful or destructive?\\n- Does it use negative language or harsh criticism?\\n\\nRate on this scale:\\n- Bad: Response has negative sentiment\\n- Good: Response lacks negative sentiment\\n\\nHere is the actual task:\\nPrompt: {{prompt}}\\nResponse: {{prediction}}\",\"ratingScale\":[{\"definition\":\"Bad\",\"value\":{\"floatValue\":1}},{\"definition\":\"Good\",\"value\":{\"floatValue\":0}}]}}]"

# # Create evaluator that only evaluates the custom model (matching test script exactly)
# evaluator = LLMAsJudgeEvaluator(
#     base_model=BASE_MODEL,
#     evaluator_model="anthropic.claude-3-5-haiku-20241022-v1:0",
#     dataset=DATASET,
#     builtin_metrics=["Completeness", "Faithfulness", "Helpfulness"],
#     custom_metrics=custom_metrics,
#     mlflow_resource_arn=MLFLOW_ARN,
#     model_package_group=MODEL_PACKAGE_GROUP,
#     model_artifact=MODEL_ARTIFACT,
#     s3_output_path=S3_BUCKET,
#     evaluate_base_model=False,  # KEY: Skip base model evaluation
# )

# print("✅ Created evaluator (custom model only)")
# pprint(evaluator)

Step 3: Run LLM-as-Judge Evaluation#

Start the evaluation job. The evaluator will:

  1. Generate inference responses from the base model (if evaluate_base_model=True)

  2. Generate inference responses from the custom model

  3. Use the judge model to evaluate responses with built-in and custom metrics

# Run evaluation
execution = evaluator.evaluate()

print(f"✅ Evaluation job started!")
print(f"Job ARN: {execution.arn}")
print(f"Job Name: {execution.name}")
print(f"Status: {execution.status.overall_status}")

pprint(execution)

Step 4: Check Job Status#

Refresh and display the current job status with step details.

# Refresh status
execution.refresh()

# Display job status using rich pprint
pprint(execution.status)

Step 5: Monitor Pipeline Execution#

Poll the pipeline status until it reaches a terminal state (Succeeded, Failed, or Stopped).

# Wait for job completion (optional)
# This will poll every 5 seconds for up to 1 hour
execution.wait(poll=5, timeout=3600)
# Display results
execution.show_results(limit=10, offset=0, show_explanations=False)

Retrieve an Existing Job#

You can retrieve and inspect any existing evaluation job using its ARN.

# Get an existing job by ARN
# Replace with your actual pipeline execution ARN
existing_arn = 'arn:aws:sagemaker:us-west-2:<>:pipeline/SagemakerEvaluation-llmasjudge/execution/4hr7446yft1d'  # or use a specific ARN

from sagemaker.train.evaluate import EvaluationPipelineExecution
from rich.pretty import pprint

existing_execution = EvaluationPipelineExecution.get(
    arn=existing_arn,
    region="us-west-2"
)
pprint(existing_execution.status)

existing_execution.show_results(limit=5, offset=0, show_explanations=False)

Get All LLM-as-Judge Evaluations#

Retrieve all LLM-as-Judge evaluation jobs.

from sagemaker.train.evaluate import LLMAsJudgeEvaluator

# Get all LLM-as-Judge evaluations as an iterator
all_executions = list(LLMAsJudgeEvaluator.get_all(region="us-west-2"))

print(f"Found {len(all_executions)} LLM-as-Judge evaluation jobs")
for execution in all_executions:
    print(f"  - {execution.name}: {execution.status.overall_status}")

Stop a Running Job (Optional)#

If needed, you can stop a running evaluation job.

# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

Dataset Support#

The dataset parameter supports two formats:

1. S3 URI#

dataset="s3://my-bucket/path/to/dataset.jsonl"

2. Dataset ARN (AI Registry)#

dataset="arn:aws:sagemaker:us-west-2:123456789012:hub-content/AIRegistry/DataSet/my-dataset/1.0.0"

The evaluator automatically detects which format is provided and uses the appropriate data source configuration.