sagemaker.train.evaluate.llm_as_judge_evaluator

sagemaker.train.evaluate.llm_as_judge_evaluator#

LLM-as-Judge Evaluator for SageMaker Model Evaluation Module.

This module provides evaluation capabilities using foundation models as judges to evaluate LLM responses based on quality and responsible AI metrics.

Classes

LLMAsJudgeEvaluator(*[, region, role, ...])

LLM-as-judge evaluation job.

Bases: BaseEvaluator

LLM-as-judge evaluation job.

This evaluator uses foundation models to evaluate LLM responses based on various quality and responsible AI metrics.

This feature is powered by Amazon Bedrock Evaluations. Your use of this feature is subject to pricing of Amazon Bedrock Evaluations, the Service Terms applicable to Amazon Bedrock, and the terms that apply to your usage of third-party models. Amazon Bedrock Evaluations may securely transmit data across AWS Regions within your geography for processing. For more information, access Amazon Bedrock Evaluations documentation.

Documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html

evaluator_model#

AWS Bedrock foundation model identifier to use as the judge. Required. For supported models, see: https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html#evaluation-judge-supported

Type:: str

dataset#

Evaluation dataset. Required. Accepts: - S3 URI (str): e.g., ‘s3://bucket/path/dataset.jsonl’ - Dataset ARN (str): e.g., ‘arn:aws:sagemaker:…:hub-content/AIRegistry/DataSet/…’ - DataSet object: sagemaker.ai_registry.dataset.DataSet instance (ARN inferred automatically)

Type:: Union[str, Any]

builtin_metrics#

List of built-in evaluation metric names to compute. The ‘Builtin.’ prefix from Bedrock documentation is optional and will be automatically removed if present. Examples: [‘Correctness’, ‘Faithfulness’] or [‘Builtin.Correctness’, ‘Builtin.Faithfulness’]. Optional.

Type:: Optional[List[str]]

custom_metrics#

JSON string containing array of custom metric definitions. Optional. For format details, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

Type:: Optional[str]

mlflow_resource_arn#

ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Inherited from BaseEvaluator.

Type:: Optional[str]

evaluate_base_model#

Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).

Type:: bool

region#

AWS region. Inherited from BaseEvaluator.

Type:: Optional[str]

sagemaker_session#

SageMaker session object. Inherited from BaseEvaluator.

Type:: Optional[Any]

model#

Model for evaluation. Inherited from BaseEvaluator.

Type:: Union[str, Any]

base_eval_name#

Base name for evaluation jobs. Inherited from BaseEvaluator.

Type:: Optional[str]

s3_output_path#

S3 location for evaluation outputs. Inherited from BaseEvaluator.

Type:: str

mlflow_experiment_name#

MLflow experiment name. Inherited from BaseEvaluator.

Type:: Optional[str]

mlflow_run_name#

MLflow run name. Inherited from BaseEvaluator.

Type:: Optional[str]

networking#

VPC configuration. Inherited from BaseEvaluator.

Type:: Optional[VpcConfig]

kms_key_id#

KMS key ID for encryption. Inherited from BaseEvaluator.

Type:: Optional[str]

model_package_group#

Model package group. Inherited from BaseEvaluator.

Type:: Optional[Union[str, ModelPackageGroup]]

Example

from sagemaker.train.evaluate import LLMAsJudgeEvaluator

# Example with built-in metrics (prefix optional)
# Both formats work - with or without 'Builtin.' prefix
evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
    evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    dataset="s3://my-bucket/my-dataset.jsonl",
    builtin_metrics=["Correctness", "Helpfulness"],  # Prefix optional
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server",
    s3_output_path="s3://my-bucket/output"
)
execution = evaluator.evaluate()

# Example with custom metrics
custom_metrics = [
    {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": "Assess if the response has positive sentiment. Prompt: {{prompt}}\nResponse: {{prediction}}",
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1.0}},
                {"definition": "Poor", "value": {"floatValue": 0.0}}
            ]
        }
    }
]

evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
    evaluator_model="anthropic.claude-3-haiku-20240307-v1:0",
    dataset="s3://my-bucket/dataset.jsonl",
    custom_metrics=custom_metrics,
    s3_output_path="s3://my-bucket/output"
)
execution = evaluator.evaluate()

# Example evaluating only custom model (skip base model)
evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
    evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    dataset="s3://my-bucket/my-dataset.jsonl",
    builtin_metrics=["Correctness"],  # Prefix optional
    evaluate_base_model=False,
    s3_output_path="s3://my-bucket/output"
)
execution = evaluator.evaluate()

base_eval_name: str | None#

builtin_metrics: List[str] | None#

custom_metrics: str | None#

dataset: str | Any#

evaluate()[source]#

Create and start an LLM-as-judge evaluation job.

This method initiates a 2-phase evaluation job:

Phase 1: Generate inference responses from base and custom models
Phase 2: Use judge model to evaluate responses with built-in and custom metrics

Returns:: The created LLM-as-judge evaluation execution
Return type:: EvaluationPipelineExecution
Raises:: ValueError – If invalid model, dataset, or metric configurations are provided

Example

evaluator = LLMAsJudgeEvaluator(
    base_model="llama-3-3-70b-instruct",
evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0",
dataset="s3://my-bucket/my-dataset.jsonl",
builtin_metrics=["Correctness", "Helpfulness"],  # Prefix optional
s3_output_path="s3://my-bucket/output"

): evaluator_model=”anthropic.claude-3-5-sonnet-20240620-v1:0”, dataset=”s3://my-bucket/my-dataset.jsonl”, builtin_metrics=[“Correctness”, “Helpfulness”], s3_output_path=”s3://my-bucket/output”

) execution = evaluator.evaluate() execution.wait()

evaluate_base_model: bool#

evaluator_model: str#

classmethod get_all(session: Any | None = None, region: str | None = None)[source]#

Get all LLM-as-judge evaluation executions.

Uses EvaluationPipelineExecution.get_all() to retrieve all LLM-as-judge evaluation executions as an iterator.

Parameters:

session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.
region (Optional[str]) – Optional AWS region. If not provided, will be inferred.

Yields:

EvaluationPipelineExecution – LLM-as-judge evaluation execution instances

Example

# Get all LLM-as-judge evaluations as iterator
evaluations = LLMAsJudgeEvaluator.get_all()
all_executions = list(evaluations)

# Or iterate directly
for execution in LLMAsJudgeEvaluator.get_all():
    print(f"{execution.name}: {execution.status.overall_status}")

# With specific session/region
evaluations = LLMAsJudgeEvaluator.get_all(session=my_session, region='us-west-2')
all_executions = list(evaluations)

kms_key_id: str | None#

mlflow_experiment_name: str | None#

mlflow_resource_arn: str | None#

mlflow_run_name: str | None#

model: str | BaseTrainer | ModelPackage#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_package_group: str | ModelPackageGroup | None#

networking: VpcConfig | None#

region: str | None#

role: str | None#

s3_output_path: str#

sagemaker_session: Any | None#

sagemaker.train.evaluate.llm_as_judge_evaluator

Contents

sagemaker.train.evaluate.llm_as_judge_evaluator#