sagemaker.train.evaluate.llm_as_judge_evaluator#
LLM-as-Judge Evaluator for SageMaker Model Evaluation Module.
This module provides evaluation capabilities using foundation models as judges to evaluate LLM responses based on quality and responsible AI metrics.
Classes
|
LLM-as-judge evaluation job. |
- class sagemaker.train.evaluate.llm_as_judge_evaluator.LLMAsJudgeEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, evaluator_model: str, dataset: str | Any, builtin_metrics: List[str] | None = None, custom_metrics: str | None = None, evaluate_base_model: bool = False)[source]#
Bases:
BaseEvaluatorLLM-as-judge evaluation job.
This evaluator uses foundation models to evaluate LLM responses based on various quality and responsible AI metrics.
This feature is powered by Amazon Bedrock Evaluations. Your use of this feature is subject to pricing of Amazon Bedrock Evaluations, the Service Terms applicable to Amazon Bedrock, and the terms that apply to your usage of third-party models. Amazon Bedrock Evaluations may securely transmit data across AWS Regions within your geography for processing. For more information, access Amazon Bedrock Evaluations documentation.
Documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html
- evaluator_model#
AWS Bedrock foundation model identifier to use as the judge. Required. For supported models, see: https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html#evaluation-judge-supported
- Type:
str
- dataset#
Evaluation dataset. Required. Accepts: - S3 URI (str): e.g., ‘s3://bucket/path/dataset.jsonl’ - Dataset ARN (str): e.g., ‘arn:aws:sagemaker:…:hub-content/AIRegistry/DataSet/…’ - DataSet object: sagemaker.ai_registry.dataset.DataSet instance (ARN inferred automatically)
- Type:
Union[str, Any]
- builtin_metrics#
List of built-in evaluation metric names to compute. The ‘Builtin.’ prefix from Bedrock documentation is optional and will be automatically removed if present. Examples: [‘Correctness’, ‘Faithfulness’] or [‘Builtin.Correctness’, ‘Builtin.Faithfulness’]. Optional.
- Type:
Optional[List[str]]
- custom_metrics#
JSON string containing array of custom metric definitions. Optional. For format details, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html
- Type:
Optional[str]
- mlflow_resource_arn#
ARN of the MLflow tracking server for experiment tracking. Optional. If not provided, the system will attempt to resolve it using the default MLflow app experience (checks domain match, account default, or creates a new app). Inherited from BaseEvaluator.
- Type:
Optional[str]
- evaluate_base_model#
Whether to evaluate the base model in addition to the custom model. Set to False to skip base model evaluation and only evaluate the custom model. Defaults to True (evaluates both models).
- Type:
bool
- region#
AWS region. Inherited from BaseEvaluator.
- Type:
Optional[str]
- sagemaker_session#
SageMaker session object. Inherited from BaseEvaluator.
- Type:
Optional[Any]
- model#
Model for evaluation. Inherited from BaseEvaluator.
- Type:
Union[str, Any]
- base_eval_name#
Base name for evaluation jobs. Inherited from BaseEvaluator.
- Type:
Optional[str]
- s3_output_path#
S3 location for evaluation outputs. Inherited from BaseEvaluator.
- Type:
str
- mlflow_experiment_name#
MLflow experiment name. Inherited from BaseEvaluator.
- Type:
Optional[str]
- mlflow_run_name#
MLflow run name. Inherited from BaseEvaluator.
- Type:
Optional[str]
- kms_key_id#
KMS key ID for encryption. Inherited from BaseEvaluator.
- Type:
Optional[str]
- model_package_group#
Model package group. Inherited from BaseEvaluator.
- Type:
Optional[Union[str, ModelPackageGroup]]
Example
from sagemaker.train.evaluate import LLMAsJudgeEvaluator # Example with built-in metrics (prefix optional) # Both formats work - with or without 'Builtin.' prefix evaluator = LLMAsJudgeEvaluator( base_model="llama-3-3-70b-instruct", evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0", dataset="s3://my-bucket/my-dataset.jsonl", builtin_metrics=["Correctness", "Helpfulness"], # Prefix optional mlflow_resource_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server", s3_output_path="s3://my-bucket/output" ) execution = evaluator.evaluate() # Example with custom metrics custom_metrics = [ { "customMetricDefinition": { "name": "PositiveSentiment", "instructions": "Assess if the response has positive sentiment. Prompt: {{prompt}}\nResponse: {{prediction}}", "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1.0}}, {"definition": "Poor", "value": {"floatValue": 0.0}} ] } } ] evaluator = LLMAsJudgeEvaluator( base_model="llama-3-3-70b-instruct", evaluator_model="anthropic.claude-3-haiku-20240307-v1:0", dataset="s3://my-bucket/dataset.jsonl", custom_metrics=custom_metrics, s3_output_path="s3://my-bucket/output" ) execution = evaluator.evaluate() # Example evaluating only custom model (skip base model) evaluator = LLMAsJudgeEvaluator( base_model="llama-3-3-70b-instruct", evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0", dataset="s3://my-bucket/my-dataset.jsonl", builtin_metrics=["Correctness"], # Prefix optional evaluate_base_model=False, s3_output_path="s3://my-bucket/output" ) execution = evaluator.evaluate()
- base_eval_name: str | None#
- builtin_metrics: List[str] | None#
- custom_metrics: str | None#
- dataset: str | Any#
- evaluate()[source]#
Create and start an LLM-as-judge evaluation job.
This method initiates a 2-phase evaluation job:
Phase 1: Generate inference responses from base and custom models
Phase 2: Use judge model to evaluate responses with built-in and custom metrics
- Returns:
The created LLM-as-judge evaluation execution
- Return type:
- Raises:
ValueError – If invalid model, dataset, or metric configurations are provided
Example
evaluator = LLMAsJudgeEvaluator( base_model="llama-3-3-70b-instruct", evaluator_model="anthropic.claude-3-5-sonnet-20240620-v1:0", dataset="s3://my-bucket/my-dataset.jsonl", builtin_metrics=["Correctness", "Helpfulness"], # Prefix optional s3_output_path="s3://my-bucket/output"
- )
evaluator_model=”anthropic.claude-3-5-sonnet-20240620-v1:0”, dataset=”s3://my-bucket/my-dataset.jsonl”, builtin_metrics=[“Correctness”, “Helpfulness”], s3_output_path=”s3://my-bucket/output”
) execution = evaluator.evaluate() execution.wait()
- evaluate_base_model: bool#
- evaluator_model: str#
- classmethod get_all(session: Any | None = None, region: str | None = None)[source]#
Get all LLM-as-judge evaluation executions.
Uses
EvaluationPipelineExecution.get_all()to retrieve all LLM-as-judge evaluation executions as an iterator.- Parameters:
session (Optional[Any]) – Optional boto3 session. If not provided, will be inferred.
region (Optional[str]) – Optional AWS region. If not provided, will be inferred.
- Yields:
EvaluationPipelineExecution – LLM-as-judge evaluation execution instances
Example
# Get all LLM-as-judge evaluations as iterator evaluations = LLMAsJudgeEvaluator.get_all() all_executions = list(evaluations) # Or iterate directly for execution in LLMAsJudgeEvaluator.get_all(): print(f"{execution.name}: {execution.status.overall_status}") # With specific session/region evaluations = LLMAsJudgeEvaluator.get_all(session=my_session, region='us-west-2') all_executions = list(evaluations)
- kms_key_id: str | None#
- mlflow_experiment_name: str | None#
- mlflow_resource_arn: str | None#
- mlflow_run_name: str | None#
- model: str | BaseTrainer | ModelPackage#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_package_group: str | ModelPackageGroup | None#
- region: str | None#
- role: str | None#
- s3_output_path: str#
- sagemaker_session: Any | None#