sagemaker.core.clarify

sagemaker.core.clarify#

This module configures the SageMaker Clarify bias and model explainability processor jobs.

SageMaker Clarify#

Classes

`AsymmetricShapleyValueConfig`([direction, ...])	Config class for Asymmetric Shapley value algorithm for time series explainability.
`BiasConfig`(label_values_or_threshold, facet_name)	Config object with user-defined bias configurations of the input dataset.
`DataConfig`(s3_data_input_path, s3_output_path)	Config object related to configurations of the input and output dataset.
`DatasetType`(value)	Enum to store different dataset types supported in the Analysis config file
`ExplainabilityConfig`()	Abstract config class to configure an explainability method.
`ImageConfig`(model_type[, num_segments, ...])	Config object for handling images
`ModelConfig`([model_name, instance_count, ...])	Config object related to a model and its endpoint to be created.
`ModelPredictedLabelConfig`([label, ...])	Config object to extract a predicted label from the model output.
`PDPConfig`([features, grid_resolution, ...])	Config class for Partial Dependence Plots (PDP).
`ProcessingOutputHandler`()	Class to handle the parameters for SagemakerProcessor.Processingoutput
`SHAPConfig`([baseline, num_samples, ...])	Config class for SHAP.
`SageMakerClarifyProcessor`([role, ...])	Handles SageMaker Processing tasks to compute bias metrics and model explanations.
`SegmentationConfig`(name_or_index, segments)	Config object that defines segment(s) of the dataset on which metrics are computed.
`TextConfig`(granularity, language)	Config object to handle text features for text explainability
`TimeSeriesDataConfig`(target_time_series, ...)	Config object for TimeSeries explainability data configuration fields.
`TimeSeriesJSONDatasetFormat`(value)	Possible dataset formats for JSON time series data files.
`TimeSeriesModelConfig`(forecast)	Config object for TimeSeries predictor configuration fields.

class sagemaker.core.clarify.AsymmetricShapleyValueConfig(direction: Literal['chronological', 'anti_chronological', 'bidirectional'] = 'chronological', granularity: Literal['timewise', 'fine_grained'] = 'timewise', num_samples: int | None = None, baseline: str | Dict[str, Any] | None = None)[source]#

Bases: ExplainabilityConfig

Config class for Asymmetric Shapley value algorithm for time series explainability.

Asymmetric Shapley Values are a variant of the Shapley Value that drop the symmetry axiom [1]. We use these to determine how features contribute to the forecasting outcome. Asymmetric Shapley values can take into account the temporal dependencies of the time series that forecasting models take as input.

[1] Frye, Christopher, Colin Rowat, and Ilya Feige. “Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability.” NeurIPS (2020). https://doi.org/10.48550/arXiv.1910.06358

get_explainability_config()[source]#: Returns an asymmetric shap config dictionary.

Bases: object

Config object with user-defined bias configurations of the input dataset.

get_config()[source]#: Returns a dictionary of bias detection configurations, part of the analysis config

class sagemaker.core.clarify.DataConfig(s3_data_input_path: str, s3_output_path: str, s3_analysis_config_output_path: str | None = None, label: str | None = None, headers: List[str] | None = None, features: str | None = None, dataset_type: str = 'text/csv', s3_compression_type: str = 'None', joinsource: str | int | None = None, facet_dataset_uri: str | None = None, facet_headers: List[str] | None = None, predicted_label_dataset_uri: str | None = None, predicted_label_headers: List[str] | None = None, predicted_label: str | int | None = None, excluded_columns: List[int] | List[str] | None = None, segmentation_config: List[SegmentationConfig] | None = None, time_series_data_config: TimeSeriesDataConfig | None = None)[source]#

Bases: object

Config object related to configurations of the input and output dataset.

get_config()[source]#: Returns part of an analysis config dictionary.

class sagemaker.core.clarify.DatasetType(value)[source]#

Bases: Enum

Enum to store different dataset types supported in the Analysis config file

IMAGE = 'application/x-image'#

JSON = 'application/json'#

JSONLINES = 'application/jsonlines'#

PARQUET = 'application/x-parquet'#

TEXTCSV = 'text/csv'#

class sagemaker.core.clarify.ExplainabilityConfig[source]#

Bases: ABC

Abstract config class to configure an explainability method.

abstract get_explainability_config()[source]#: Returns config.

class sagemaker.core.clarify.ImageConfig(model_type: str, num_segments: int | None = None, feature_extraction_method: str | None = None, segment_compactness: float | None = None, max_objects: int | None = None, iou_threshold: float | None = None, context: float | None = None)[source]#

Bases: object

Config object for handling images

get_image_config()[source]#: Returns the image config part of an analysis config dictionary.

Bases: object

Config object related to a model and its endpoint to be created.

get_predictor_config()[source]#: Returns part of the predictor dictionary of the analysis config.

Bases: object

Config object to extract a predicted label from the model output.

get_predictor_config()[source]#: Returns probability_threshold and predictor config dictionary.

class sagemaker.core.clarify.PDPConfig(features: List | None = None, grid_resolution: int = 15, top_k_features: int = 10)[source]#

Bases: ExplainabilityConfig

Config class for Partial Dependence Plots (PDP).

PDPs show the marginal effect (the dependence) a subset of features has on the predicted outcome of an ML model.

When PDP is requested (by passing in a PDPConfig to the explainability_config parameter of SageMakerClarifyProcessor), the Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.

get_explainability_config()[source]#: Returns PDP config dictionary.

class sagemaker.core.clarify.ProcessingOutputHandler[source]#

Bases: object

Class to handle the parameters for SagemakerProcessor.Processingoutput

class S3UploadMode(value)[source]#

Bases: Enum

Enum values for different uplaod modes to s3 bucket

CONTINUOUS = 'Continuous'#

ENDOFJOB = 'EndOfJob'#

classmethod get_s3_upload_mode(analysis_config: Dict[str, Any]) → str[source]#

Fetches s3_upload mode based on the shap_config values

Parameters:: analysis_config (dict) – dict Config following the analysis_config.json format
Returns:: The s3_upload_mode type for the processing output.

Bases: ExplainabilityConfig

Config class for SHAP.

The SHAP algorithm calculates feature attributions by computing the contribution of each feature to the prediction outcome, using the concept of Shapley values.

These attributions can be provided for specific predictions (locally) and at a global level for the model as a whole.

get_explainability_config()[source]#: Returns a shap config dictionary.

Bases: Processor

Handles SageMaker Processing tasks to compute bias metrics and model explanations.

run(**_)[source]#: Overriding the base class method but deferring to specific run_* methods.

run_bias(data_config: DataConfig, bias_config: BiasConfig, model_config: ModelConfig | None = None, model_predicted_label_config: ModelPredictedLabelConfig | None = None, pre_training_methods: str | List[str] = 'all', post_training_methods: str | List[str] = 'all', wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#

Runs a ProcessingJob to compute the requested bias methods

Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the DataConfig) to obtain predicted labels.

Parameters:

data_config (DataConfig) – Config of the input/output data.
bias_config (BiasConfig) – Config of sensitive groups.
model_config (ModelConfig) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` or predicted_label is provided in data_config.
model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.
pre_training_methods (str or list[str]) – Selector of a subset of potential metrics: [”CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
post_training_methods (str or list[str]) – Selector of a subset of potential metrics: [”DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be job_name_prefix and the current timestamp; otherwise use "Clarify-Bias" as prefix.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

The behavior of setting these keys is as follows:
- If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.
- If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.
- If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.
- 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_bias_and_explainability(data_config: DataConfig, model_config: ModelConfig, explainability_config: ExplainabilityConfig | List[ExplainabilityConfig], bias_config: BiasConfig, pre_training_methods: str | List[str] = 'all', post_training_methods: str | List[str] = 'all', model_predicted_label_config: ModelPredictedLabelConfig | None = None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)[source]#

Runs a ProcessingJob computing feature attributions.

For bias: Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the DataConfig) to obtain predicted labels.

For Explainability: Spins up a model endpoint.

Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the explainability_config parameter.

When SHAP is requested in the explainability_config, the SHAP algorithm calculates the feature importance for each input example in the s3_data_input_path of the DataConfig, by creating num_samples copies of the example with a subset of features replaced with values from the baseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated using agg_method.

When PDP is requested in the explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.

Parameters:

data_config (DataConfig) – Config of the input/output data.
model_config (ModelConfig) – Config of the model and its endpoint to be created.
explainability_config (ExplainabilityConfig or list) – Config of the specific explainability method or a list of ExplainabilityConfig objects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.
bias_config (BiasConfig) – Config of sensitive groups.
pre_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [”CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
post_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [”DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
( (model_predicted_label_config) – int or str or ModelPredictedLabelConfig
) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance of SageMakerClarifyProcessor to provide more parameters like label_headers.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use "Clarify-Explainability" as prefix.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

The behavior of setting these keys is as follows:
- If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.
- If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.
- If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.
- 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_explainability(data_config: DataConfig, model_config: ModelConfig, explainability_config: ExplainabilityConfig | List, model_scores: int | str | ModelPredictedLabelConfig | None = None, wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#

Runs a ProcessingJob computing feature attributions.

Spins up a model endpoint.

Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the explainability_config parameter.

When SHAP is requested in the explainability_config, the SHAP algorithm calculates the feature importance for each input example in the s3_data_input_path of the DataConfig, by creating num_samples copies of the example with a subset of features replaced with values from the baseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated using agg_method.

When PDP is requested in the explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.

Parameters:

data_config (DataConfig) – Config of the input/output data.
model_config (ModelConfig) – Config of the model and its endpoint to be created.
explainability_config (ExplainabilityConfig or list) – Config of the specific explainability method or a list of ExplainabilityConfig objects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.
model_scores (int or str or ModelPredictedLabelConfig) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance of SageMakerClarifyProcessor to provide more parameters like label_headers.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use "Clarify-Explainability" as prefix.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

The behavior of setting these keys is as follows:
- If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.
- If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.
- If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.
- 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_post_training_bias(data_config: DataConfig, data_bias_config: BiasConfig, model_config: ModelConfig | None = None, model_predicted_label_config: ModelPredictedLabelConfig | None = None, methods: str | List[str] = 'all', wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#

Runs a ProcessingJob to compute posttraining bias

Spins up a model endpoint and runs inference over the input dataset in the s3_data_input_path (from the DataConfig) to obtain predicted labels. Using model predictions, computes the requested posttraining bias methods that compare metrics (e.g. accuracy, precision, recall) for the sensitive group(s) versus the other examples.

Parameters:

data_config (DataConfig) – Config of the input/output data.
data_bias_config (BiasConfig) – Config of sensitive groups.
model_config (ModelConfig) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` or predicted_label is provided in data_config.
model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.
methods (str or list[str]) –
Selector of a subset of potential metrics: [”DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be the job_name_prefix and current timestamp; otherwise use "Clarify-Posttraining-Bias" as prefix.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

The behavior of setting these keys is as follows:
- If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.
- If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.
- If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.
- 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_pre_training_bias(data_config: DataConfig, data_bias_config: BiasConfig, methods: str | List[str] = 'all', wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#

Runs a ProcessingJob to compute pre-training bias methods

Computes the requested methods on the input data. The methods compare metrics (e.g. fraction of examples) for the sensitive group(s) vs. the other examples.

Parameters:

data_config (DataConfig) – Config of the input/output data.
data_bias_config (BiasConfig) – Config of sensitive groups.
methods (str or list[str]) –
Selects a subset of potential metrics: [”CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be the job_name_prefix and current timestamp; otherwise use "Clarify-Pretraining-Bias" as prefix.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

The behavior of setting these keys is as follows:
- If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.
- If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.
- If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.
- 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

class sagemaker.core.clarify.SegmentationConfig(name_or_index: str | int, segments: List[List[int | str]], config_name: str | None = None, display_aliases: List[str] | None = None)[source]#

Bases: object

Config object that defines segment(s) of the dataset on which metrics are computed.

to_dict() → Dict[str, Any][source]#: Returns SegmentationConfig as a dict.

class sagemaker.core.clarify.TextConfig(granularity: str, language: str)[source]#

Bases: object

Config object to handle text features for text explainability

SHAP analysis breaks down longer text into chunks (e.g. tokens, sentences, or paragraphs) and replaces them with the strings specified in the baseline for that feature. The shap value of a chunk then captures how much replacing it affects the prediction.

get_text_config()[source]#: Returns a text config dictionary, part of the analysis config dictionary.

Bases: object

Config object for TimeSeries explainability data configuration fields.

get_time_series_data_config()[source]#: Returns part of an analysis config dictionary.

class sagemaker.core.clarify.TimeSeriesJSONDatasetFormat(value)[source]#

Bases: Enum

Possible dataset formats for JSON time series data files.

Below is an example COLUMNS dataset for time series explainability:

{
    "ids": [1, 2],
    "timestamps": [3, 4],
    "target_ts": [5, 6],
    "rts1": [0.25, 0.5],
    "rts2": [1.25, 1.5],
    "scv1": [10, 20],
    "scv2": [30, 40]
}

For this example, JMESPaths are specified when creating TimeSeriesDataConfig as follows:

item_id="ids"
timestamp="timestamps"
target_time_series="target_ts"
related_time_series=["rts1", "rts2"]
static_covariates=["scv1", "scv2"]

Below is an example ITEM_RECORDS dataset for time series explainability:

[
    {
        "id": 1,
        "scv1": 10,
        "scv2": "red",
        "timeseries": [
            {"timestamp": 1, "target_ts": 5, "rts1": 0.25, "rts2": 10},
            {"timestamp": 2, "target_ts": 6, "rts1": 0.35, "rts2": 20},
            {"timestamp": 3, "target_ts": 4, "rts1": 0.45, "rts2": 30}
        ]
    },
    {
        "id": 2,
        "scv1": 20,
        "scv2": "blue",
        "timeseries": [
            {"timestamp": 1, "target_ts": 4, "rts1": 0.25, "rts2": 40},
            {"timestamp": 2, "target_ts": 2, "rts1": 0.35, "rts2": 50}
        ]
    }
]

For this example, JMESPaths are specified when creating TimeSeriesDataConfig as follows:

item_id="[*].id"
timestamp="[*].timeseries[].timestamp"
target_time_series="[*].timeseries[].target_ts"
related_time_series=["[*].timeseries[].rts1", "[*].timeseries[].rts2"]
static_covariates=["[*].scv1", "[*].scv2"]

Below is an example TIMESTAMP_RECORDS dataset for time series explainability:

[
    {"id": 1, "timestamp": 1, "target_ts": 5, "scv1": 10, "rts1": 0.25},
    {"id": 1, "timestamp": 2, "target_ts": 6, "scv1": 10, "rts1": 0.5},
    {"id": 1, "timestamp": 3, "target_ts": 3, "scv1": 10, "rts1": 0.75},
    {"id": 2, "timestamp": 5, "target_ts": 10, "scv1": 20, "rts1": 1}
]

For this example, JMESPaths are specified when creating TimeSeriesDataConfig as follows:

item_id="[*].id"
timestamp="[*].timestamp"
target_time_series="[*].target_ts"
related_time_series=["[*].rts1"]
static_covariates=["[*].scv1"]

COLUMNS = 'columns'#

ITEM_RECORDS = 'item_records'#

TIMESTAMP_RECORDS = 'timestamp_records'#

class sagemaker.core.clarify.TimeSeriesModelConfig(forecast: str)[source]#

Bases: object

Config object for TimeSeries predictor configuration fields.

get_time_series_model_config()[source]#: Returns TimeSeries model config dictionary

sagemaker.core.clarify

Contents

sagemaker.core.clarify#

SageMaker Clarify#