sagemaker.core.clarify#
This module configures the SageMaker Clarify bias and model explainability processor jobs.
SageMaker Clarify#
Classes
|
Config class for Asymmetric Shapley value algorithm for time series explainability. |
|
Config object with user-defined bias configurations of the input dataset. |
|
Config object related to configurations of the input and output dataset. |
|
Enum to store different dataset types supported in the Analysis config file |
Abstract config class to configure an explainability method. |
|
|
Config object for handling images |
|
Config object related to a model and its endpoint to be created. |
|
Config object to extract a predicted label from the model output. |
|
Config class for Partial Dependence Plots (PDP). |
Class to handle the parameters for SagemakerProcessor.Processingoutput |
|
|
Config class for SHAP. |
|
Handles SageMaker Processing tasks to compute bias metrics and model explanations. |
|
Config object that defines segment(s) of the dataset on which metrics are computed. |
|
Config object to handle text features for text explainability |
|
Config object for TimeSeries explainability data configuration fields. |
|
Possible dataset formats for JSON time series data files. |
|
Config object for TimeSeries predictor configuration fields. |
- class sagemaker.core.clarify.AsymmetricShapleyValueConfig(direction: Literal['chronological', 'anti_chronological', 'bidirectional'] = 'chronological', granularity: Literal['timewise', 'fine_grained'] = 'timewise', num_samples: int | None = None, baseline: str | Dict[str, Any] | None = None)[source]#
Bases:
ExplainabilityConfigConfig class for Asymmetric Shapley value algorithm for time series explainability.
Asymmetric Shapley Values are a variant of the Shapley Value that drop the symmetry axiom [1]. We use these to determine how features contribute to the forecasting outcome. Asymmetric Shapley values can take into account the temporal dependencies of the time series that forecasting models take as input.
[1] Frye, Christopher, Colin Rowat, and Ilya Feige. “Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability.” NeurIPS (2020). https://doi.org/10.48550/arXiv.1910.06358
- class sagemaker.core.clarify.BiasConfig(label_values_or_threshold: List[int | float | str], facet_name: str | int | List[str] | List[int], facet_values_or_threshold: int | float | str | None = None, group_name: str | None = None)[source]#
Bases:
objectConfig object with user-defined bias configurations of the input dataset.
- class sagemaker.core.clarify.DataConfig(s3_data_input_path: str, s3_output_path: str, s3_analysis_config_output_path: str | None = None, label: str | None = None, headers: List[str] | None = None, features: str | None = None, dataset_type: str = 'text/csv', s3_compression_type: str = 'None', joinsource: str | int | None = None, facet_dataset_uri: str | None = None, facet_headers: List[str] | None = None, predicted_label_dataset_uri: str | None = None, predicted_label_headers: List[str] | None = None, predicted_label: str | int | None = None, excluded_columns: List[int] | List[str] | None = None, segmentation_config: List[SegmentationConfig] | None = None, time_series_data_config: TimeSeriesDataConfig | None = None)[source]#
Bases:
objectConfig object related to configurations of the input and output dataset.
- class sagemaker.core.clarify.DatasetType(value)[source]#
Bases:
EnumEnum to store different dataset types supported in the Analysis config file
- IMAGE = 'application/x-image'#
- JSON = 'application/json'#
- JSONLINES = 'application/jsonlines'#
- PARQUET = 'application/x-parquet'#
- TEXTCSV = 'text/csv'#
- class sagemaker.core.clarify.ExplainabilityConfig[source]#
Bases:
ABCAbstract config class to configure an explainability method.
- class sagemaker.core.clarify.ImageConfig(model_type: str, num_segments: int | None = None, feature_extraction_method: str | None = None, segment_compactness: float | None = None, max_objects: int | None = None, iou_threshold: float | None = None, context: float | None = None)[source]#
Bases:
objectConfig object for handling images
- class sagemaker.core.clarify.ModelConfig(model_name: str | None = None, instance_count: int | None = None, instance_type: str | None = None, accept_type: str | None = None, content_type: str | None = None, content_template: str | None = None, record_template: str | None = None, custom_attributes: str | None = None, accelerator_type: str | None = None, endpoint_name_prefix: str | None = None, target_model: str | None = None, endpoint_name: str | None = None, time_series_model_config: TimeSeriesModelConfig | None = None)[source]#
Bases:
objectConfig object related to a model and its endpoint to be created.
- class sagemaker.core.clarify.ModelPredictedLabelConfig(label: str | int | None = None, probability: str | int | None = None, probability_threshold: float | None = None, label_headers: List[str] | None = None)[source]#
Bases:
objectConfig object to extract a predicted label from the model output.
- class sagemaker.core.clarify.PDPConfig(features: List | None = None, grid_resolution: int = 15, top_k_features: int = 10)[source]#
Bases:
ExplainabilityConfigConfig class for Partial Dependence Plots (PDP).
PDPs show the marginal effect (the dependence) a subset of features has on the predicted outcome of an ML model.
When PDP is requested (by passing in a
PDPConfigto theexplainability_configparameter ofSageMakerClarifyProcessor), the Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.
- class sagemaker.core.clarify.ProcessingOutputHandler[source]#
Bases:
objectClass to handle the parameters for SagemakerProcessor.Processingoutput
- class sagemaker.core.clarify.SHAPConfig(baseline: str | List | Dict | None = None, num_samples: int | None = None, agg_method: str | None = None, use_logit: bool = False, save_local_shap_values: bool = True, seed: int | None = None, num_clusters: int | None = None, text_config: TextConfig | None = None, image_config: ImageConfig | None = None, features_to_explain: List[int | str] | None = None)[source]#
Bases:
ExplainabilityConfigConfig class for SHAP.
The SHAP algorithm calculates feature attributions by computing the contribution of each feature to the prediction outcome, using the concept of Shapley values.
These attributions can be provided for specific predictions (locally) and at a global level for the model as a whole.
- class sagemaker.core.clarify.SageMakerClarifyProcessor(role: str | None = None, instance_count: int | None = None, instance_type: str | None = None, volume_size_in_gb: int = 30, volume_kms_key: str | None = None, output_kms_key: str | None = None, max_runtime_in_seconds: int | None = None, sagemaker_session: Session | None = None, env: Dict[str, str] | None = None, tags: List[Dict[str, str | PipelineVariable]] | Dict[str, str | PipelineVariable] | None = None, network_config: NetworkConfig | None = None, job_name_prefix: str | None = None, version: str | None = None, skip_early_validation: bool = False)[source]#
Bases:
ProcessorHandles SageMaker Processing tasks to compute bias metrics and model explanations.
- run_bias(data_config: DataConfig, bias_config: BiasConfig, model_config: ModelConfig | None = None, model_predicted_label_config: ModelPredictedLabelConfig | None = None, pre_training_methods: str | List[str] = 'all', post_training_methods: str | List[str] = 'all', wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#
Runs a
ProcessingJobto compute the requested bias methodsComputes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the
DataConfig) to obtain predicted labels.- Parameters:
data_config (
DataConfig) – Config of the input/output data.bias_config (
BiasConfig) – Config of sensitive groups.model_config (
ModelConfig) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` orpredicted_labelis provided indata_config.model_predicted_label_config (
ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.pre_training_methods (str or list[str]) – Selector of a subset of potential metrics: [”CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
post_training_methods (str or list[str]) – Selector of a subset of potential metrics: [”DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
waitis True (default: True).job_name (str) – Processing job name. When
job_nameis not specified, ifjob_name_prefixinSageMakerClarifyProcessoris specified, the job name will bejob_name_prefixand the current timestamp; otherwise use"Clarify-Bias"as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName','TrialName', and'TrialComponentDisplayName'.The behavior of setting these keys is as follows:
If
'ExperimentName'is supplied but'TrialName'is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'and'TrialName'are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'is used for display in Amazon SageMaker Studio.
- run_bias_and_explainability(data_config: DataConfig, model_config: ModelConfig, explainability_config: ExplainabilityConfig | List[ExplainabilityConfig], bias_config: BiasConfig, pre_training_methods: str | List[str] = 'all', post_training_methods: str | List[str] = 'all', model_predicted_label_config: ModelPredictedLabelConfig | None = None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)[source]#
Runs a
ProcessingJobcomputing feature attributions.For bias: Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the
DataConfig) to obtain predicted labels.For Explainability: Spins up a model endpoint.
Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the
explainability_configparameter.When SHAP is requested in the
explainability_config, the SHAP algorithm calculates the feature importance for each input example in thes3_data_input_pathof theDataConfig, by creatingnum_samplescopies of the example with a subset of features replaced with values from thebaseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated usingagg_method.When PDP is requested in the
explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.- Parameters:
data_config (
DataConfig) – Config of the input/output data.model_config (
ModelConfig) – Config of the model and its endpoint to be created.explainability_config (
ExplainabilityConfigor list) – Config of the specific explainability method or a list ofExplainabilityConfigobjects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.bias_config (
BiasConfig) – Config of sensitive groups.pre_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [”CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
post_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [”DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
( (model_predicted_label_config) – int or str or
ModelPredictedLabelConfig) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance of
SageMakerClarifyProcessorto provide more parameters likelabel_headers.wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
waitis True (default: True).job_name (str) – Processing job name. When
job_nameis not specified, ifjob_name_prefixinSageMakerClarifyProcessoris specified, the job name will be composed ofjob_name_prefixand current timestamp; otherwise use"Clarify-Explainability"as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName','TrialName', and'TrialComponentDisplayName'.The behavior of setting these keys is as follows:
If
'ExperimentName'is supplied but'TrialName'is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'and'TrialName'are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'is used for display in Amazon SageMaker Studio.
- run_explainability(data_config: DataConfig, model_config: ModelConfig, explainability_config: ExplainabilityConfig | List, model_scores: int | str | ModelPredictedLabelConfig | None = None, wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#
Runs a
ProcessingJobcomputing feature attributions.Spins up a model endpoint.
Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the
explainability_configparameter.When SHAP is requested in the
explainability_config, the SHAP algorithm calculates the feature importance for each input example in thes3_data_input_pathof theDataConfig, by creatingnum_samplescopies of the example with a subset of features replaced with values from thebaseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated usingagg_method.When PDP is requested in the
explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.- Parameters:
data_config (
DataConfig) – Config of the input/output data.model_config (
ModelConfig) – Config of the model and its endpoint to be created.explainability_config (
ExplainabilityConfigor list) – Config of the specific explainability method or a list ofExplainabilityConfigobjects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.model_scores (int or str or
ModelPredictedLabelConfig) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance ofSageMakerClarifyProcessorto provide more parameters likelabel_headers.wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
waitis True (default: True).job_name (str) – Processing job name. When
job_nameis not specified, ifjob_name_prefixinSageMakerClarifyProcessoris specified, the job name will be composed ofjob_name_prefixand current timestamp; otherwise use"Clarify-Explainability"as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName','TrialName', and'TrialComponentDisplayName'.The behavior of setting these keys is as follows:
If
'ExperimentName'is supplied but'TrialName'is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'and'TrialName'are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'is used for display in Amazon SageMaker Studio.
- run_post_training_bias(data_config: DataConfig, data_bias_config: BiasConfig, model_config: ModelConfig | None = None, model_predicted_label_config: ModelPredictedLabelConfig | None = None, methods: str | List[str] = 'all', wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#
Runs a
ProcessingJobto compute posttraining biasSpins up a model endpoint and runs inference over the input dataset in the
s3_data_input_path(from theDataConfig) to obtain predicted labels. Using model predictions, computes the requested posttraining biasmethodsthat compare metrics (e.g. accuracy, precision, recall) for the sensitive group(s) versus the other examples.- Parameters:
data_config (
DataConfig) – Config of the input/output data.data_bias_config (
BiasConfig) – Config of sensitive groups.model_config (
ModelConfig) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` orpredicted_labelis provided indata_config.model_predicted_label_config (
ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.methods (str or list[str]) –
Selector of a subset of potential metrics: [”DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
waitis True (default: True).job_name (str) – Processing job name. When
job_nameis not specified, ifjob_name_prefixinSageMakerClarifyProcessoris specified, the job name will be thejob_name_prefixand current timestamp; otherwise use"Clarify-Posttraining-Bias"as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName','TrialName', and'TrialComponentDisplayName'.The behavior of setting these keys is as follows:
If
'ExperimentName'is supplied but'TrialName'is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'and'TrialName'are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'is used for display in Amazon SageMaker Studio.
- run_pre_training_bias(data_config: DataConfig, data_bias_config: BiasConfig, methods: str | List[str] = 'all', wait: bool = True, logs: bool = True, job_name: str | None = None, kms_key: str | None = None, experiment_config: Dict[str, str] | None = None)[source]#
Runs a
ProcessingJobto compute pre-training bias methodsComputes the requested
methodson the input data. Themethodscompare metrics (e.g. fraction of examples) for the sensitive group(s) vs. the other examples.- Parameters:
data_config (
DataConfig) – Config of the input/output data.data_bias_config (
BiasConfig) – Config of sensitive groups.methods (str or list[str]) –
Selects a subset of potential metrics: [”CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
waitis True (default: True).job_name (str) – Processing job name. When
job_nameis not specified, ifjob_name_prefixinSageMakerClarifyProcessoris specified, the job name will be thejob_name_prefixand current timestamp; otherwise use"Clarify-Pretraining-Bias"as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName','TrialName', and'TrialComponentDisplayName'.The behavior of setting these keys is as follows:
If
'ExperimentName'is supplied but'TrialName'is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'and'TrialName'are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'is used for display in Amazon SageMaker Studio.
- class sagemaker.core.clarify.SegmentationConfig(name_or_index: str | int, segments: List[List[int | str]], config_name: str | None = None, display_aliases: List[str] | None = None)[source]#
Bases:
objectConfig object that defines segment(s) of the dataset on which metrics are computed.
- class sagemaker.core.clarify.TextConfig(granularity: str, language: str)[source]#
Bases:
objectConfig object to handle text features for text explainability
SHAP analysis breaks down longer text into chunks (e.g. tokens, sentences, or paragraphs) and replaces them with the strings specified in the baseline for that feature. The shap value of a chunk then captures how much replacing it affects the prediction.
- class sagemaker.core.clarify.TimeSeriesDataConfig(target_time_series: str | int, item_id: str | int, timestamp: str | int, related_time_series: List[int | str] | None = None, static_covariates: List[int | str] | None = None, dataset_format: TimeSeriesJSONDatasetFormat | None = None)[source]#
Bases:
objectConfig object for TimeSeries explainability data configuration fields.
- class sagemaker.core.clarify.TimeSeriesJSONDatasetFormat(value)[source]#
Bases:
EnumPossible dataset formats for JSON time series data files.
Below is an example
COLUMNSdataset for time series explainability:{ "ids": [1, 2], "timestamps": [3, 4], "target_ts": [5, 6], "rts1": [0.25, 0.5], "rts2": [1.25, 1.5], "scv1": [10, 20], "scv2": [30, 40] }
For this example, JMESPaths are specified when creating
TimeSeriesDataConfigas follows:item_id="ids" timestamp="timestamps" target_time_series="target_ts" related_time_series=["rts1", "rts2"] static_covariates=["scv1", "scv2"]
Below is an example
ITEM_RECORDSdataset for time series explainability:[ { "id": 1, "scv1": 10, "scv2": "red", "timeseries": [ {"timestamp": 1, "target_ts": 5, "rts1": 0.25, "rts2": 10}, {"timestamp": 2, "target_ts": 6, "rts1": 0.35, "rts2": 20}, {"timestamp": 3, "target_ts": 4, "rts1": 0.45, "rts2": 30} ] }, { "id": 2, "scv1": 20, "scv2": "blue", "timeseries": [ {"timestamp": 1, "target_ts": 4, "rts1": 0.25, "rts2": 40}, {"timestamp": 2, "target_ts": 2, "rts1": 0.35, "rts2": 50} ] } ]
For this example, JMESPaths are specified when creating
TimeSeriesDataConfigas follows:item_id="[*].id" timestamp="[*].timeseries[].timestamp" target_time_series="[*].timeseries[].target_ts" related_time_series=["[*].timeseries[].rts1", "[*].timeseries[].rts2"] static_covariates=["[*].scv1", "[*].scv2"]
Below is an example
TIMESTAMP_RECORDSdataset for time series explainability:[ {"id": 1, "timestamp": 1, "target_ts": 5, "scv1": 10, "rts1": 0.25}, {"id": 1, "timestamp": 2, "target_ts": 6, "scv1": 10, "rts1": 0.5}, {"id": 1, "timestamp": 3, "target_ts": 3, "scv1": 10, "rts1": 0.75}, {"id": 2, "timestamp": 5, "target_ts": 10, "scv1": 20, "rts1": 1} ]
For this example, JMESPaths are specified when creating
TimeSeriesDataConfigas follows:item_id="[*].id" timestamp="[*].timestamp" target_time_series="[*].target_ts" related_time_series=["[*].rts1"] static_covariates=["[*].scv1"]
- COLUMNS = 'columns'#
- ITEM_RECORDS = 'item_records'#
- TIMESTAMP_RECORDS = 'timestamp_records'#