sagemaker.core.debugger.debugger#

Amazon SageMaker Debugger provides full visibility into ML training jobs.

This module provides SageMaker Debugger high-level methods to set up Debugger objects, such as Debugger built-in rules, tensor collections, and hook configuration. Use the Debugger objects for parameters when constructing a SageMaker estimator to initiate a training job.

Functions

get_default_profiler_processing_job([...])

Return the default profiler processing job (a rule) with a unique name.

get_rule_container_image_uri(name, region)

Return the Debugger rule image URI for the given AWS Region.

Classes

CollectionConfig(name[, parameters])

Creates tensor collections for SageMaker Debugger.

DebuggerHookConfig([s3_output_path, ...])

Create a Debugger hook configuration object to save the tensor for debugging.

DetailedProfilerProcessingJobConfig()

ProfilerRule like class.

ProfilerRule(name, image_uri, instance_type, ...)

The SageMaker Debugger ProfilerRule class configures profiling rules.

Rule(name, image_uri, instance_type, ...[, ...])

The SageMaker Debugger Rule class configures debugging rules to debug your training job.

RuleBase(name, image_uri, instance_type, ...)

The SageMaker Debugger rule base class that cannot be instantiated directly.

TensorBoardOutputConfig(s3_output_path[, ...])

Create a tensor ouput configuration object for debugging visualizations on TensorBoard.

class sagemaker.core.debugger.debugger.CollectionConfig(name: str | PipelineVariable, parameters: Dict[str, str | PipelineVariable] | None = None)[source]#

Bases: object

Creates tensor collections for SageMaker Debugger.

class sagemaker.core.debugger.debugger.DebuggerHookConfig(s3_output_path: str | PipelineVariable | None = None, container_local_output_path: str | PipelineVariable | None = None, hook_parameters: Dict[str, str | PipelineVariable] | None = None, collection_configs: List[CollectionConfig] | None = None)[source]#

Bases: object

Create a Debugger hook configuration object to save the tensor for debugging.

DebuggerHookConfig provides options to customize how debugging information is emitted and saved. This high-level DebuggerHookConfig class runs based on the smdebug.SaveConfig class.

class sagemaker.core.debugger.debugger.DetailedProfilerProcessingJobConfig[source]#

Bases: object

ProfilerRule like class.

Serves as a vehicle to pass info through to the processing instance.

class sagemaker.core.debugger.debugger.ProfilerRule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters)[source]#

Bases: RuleBase

The SageMaker Debugger ProfilerRule class configures profiling rules.

SageMaker Debugger profiling rules automatically analyze hardware system resource utilization and framework metrics of a training job to identify performance bottlenecks.

SageMaker Debugger comes pre-packaged with built-in profiling rules. For example, the profiling rules can detect if GPUs are underutilized due to CPU bottlenecks or IO bottlenecks. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules. You can also write your own profiling rules using the Amazon SageMaker Debugger APIs.

Tip

Use the following ProfilerRule.sagemaker class method for built-in profiling rules or the ProfilerRule.custom class method for custom profiling rules. Do not directly use the Rule initialization method.

classmethod custom(name, image_uri, instance_type, volume_size_in_gb, source=None, rule_to_invoke=None, container_local_output_path=None, s3_output_path=None, rule_parameters=None)[source]#

Initialize a ProfilerRule object for a custom profiling rule.

You can create a rule that analyzes system and framework metrics emitted during the training of a model and monitors conditions that are critical for the success of a training job.

Parameters:
  • name (str) – The name of the profiler rule.

  • image_uri (str) – The URI of the image to be used by the proflier rule.

  • instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data.

  • source (str) – A source file containing a rule to invoke. If provided, you must also provide rule_to_invoke. This can either be an S3 uri or a local path.

  • rule_to_invoke (str) – The name of the rule to invoke within the source. If provided, you must also provide the source.

  • container_local_output_path (str) – The path in the container.

  • s3_output_path (str) – The location in Amazon S3 to store the output. The default Debugger output path for profiling data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.

  • rule_parameters (dict) – A dictionary of parameters for the rule.

Returns:

The instance of the custom ProfilerRule.

Return type:

ProfilerRule

classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, instance_type=None, volume_size_in_gb=None)[source]#

Initialize a ProfilerRule object for a built-in profiling rule.

The rule analyzes system and framework metrics of a given training job to identify performance bottlenecks.

Parameters:
  • base_config (rule_configs.ProfilerRule) –

    The base rule configuration object returned from the rule_configs method. For example, ‘rule_configs.ProfilerReport()’. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules.

  • name (str) – The name of the profiler rule. If one is not provided, the name of the base_config will be used.

  • container_local_output_path (str) – The path in the container.

  • s3_output_path (str) – The location in Amazon S3 to store the profiling output data. The default Debugger output path for profiling data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.

Returns:

The instance of the built-in ProfilerRule.

Return type:

ProfilerRule

to_profiler_rule_config_dict()[source]#

Generates a request dictionary using the parameters provided when initializing object.

Returns:

An portion of an API request as a dictionary.

Return type:

dict

class sagemaker.core.debugger.debugger.Rule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters, collections_to_save, actions=None)[source]#

Bases: RuleBase

The SageMaker Debugger Rule class configures debugging rules to debug your training job.

The debugging rules analyze tensor outputs from your training job and monitor conditions that are critical for the success of the training job.

SageMaker Debugger comes pre-packaged with built-in debugging rules. For example, the debugging rules can detect whether gradients are getting too large or too small, or if a model is overfitting. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules. You can also write your own rules using the custom rule classmethod.

classmethod custom(name: str, image_uri: str | PipelineVariable, instance_type: str | PipelineVariable, volume_size_in_gb: int | PipelineVariable, source: str | None = None, rule_to_invoke: str | PipelineVariable | None = None, container_local_output_path: str | PipelineVariable | None = None, s3_output_path: str | PipelineVariable | None = None, other_trials_s3_input_paths: List[str | PipelineVariable] | None = None, rule_parameters: Dict[str, str | PipelineVariable] | None = None, collections_to_save: List[CollectionConfig] | None = None, actions=None)[source]#

Initialize a Rule object for a custom debugging rule.

You can create a custom rule that analyzes tensors emitted during the training of a model and monitors conditions that are critical for the success of a training job. For more information, see Create Debugger Custom Rules for Training Job Analysis.

Parameters:
  • name (str) – Required. The name of the debugger rule.

  • image_uri (str or PipelineVariable) – Required. The URI of the image to be used by the debugger rule.

  • instance_type (str or PipelineVariable) – Required. Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int or PipelineVariable) – Required. Size in GB of the EBS volume to use for storing data.

  • source (str) – Optional. A source file containing a rule to invoke. If provided, you must also provide rule_to_invoke. This can either be an S3 uri or a local path.

  • rule_to_invoke (str or PipelineVariable) – Optional. The name of the rule to invoke within the source. If provided, you must also provide source.

  • container_local_output_path (str or PipelineVariable) – Optional. The local path in the container.

  • s3_output_path (str or PipelineVariable) – Optional. The location in Amazon S3 to store the output tensors. The default Debugger output path for debugging data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.

  • list[PipelineVariable] (other_trials_s3_input_paths (list[str] or) – Optional. The Amazon S3 input paths of other trials to use the SimilarAcrossRuns rule.

  • rule_parameters (dict[str, str] or dict[str, PipelineVariable]) – Optional. A dictionary of parameters for the rule.

  • collections_to_save ([sagemaker.debugger.CollectionConfig]) – Optional. A list of CollectionConfig objects to be saved.

Returns:

The instance of the custom rule.

Return type:

Rule

prepare_actions(training_job_name)[source]#

Prepare actions for Debugger Rule.

Parameters:

training_job_name (str) – The training job name. To be set as the default training job prefix for the StopTraining action if it is specified.

classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None, actions=None)[source]#

Initialize a Rule object for a built-in debugging rule.

Parameters:
  • base_config (dict) –

    Required. This is the base rule config dictionary returned from the rule_configs method. For example, rule_configs.dead_relu(). For a full list of built-in rules for debugging, see List of Debugger Built-in Rules.

  • name (str) – Optional. The name of the debugger rule. If one is not provided, the name of the base_config will be used.

  • container_local_output_path (str) – Optional. The local path in the rule processing container.

  • s3_output_path (str) – Optional. The location in Amazon S3 to store the output tensors. The default Debugger output path for debugging data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.

  • other_trials_s3_input_paths ([str]) – Optional. The Amazon S3 input paths of other trials to use the SimilarAcrossRuns rule.

  • rule_parameters (dict) – Optional. A dictionary of parameters for the rule.

  • collections_to_save (CollectionConfig) – Optional. A list of CollectionConfig objects to be saved.

Returns:

An instance of the built-in rule.

Return type:

Rule

Example of how to create a built-in rule instance:

from sagemaker.debugger import Rule, rule_configs

built_in_rules = [
    Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_n())
]

You need to replace the built_in_rule_name_in_pysdk_format_* with the names of built-in rules. You can find the rule names at List of Debugger Built-in Rules.

Example of creating a built-in rule instance with adjusting parameter values:

from sagemaker.debugger import Rule, rule_configs

built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name_in_pysdk_format(),
        rule_parameters={
                "key": "value"
        }
        collections_to_save=[
            CollectionConfig(
                name="tensor_collection_name",
                parameters={
                    "key": "value"
                }
            )
        ]
    )
]

For more information about setting up the rule_parameters parameter, see List of Debugger Built-in Rules.

For more information about setting up the collections_to_save parameter, see the CollectionConfig class.

to_debugger_rule_config_dict()[source]#

Generates a request dictionary using the parameters provided when initializing object.

Returns:

An portion of an API request as a dictionary.

Return type:

dict

class sagemaker.core.debugger.debugger.RuleBase(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters)[source]#

Bases: ABC

The SageMaker Debugger rule base class that cannot be instantiated directly.

Tip

Debugger rule classes inheriting this RuleBase class are Rule and ProfilerRule. Do not directly use the rule base class to instantiate a SageMaker Debugger rule. Use the Rule classmethods for debugging and the ProfilerRule classmethods for profiling.

name#

The name of the rule.

Type:

str

image_uri#

The image URI to use the rule.

Type:

str

instance_type#

Type of EC2 instance to use. For example, ‘ml.c4.xlarge’.

Type:

str

container_local_output_path#

The local path to store the Rule output.

Type:

str

s3_output_path#

The location in S3 to store the output.

Type:

str

volume_size_in_gb#

Size in GB of the EBS volume to use for storing data.

Type:

int

rule_parameters#

A dictionary of parameters for the rule.

Type:

dict

class sagemaker.core.debugger.debugger.TensorBoardOutputConfig(s3_output_path: str | PipelineVariable, container_local_output_path: str | PipelineVariable | None = None)[source]#

Bases: object

Create a tensor ouput configuration object for debugging visualizations on TensorBoard.

sagemaker.core.debugger.debugger.get_default_profiler_processing_job(instance_type=None, volume_size_in_gb=None)[source]#

Return the default profiler processing job (a rule) with a unique name.

Returns:

The instance of the built-in ProfilerRule.

Return type:

sagemaker.debugger.ProfilerRule

sagemaker.core.debugger.debugger.get_rule_container_image_uri(name, region)[source]#

Return the Debugger rule image URI for the given AWS Region.

For a full list of rule image URIs, see Use Debugger Docker Images for Built-in or Custom Rules.

Parameters:

region (str) – A string of AWS Region. For example, 'us-east-1'.

Returns:

Formatted image URI for the given AWS Region and the rule container type.

Return type:

str