sagemaker.train.distributed

sagemaker.train.distributed#

Distributed module.

Classes

`DistributedConfig`()	Abstract base class for distributed training configurations.
`MPI`(*[, process_count_per_node, ...])	MPI.
`SMP`(*[, hybrid_shard_degree, ...])	SMP.
`Torchrun`(*[, process_count_per_node, smp])	Torchrun.

class sagemaker.train.distributed.DistributedConfig[source]#

Bases: BaseConfig, ABC

Abstract base class for distributed training configurations.

This class defines the interface that all distributed training configurations must implement. It provides a standardized way to specify driver scripts and their locations for distributed training jobs.

abstract property driver_dir: str#

Directory containing the driver script.

This property should return the path to the directory containing the driver script, relative to the container’s working directory.

Returns:: Path to directory containing the driver script
Return type:: str

abstract property driver_script: str#

Name of the driver script.

This property should return the name of the Python script that implements the distributed training driver logic.

Returns:: Name of the driver script file
Return type:: str

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'validate_assignment': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class sagemaker.train.distributed.MPI(*, process_count_per_node: int | None = None, mpi_additional_options: List[str] | None = None)[source]#

Bases: DistributedConfig

MPI.

The MPI class configures a job that uses mpirun in the backend to launch distributed training.

Parameters:

process_count_per_node (int) – The number of processes to run on each node in the training job. Will default to the number of GPUs available in the container.
mpi_additional_options (Optional[str]) – The custom MPI options to use for the training job.

property driver_dir: str#

Directory containing the driver script.

Returns:: Path to directory containing the driver script
Return type:: str

property driver_script: str#

Name of the driver script.

Returns:: Name of the driver script
Return type:: str

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'validate_assignment': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

mpi_additional_options: List[str] | None#

process_count_per_node: int | None#

class sagemaker.train.distributed.SMP(*, hybrid_shard_degree: int | None = None, sm_activation_offloading: bool | None = None, activation_loading_horizon: int | None = None, fsdp_cache_flush_warnings: bool | None = None, allow_empty_shards: bool | None = None, tensor_parallel_degree: int | None = None, context_parallel_degree: int | None = None, expert_parallel_degree: int | None = None, random_seed: int | None = None)[source]#

Bases: BaseConfig

SMP.

This class is used for configuring the SageMaker Model Parallelism v2 parameters. For more information on the model parallelism parameters, see: https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#distributed-model-parallel-v2-reference-init-config

Parameters:

hybrid_shard_degree (Optional[int]) – Specifies a sharded parallelism degree for the model.
sm_activation_offloading (Optional[bool]) – Specifies whether to enable the SMP activation offloading implementation.
activation_loading_horizon (Optional[int]) – An integer specifying the activation offloading horizon type for FSDP. This is the maximum number of checkpointed or offloaded layers whose inputs can be in the GPU memory simultaneously.
fsdp_cache_flush_warnings (Optional[bool]) – Detects and warns if cache flushes happen in the PyTorch memory manager, because they can degrade computational performance.
allow_empty_shards (Optional[bool]) – Whether to allow empty shards when sharding tensors if tensor is not divisible. This is an experimental fix for crash during checkpointing in certain scenarios. Disabling this falls back to the original PyTorch behavior.
tensor_parallel_degree (Optional[int]) – Specifies a tensor parallelism degree. The value must be between 1 and world_size.
context_parallel_degree (Optional[int]) – Specifies the context parallelism degree. The value must be between 1 and world_size , and must be <= hybrid_shard_degree.
expert_parallel_degree (Optional[int]) – Specifies a expert parallelism degree. The value must be between 1 and world_size.
random_seed (Optional[int]) – A seed number for the random operations in distributed modules by SMP tensor parallelism or expert parallelism.

activation_loading_horizon: int | None#

allow_empty_shards: bool | None#

context_parallel_degree: int | None#

expert_parallel_degree: int | None#

fsdp_cache_flush_warnings: bool | None#

hybrid_shard_degree: int | None#

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'validate_assignment': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

random_seed: int | None#

sm_activation_offloading: bool | None#

tensor_parallel_degree: int | None#

class sagemaker.train.distributed.Torchrun(*, process_count_per_node: int | None = None, smp: SMP | None = None)[source]#

Bases: DistributedConfig

Torchrun.

The Torchrun class configures a job that uses torchrun or torch.distributed.launch in the backend to launch distributed training.

Parameters:

process_count_per_node (int) – The number of processes to run on each node in the training job. Will default to the number of GPUs available in the container.
smp (Optional[SMP]) – The SageMaker Model Parallelism v2 parameters.

property driver_dir: str#

Directory containing the driver script.

Returns:: Path to directory containing the driver script
Return type:: str

property driver_script: str#

Name of the driver script.

Returns:: Name of the driver script file
Return type:: str

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'validate_assignment': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

process_count_per_node: int | None#

smp: SMP | None#

sagemaker.train.distributed

Contents

sagemaker.train.distributed#