sagemaker.train.remote_function.runtime_environment.mpi_utils_remote#

An utils function for runtime environment. This must be kept independent of SageMaker PySDK

Functions

bootstrap_master_node(worker_hosts)

Bootstrap the master node.

bootstrap_worker_node(master_host, current_host)

Bootstrap the worker nodes.

main([sys_args])

Entry point for bootstrap script

start_sshd_daemon()

Start the SSH daemon on the current node.

write_status_file_to_workers(worker_hosts[, ...])

Write the status file to all worker nodes.

Classes

CustomHostKeyPolicy()

Class to handle host key policy for SageMaker distributed training SSH connections.

class sagemaker.train.remote_function.runtime_environment.mpi_utils_remote.CustomHostKeyPolicy[source]#

Bases: MissingHostKeyPolicy

Class to handle host key policy for SageMaker distributed training SSH connections.

Example: >>> client = paramiko.SSHClient() >>> client.set_missing_host_key_policy(CustomHostKeyPolicy()) >>> # Will succeed for SageMaker algorithm containers >>> client.connect(‘algo-1234.internal’) >>> # Will raise SSHException for other unknown hosts >>> client.connect(‘unknown-host’) # raises SSHException

missing_host_key(client, hostname, key)[source]#

Accept host keys for algo-* hostnames, reject others.

Parameters:
  • client – The SSHClient instance

  • hostname – The hostname attempting to connect

  • key – The host key

Raises:

paramiko.SSHException – If hostname doesn’t match algo-* pattern

sagemaker.train.remote_function.runtime_environment.mpi_utils_remote.bootstrap_master_node(worker_hosts: List[str])[source]#

Bootstrap the master node.

sagemaker.train.remote_function.runtime_environment.mpi_utils_remote.bootstrap_worker_node(master_host: str, current_host: str, status_file: str = '/tmp/done.algo-1')[source]#

Bootstrap the worker nodes.

sagemaker.train.remote_function.runtime_environment.mpi_utils_remote.main(sys_args=None)[source]#

Entry point for bootstrap script

sagemaker.train.remote_function.runtime_environment.mpi_utils_remote.start_sshd_daemon()[source]#

Start the SSH daemon on the current node.

sagemaker.train.remote_function.runtime_environment.mpi_utils_remote.write_status_file_to_workers(worker_hosts: List[str], status_file: str = '/tmp/done.algo-1')[source]#

Write the status file to all worker nodes.