sagemaker.core.processing#

This module contains code related to the Processor class.

which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.

Functions

logs_for_processing_job(sagemaker_session, ...)

Display logs for a given processing job, optionally tailing them until the is complete.

Classes

FeatureStoreOutput(**kwargs)

Configuration for processing job outputs in Amazon SageMaker Feature Store.

FrameworkProcessor(image_uri[, role, ...])

Handles Amazon SageMaker processing tasks using ModelTrainer for code packaging.

Processor([role, image_uri, instance_count, ...])

Handles Amazon SageMaker Processing tasks.

ScriptProcessor([role, image_uri, command, ...])

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

class sagemaker.core.processing.FeatureStoreOutput(**kwargs)[source]#

Bases: ApiObject

Configuration for processing job outputs in Amazon SageMaker Feature Store.

feature_group_name: str | None = None#
class sagemaker.core.processing.FrameworkProcessor(image_uri: str | PipelineVariable, role: str | PipelineVariable | None = None, instance_count: int | PipelineVariable | None = None, instance_type: str | PipelineVariable | None = None, command: List[str] | None = None, volume_size_in_gb: int | PipelineVariable = 30, volume_kms_key: str | PipelineVariable | None = None, output_kms_key: str | PipelineVariable | None = None, code_location: str | None = None, max_runtime_in_seconds: int | PipelineVariable | None = None, base_job_name: str | None = None, sagemaker_session: Session | None = None, env: Dict[str, str | PipelineVariable] | None = None, tags: List[Dict[str, str | PipelineVariable]] | Dict[str, str | PipelineVariable] | None = None, network_config: NetworkConfig | None = None)[source]#

Bases: ScriptProcessor

Handles Amazon SageMaker processing tasks using ModelTrainer for code packaging.

framework_entrypoint_command = ['/bin/bash']#
run(code: str, source_dir: str | None = None, requirements: str | None = None, inputs: List[ProcessingInput] | None = None, outputs: List[ProcessingOutput] | None = None, arguments: List[str | PipelineVariable] | None = None, wait: bool = True, logs: bool = True, job_name: str | None = None, experiment_config: Dict[str, str] | None = None, kms_key: str | None = None)[source]#

Runs a processing job.

Parameters:
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • source_dir (str) – Path (absolute, relative or an S3 URI) to a directory with any other processing source code dependencies aside from the entry point file (default: None).

  • requirements (str) – Path to a requirements.txt file relative to source_dir (default: None).

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Returns:

None or pipeline step arguments in case the Processor instance is built with PipelineSession

class sagemaker.core.processing.Processor(role: str | None = None, image_uri: str | PipelineVariable | None = None, instance_count: int | PipelineVariable | None = None, instance_type: str | PipelineVariable | None = None, entrypoint: List[str | PipelineVariable] | None = None, volume_size_in_gb: int | PipelineVariable = 30, volume_kms_key: str | PipelineVariable | None = None, output_kms_key: str | PipelineVariable | None = None, max_runtime_in_seconds: int | PipelineVariable | None = None, base_job_name: str | None = None, sagemaker_session: Session | None = None, env: Dict[str, str | PipelineVariable] | None = None, tags: List[Dict[str, str | PipelineVariable]] | Dict[str, str | PipelineVariable] | None = None, network_config: NetworkConfig | None = None)[source]#

Bases: object

Handles Amazon SageMaker Processing tasks.

JOB_CLASS_NAME = 'processing-job'#
run(inputs: List[ProcessingInput] | None = None, outputs: List[ProcessingOutput] | None = None, arguments: List[str | PipelineVariable] | None = None, wait: bool = True, logs: bool = True, job_name: str | None = None, experiment_config: Dict[str, str] | None = None, kms_key: str | None = None)[source]#

Runs a processing job.

Parameters:
  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with PipelineSession. However, the value of TrialComponentDisplayName is honored for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Returns:

None or pipeline step arguments in case the Processor instance is built with PipelineSession

Raises:

ValueError – if logs is True but wait is False.

class sagemaker.core.processing.ScriptProcessor(role: str | PipelineVariable | None = None, image_uri: str | PipelineVariable | None = None, command: List[str] | None = None, instance_count: int | PipelineVariable | None = None, instance_type: str | PipelineVariable | None = None, volume_size_in_gb: int | PipelineVariable = 30, volume_kms_key: str | PipelineVariable | None = None, output_kms_key: str | PipelineVariable | None = None, max_runtime_in_seconds: int | PipelineVariable | None = None, base_job_name: str | None = None, sagemaker_session: Session | None = None, env: Dict[str, str | PipelineVariable] | None = None, tags: List[Dict[str, str | PipelineVariable]] | Dict[str, str | PipelineVariable] | None = None, network_config: NetworkConfig | None = None)[source]#

Bases: Processor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

run(code: str, inputs: List[ProcessingInput] | None = None, outputs: List[ProcessingOutput] | None = None, arguments: List[str | PipelineVariable] | None = None, wait: bool = True, logs: bool = True, job_name: str | None = None, experiment_config: Dict[str, str] | None = None, kms_key: str | None = None)[source]#

Runs a processing job.

Parameters:
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with PipelineSession. However, the value of TrialComponentDisplayName is honored for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Returns:

None or pipeline step arguments in case the Processor instance is built with PipelineSession

sagemaker.core.processing.logs_for_processing_job(sagemaker_session, job_name, wait=False, poll=10)[source]#

Display logs for a given processing job, optionally tailing them until the is complete.

Parameters:
  • job_name (str) – Name of the processing job to display the logs for.

  • wait (bool) – Whether to keep looking for new log entries until the job completes (default: False).

  • poll (int) – The interval in seconds between polling for new log entries and job completion (default: 5).

Raises:

ValueError – If the processing job fails.