sagemaker.core.local.data#

Placeholder docstring

Functions

get_batch_strategy_instance(strategy, splitter)

Return an Instance of sagemaker.local.data.BatchStrategy according to strategy

get_data_source_instance(data_source, ...)

Return an Instance of sagemaker.local.data.DataSource.

get_splitter_instance(split_type)

Return an Instance of sagemaker.local.data.Splitter.

Classes

BatchStrategy(splitter)

Placeholder docstring

DataSource()

Placeholder docstring

LineSplitter()

Split records by new line.

LocalFileDataSource(root_path)

Represents a data source within the local filesystem.

MultiRecordStrategy(splitter)

Feed multiple records at a time for batch inference.

NoneSplitter()

Does not split records, essentially reads the whole file.

RecordIOSplitter()

Split using Amazon Recordio.

S3DataSource(bucket, prefix, sagemaker_session)

Defines a data source given by a bucket and S3 prefix.

SingleRecordStrategy(splitter)

Feed a single record at a time for batch inference.

Splitter()

Placeholder docstring

class sagemaker.core.local.data.BatchStrategy(splitter)[source]#

Bases: object

Placeholder docstring

abstract pad(file, size)[source]#

Group together as many records as possible to fit in the specified size.

Parameters:
  • file (str) – file path to read the records from.

  • size (int) – maximum size in MB that each group of records will be fitted to. passing 0 means unlimited size.

Returns:

generator of records

class sagemaker.core.local.data.DataSource[source]#

Bases: object

Placeholder docstring

abstract get_file_list()[source]#

Retrieve the list of absolute paths to all the files in this data source.

Returns:

List of absolute paths.

Return type:

List[str]

abstract get_root_dir()[source]#

Retrieve the absolute path to the root directory of this data source.

Returns:

absolute path to the root directory of this data source.

Return type:

str

class sagemaker.core.local.data.LineSplitter[source]#

Bases: Splitter

Split records by new line.

split(file)[source]#

Split a file into records using a specific strategy

This LineSplitter splits the file on each line break.

Parameters:

file (str) – path to the file to split

Returns: generator for the individual records that were split from the file

class sagemaker.core.local.data.LocalFileDataSource(root_path)[source]#

Bases: DataSource

Represents a data source within the local filesystem.

get_file_list()[source]#

Retrieve the list of absolute paths to all the files in this data source.

Returns:

List[str] List of absolute paths.

get_root_dir()[source]#

Retrieve the absolute path to the root directory of this data source.

Returns:

absolute path to the root directory of this data source.

Return type:

str

class sagemaker.core.local.data.MultiRecordStrategy(splitter)[source]#

Bases: BatchStrategy

Feed multiple records at a time for batch inference.

Will group up as many records as possible within the payload specified.

pad(file, size=6)[source]#

Group together as many records as possible to fit in the specified size.

Parameters:
  • file (str) – file path to read the records from.

  • size (int) – maximum size in MB that each group of records will be fitted to. passing 0 means unlimited size.

Returns:

generator of records

class sagemaker.core.local.data.NoneSplitter[source]#

Bases: Splitter

Does not split records, essentially reads the whole file.

split(filename)[source]#

Split a file into records using a specific strategy.

For this NoneSplitter there is no actual split happening and the file is returned as a whole.

Parameters:

filename (str) – path to the file to split

Returns: generator for the individual records that were split from

the file

class sagemaker.core.local.data.RecordIOSplitter[source]#

Bases: Splitter

Split using Amazon Recordio.

Not useful for string content.

Note: This class depends on the deprecated sagemaker.core.amazon module and is no longer functional.

split(file)[source]#

Split a file into records using a specific strategy

This RecordIOSplitter splits the data into individual RecordIO records.

Parameters:

file (str) – path to the file to split

Returns: generator for the individual records that were split from the file

Raises:

NotImplementedError – This functionality has been removed due to deprecation of sagemaker.core.amazon module

class sagemaker.core.local.data.S3DataSource(bucket, prefix, sagemaker_session)[source]#

Bases: DataSource

Defines a data source given by a bucket and S3 prefix.

The contents will be downloaded and then processed as local data.

get_file_list()[source]#

Retrieve the list of absolute paths to all the files in this data source.

Returns:

List of absolute paths.

Return type:

List[str]

get_root_dir()[source]#

Retrieve the absolute path to the root directory of this data source.

Returns:

absolute path to the root directory of this data source.

Return type:

str

class sagemaker.core.local.data.SingleRecordStrategy(splitter)[source]#

Bases: BatchStrategy

Feed a single record at a time for batch inference.

If a single record does not fit within the payload specified it will throw a RuntimeError.

pad(file, size=6)[source]#

Group together as many records as possible to fit in the specified size.

This SingleRecordStrategy will not group any record and will return them one by one as long as they are within the maximum size.

Parameters:
  • file (str) – file path to read the records from.

  • size (int) – maximum size in MB that each group of records will be fitted to. passing 0 means unlimited size.

Returns:

generator of records

class sagemaker.core.local.data.Splitter[source]#

Bases: object

Placeholder docstring

abstract split(file)[source]#

Split a file into records using a specific strategy

Parameters:

file (str) – path to the file to split

Returns:

generator for the individual records that were split from the file

sagemaker.core.local.data.get_batch_strategy_instance(strategy, splitter)[source]#

Return an Instance of sagemaker.local.data.BatchStrategy according to strategy

Parameters:
  • strategy (str) – Either ‘SingleRecord’ or ‘MultiRecord’

  • ( (splitter) – class:`sagemaker.local.data.Splitter): splitter to get the data from.

Returns

sagemaker.local.data.BatchStrategy: an Instance of a BatchStrategy

sagemaker.core.local.data.get_data_source_instance(data_source, sagemaker_session)[source]#

Return an Instance of sagemaker.local.data.DataSource.

The instance can handle the provided data_source URI.

data_source can be either file:// or s3://

Parameters:
  • data_source (str) – a valid URI that points to a data source.

  • sagemaker_session (sagemaker.core.helper.session.Session) – a SageMaker Session to interact with S3 if required.

Returns:

an Instance of a Data Source

Return type:

sagemaker.local.data.DataSource

Raises:

ValueError – If parsed_uri scheme is neither file nor s3 , raise an error.

sagemaker.core.local.data.get_splitter_instance(split_type)[source]#

Return an Instance of sagemaker.local.data.Splitter.

The instance returned is according to the specified split_type.

Parameters:

split_type (str) – either ‘Line’ or ‘RecordIO’. Can be left as None to signal no data split will happen.

Returns

sagemaker.local.data.Splitter: an Instance of a Splitter