data.sampler package

Submodules

data.sampler.base_sampler module

class data.sampler.base_sampler.BaseSampler(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: Sampler

Base class for standard and DataParallel Sampler.

Every subclass should implement __iter__ method, providing a way to iterate over indices of dataset elements.

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training mode or not. Default: False

__init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

classmethod add_arguments(parser: ArgumentParser) → ArgumentParser[source]

get_indices() → List[int][source]

Returns a list of indices of dataset elements to iterate over.

…note:: If repeated augmentation is enabled, then indices will be repeated.

set_epoch(epoch: int) → None[source]: Helper function to set epoch in each sampler.

update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) → None[source]

Helper function to update scales in each sampler. This is typically useful in variable-batch sampler.

Subclass is expected to implement this function. By default, we do not do anything

update_indices(new_indices: List[int]) → None[source]: Update indices to new indices. This function might be useful for sample-efficient training.

extra_repr() → str[source]

class data.sampler.base_sampler.BaseSamplerDDP(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: Sampler

Base class for DistributedDataParallel Sampler.

Every subclass should implement __iter__ method, providing a way to iterate over indices of dataset elements.

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

get_indices_rank_i() → List[int][source]

Returns a list of indices of dataset elements for each rank to iterate over.

…note:

If repeated augmentation is enabled, then indices will be repeated.
If sharding is enabled, then each rank will process a subset of the dataset.

classmethod add_arguments(parser: ArgumentParser) → ArgumentParser[source]

set_epoch(epoch: int) → None[source]: Helper function to set epoch in each sampler.

update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) → None[source]

Helper function to update scales in each sampler. This is typically useful in variable-batch sampler

Subclass is expected to implement this function. By default, we do not do anything

update_indices(new_indices: List[int]) → None[source]: Update indices to new indices. This function might be useful for sample-efficient training.

extra_repr() → str[source]

data.sampler.base_sampler.get_batch_size_from_opts(opts: Namespace, is_training: bool = False) → int[source]

Helper function to extract batch size for training or validation/test

Parameters:

opts – command line argument
is_training – Training or validation mode. Default: False

Returns:

Returns an integer

data.sampler.batch_sampler module

class data.sampler.batch_sampler.BatchSampler(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: BaseSampler

Standard Batch Sampler for data parallel. This sampler yields batches of fixed batch size and spatial resolutions.

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

extra_repr() → str[source]

classmethod add_arguments(parser: ArgumentParser) → ArgumentParser[source]

class data.sampler.batch_sampler.BatchSamplerDDP(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: BaseSamplerDDP

DDP variant of BatchSampler

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

extra_repr() → str[source]

data.sampler.chain_sampler module

class data.sampler.chain_sampler.ChainSampler(opts: Namespace, *args, **kwargs)[source]

Bases: Sampler

This class is a wrapper for iterating over datasets for multiple or similar tasks, typically useful for multi-task training. task_name and sampler_config are two mandatory keys that allows us to use task-specific data samplers. For specifying batch sizes, we use train_batch_size0, and val_batch_size0 as keys for training and validation sets. Note that the batch sizes are scaled automatically depending on the number of GPUs.

Parameters:

opts – Command-line arguments
data_samplers – dictionary containing different samplers

Example:: # Example yaml config for combining different samplers is given below. # Please note that configuration for each sampler should start with - in chain_sampler.

sampler:

name: “chain_sampler” chain_sampler_mode: “sequential” chain_sampler:

task_name: “segmentation” train_batch_size0: 10 sampler_config:

name: “variable_batch_sampler” use_shards: false num_repeats: 4 truncated_repeat_aug_sampler: false vbs:

crop_size_width: 512 crop_size_height: 512 max_n_scales: 25 min_crop_size_width: 256 max_crop_size_width: 768 min_crop_size_height: 256 max_crop_size_height: 768 check_scale: 16

task_name: “classification” train_batch_size0: 20 sampler_config:

name: “batch_sampler” bs:

crop_size_width: 512 crop_size_height: 512

__init__(opts: Namespace, *args, **kwargs) → None[source]

classmethod add_arguments(parser: ArgumentParser) → ArgumentParser[source]: Add arguments for chain sampler.

classmethod build_chain_sampler(opts: Namespace, n_data_samples: Mapping[str, int], is_training: bool = False, *args, **kwargs) → Mapping[str, Sampler][source]

Build chain sampler from command line arguments and sampler registry :param opts: command-line arguments :param n_data_samples: Mapping containing the task name and number of dataset samples in task-specific dataset :param is_training: Training mode or not

Returns:: A dictionary, sampler_dict, containing information about sampler name and module.

set_epoch(epoch: int) → None[source]

Helper function to set epoch in each sampler. :param epoch: Current epoch

Returns:: Nothing

update_scales(epoch: int, is_master_node: bool | None = False, *args, **kwargs) → None[source]

Helper function to update scales in each sampler. This is typically useful for variable-batch samplers

Parameters:

epoch – Current epoch
is_master_node – Master node or not.

Returns:

Nothing

update_indices(new_indices: List[int]) → None[source]

Update sample indices of the datasets with these new indices.

Parameters:: new_indices – Filtered indices of the samples that needs to be used in next epoch.
Returns:: Nothing

…note:: This function is useful for sample-efficient training. This function may be implemented in future (depending on use-case)

data.sampler.multi_scale_sampler module

class data.sampler.multi_scale_sampler.MultiScaleSampler(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: BaseSampler

Multi-scale batch sampler for data parallel. This sampler yields batches of fixed batch size, but each batch has different spatial resolution.

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

classmethod add_arguments(parser: ArgumentParser) → ArgumentParser[source]

update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) → None[source]

Helper function to update scales in each sampler. This is typically useful in variable-batch sampler.

Subclass is expected to implement this function. By default, we do not do anything

extra_repr() → str[source]

class data.sampler.multi_scale_sampler.MultiScaleSamplerDDP(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: BaseSamplerDDP

DDP version of MultiScaleSampler

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

extra_repr() → str[source]

update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) → None[source]

Helper function to update scales in each sampler. This is typically useful in variable-batch sampler

Subclass is expected to implement this function. By default, we do not do anything

data.sampler.utils module

data.sampler.utils.image_batch_pairs(crop_size_w: int, crop_size_h: int, batch_size_gpu0: int, max_scales: float | None = 5, check_scale_div_factor: int | None = 32, min_crop_size_w: int | None = 160, max_crop_size_w: int | None = 320, min_crop_size_h: int | None = 160, max_crop_size_h: int | None = 320, *args, **kwargs) → List[Tuple[int, int, int]][source]

This function creates batch and image size pairs. For a given batch size and image size, different image sizes: are generated and batch size is adjusted so that GPU memory can be utilized efficiently.

Parameters:

crop_size_w – Base Image width (e.g., 224)
crop_size_h – Base Image height (e.g., 224)
batch_size_gpu0 – Batch size on GPU 0 for base image
max_scales – Number of scales. How many image sizes that we want to generate between min and max scale factors. Default: 5
check_scale_div_factor – Check if image scales are divisible by this factor. Default: 32
min_crop_size_w – Min. crop size along width. Default: 160
max_crop_size_w – Max. crop size along width. Default: 320
min_crop_size_h – Min. crop size along height. Default: 160
max_crop_size_h – Max. crop size along height. Default: 320

Returns:

a sorted list of tuples. Each index is of the form (h, w, batch_size)

data.sampler.utils.make_video_pairs(crop_size_h: int, crop_size_w: int, min_crop_size_h: int, max_crop_size_h: int, min_crop_size_w: int, max_crop_size_w: int, default_frames: int, max_scales: int | None = 5, check_scale_div_factor: int | None = 32, *args, **kwargs) → List[Tuple[int, int, int]][source]

This function creates number of frames and spatial size pairs for videos.

Parameters:

crop_size_h – Base Image height (e.g., 224)
crop_size_w – Base Image width (e.g., 224)
min_crop_size_w – Min. crop size along width.
max_crop_size_w – Max. crop size along width.
min_crop_size_h – Min. crop size along height.
max_crop_size_h – Max. crop size along height.
default_frames – Default number of frames per clip in a video.
max_scales – Number of scales. Default: 5
check_scale_div_factor – Check if spatial scales are divisible by this factor. Default: 32.

Returns:

A sorted list of tuples. Each index is of the form (h, w, n_frames)

data.sampler.utils.create_intervallic_integer_list(base_val: int | float, min_val: float, max_val: float, num_scales: int | None = 5, scale_div_factor: int | None = 1) → List[int][source]

This function creates a list of n integer values that scales base_val between min_scale and max_scale.

Parameters:

base_val – The base value to scale.
min_val – The lower end of the value.
max_val – The higher end of the value.
n – Number of scaled values to generate.
scale_div_factor – Check if scaled values are divisible by this factor.

Returns:

a sorted list of tuples. Each index is of the form (h, w, n_frames)

data.sampler.utils.make_tuple_list(*val_list: List) → List[Tuple][source]

Make a list of values to a list of the tuples. Where ith element in each list is in the ith tuple of the returned list.

For example: [[1, 2], [3, 4], [5, 6]] is converted to [(1, 3, 5), (2, 4, 6)].

Parameters:: val_list – A list of m list, where each element is a list of n values.
Returns:: A list of size n, where each value is a tupe if m values.

data.sampler.variable_batch_sampler module

class data.sampler.variable_batch_sampler.VariableBatchSampler(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: BaseSampler

Variably-size multi-scale batch sampler <https://arxiv.org/abs/2110.02178?context=cs.LG>` for data parallel. This sampler yields batches with variable spatial resolution and batch size.

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) → None[source]: Update the scales in variable batch sampler at specified epoch intervals during training.

extra_repr() → str[source]

classmethod add_arguments(parser: ArgumentParser) → ArgumentParser[source]

class data.sampler.variable_batch_sampler.VariableBatchSamplerDDP(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]

Bases: BaseSamplerDDP

DDP version of VariableBatchSampler

Parameters:

opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False

__init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) → None[source]

update_scales(epoch: int, is_master_node=False, *args, **kwargs) → None[source]: Update the scales in variable batch sampler at specified epoch intervals during training.

extra_repr() → str[source]

Module contents

data.sampler.build_sampler(opts: Namespace, n_data_samples: int | Mapping[str, int], is_training: bool = False, get_item_metadata: Callable[[int], Dict] | None = None, *args, **kwargs) → Sampler[source]

Helper function to build data sampler from command-line arguments

Parameters:

opts – Command-line arguments
n_data_samples – Number of data samples. It can be an integer specifying number of data samples for a given task or a mapping of task name and data samples per task in case of a chain sampler.
get_item_metadata – A callable that provides sample metadata, given sample index.
is_training – Training mode or not. Defaults to False.

Returns:

Data sampler over which we can iterate.

data.sampler.add_sampler_arguments(parser: ArgumentParser) → ArgumentParser[source]: Add sampler arguments to parser from SAMPLER_REGISTRY, BaseSampler, and BaseSamplerDDP