data.sampler package
Submodules
data.sampler.base_sampler module
- class data.sampler.base_sampler.BaseSampler(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
Sampler
Base class for standard and DataParallel Sampler.
Every subclass should implement __iter__ method, providing a way to iterate over indices of dataset elements.
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training mode or not. Default: False
- __init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) None [source]
- get_indices() List[int] [source]
Returns a list of indices of dataset elements to iterate over.
- …note:
If repeated augmentation is enabled, then indices will be repeated.
- update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) None [source]
Helper function to update scales in each sampler. This is typically useful in variable-batch sampler.
Subclass is expected to implement this function. By default, we do not do anything
- class data.sampler.base_sampler.BaseSamplerDDP(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
Sampler
Base class for DistributedDataParallel Sampler.
Every subclass should implement __iter__ method, providing a way to iterate over indices of dataset elements.
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
- __init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) None [source]
- get_indices_rank_i() List[int] [source]
Returns a list of indices of dataset elements for each rank to iterate over.
- …note:
If repeated augmentation is enabled, then indices will be repeated.
If sharding is enabled, then each rank will process a subset of the dataset.
- update_scales(epoch: int, is_master_node: bool = False, *args, **kwargs) None [source]
Helper function to update scales in each sampler. This is typically useful in variable-batch sampler
Subclass is expected to implement this function. By default, we do not do anything
- data.sampler.base_sampler.get_batch_size_from_opts(opts: Namespace, is_training: bool = False) int [source]
Helper function to extract batch size for training or validation/test
- Parameters:
opts – command line argument
is_training – Training or validation mode. Default: False
- Returns:
Returns an integer
data.sampler.batch_sampler module
- class data.sampler.batch_sampler.BatchSampler(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
BaseSampler
Standard Batch Sampler for data parallel. This sampler yields batches of fixed batch size and spatial resolutions.
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
- class data.sampler.batch_sampler.BatchSamplerDDP(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
BaseSamplerDDP
DDP variant of BatchSampler
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
data.sampler.chain_sampler module
- class data.sampler.chain_sampler.ChainSampler(opts: Namespace, *args, **kwargs)[source]
Bases:
Sampler
This class is a wrapper for iterating over datasets for multiple or similar tasks, typically useful for multi-task training. task_name and sampler_config are two mandatory keys that allows us to use task-specific data samplers. For specifying batch sizes, we use train_batch_size0, and val_batch_size0 as keys for training and validation sets. Note that the batch sizes are scaled automatically depending on the number of GPUs.
- Parameters:
opts – Command-line arguments
data_samplers – dictionary containing different samplers
Example:: # Example yaml config for combining different samplers is given below. # Please note that configuration for each sampler should start with - in chain_sampler.
- sampler:
name: “chain_sampler” chain_sampler_mode: “sequential” chain_sampler:
task_name: “segmentation” train_batch_size0: 10 sampler_config:
name: “variable_batch_sampler” use_shards: false num_repeats: 4 truncated_repeat_aug_sampler: false vbs:
crop_size_width: 512 crop_size_height: 512 max_n_scales: 25 min_crop_size_width: 256 max_crop_size_width: 768 min_crop_size_height: 256 max_crop_size_height: 768 check_scale: 16
task_name: “classification” train_batch_size0: 20 sampler_config:
name: “batch_sampler” bs:
crop_size_width: 512 crop_size_height: 512
- classmethod add_arguments(parser: ArgumentParser) ArgumentParser [source]
Add arguments for chain sampler.
- classmethod build_chain_sampler(opts: Namespace, n_data_samples: Mapping[str, int], is_training: bool = False, *args, **kwargs) Mapping[str, Sampler] [source]
Build chain sampler from command line arguments and sampler registry :param opts: command-line arguments :param n_data_samples: Mapping containing the task name and number of dataset samples in task-specific dataset :param is_training: Training mode or not
- Returns:
A dictionary, sampler_dict, containing information about sampler name and module.
- set_epoch(epoch: int) None [source]
Helper function to set epoch in each sampler. :param epoch: Current epoch
- Returns:
Nothing
- update_scales(epoch: int, is_master_node: bool | None = False, *args, **kwargs) None [source]
Helper function to update scales in each sampler. This is typically useful for variable-batch samplers
- Parameters:
epoch – Current epoch
is_master_node – Master node or not.
- Returns:
Nothing
- update_indices(new_indices: List[int]) None [source]
Update sample indices of the datasets with these new indices.
- Parameters:
new_indices – Filtered indices of the samples that needs to be used in next epoch.
- Returns:
Nothing
- …note:
This function is useful for sample-efficient training. This function may be implemented in future (depending on use-case)
data.sampler.multi_scale_sampler module
- class data.sampler.multi_scale_sampler.MultiScaleSampler(opts, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
BaseSampler
Multi-scale batch sampler for data parallel. This sampler yields batches of fixed batch size, but each batch has different spatial resolution.
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
- class data.sampler.multi_scale_sampler.MultiScaleSamplerDDP(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
BaseSamplerDDP
DDP version of MultiScaleSampler
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
data.sampler.utils module
- data.sampler.utils.image_batch_pairs(crop_size_w: int, crop_size_h: int, batch_size_gpu0: int, max_scales: float | None = 5, check_scale_div_factor: int | None = 32, min_crop_size_w: int | None = 160, max_crop_size_w: int | None = 320, min_crop_size_h: int | None = 160, max_crop_size_h: int | None = 320, *args, **kwargs) List[Tuple[int, int, int]] [source]
- This function creates batch and image size pairs. For a given batch size and image size, different image sizes
are generated and batch size is adjusted so that GPU memory can be utilized efficiently.
- Parameters:
crop_size_w – Base Image width (e.g., 224)
crop_size_h – Base Image height (e.g., 224)
batch_size_gpu0 – Batch size on GPU 0 for base image
max_scales – Number of scales. How many image sizes that we want to generate between min and max scale factors. Default: 5
check_scale_div_factor – Check if image scales are divisible by this factor. Default: 32
min_crop_size_w – Min. crop size along width. Default: 160
max_crop_size_w – Max. crop size along width. Default: 320
min_crop_size_h – Min. crop size along height. Default: 160
max_crop_size_h – Max. crop size along height. Default: 320
- Returns:
a sorted list of tuples. Each index is of the form (h, w, batch_size)
- data.sampler.utils.make_video_pairs(crop_size_h: int, crop_size_w: int, min_crop_size_h: int, max_crop_size_h: int, min_crop_size_w: int, max_crop_size_w: int, default_frames: int, max_scales: int | None = 5, check_scale_div_factor: int | None = 32, *args, **kwargs) List[Tuple[int, int, int]] [source]
This function creates number of frames and spatial size pairs for videos.
- Parameters:
crop_size_h – Base Image height (e.g., 224)
crop_size_w – Base Image width (e.g., 224)
min_crop_size_w – Min. crop size along width.
max_crop_size_w – Max. crop size along width.
min_crop_size_h – Min. crop size along height.
max_crop_size_h – Max. crop size along height.
default_frames – Default number of frames per clip in a video.
max_scales – Number of scales. Default: 5
check_scale_div_factor – Check if spatial scales are divisible by this factor. Default: 32.
- Returns:
A sorted list of tuples. Each index is of the form (h, w, n_frames)
- data.sampler.utils.create_intervallic_integer_list(base_val: int | float, min_val: float, max_val: float, num_scales: int | None = 5, scale_div_factor: int | None = 1) List[int] [source]
This function creates a list of n integer values that scales base_val between min_scale and max_scale.
- Parameters:
base_val – The base value to scale.
min_val – The lower end of the value.
max_val – The higher end of the value.
n – Number of scaled values to generate.
scale_div_factor – Check if scaled values are divisible by this factor.
- Returns:
a sorted list of tuples. Each index is of the form (h, w, n_frames)
- data.sampler.utils.make_tuple_list(*val_list: List) List[Tuple] [source]
Make a list of values to a list of the tuples. Where ith element in each list is in the ith tuple of the returned list.
For example: [[1, 2], [3, 4], [5, 6]] is converted to [(1, 3, 5), (2, 4, 6)].
- Parameters:
val_list – A list of m list, where each element is a list of n values.
- Returns:
A list of size n, where each value is a tupe if m values.
data.sampler.variable_batch_sampler module
- class data.sampler.variable_batch_sampler.VariableBatchSampler(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
BaseSampler
Variably-size multi-scale batch sampler <https://arxiv.org/abs/2110.02178?context=cs.LG>` for data parallel. This sampler yields batches with variable spatial resolution and batch size.
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
- __init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) None [source]
- class data.sampler.variable_batch_sampler.VariableBatchSamplerDDP(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs)[source]
Bases:
BaseSamplerDDP
DDP version of VariableBatchSampler
- Parameters:
opts – command line argument
n_data_samples – Number of samples in the dataset
is_training – Training or validation mode. Default: False
- __init__(opts: Namespace, n_data_samples: int, is_training: bool = False, *args, **kwargs) None [source]
Module contents
- data.sampler.build_sampler(opts: Namespace, n_data_samples: int | Mapping[str, int], is_training: bool = False, get_item_metadata: Callable[[int], Dict] | None = None, *args, **kwargs) Sampler [source]
Helper function to build data sampler from command-line arguments
- Parameters:
opts – Command-line arguments
n_data_samples – Number of data samples. It can be an integer specifying number of data samples for a given task or a mapping of task name and data samples per task in case of a chain sampler.
get_item_metadata – A callable that provides sample metadata, given sample index.
is_training – Training mode or not. Defaults to False.
- Returns:
Data sampler over which we can iterate.