Batch Processors#
DNIKit provides a large number of various
Processors for batches. For instance,
a processor might resize the images in a batch, perform data augmentation,
remove batch fields, attach metadata, rename labels, etc. These processors
are chained together in pipelines,
acting as PipelineStages. Note that
Processors
always come after a data Producer,
which is what generates batches to begin with.
Here are most of the available batch processors and data loaders, and links to their API for more information.
Batch filtering and concatenation#
Composer for filtering#
Concatenator for merging fields#
- class dnikit.processors.Concatenator(dim, output_field, fields)[source]
This
PipelineStagewill concatenate 2 or morefieldsin theBatchand produce a new field with the givenoutput_field.Example
If there were fields
MandNwith dimensionsBxM1xZandBxN1xZand they were concatenated along dimension 1, the result will have a new field of sizeBx(M1+N1)xZ.
Renaming fields and metadata#
FieldRenamer#
MetadataRenamer#
- class dnikit.processors.MetadataRenamer(mapping, *, meta_keys=None)[source]
A
PipelineStagethat renames somemetadatafields in aBatch. This only works with metadata that has key typeBatch.DictMetaKey.- Parameters:
mapping (Mapping[str, str]) – a dictionary (or similar) whose keys are the old metadata field names and values are the new metadata field names.
meta_keys (None | DictMetaKey | Collection[DictMetaKey]) – [keyword arg, optional] either a single instance or an iterable of metadata keys of type
Batch.DictMetaKeywhosekey-fieldswill be renamed. IfNone(the default case), allkey-fieldsfor all metadata keys will be renamed.
Note
MetadataRenameronly works with class:Batch.DictMetaKey <dnikit.base.Batch.DictMetaKey> (which has entries that can be renamed).
Removing fields and metadata#
FieldRemover#
- class dnikit.processors.FieldRemover(*, fields, keep=False)[source]
A
PipelineStagethat removes somefieldsfrom aBatch.
MetadataRemover#
- class dnikit.processors.MetadataRemover(*, meta_keys=None, keys=None, keep=False)[source]
A
PipelineStagethat removes somemetadatafrom aBatch.- Parameters:
meta_keys (None | MetaKey | DictMetaKey | Collection[MetaKey | DictMetaKey]) – [keyword arg, optional] either a single instance or an iterable of
Batch.MetaKey/Batch.DictMetaKeythat may be removed. IfNone(the default case), this processor will operate on allmetadatakeys.keys (Any) – [keyword arg, optional] key within metadata to be removed.
metadatawith metadata key typeBatch.DictMetaKeyis a mapping fromstr: data-type. This argument specifies the strkey-fieldthat will be removed from the batch’s metadata, where the metadata must have metadata key typeBatch.DictMetaKey. IfNone(the default case), this processor will operate on allkey-fieldsfor metadata with typeBatch.DictMetaKeymetadata key.keep (bool) – [keyword arg, optional] if True, the selected
meta_keysandkeysnow specify what to keep, and all other data will be removed.
SnapshotRemover#
- class dnikit.processors.SnapshotRemover(snapshots=None, keep=False)[source]
A
PipelineStagethat removes snapshots from aBatch. If used with no arguments, this will remove allsnapshots.
General data transforms#
MeanStdNormalizer#
- class dnikit.processors.MeanStdNormalizer(*, mean, std, fields=None)[source]
A
Processorthat standardizes afieldof aBatchby subtracting the mean and adjusting the standard deviation to 1.More precisely, if
xis the data to be processed, the following processing is applied:(x - mean) / std.- Parameters:
mean (float) – [keyword arg] The mean to be applied
std (float) – [keyword arg] The standard deviation to be applied
fields (None | str | Collection[str]) – [keyword arg, optional] a single
fieldname, or an iterable offieldnames, to be processed. Iffieldsparam isNone, then allfieldswill be processed.
Pooler (Max Pooling)#
- class dnikit.processors.Pooler(*, dim, method, fields=None)[source]
A
Processorthat pools the axes of a data field from aBatchwith a specific method.- Parameters:
dim (int | Collection[int]) – [keyword arg] The dimension (one or many) to be pooled. E.g., Spatial pooling is generally
(1, 2).method (Method) – [keyword arg] Pooling method. See
Pooler.Methodfor full list of options.fields (None | str | Collection[str]) – [keyword arg, optional] a single
fieldname, or an iterable offieldnames, to be pooled. If thefieldsparam isNone, then all thefieldsin thebatchwill be pooled.
Transposer#
- class dnikit.processors.Transposer(*, dim, fields=None)[source]
A
Processorthat transposes dimensions in a datafieldfrom aBatch. This processor will reorder the dimensions of the data as specified in thedimparam.Example
To reorder
NCHWtoNHWC(or vice versa), specifyTransposer(dim=[0,3,1,2])- Parameters:
dim (Sequence[int]) – [keyword arg] the new order of the dimensions. It is illegal to reorder the 0th dimension.
fields (None | str | Collection[str]) – [keyword arg, optional] a single
fieldname, or an iterable offieldnames, to be transposed. Iffieldsparam isNone, then allfieldswill be transposed.
See also
- Raises:
ValueError – if input specifies reordering the 0th dimension
- Parameters:
fields (None | str | Collection[str]) –
Image operations#
ImageResizer to resize images#
- class dnikit.processors.ImageResizer(*, pixel_format, size, fields=None)[source]
Initialize an
ImageResizer. This uses OpenCV to resize images. This can convert responses with the structureBxHxWxC(seeImageFormatfor alternatives) to a newHxWvalue. This does not honor aspect ratio – the new image will be exactly the size given. This uses the default OpenCV interpolation,INTER_LINEAR.- Parameters:
pixel_format (ImageFormat) – [keyword arg] the layout of the pixel data, see
ImageFormatsize (Tuple[int, int]) – [keyword arg] the size to scale to,
(width, height)fields (None | str | Collection[str]) – [keyword arg, optional] a single
fieldname, or an iterable offieldnames, to be processed. Iffieldsparam isNone, then allfieldswill be resized.
- Raises:
if OpenCV is not installed.
ValueError – if
sizeelements ((width, height)) are not positive
ImageRotationProcessor to rotate images#
- class dnikit.processors.ImageRotationProcessor(angle=0.0, pixel_format=ImageFormat.HWC, *, cval=(0, 0, 0), fields=None)[source]
Processorthat performs image rotation along y-axis on data in a data field from aBatch.BxCHWandBxHWCimages accepted with non-normalized values (between 0 and 255).- Parameters:
angle (float) – [optional] angle (in degrees) of image rotation; positive values mean counter-clockwise rotation
pixel_format (ImageFormat) – [optional] the layout of the pixel data, see
ImageFormatcval (Tuple[int, int, int]) – [keyword arg, optional] RGB color value to fill areas outside image; defaults to
(0, 0, 0)(black)fields (None | str | Collection[str]) – [keyword arg, optional] a single
fieldname, or an iterable offieldnames, to be processed. Iffieldsparam isNone, then allfieldswill be processed.
- Raises:
if OpenCV is not installed.
Augmentations#
ImageGaussianBlurProcessor#
- class dnikit.processors.ImageGaussianBlurProcessor(sigma=0.0, *, fields=None)[source]
Processorthat blurs images in a data field from aBatch.BxCHWandBxHWCimages accepted with non-normalized values (between 0 and 255).- Parameters:
sigma (float) – [optional] blur filter size; recommended values between 0 and 3, but values beyond this range are acceptable.
fields (None | str | Collection[str]) – [keyword arg, optional] a single
fieldname, or an iterable offieldnames, to be processed. Iffieldsparam isNone, then allfieldswill be processed.
- Raises:
if OpenCV is not installed.
ValueError – if
sigmais not positive
ImageGammaContrastProcessor#
- class dnikit.processors.ImageGammaContrastProcessor(gamma=1.0, *, fields=None)[source]
Processorthat gamma corrects images in a data field from aBatch.BxCHWandBxHWCimages accepted with non-normalized values (between 0 and 255). Image (I) is contrasting using formula(I/255)^gamma*255.- Parameters:
- Raises:
if OpenCV is not installed.
Utility processors#
Cacher to cache responses from pipelines#
- class dnikit.processors.Cacher(storage_path=None)[source]
Cacheris aPipelineStagethat will cache to disk the batches produced by the previousProducerin a pipeline created withpipeline().The first time a pipeline with a
Cacheris executed,Cacherstore the batches to disk. Every time the pipeline is called after that, batches will be read directly from disk, without doing any computation for previous stages.Note that batches may be quite large and this may require a large portion of available disk space. Be mindful when using
Cacher.If the data from the
producerdoes not haveBatch.StdKeys.IDENTIFIER, this class will assign a numeric identifier. This cannot be used across calls toCacherbut will be consistent for all uses of thepipelined_producer.Example
producer = ... # create a valid dnikit Producer processor = ... # create a valid dnikit Processor cacher = Cacher() # Pipeline everything pipelined_producer = pipeline(producer, processor, cacher) # No results have been cached cacher.cached # returns False # Trigger pipeline batches = list(pipelined_producer(batch_size=32)) # producer and processor are invoked. # Results have been cached cacher.cached # returns True # Trigger pipeline again (fast, because batch_size has the same value as before) list(pipelined_producer(batch_size=32)) # producer and processor are NOT invoked # Trigger pipeline once more (slower, because batch_size is different from first time) list(pipelined_producer(batch_size=48)) # producer and processor are NOT invoked
The typical use-case for this class is to cache the results of expensive computation (such as inference and post-processing) to avoid re-doing said computation more than once.
Note
Just as with
Model, andProcessorno computation (or in this case, caching) will be executed until the pipeline is triggered.See also
dnikit.base.multi_introspect()which allows several introspectors to use the same batches without storing them in the file-system.multi_introspect()may be a better option for very large datasets.Warning
Cacherhas the ability to resize batches if batches of different sizes are requested (see example). However, doing so is relatively computationally expensive since it involves concatenating and splitting batches. Therefore it’s recommended to use this feature sparingly.Warning
Unlike other
PipelineStage,Cacherwill raise aDNIKitExceptionif it is used with more than one pipeline. This is to avoid reading batches generated from another pipeline with different characteristics.- Parameters:
storage_path (Path | None) – [optional ] If set,
Cacherwill store batches in storage_path, otherwise it will create a random temporary directory.