Batch Processors#
DNIKit provides a large number of various
Processors
for batches. For instance,
a processor might resize the images in a batch, perform data augmentation,
remove batch fields, attach metadata, rename labels, etc. These processors
are chained together in pipelines
,
acting as PipelineStages
. Note that
Processors
always come after a data Producer,
which is what generates batches to begin with.
Here are most of the available batch processors and data loaders, and links to their API for more information.
Batch filtering and concatenation#
Composer
for filtering#
Concatenator
for merging fields#
- class dnikit.processors.Concatenator(dim, output_field, fields)[source]
This
PipelineStage
will concatenate 2 or morefields
in theBatch
and produce a new field with the givenoutput_field
.Example
If there were fields
M
andN
with dimensionsBxM1xZ
andBxN1xZ
and they were concatenated along dimension 1, the result will have a new field of sizeBx(M1+N1)xZ
.
Renaming fields and metadata#
FieldRenamer
#
MetadataRenamer
#
- class dnikit.processors.MetadataRenamer(mapping, *, meta_keys=None)[source]
A
PipelineStage
that renames somemetadata
fields in aBatch
. This only works with metadata that has key typeBatch.DictMetaKey
.- Parameters:
mapping (Mapping[str, str]) – a dictionary (or similar) whose keys are the old metadata field names and values are the new metadata field names.
meta_keys (None | DictMetaKey | Collection[DictMetaKey]) – [keyword arg, optional] either a single instance or an iterable of metadata keys of type
Batch.DictMetaKey
whosekey-fields
will be renamed. IfNone
(the default case), allkey-fields
for all metadata keys will be renamed.
Note
MetadataRenamer
only works with class:Batch.DictMetaKey <dnikit.base.Batch.DictMetaKey> (which has entries that can be renamed).
Removing fields and metadata#
FieldRemover
#
- class dnikit.processors.FieldRemover(*, fields, keep=False)[source]
A
PipelineStage
that removes somefields
from aBatch
.
MetadataRemover
#
- class dnikit.processors.MetadataRemover(*, meta_keys=None, keys=None, keep=False)[source]
A
PipelineStage
that removes somemetadata
from aBatch
.- Parameters:
meta_keys (None | MetaKey | DictMetaKey | Collection[MetaKey | DictMetaKey]) – [keyword arg, optional] either a single instance or an iterable of
Batch.MetaKey
/Batch.DictMetaKey
that may be removed. IfNone
(the default case), this processor will operate on allmetadata
keys.keys (Any) – [keyword arg, optional] key within metadata to be removed.
metadata
with metadata key typeBatch.DictMetaKey
is a mapping fromstr: data-type
. This argument specifies the strkey-field
that will be removed from the batch’s metadata, where the metadata must have metadata key typeBatch.DictMetaKey
. IfNone
(the default case), this processor will operate on allkey-fields
for metadata with typeBatch.DictMetaKey
metadata key.keep (bool) – [keyword arg, optional] if True, the selected
meta_keys
andkeys
now specify what to keep, and all other data will be removed.
SnapshotRemover
#
- class dnikit.processors.SnapshotRemover(snapshots=None, keep=False)[source]
A
PipelineStage
that removes snapshots from aBatch
. If used with no arguments, this will remove allsnapshots
.
General data transforms#
MeanStdNormalizer
#
- class dnikit.processors.MeanStdNormalizer(*, mean, std, fields=None)[source]
A
Processor
that standardizes afield
of aBatch
by subtracting the mean and adjusting the standard deviation to 1.More precisely, if
x
is the data to be processed, the following processing is applied:(x - mean) / std
.- Parameters:
mean (float) – [keyword arg] The mean to be applied
std (float) – [keyword arg] The standard deviation to be applied
fields (None | str | Collection[str]) – [keyword arg, optional] a single
field
name, or an iterable offield
names, to be processed. Iffields
param isNone
, then allfields
will be processed.
Pooler
(Max Pooling)#
- class dnikit.processors.Pooler(*, dim, method, fields=None)[source]
A
Processor
that pools the axes of a data field from aBatch
with a specific method.- Parameters:
dim (int | Collection[int]) – [keyword arg] The dimension (one or many) to be pooled. E.g., Spatial pooling is generally
(1, 2)
.method (Method) – [keyword arg] Pooling method. See
Pooler.Method
for full list of options.fields (None | str | Collection[str]) – [keyword arg, optional] a single
field
name, or an iterable offield
names, to be pooled. If thefields
param isNone
, then all thefields
in thebatch
will be pooled.
Transposer
#
- class dnikit.processors.Transposer(*, dim, fields=None)[source]
A
Processor
that transposes dimensions in a datafield
from aBatch
. This processor will reorder the dimensions of the data as specified in thedim
param.Example
To reorder
NCHW
toNHWC
(or vice versa), specifyTransposer(dim=[0,3,1,2])
- Parameters:
dim (Sequence[int]) – [keyword arg] the new order of the dimensions. It is illegal to reorder the 0th dimension.
fields (None | str | Collection[str]) – [keyword arg, optional] a single
field
name, or an iterable offield
names, to be transposed. Iffields
param isNone
, then allfields
will be transposed.
See also
- Raises:
ValueError – if input specifies reordering the 0th dimension
- Parameters:
fields (None | str | Collection[str]) –
Image operations#
ImageResizer
to resize images#
- class dnikit.processors.ImageResizer(*, pixel_format, size, fields=None)[source]
Initialize an
ImageResizer
. This uses OpenCV to resize images. This can convert responses with the structureBxHxWxC
(seeImageFormat
for alternatives) to a newHxW
value. This does not honor aspect ratio – the new image will be exactly the size given. This uses the default OpenCV interpolation,INTER_LINEAR
.- Parameters:
pixel_format (ImageFormat) – [keyword arg] the layout of the pixel data, see
ImageFormat
size (Tuple[int, int]) – [keyword arg] the size to scale to,
(width, height)
fields (None | str | Collection[str]) – [keyword arg, optional] a single
field
name, or an iterable offield
names, to be processed. Iffields
param isNone
, then allfields
will be resized.
- Raises:
if OpenCV is not installed.
ValueError – if
size
elements ((width, height)
) are not positive
ImageRotationProcessor
to rotate images#
- class dnikit.processors.ImageRotationProcessor(angle=0.0, pixel_format=ImageFormat.HWC, *, cval=(0, 0, 0), fields=None)[source]
Processor
that performs image rotation along y-axis on data in a data field from aBatch
.BxCHW
andBxHWC
images accepted with non-normalized values (between 0 and 255).- Parameters:
angle (float) – [optional] angle (in degrees) of image rotation; positive values mean counter-clockwise rotation
pixel_format (ImageFormat) – [optional] the layout of the pixel data, see
ImageFormat
cval (Tuple[int, int, int]) – [keyword arg, optional] RGB color value to fill areas outside image; defaults to
(0, 0, 0)
(black)fields (None | str | Collection[str]) – [keyword arg, optional] a single
field
name, or an iterable offield
names, to be processed. Iffields
param isNone
, then allfields
will be processed.
- Raises:
if OpenCV is not installed.
Augmentations#
ImageGaussianBlurProcessor
#
- class dnikit.processors.ImageGaussianBlurProcessor(sigma=0.0, *, fields=None)[source]
Processor
that blurs images in a data field from aBatch
.BxCHW
andBxHWC
images accepted with non-normalized values (between 0 and 255).- Parameters:
sigma (float) – [optional] blur filter size; recommended values between 0 and 3, but values beyond this range are acceptable.
fields (None | str | Collection[str]) – [keyword arg, optional] a single
field
name, or an iterable offield
names, to be processed. Iffields
param isNone
, then allfields
will be processed.
- Raises:
if OpenCV is not installed.
ValueError – if
sigma
is not positive
ImageGammaContrastProcessor
#
- class dnikit.processors.ImageGammaContrastProcessor(gamma=1.0, *, fields=None)[source]
Processor
that gamma corrects images in a data field from aBatch
.BxCHW
andBxHWC
images accepted with non-normalized values (between 0 and 255). Image (I) is contrasting using formula(I/255)^gamma*255
.- Parameters:
- Raises:
if OpenCV is not installed.
Utility processors#
Cacher
to cache responses from pipelines#
- class dnikit.processors.Cacher(storage_path=None)[source]
Cacher
is aPipelineStage
that will cache to disk the batches produced by the previousProducer
in a pipeline created withpipeline()
.The first time a pipeline with a
Cacher
is executed,Cacher
store the batches to disk. Every time the pipeline is called after that, batches will be read directly from disk, without doing any computation for previous stages.Note that batches may be quite large and this may require a large portion of available disk space. Be mindful when using
Cacher
.If the data from the
producer
does not haveBatch.StdKeys.IDENTIFIER
, this class will assign a numeric identifier. This cannot be used across calls toCacher
but will be consistent for all uses of thepipelined_producer
.Example
producer = ... # create a valid dnikit Producer processor = ... # create a valid dnikit Processor cacher = Cacher() # Pipeline everything pipelined_producer = pipeline(producer, processor, cacher) # No results have been cached cacher.cached # returns False # Trigger pipeline batches = list(pipelined_producer(batch_size=32)) # producer and processor are invoked. # Results have been cached cacher.cached # returns True # Trigger pipeline again (fast, because batch_size has the same value as before) list(pipelined_producer(batch_size=32)) # producer and processor are NOT invoked # Trigger pipeline once more (slower, because batch_size is different from first time) list(pipelined_producer(batch_size=48)) # producer and processor are NOT invoked
The typical use-case for this class is to cache the results of expensive computation (such as inference and post-processing) to avoid re-doing said computation more than once.
Note
Just as with
Model
, andProcessor
no computation (or in this case, caching) will be executed until the pipeline is triggered.See also
dnikit.base.multi_introspect()
which allows several introspectors to use the same batches without storing them in the file-system.multi_introspect()
may be a better option for very large datasets.Warning
Cacher
has the ability to resize batches if batches of different sizes are requested (see example). However, doing so is relatively computationally expensive since it involves concatenating and splitting batches. Therefore it’s recommended to use this feature sparingly.Warning
Unlike other
PipelineStage
,Cacher
will raise aDNIKitException
if it is used with more than one pipeline. This is to avoid reading batches generated from another pipeline with different characteristics.- Parameters:
storage_path (Path | None) – [optional ] If set,
Cacher
will store batches in storage_path, otherwise it will create a random temporary directory.