cvnets.modules package

Submodules

cvnets.modules.aspp_block module

class cvnets.modules.aspp_block.ASPP(opts, in_channels: int, out_channels: int, atrous_rates: Tuple[int], is_sep_conv: bool | None = False, dropout: float | None = 0.0, *args, **kwargs)[source]

Bases: BaseModule

ASPP module defined in DeepLab papers, here and here

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$
atrous_rates (Tuple[int]) – atrous rates for different branches.
is_sep_conv (Optional[bool]) – Use separable convolution instead of standaard conv. Default: False
dropout (Optional[float]) – Apply dropout. Default is 0.0

Shape:

Input: $(N, C_{i n}, H, W)$
Output: $(N, C_{o u t}, H, W)$

__init__(opts, in_channels: int, out_channels: int, atrous_rates: Tuple[int], is_sep_conv: bool | None = False, dropout: float | None = 0.0, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.aspp_block.ASPPConv2d(opts, in_channels: int, out_channels: int, dilation: int, *args, **kwargs)[source]

Bases: ConvLayer2d

Convolution with a dilation for the ASPP module :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$ :type in_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$ :type out_channels: int :param dilation: Dilation rate :type dilation: int

Shape:

Input: $(N, C_{i n}, H, W)$
Output: $(N, C_{o u t}, H, W)$

__init__(opts, in_channels: int, out_channels: int, dilation: int, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

adjust_atrous_rate(rate: int) → None[source]: This function allows to adjust the dilation rate

class cvnets.modules.aspp_block.ASPPSeparableConv2d(opts, in_channels: int, out_channels: int, dilation: int, *args, **kwargs)[source]

Bases: SeparableConv2d

Separable Convolution with a dilation for the ASPP module :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$ :type in_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$ :type out_channels: int :param dilation: Dilation rate :type dilation: int

Shape:

Input: $(N, C_{i n}, H, W)$
Output: $(N, C_{o u t}, H, W)$

__init__(opts, in_channels: int, out_channels: int, dilation: int, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

adjust_atrous_rate(rate: int) → None[source]: This function allows to adjust the dilation rate

class cvnets.modules.aspp_block.ASPPPooling(opts, in_channels: int, out_channels: int, *args, **kwargs)[source]

Bases: BaseLayer

ASPP pooling layer :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$ :type in_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$ :type out_channels: int

Shape:

Input: $(N, C_{i n}, H, W)$
Output: $(N, C_{o u t}, H, W)$

__init__(opts, in_channels: int, out_channels: int, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Forward function.

cvnets.modules.base_module module

class cvnets.modules.base_module.BaseModule(*args, **kwargs)[source]

Bases: Module

Base class for all modules

__init__(*args, **kwargs)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Any, *args, **kwargs) → Any[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.efficientnet module

class cvnets.modules.efficientnet.EfficientNetBlock(stochastic_depth_prob: float, *args, **kwargs)[source]

Bases: InvertedResidualSE

This class implements a variant of the inverted residual block with squeeze-excitation unit, as described in MobileNetv3 paper. This variant includes stochastic depth, as used in EfficientNet paper.

Parameters:

stochastic_depth_prob – float,
arguments (For other) –
class. (refer to the parent) –

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

__init__(stochastic_depth_prob: float, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.fastvit module

cvnets.modules.fastvit.convolutional_stem(opts: Namespace, in_channels: int, out_channels: int) → Sequential[source]

Build convolutional stem with MobileOne blocks.

Parameters:

opts – Command line arguments.
in_channels – Number of input channels.
out_channels – Number of output channels.

Returns:

nn.Sequential object with stem elements.

class cvnets.modules.fastvit.PatchEmbed(opts: Namespace, patch_size: int, stride: int, in_channels: int, embed_dim: int)[source]

Bases: BaseModule

Convolutional Patch embedding layer.

Parameters:

opts – Command line arguments.
patch_size – Patch size for embedding computation.
stride – Stride for convolutional embedding layer.
in_channels – Number of channels of input tensor.
embed_dim – Number of embedding dimensions.

__init__(opts: Namespace, patch_size: int, stride: int, in_channels: int, embed_dim: int)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H / / s, W / / s)$ , where s is the stride provide while instantiating the layer.

class cvnets.modules.fastvit.RepMixer(opts: Namespace, dim: int, kernel_size: int = 3, use_layer_scale: bool = True, layer_scale_init_value: float = 1e-05, inference_mode: bool = False)[source]

Bases: BaseModule

Reparameterizable token mixer

For more details, please refer to our paper: FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Parameters:

opts – Command line arguments.
dim – Input feature map dimension. $C_{i n}$ from an expected input of size $(B, C_{i n}, H, W)$ .
kernel_size – Kernel size for spatial mixing. Default: 3
use_layer_scale – If True, learnable layer scale is used. Default: True
layer_scale_init_value – Initial value for layer scale. Default: 1e-5
inference_mode – If True, instantiates model in inference mode. Default: False

__init__(opts: Namespace, dim: int, kernel_size: int = 3, use_layer_scale: bool = True, layer_scale_init_value: float = 1e-05, inference_mode: bool = False)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass implements inference logic for module before and after reparameterization.

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

reparameterize() → None[source]: Reparameterize mixer and norm into a single convolutional layer for efficient inference.

class cvnets.modules.fastvit.ConvFFN(opts: Namespace, in_channels: int, hidden_channels: int | None = None, out_channels: int | None = None, drop: float = 0.0)[source]

Bases: BaseModule

Convolutional FFN Module.

Parameters:

opts – Command line arguments.
in_channels – Number of input channels.
hidden_channels – Number of channels after expansion. Default: None
out_channels – Number of output channels. Default: None
drop – Dropout rate. Default: 0.0.

__init__(opts: Namespace, in_channels: int, hidden_channels: int | None = None, out_channels: int | None = None, drop: float = 0.0)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

class cvnets.modules.fastvit.RepMixerBlock(opts: Namespace, dim: int, kernel_size: int = 3, mlp_ratio: float = 4.0, drop: float = 0.0, drop_path: float = 0.0, use_layer_scale: bool = True, layer_scale_init_value: float = 1e-05, inference_mode: bool = False)[source]

Bases: BaseModule

Implementation of Metaformer block with RepMixer as token mixer. For more details on Metaformer structure, please refer to: MetaFormer Is Actually What You Need for Vision

Parameters:

opts – Command line arguments.
dim – Number of embedding dimensions.
kernel_size – Kernel size for repmixer. Default: 3
mlp_ratio – MLP expansion ratio. Default: 4.0
drop – Dropout rate. Default: 0.0
drop_path – Drop path rate. Default: 0.0
use_layer_scale – Flag to turn on layer scale. Default: True
layer_scale_init_value – Layer scale value at initialization. Default: 1e-5
inference_mode – Flag to instantiate block in inference mode. Default: False

__init__(opts: Namespace, dim: int, kernel_size: int = 3, mlp_ratio: float = 4.0, drop: float = 0.0, drop_path: float = 0.0, use_layer_scale: bool = True, layer_scale_init_value: float = 1e-05, inference_mode: bool = False)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

class cvnets.modules.fastvit.AttentionBlock(opts: Namespace, dim: int, mlp_ratio: float = 4.0, drop: float = 0.0, drop_path: float = 0.0, use_layer_scale: bool = True, layer_scale_init_value: float = 1e-05)[source]

Bases: BaseModule

Implementation of metaformer block with MHSA as token mixer. For more details on Metaformer structure, please refer to: MetaFormer Is Actually What You Need for Vision

Parameters:

opts – Command line arguments.
dim – Number of embedding dimensions.
mlp_ratio – MLP expansion ratio. Default: 4.0
drop – Dropout rate. Default: 0.0
drop_path – Drop path rate. Default: 0.0
use_layer_scale – Flag to turn on layer scale. Default: True
layer_scale_init_value – Layer scale value at initialization. Default: 1e-5

__init__(opts: Namespace, dim: int, mlp_ratio: float = 4.0, drop: float = 0.0, drop_path: float = 0.0, use_layer_scale: bool = True, layer_scale_init_value: float = 1e-05)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor output from the attention block.

class cvnets.modules.fastvit.RepCPE(opts: Namespace, in_channels: int, embed_dim: int = 768, spatial_shape: int | Tuple[int, int] = (7, 7), inference_mode: bool = False)[source]

Bases: BaseModule

Implementation of reparameterizable conditional positional encoding. For more details refer to paper: Conditional Positional Encodings for Vision Transformers

Parameters:

opts – Command line arguments.
in_channels – Number of input channels.
embed_dim – Number of embedding dimensions. Default: 768
spatial_shape – Spatial shape of kernel for positional encoding. Default: (7, 7)
inference_mode – Flag to instantiate block in inference mode. Default: False

__init__(opts: Namespace, in_channels: int, embed_dim: int = 768, spatial_shape: int | Tuple[int, int] = (7, 7), inference_mode: bool = False)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass implements inference logic for module before and after reparameterization.

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

reparameterize() → None[source]: Reparameterize linear branches.

cvnets.modules.feature_pyramid module

class cvnets.modules.feature_pyramid.FeaturePyramidNetwork(opts, in_channels: List[int], output_strides: List[str], out_channels: int, *args, **kwargs)[source]

Bases: BaseModule

This class implements the Feature Pyramid Network module for object detection.

Parameters:

opts – command-line arguments
in_channels (List[int]) – List of channels at different output strides
output_strides (List[int]) – Feature maps from these output strides will be used in FPN
out_channels (int) – Output channels

__init__(opts, in_channels: List[int], output_strides: List[str], out_channels: int, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

reset_weights() → None[source]: Resets the weights of FPN layers

forward(x: Dict[str, Tensor], *args, **kwargs) → Dict[str, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.mobilenetv2 module

Bases: BaseModule

This class implements the inverted residual block with squeeze-excitation unit, as described in MobileNetv3 paper

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{out}, H_{out}, W_{out)$
expand_ratio (Union[int, float]) – Expand the input channels by this factor in depth-wise conv
dilation (Optional[int]) – Use conv with dilation. Default: 1
stride (Optional[int]) – Use convolutions with a stride. Default: 1
use_se (Optional[bool]) – Use squeeze-excitation block. Default: False
act_fn_name (Optional[str]) – Activation function name. Default: relu
se_scale_fn_name (Optional [str]) – Scale activation function inside SE unit. Defaults to hard_sigmoid
kernel_size (Optional[int]) – Kernel size in depth-wise convolution. Defaults to 3.
squeeze_factor (Optional[bool]) – Squeezing factor in SE unit. Defaults to 4.

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

__init__(opts, in_channels: int, out_channels: int, expand_ratio: int | float, dilation: int | None = 1, stride: int | None = 1, use_se: bool | None = False, act_fn_name: str | None = 'relu', se_scale_fn_name: str | None = 'hard_sigmoid', kernel_size: int | None = 3, squeeze_factor: int | None = 4, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.mobilenetv2.InvertedResidual(opts, in_channels: int, out_channels: int, stride: int, expand_ratio: int | float, dilation: int = 1, skip_connection: bool | None = True, *args, **kwargs)[source]

Bases: BaseModule

This class implements the inverted residual block, as described in MobileNetv2 paper

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{out}, H_{out}, W_{out)$
stride (Optional[int]) – Use convolutions with a stride. Default: 1
expand_ratio (Union[int, float]) – Expand the input channels by this factor in depth-wise conv
dilation (Optional[int]) – Use conv with dilation. Default: 1
skip_connection (Optional[bool]) – Use skip-connection. Default: True

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

Note

If in_channels =! out_channels and stride > 1, we set skip_connection=False

__init__(opts, in_channels: int, out_channels: int, stride: int, expand_ratio: int | float, dilation: int = 1, skip_connection: bool | None = True, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.mobileone_block module

class cvnets.modules.mobileone_block.MobileOneBlock(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, inference_mode: bool = False, use_se: bool = False, use_act: bool = True, use_scale_branch: bool = True, num_conv_branches: int = 1)[source]

Bases: BaseModule

MobileOne building block.

For more details, please refer to our paper: An Improved One millisecond Mobile Backbone <https://arxiv.org/pdf/2206.04040.pdf>

__init__(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, inference_mode: bool = False, use_se: bool = False, use_act: bool = True, use_scale_branch: bool = True, num_conv_branches: int = 1) → None[source]

Construct a MobileOneBlock.

Parameters:

opts – Command line arguments.
in_channels – Number of channels in the input.
out_channels – Number of channels produced by the block.
kernel_size – Size of the convolution kernel.
stride – Stride size. Default: 1
padding – Zero-padding size. Default: 0
dilation – Kernel dilation factor. Default: 1
groups – Group number. Default: 1
inference_mode – If True, instantiates model in inference mode. Default: False
use_se – Whether to use SE-ReLU activations. Default: False
use_act – Whether to use activation. Default: True
use_scale_branch – Whether to use scale branch. Default: True
num_conv_branches – Number of linear conv branches. Default: 1

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass implements inference logic for module before and after reparameterization.

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

reparameterize() → None[source]: Following works like RepVGG: Making VGG-style ConvNets Great Again - https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched architecture used at training time to obtain a plain CNN-like structure for inference.

class cvnets.modules.mobileone_block.RepLKBlock(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, groups: int = 1, small_kernel_size: int | None = None, inference_mode: bool = False, use_act: bool = True)[source]

Bases: BaseModule

This class defines overparameterized large kernel conv block in RepLKNet Reference: https://github.com/DingXiaoH/RepLKNet-pytorch

Parameters:

opts – Command-line arguments.
in_channels – Number of input channels.
out_channels – Number of output channels.
kernel_size – Kernel size of the large kernel conv branch.
stride – Stride size. Default: 1
dilation – Kernel dilation factor. Default: 1
groups – Group number. Default: 1
small_kernel_size – Kernel size of small kernel conv branch.
inference_mode – If True, instantiates model in inference mode. Default: False
use_act – If True, activation is used. Default: True

__init__(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, groups: int = 1, small_kernel_size: int | None = None, inference_mode: bool = False, use_act: bool = True) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass implements inference logic for module before and after reparameterization.

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

reparameterize() → None[source]: Following works like RepVGG: Making VGG-style ConvNets Great Again - https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched architecture used at training time to obtain a plain CNN-like structure for inference.

cvnets.modules.mobilevit_block module

Bases: BaseModule

This class defines the MobileViT block

Parameters:

opts – command line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
transformer_dim (int) – Input dimension to the transformer unit
ffn_dim (int) – Dimension of the FFN block
n_transformer_blocks (Optional[int]) – Number of transformer blocks. Default: 2
head_dim (Optional[int]) – Head dimension in the multi-head attention. Default: 32
attn_dropout (Optional[float]) – Dropout in multi-head attention. Default: 0.0
dropout (Optional[float]) – Dropout rate. Default: 0.0
ffn_dropout (Optional[float]) – Dropout between FFN layers in transformer. Default: 0.0
patch_h (Optional[int]) – Patch height for unfolding operation. Default: 8
patch_w (Optional[int]) – Patch width for unfolding operation. Default: 8
transformer_norm_layer (Optional[str]) – Normalization layer in the transformer block. Default: layer_norm
conv_ksize (Optional[int]) – Kernel size to learn local representations in MobileViT block. Default: 3
dilation (Optional[int]) – Dilation rate in convolutions. Default: 1
no_fusion (Optional[bool]) – Do not combine the input and output feature maps. Default: False

__init__(opts, in_channels: int, transformer_dim: int, ffn_dim: int, n_transformer_blocks: int | None = 2, head_dim: int | None = 32, attn_dropout: float | None = 0.0, dropout: int | None = 0.0, ffn_dropout: int | None = 0.0, patch_h: int | None = 8, patch_w: int | None = 8, transformer_norm_layer: str | None = 'layer_norm', conv_ksize: int | None = 3, dilation: int | None = 1, no_fusion: bool | None = False, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

unfolding(feature_map: Tensor) → Tuple[Tensor, Dict][source]

folding(patches: Tensor, info_dict: Dict) → Tensor[source]

forward_spatial(x: Tensor) → Tensor[source]

forward_temporal(x: Tensor, x_prev: Tensor | None = None) → Tensor | Tuple[Tensor, Tensor][source]

forward(x: Tensor | Tuple[Tensor], *args, **kwargs) → Tensor | Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Bases: BaseModule

This class defines the MobileViTv2 block

Parameters:

opts – command line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
attn_unit_dim (int) – Input dimension to the attention unit
ffn_multiplier (int) – Expand the input dimensions by this factor in FFN. Default is 2.
n_attn_blocks (Optional[int]) – Number of attention units. Default: 2
attn_dropout (Optional[float]) – Dropout in multi-head attention. Default: 0.0
dropout (Optional[float]) – Dropout rate. Default: 0.0
ffn_dropout (Optional[float]) – Dropout between FFN layers in transformer. Default: 0.0
patch_h (Optional[int]) – Patch height for unfolding operation. Default: 8
patch_w (Optional[int]) – Patch width for unfolding operation. Default: 8
conv_ksize (Optional[int]) – Kernel size to learn local representations in MobileViT block. Default: 3
dilation (Optional[int]) – Dilation rate in convolutions. Default: 1
attn_norm_layer (Optional[str]) – Normalization layer in the attention block. Default: layer_norm_2d

__init__(opts, in_channels: int, attn_unit_dim: int, ffn_multiplier: Sequence[int | float] | int | float | None = 2.0, n_attn_blocks: int | None = 2, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, patch_h: int | None = 8, patch_w: int | None = 8, conv_ksize: int | None = 3, dilation: int | None = 1, attn_norm_layer: str | None = 'layer_norm_2d', *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

unfolding_pytorch(feature_map: Tensor) → Tuple[Tensor, Tuple[int, int]][source]

folding_pytorch(patches: Tensor, output_size: Tuple[int, int]) → Tensor[source]

unfolding_coreml(feature_map: Tensor) → Tuple[Tensor, Tuple[int, int]][source]

folding_coreml(patches: Tensor, output_size: Tuple[int, int]) → Tensor[source]

resize_input_if_needed(x)[source]

forward_spatial(x: Tensor, *args, **kwargs) → Tensor[source]

forward_temporal(x: Tensor, x_prev: Tensor, *args, **kwargs) → Tensor | Tuple[Tensor, Tensor][source]

forward(x: Tensor | Tuple[Tensor], *args, **kwargs) → Tensor | Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.pspnet_module module

class cvnets.modules.pspnet_module.PSP(opts, in_channels: int, out_channels: int, pool_sizes: Tuple[int, ...] | None = (1, 2, 3, 6), dropout: float | None = 0.0, *args, **kwargs)[source]

Bases: BaseModule

This class defines the Pyramid Scene Parsing module in the PSPNet paper

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$
Optional[Tuple[int (pool_sizes) – List or Tuple of pool sizes. Default: (1, 2, 3, 6)
...]] – List or Tuple of pool sizes. Default: (1, 2, 3, 6)
dropout (Optional[float]) – Apply dropout. Default is 0.0

__init__(opts, in_channels: int, out_channels: int, pool_sizes: Tuple[int, ...] | None = (1, 2, 3, 6), dropout: float | None = 0.0, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.regnet_modules module

class cvnets.modules.regnet_modules.XRegNetBlock(opts: Namespace, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stochastic_depth_prob: float = 0.0)[source]

Bases: BaseModule

This class implements the X block based on the ResNet bottleneck block. See figure 4 of RegNet paper RegNet model

Parameters:

opts – command-line arguments
width_in – The number of input channels
width_out – The number of output channels
stride – Stride for convolution
groups – Number of groups for convolution
bottleneck_multiplier – The number of in/out channels of the intermediate conv layer will be scaled by this value
se_ratio – The numer squeeze-excitation ratio. The number of channels in the SE module will be scaled by this value
stochastic_depth_prob – The stochastic depth probability

__init__(opts: Namespace, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stochastic_depth_prob: float = 0.0) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Forward pass for XRegNetBlock.

Parameters:: x – Batch of images

Retruns:

output of XRegNetBlock including stochastic depth layer and
residual.

Shape:

x: $(N, C_{i n}, H_{i n}, W_{i n})$ Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

class cvnets.modules.regnet_modules.AnyRegNetStage(opts: Namespace, depth: int, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stage_index: int, stochastic_depth_probs: List[float])[source]

Bases: BaseModule

This class implements a ‘stage’ as defined in the RegNet paper. It consists of a sequence of bottleneck blocks.

Parameters:

opts – command-line arguments
depth – The number of XRegNetBlocks in the stage
width_in – The number of input channels of the first block
width_out – The number of output channels of each block
stride – Stride for convolution of first block
groups – Number of groups for the intermediate convolution (bottleneck) layer in each block
bottleneck_multiplier – The number of in/out channels of the intermediate conv layer of each block will be scaled by this value
se_ratio – The numer squeeze-excitation ratio. The number of channels in the SE module of each block will be scaled by this value
stage_depths – A list of the number of blocks in each stage
stage_index – The index of the current stage being constructed
stochastic_depth_prob – The stochastic depth probability

__init__(opts: Namespace, depth: int, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stage_index: int, stochastic_depth_probs: List[float]) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Forward pass through all blocks in the stage.

Parameters:

x – Batch of images.

Returns:

output of passing x through all blocks in the stage.

Shape:: x: $(N, C_{i n}, H_{i n}, W_{i n})$ Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

cvnets.modules.resnet_modules module

class cvnets.modules.resnet_modules.BasicResNetBlock(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs)[source]

Bases: BaseModule

This class defines the Basic block in the ResNet model :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$ :type in_channels: int :param mid_channels: $C_{m i d}$ from an expected tensor of size $(N, C_{m i d}, H_{o u t}, W_{o u t})$ :type mid_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H_{o u t}, W_{o u t})$ :type out_channels: int :param stride: Stride for convolution. Default: 1 :type stride: Optional[int] :param dilation: Dilation for convolution. Default: 1 :type dilation: Optional[int] :param dropout: Dropout after second convolution. Default: 0.0 :type dropout: Optional[float] :param stochastic_depth_prob: Stochastic depth drop probability (1 - survival_prob). Default: 0.0 :type stochastic_depth_prob: Optional[float] :param squeeze_channels: The number of channels to use in the Squeeze-Excitation block for SE-ResNet.

Default: None.

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

expansion: int = 1

__init__(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.resnet_modules.BottleneckResNetBlock(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs)[source]

Bases: BaseModule

This class defines the Bottleneck block in the ResNet model :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$ :type in_channels: int :param mid_channels: $C_{m i d}$ from an expected tensor of size $(N, C_{m i d}, H_{o u t}, W_{o u t})$ :type mid_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H_{o u t}, W_{o u t})$ :type out_channels: int :param stride: Stride for convolution. Default: 1 :type stride: Optional[int] :param dilation: Dilation for convolution. Default: 1 :type dilation: Optional[int] :param dropout: Dropout after third convolution. Default: 0.0 :type dropout: Optional[float] :param stochastic_depth_prob: Stochastic depth drop probability (1 - survival_prob). Default: 0.0 :type stochastic_depth_prob: Optional[float] :param squeeze_channels: The number of channels to use in the Squeeze-Excitation block for SE-ResNet. :type squeeze_channels: Optional[int]

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

expansion: int = 4

__init__(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.squeeze_excitation module

class cvnets.modules.squeeze_excitation.SqueezeExcitation(opts, in_channels: int, squeeze_factor: int | None = 4, squeeze_channels: int | None = None, scale_fn_name: str | None = 'sigmoid', *args, **kwargs)[source]

Bases: BaseModule

This class defines the Squeeze-excitation module, in the SENet paper

Parameters:

opts – command-line arguments
in_channels (int) – $C$ from an expected input of size $(N, C, H, W)$
squeeze_factor (Optional[int]) – Reduce $C$ by this factor. Default: 4
squeeze_channels (Optional[int]) – This module’s output channels. Overrides squeeze_factor if specified
scale_fn_name (Optional[str]) – Scaling function name. Default: sigmoid

Shape:

Input: $(N, C, H, W)$
Output: $(N, C, H, W)$

__init__(opts, in_channels: int, squeeze_factor: int | None = 4, squeeze_channels: int | None = None, scale_fn_name: str | None = 'sigmoid', *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.ssd_heads module

class cvnets.modules.ssd_heads.SSDHead(opts, in_channels: int, n_anchors: int, n_classes: int, n_coordinates: int | None = 4, proj_channels: int | None = -1, kernel_size: int | None = 3, stride: int | None = 1, *args, **kwargs)[source]

Bases: BaseModule

This class defines the SSD object detection Head

Parameters:

opts – command-line arguments
in_channels (int) – $C$ from an expected input of size $(N, C, H, W)$
n_anchors (int) – Number of anchors
n_classes (int) – Number of classes in the dataset
n_coordinates (Optional[int]) – Number of coordinates. Default: 4 (x, y, w, h)
proj_channels (Optional[int]) – Number of projected channels. If -1, then projection layer is not used
kernel_size (Optional[int]) – Kernel size in convolutional layer. If kernel_size=1, then standard point-wise convolution is used. Otherwise, separable convolution is used
stride (Optional[int]) – stride for feature map. If stride > 1, then feature map is sampled at this rate and predictions are made on fewer pixels as compared to the input tensor. Default: 1

__init__(opts, in_channels: int, n_anchors: int, n_classes: int, n_coordinates: int | None = 4, proj_channels: int | None = -1, kernel_size: int | None = 3, stride: int | None = 1, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

reset_parameters() → None[source]

forward(x: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.ssd_heads.SSDInstanceHead(opts, in_channels: int, n_classes: int | None = 1, inner_dim: int | None = 256, output_stride: int | None = 1, output_size: int | None = 8, *args, **kwargs)[source]

Bases: BaseModule

Instance segmentation head for SSD model.

__init__(opts, in_channels: int, n_classes: int | None = 1, inner_dim: int | None = 256, output_stride: int | None = 1, output_size: int | None = 8, *args, **kwargs) → None[source]

Parameters:

opts – command-line arguments
in_channels (int) – $C$ from an expected input of size $(N, C, H, W)$
n_classes (Optional[int]) – Number of classes. Default: 1
inner_dim – (Optional[int]): Inner dimension of the instance head. Default: 256
output_stride (Optional[int]) – Output stride of the feature map. Output stride is the ratio of input to the feature map size. Default: 1
output_size (Optional[int]) – Output size of the instances extracted from RoIAlign layer. Default: 8

reset_parameters() → None[source]

forward(x: Tensor, boxes: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.swin_transformer_block module

class cvnets.modules.swin_transformer_block.Permute(dims: List[int])[source]

Bases: BaseModule

This module returns a view of the tensor input with its dimensions permuted. :param dims: The desired ordering of dimensions :type dims: List[int]

__init__(dims: List[int])[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.swin_transformer_block.PatchMerging(opts, dim: int, norm_layer: str, strided: bool | None = True)[source]

Bases: BaseModule

Patch Merging Layer. :param dim: Number of input channels. :type dim: int :param norm_layer: Normalization layer name. :type norm_layer: str :param strided: Down-sample the input by a factor of 2. Default is True. :type strided: Optional[bool]

__init__(opts, dim: int, norm_layer: str, strided: bool | None = True)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Parameters:: x (Tensor) – input tensor with expected layout of […, H, W, C]
Returns:: Tensor with layout of […, H/2, W/2, 2*C]

cvnets.modules.swin_transformer_block.shifted_window_attention(input: Tensor, qkv_weight: Tensor, proj_weight: Tensor, relative_position_bias: Tensor, window_size: List[int], num_heads: int, shift_size: List[int], attention_dropout: float = 0.0, dropout: float = 0.0, qkv_bias: Tensor | None = None, proj_bias: Tensor | None = None)[source]

Window based multi-head self attention (W-MSA) module with relative position bias. It supports both of shifted and non-shifted window. :param input: The input tensor or 4-dimensions. :type input: Tensor[N, H, W, C] :param qkv_weight: The weight tensor of query, key, value. :type qkv_weight: Tensor[in_dim, out_dim] :param proj_weight: The weight tensor of projection. :type proj_weight: Tensor[out_dim, out_dim] :param relative_position_bias: The learned relative position bias added to attention. :type relative_position_bias: Tensor :param window_size: Window size. :type window_size: List[int] :param num_heads: Number of attention heads. :type num_heads: int :param shift_size: Shift size for shifted window attention. :type shift_size: List[int] :param attention_dropout: Dropout ratio of attention weight. Default: 0.0. :type attention_dropout: float :param dropout: Dropout ratio of output. Default: 0.0. :type dropout: float :param qkv_bias: The bias tensor of query, key, value. Default: None. :type qkv_bias: Tensor[out_dim], optional :param proj_bias: The bias tensor of projection. Default: None. :type proj_bias: Tensor[out_dim], optional

Returns:: The output tensor after shifted window attention.
Return type:: Tensor[N, H, W, C]

class cvnets.modules.swin_transformer_block.ShiftedWindowAttention(dim: int, window_size: List[int], shift_size: List[int], num_heads: int, qkv_bias: bool = True, proj_bias: bool = True, attention_dropout: float = 0.0, dropout: float = 0.0)[source]

Bases: BaseModule

See shifted_window_attention().

__init__(dim: int, window_size: List[int], shift_size: List[int], num_heads: int, qkv_bias: bool = True, proj_bias: bool = True, attention_dropout: float = 0.0, dropout: float = 0.0)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Parameters:: x (Tensor) – Tensor with layout of [B, H, W, C]
Returns:: Tensor with same layout as input, i.e. [B, H, W, C]

class cvnets.modules.swin_transformer_block.SwinTransformerBlock(opts, embed_dim: int, num_heads: int, window_size: List[int], shift_size: List[int], mlp_ratio: float = 4.0, dropout: float = 0.0, attn_dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, stochastic_depth_prob: float = 0.0, norm_layer: str | None = 'layer_norm')[source]

Bases: BaseModule

Swin Transformer Block. :param dim: Number of input channels. :type dim: int :param num_heads: Number of attention heads. :type num_heads: int :param window_size: Window size. :type window_size: List[int] :param shift_size: Shift size for shifted window attention. :type shift_size: List[int] :param mlp_ratio: Ratio of mlp hidden dim to embedding dim. Default: 4.0. :type mlp_ratio: float :param dropout: Dropout rate. Default: 0.0. :type dropout: float :param attention_dropout: Attention dropout rate. Default: 0.0. :type attention_dropout: float :param stochastic_depth_prob: (float): Stochastic depth rate. Default: 0.0. :param norm_layer: Normalization layer. Default: nn.LayerNorm. :type norm_layer: nn.Module :param attn_layer: Attention layer. Default: ShiftedWindowAttention :type attn_layer: nn.Module

__init__(opts, embed_dim: int, num_heads: int, window_size: List[int], shift_size: List[int], mlp_ratio: float = 4.0, dropout: float = 0.0, attn_dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, stochastic_depth_prob: float = 0.0, norm_layer: str | None = 'layer_norm')[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.transformer module

class cvnets.modules.transformer.TransformerEncoder(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, *args, **kwargs)[source]

Bases: BaseModule

This class defines the pre-norm Transformer encoder :param opts: Command line arguments. :param embed_dim: $C_{i n}$ from an expected input of size $(N, P, C_{i n})$ . :param ffn_latent_dim: Inner dimension of the FFN. :param num_heads: Number of heads in multi-head attention. Default: 8. :param attn_dropout: Dropout rate for attention in multi-head attention. Default: 0.0 :param dropout: Dropout rate. Default: 0.0. :param ffn_dropout: Dropout between FFN layers. Default: 0.0. :param transformer_norm_layer: Normalization layer. Default: layer_norm. :param stochastic_dropout: Stochastic dropout setting. Default: 0.0.

Shape:

Input: $(N, P, C_{i n})$ where $N$ is batch size, $P$ is number of patches,

and $C_{i n}$ is input embedding dim - Output: same shape as the input

__init__(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, x_prev: Tensor | None = None, key_padding_mask: Tensor | None = None, attn_mask: Tensor | None = None, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.transformer.LinearAttnFFN(opts, embed_dim: int, ffn_latent_dim: int, attn_dropout: float | None = 0.0, dropout: float | None = 0.1, ffn_dropout: float | None = 0.0, norm_layer: str | None = 'layer_norm_2d', *args, **kwargs)[source]

Bases: BaseModule

This class defines the pre-norm transformer encoder with linear self-attention in MobileViTv2 paper :param opts: command line arguments :param embed_dim: $C_{i n}$ from an expected input of size $(B, C_{i n}, P, N)$ :type embed_dim: int :param ffn_latent_dim: Inner dimension of the FFN :type ffn_latent_dim: int :param attn_dropout: Dropout rate for attention in multi-head attention. Default: 0.0 :type attn_dropout: Optional[float] :param dropout: Dropout rate. Default: 0.0 :type dropout: Optional[float] :param ffn_dropout: Dropout between FFN layers. Default: 0.0 :type ffn_dropout: Optional[float] :param norm_layer: Normalization layer. Default: layer_norm_2d :type norm_layer: Optional[str]

Shape:

Input: $(B, C_{i n}, P, N)$ where $B$ is batch size, $C_{i n}$ is input embedding dim,
$P$ is number of pixels in a patch, and $N$ is number of patches,
Output: same shape as the input

__init__(opts, embed_dim: int, ffn_latent_dim: int, attn_dropout: float | None = 0.0, dropout: float | None = 0.1, ffn_dropout: float | None = 0.0, norm_layer: str | None = 'layer_norm_2d', *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, x_prev: Tensor | None = None, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cvnets.modules.windowed_transformer module

cvnets.modules.windowed_transformer.window_partition(t: Tensor, window_size: int) → Tensor[source]

Partition tensor @t into chunks of size @window_size.

@t’s sequence length must be divisible by @window_size.

Parameters:

t – A tensor of shape [batch_size, sequence_length, embed_dim].
window_size – The desired window size.

Returns:

A tensor of shape [batch_size * sequence_length // window_size, window_size, embed_dim].

cvnets.modules.windowed_transformer.window_partition_reverse(t: Tensor, B: int, num_windows: int, C: int) → Tensor[source]

Undo the @window_partition operation.

Parameters:

t – The input tensor of shape [batch_size * num_windows, window_size, embed_dim].
B – The batch size.
num_windows – The number of windows.
C – The embedding dimension.

Returns:

A tensor of shape [batch_size, num_windows * window_size, embed_dim].

cvnets.modules.windowed_transformer.get_windows_shift_mask(N: int, window_size: int, window_shift: int, device: device) → Tensor[source]

Get the mask window required due to window shifting (needed for shifted window attention).

This produces a tensor with mask values for each window. Most windows don’t require masking, but windows that bleed across the beginning/end of the tensor (due to shifting) require it.

Parameters:

N – The sequence length.
window_size – The window size.
window_shift – The window shift.
device – The device on which to create the tensor.

Returns:

A tensor of shape [N // window_size, window_size, window_size] containing mask values. The values are 0 (unmasked) or float(“-inf”) (masked).

cvnets.modules.windowed_transformer.window_x_and_key_padding_mask(x: Tensor, key_padding_mask: Tensor, window_size: int, window_shift: int) → Tuple[Tensor, Tensor, Tensor][source]

Perform windowing on @x and @key_padding_mask in preparation for windowed attention.

Parameters:

x – The input tensor of shape [batch_size, sequence_length, num_channels].
key_padding_mask – The mask, as a tensor of shape [batch_size, sequence_length].
window_size – The window size to be used for windowed attention.
window_shift – The window shift to be used for windowed attention.

Returns:

A tuple containing 3 tensors. The first is the windowed input. The second is the windowed mask. The third is the mask needed to perform shifted window attention (to avoid the first and last windows from bleeding into each other).

cvnets.modules.windowed_transformer.unwindow_x(x_windows: Tensor, B: int, N: int, C: int, window_shift: int)[source]

Undoes the operation of @window_x_and_attention on the input tensor @x_windows.

Parameters:

x_windows – The input tensor to unwindow. Its shape is [batch_size * padded_sequence_length // window_size, window_size, embed_dim].
B – The batch size. Referred to as batch_size in this docstring.
N – The sequence length of the tensor before windowing. Referred to as sequence_length in this docstring.
C – The number of channels. Referred to as embed_dim in this docstring.
window_shift – The shift applied to the sequence before the windowing originally occurred.

Returns:

A tensor of shape [batch_size, sequence_length, embed_dim].

class cvnets.modules.windowed_transformer.WindowedTransformerEncoder(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, window_size: int | None = None, window_shift: int | None = None, *args, **kwargs)[source]

Bases: TransformerEncoder

This class defines the pre-norm Transformer encoder with the addition of windowed attention.

This class first partitions the input sequence into a series of windows (with an optional offset to use when defining windows). Then, it calls a TransformerEncoder module. Then, it undoes windowing.

Parameters:

opts – Command line arguments.
embed_dim – $C_{i n}$ from an expected input of size $(N, P, C_{i n})$ .
ffn_latent_dim – Inner dimension of the FFN.
num_heads – Number of heads in multi-head attention. Default: 8.
attn_dropout – Dropout rate for attention in multi-head attention. Default: 0.0.
dropout – Dropout rate. Default: 0.0.
ffn_dropout – Dropout between FFN layers. Default: 0.0.
transformer_norm_layer – Normalization layer. Default: layer_norm.
stochastic_dropout – Stochastic dropout setting. Default: 0.0.
window_size – The size of the window, if using windowed attention. Default: None.
window_shift – The size of the shift, if using shifted windowed attention. Default: None.

Shape:

Input: $(N, P, C_{i n})$ where $N$ is batch size, $P$ is number of patches,

and $C_{i n}$ is input embedding dim - Output: same shape as the input

__init__(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, window_size: int | None = None, window_shift: int | None = None, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, x_prev: Tensor | None = None, key_padding_mask: Tensor | None = None, attn_mask: Tensor | None = None, *args, **kwargs) → Tensor[source]

Compute the outputs of the WindowedTransformerEncoder on an input.

Parameters:

x – The input tensor, of shape [batch_size, sequence_length, embed_dim].
x_prev – The context input, if using cross-attention. Its shape is [batch_size, sequence_length_2, embed_dim].
key_padding_mask – An optional tensor of masks to be applied to the inputs @x. Its shape is [batch_size, sequence_length].
attn_mask – An optional attention mask. Its shape is [batch_size, sequence_length, sequence_length_2]. (If using self-attention, the sequence lengths will be equal.)

Returns:

The WindowedTransformerEncoder output.

Module contents

class cvnets.modules.InvertedResidual(opts, in_channels: int, out_channels: int, stride: int, expand_ratio: int | float, dilation: int = 1, skip_connection: bool | None = True, *args, **kwargs)[source]

Bases: BaseModule

This class implements the inverted residual block, as described in MobileNetv2 paper

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{out}, H_{out}, W_{out)$
stride (Optional[int]) – Use convolutions with a stride. Default: 1
expand_ratio (Union[int, float]) – Expand the input channels by this factor in depth-wise conv
dilation (Optional[int]) – Use conv with dilation. Default: 1
skip_connection (Optional[bool]) – Use skip-connection. Default: True

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

Note

If in_channels =! out_channels and stride > 1, we set skip_connection=False

__init__(opts, in_channels: int, out_channels: int, stride: int, expand_ratio: int | float, dilation: int = 1, skip_connection: bool | None = True, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Bases: BaseModule

This class implements the inverted residual block with squeeze-excitation unit, as described in MobileNetv3 paper

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{out}, H_{out}, W_{out)$
expand_ratio (Union[int, float]) – Expand the input channels by this factor in depth-wise conv
dilation (Optional[int]) – Use conv with dilation. Default: 1
stride (Optional[int]) – Use convolutions with a stride. Default: 1
use_se (Optional[bool]) – Use squeeze-excitation block. Default: False
act_fn_name (Optional[str]) – Activation function name. Default: relu
se_scale_fn_name (Optional [str]) – Scale activation function inside SE unit. Defaults to hard_sigmoid
kernel_size (Optional[int]) – Kernel size in depth-wise convolution. Defaults to 3.
squeeze_factor (Optional[bool]) – Squeezing factor in SE unit. Defaults to 4.

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

__init__(opts, in_channels: int, out_channels: int, expand_ratio: int | float, dilation: int | None = 1, stride: int | None = 1, use_se: bool | None = False, act_fn_name: str | None = 'relu', se_scale_fn_name: str | None = 'hard_sigmoid', kernel_size: int | None = 3, squeeze_factor: int | None = 4, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.BasicResNetBlock(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs)[source]

Bases: BaseModule

This class defines the Basic block in the ResNet model :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$ :type in_channels: int :param mid_channels: $C_{m i d}$ from an expected tensor of size $(N, C_{m i d}, H_{o u t}, W_{o u t})$ :type mid_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H_{o u t}, W_{o u t})$ :type out_channels: int :param stride: Stride for convolution. Default: 1 :type stride: Optional[int] :param dilation: Dilation for convolution. Default: 1 :type dilation: Optional[int] :param dropout: Dropout after second convolution. Default: 0.0 :type dropout: Optional[float] :param stochastic_depth_prob: Stochastic depth drop probability (1 - survival_prob). Default: 0.0 :type stochastic_depth_prob: Optional[float] :param squeeze_channels: The number of channels to use in the Squeeze-Excitation block for SE-ResNet.

Default: None.

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

expansion: int = 1

__init__(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

class cvnets.modules.BottleneckResNetBlock(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs)[source]

Bases: BaseModule

This class defines the Bottleneck block in the ResNet model :param opts: command-line arguments :param in_channels: $C_{i n}$ from an expected input of size $(N, C_{i n}, H_{i n}, W_{i n})$ :type in_channels: int :param mid_channels: $C_{m i d}$ from an expected tensor of size $(N, C_{m i d}, H_{o u t}, W_{o u t})$ :type mid_channels: int :param out_channels: $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H_{o u t}, W_{o u t})$ :type out_channels: int :param stride: Stride for convolution. Default: 1 :type stride: Optional[int] :param dilation: Dilation for convolution. Default: 1 :type dilation: Optional[int] :param dropout: Dropout after third convolution. Default: 0.0 :type dropout: Optional[float] :param stochastic_depth_prob: Stochastic depth drop probability (1 - survival_prob). Default: 0.0 :type stochastic_depth_prob: Optional[float] :param squeeze_channels: The number of channels to use in the Squeeze-Excitation block for SE-ResNet. :type squeeze_channels: Optional[int]

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

expansion: int = 4

__init__(opts: Namespace, in_channels: int, mid_channels: int, out_channels: int, stride: int | None = 1, dilation: int | None = 1, dropout: float | None = 0.0, stochastic_depth_prob: float | None = 0.0, squeeze_channels: int | None = None, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

class cvnets.modules.ASPP(opts, in_channels: int, out_channels: int, atrous_rates: Tuple[int], is_sep_conv: bool | None = False, dropout: float | None = 0.0, *args, **kwargs)[source]

Bases: BaseModule

ASPP module defined in DeepLab papers, here and here

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$
atrous_rates (Tuple[int]) – atrous rates for different branches.
is_sep_conv (Optional[bool]) – Use separable convolution instead of standaard conv. Default: False
dropout (Optional[float]) – Apply dropout. Default is 0.0

Shape:

Input: $(N, C_{i n}, H, W)$
Output: $(N, C_{o u t}, H, W)$

__init__(opts, in_channels: int, out_channels: int, atrous_rates: Tuple[int], is_sep_conv: bool | None = False, dropout: float | None = 0.0, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.TransformerEncoder(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, *args, **kwargs)[source]

Bases: BaseModule

This class defines the pre-norm Transformer encoder :param opts: Command line arguments. :param embed_dim: $C_{i n}$ from an expected input of size $(N, P, C_{i n})$ . :param ffn_latent_dim: Inner dimension of the FFN. :param num_heads: Number of heads in multi-head attention. Default: 8. :param attn_dropout: Dropout rate for attention in multi-head attention. Default: 0.0 :param dropout: Dropout rate. Default: 0.0. :param ffn_dropout: Dropout between FFN layers. Default: 0.0. :param transformer_norm_layer: Normalization layer. Default: layer_norm. :param stochastic_dropout: Stochastic dropout setting. Default: 0.0.

Shape:

Input: $(N, P, C_{i n})$ where $N$ is batch size, $P$ is number of patches,

and $C_{i n}$ is input embedding dim - Output: same shape as the input

__init__(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, x_prev: Tensor | None = None, key_padding_mask: Tensor | None = None, attn_mask: Tensor | None = None, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.WindowedTransformerEncoder(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, window_size: int | None = None, window_shift: int | None = None, *args, **kwargs)[source]

Bases: TransformerEncoder

This class defines the pre-norm Transformer encoder with the addition of windowed attention.

This class first partitions the input sequence into a series of windows (with an optional offset to use when defining windows). Then, it calls a TransformerEncoder module. Then, it undoes windowing.

Parameters:

opts – Command line arguments.
embed_dim – $C_{i n}$ from an expected input of size $(N, P, C_{i n})$ .
ffn_latent_dim – Inner dimension of the FFN.
num_heads – Number of heads in multi-head attention. Default: 8.
attn_dropout – Dropout rate for attention in multi-head attention. Default: 0.0.
dropout – Dropout rate. Default: 0.0.
ffn_dropout – Dropout between FFN layers. Default: 0.0.
transformer_norm_layer – Normalization layer. Default: layer_norm.
stochastic_dropout – Stochastic dropout setting. Default: 0.0.
window_size – The size of the window, if using windowed attention. Default: None.
window_shift – The size of the shift, if using shifted windowed attention. Default: None.

Shape:

Input: $(N, P, C_{i n})$ where $N$ is batch size, $P$ is number of patches,

and $C_{i n}$ is input embedding dim - Output: same shape as the input

__init__(opts: Namespace, embed_dim: int, ffn_latent_dim: int, num_heads: int | None = 8, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, transformer_norm_layer: str | None = 'layer_norm', stochastic_dropout: float | None = 0.0, window_size: int | None = None, window_shift: int | None = None, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, x_prev: Tensor | None = None, key_padding_mask: Tensor | None = None, attn_mask: Tensor | None = None, *args, **kwargs) → Tensor[source]

Compute the outputs of the WindowedTransformerEncoder on an input.

Parameters:

x – The input tensor, of shape [batch_size, sequence_length, embed_dim].
x_prev – The context input, if using cross-attention. Its shape is [batch_size, sequence_length_2, embed_dim].
key_padding_mask – An optional tensor of masks to be applied to the inputs @x. Its shape is [batch_size, sequence_length].
attn_mask – An optional attention mask. Its shape is [batch_size, sequence_length, sequence_length_2]. (If using self-attention, the sequence lengths will be equal.)

Returns:

The WindowedTransformerEncoder output.

class cvnets.modules.SqueezeExcitation(opts, in_channels: int, squeeze_factor: int | None = 4, squeeze_channels: int | None = None, scale_fn_name: str | None = 'sigmoid', *args, **kwargs)[source]

Bases: BaseModule

This class defines the Squeeze-excitation module, in the SENet paper

Parameters:

opts – command-line arguments
in_channels (int) – $C$ from an expected input of size $(N, C, H, W)$
squeeze_factor (Optional[int]) – Reduce $C$ by this factor. Default: 4
squeeze_channels (Optional[int]) – This module’s output channels. Overrides squeeze_factor if specified
scale_fn_name (Optional[str]) – Scaling function name. Default: sigmoid

Shape:

Input: $(N, C, H, W)$
Output: $(N, C, H, W)$

__init__(opts, in_channels: int, squeeze_factor: int | None = 4, squeeze_channels: int | None = None, scale_fn_name: str | None = 'sigmoid', *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.PSP(opts, in_channels: int, out_channels: int, pool_sizes: Tuple[int, ...] | None = (1, 2, 3, 6), dropout: float | None = 0.0, *args, **kwargs)[source]

Bases: BaseModule

This class defines the Pyramid Scene Parsing module in the PSPNet paper

Parameters:

opts – command-line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
out_channels (int) – $C_{o u t}$ from an expected output of size $(N, C_{o u t}, H, W)$
Optional[Tuple[int (pool_sizes) – List or Tuple of pool sizes. Default: (1, 2, 3, 6)
...]] – List or Tuple of pool sizes. Default: (1, 2, 3, 6)
dropout (Optional[float]) – Apply dropout. Default is 0.0

__init__(opts, in_channels: int, out_channels: int, pool_sizes: Tuple[int, ...] | None = (1, 2, 3, 6), dropout: float | None = 0.0, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Bases: BaseModule

This class defines the MobileViT block

Parameters:

opts – command line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
transformer_dim (int) – Input dimension to the transformer unit
ffn_dim (int) – Dimension of the FFN block
n_transformer_blocks (Optional[int]) – Number of transformer blocks. Default: 2
head_dim (Optional[int]) – Head dimension in the multi-head attention. Default: 32
attn_dropout (Optional[float]) – Dropout in multi-head attention. Default: 0.0
dropout (Optional[float]) – Dropout rate. Default: 0.0
ffn_dropout (Optional[float]) – Dropout between FFN layers in transformer. Default: 0.0
patch_h (Optional[int]) – Patch height for unfolding operation. Default: 8
patch_w (Optional[int]) – Patch width for unfolding operation. Default: 8
transformer_norm_layer (Optional[str]) – Normalization layer in the transformer block. Default: layer_norm
conv_ksize (Optional[int]) – Kernel size to learn local representations in MobileViT block. Default: 3
dilation (Optional[int]) – Dilation rate in convolutions. Default: 1
no_fusion (Optional[bool]) – Do not combine the input and output feature maps. Default: False

__init__(opts, in_channels: int, transformer_dim: int, ffn_dim: int, n_transformer_blocks: int | None = 2, head_dim: int | None = 32, attn_dropout: float | None = 0.0, dropout: int | None = 0.0, ffn_dropout: int | None = 0.0, patch_h: int | None = 8, patch_w: int | None = 8, transformer_norm_layer: str | None = 'layer_norm', conv_ksize: int | None = 3, dilation: int | None = 1, no_fusion: bool | None = False, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

unfolding(feature_map: Tensor) → Tuple[Tensor, Dict][source]

folding(patches: Tensor, info_dict: Dict) → Tensor[source]

forward_spatial(x: Tensor) → Tensor[source]

forward_temporal(x: Tensor, x_prev: Tensor | None = None) → Tensor | Tuple[Tensor, Tensor][source]

forward(x: Tensor | Tuple[Tensor], *args, **kwargs) → Tensor | Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Bases: BaseModule

This class defines the MobileViTv2 block

Parameters:

opts – command line arguments
in_channels (int) – $C_{i n}$ from an expected input of size $(N, C_{i n}, H, W)$
attn_unit_dim (int) – Input dimension to the attention unit
ffn_multiplier (int) – Expand the input dimensions by this factor in FFN. Default is 2.
n_attn_blocks (Optional[int]) – Number of attention units. Default: 2
attn_dropout (Optional[float]) – Dropout in multi-head attention. Default: 0.0
dropout (Optional[float]) – Dropout rate. Default: 0.0
ffn_dropout (Optional[float]) – Dropout between FFN layers in transformer. Default: 0.0
patch_h (Optional[int]) – Patch height for unfolding operation. Default: 8
patch_w (Optional[int]) – Patch width for unfolding operation. Default: 8
conv_ksize (Optional[int]) – Kernel size to learn local representations in MobileViT block. Default: 3
dilation (Optional[int]) – Dilation rate in convolutions. Default: 1
attn_norm_layer (Optional[str]) – Normalization layer in the attention block. Default: layer_norm_2d

__init__(opts, in_channels: int, attn_unit_dim: int, ffn_multiplier: Sequence[int | float] | int | float | None = 2.0, n_attn_blocks: int | None = 2, attn_dropout: float | None = 0.0, dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, patch_h: int | None = 8, patch_w: int | None = 8, conv_ksize: int | None = 3, dilation: int | None = 1, attn_norm_layer: str | None = 'layer_norm_2d', *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

unfolding_pytorch(feature_map: Tensor) → Tuple[Tensor, Tuple[int, int]][source]

folding_pytorch(patches: Tensor, output_size: Tuple[int, int]) → Tensor[source]

unfolding_coreml(feature_map: Tensor) → Tuple[Tensor, Tuple[int, int]][source]

folding_coreml(patches: Tensor, output_size: Tuple[int, int]) → Tensor[source]

resize_input_if_needed(x)[source]

forward_spatial(x: Tensor, *args, **kwargs) → Tensor[source]

forward_temporal(x: Tensor, x_prev: Tensor, *args, **kwargs) → Tensor | Tuple[Tensor, Tensor][source]

forward(x: Tensor | Tuple[Tensor], *args, **kwargs) → Tensor | Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.MobileOneBlock(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, inference_mode: bool = False, use_se: bool = False, use_act: bool = True, use_scale_branch: bool = True, num_conv_branches: int = 1)[source]

Bases: BaseModule

MobileOne building block.

For more details, please refer to our paper: An Improved One millisecond Mobile Backbone <https://arxiv.org/pdf/2206.04040.pdf>

__init__(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, inference_mode: bool = False, use_se: bool = False, use_act: bool = True, use_scale_branch: bool = True, num_conv_branches: int = 1) → None[source]

Construct a MobileOneBlock.

Parameters:

opts – Command line arguments.
in_channels – Number of channels in the input.
out_channels – Number of channels produced by the block.
kernel_size – Size of the convolution kernel.
stride – Stride size. Default: 1
padding – Zero-padding size. Default: 0
dilation – Kernel dilation factor. Default: 1
groups – Group number. Default: 1
inference_mode – If True, instantiates model in inference mode. Default: False
use_se – Whether to use SE-ReLU activations. Default: False
use_act – Whether to use activation. Default: True
use_scale_branch – Whether to use scale branch. Default: True
num_conv_branches – Number of linear conv branches. Default: 1

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass implements inference logic for module before and after reparameterization.

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

reparameterize() → None[source]: Following works like RepVGG: Making VGG-style ConvNets Great Again - https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched architecture used at training time to obtain a plain CNN-like structure for inference.

class cvnets.modules.RepLKBlock(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, groups: int = 1, small_kernel_size: int | None = None, inference_mode: bool = False, use_act: bool = True)[source]

Bases: BaseModule

This class defines overparameterized large kernel conv block in RepLKNet Reference: https://github.com/DingXiaoH/RepLKNet-pytorch

Parameters:

opts – Command-line arguments.
in_channels – Number of input channels.
out_channels – Number of output channels.
kernel_size – Kernel size of the large kernel conv branch.
stride – Stride size. Default: 1
dilation – Kernel dilation factor. Default: 1
groups – Group number. Default: 1
small_kernel_size – Kernel size of small kernel conv branch.
inference_mode – If True, instantiates model in inference mode. Default: False
use_act – If True, activation is used. Default: True

__init__(opts: Namespace, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, groups: int = 1, small_kernel_size: int | None = None, inference_mode: bool = False, use_act: bool = True) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Forward pass implements inference logic for module before and after reparameterization.

Parameters:: x – Input tensor of shape $(B, C, H, W)$ .
Returns:: torch.Tensor of shape $(B, C, H, W)$ .

reparameterize() → None[source]: Following works like RepVGG: Making VGG-style ConvNets Great Again - https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched architecture used at training time to obtain a plain CNN-like structure for inference.

class cvnets.modules.FeaturePyramidNetwork(opts, in_channels: List[int], output_strides: List[str], out_channels: int, *args, **kwargs)[source]

Bases: BaseModule

This class implements the Feature Pyramid Network module for object detection.

Parameters:

opts – command-line arguments
in_channels (List[int]) – List of channels at different output strides
output_strides (List[int]) – Feature maps from these output strides will be used in FPN
out_channels (int) – Output channels

__init__(opts, in_channels: List[int], output_strides: List[str], out_channels: int, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

reset_weights() → None[source]: Resets the weights of FPN layers

forward(x: Dict[str, Tensor], *args, **kwargs) → Dict[str, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.SSDHead(opts, in_channels: int, n_anchors: int, n_classes: int, n_coordinates: int | None = 4, proj_channels: int | None = -1, kernel_size: int | None = 3, stride: int | None = 1, *args, **kwargs)[source]

Bases: BaseModule

This class defines the SSD object detection Head

Parameters:

opts – command-line arguments
in_channels (int) – $C$ from an expected input of size $(N, C, H, W)$
n_anchors (int) – Number of anchors
n_classes (int) – Number of classes in the dataset
n_coordinates (Optional[int]) – Number of coordinates. Default: 4 (x, y, w, h)
proj_channels (Optional[int]) – Number of projected channels. If -1, then projection layer is not used
kernel_size (Optional[int]) – Kernel size in convolutional layer. If kernel_size=1, then standard point-wise convolution is used. Otherwise, separable convolution is used
stride (Optional[int]) – stride for feature map. If stride > 1, then feature map is sampled at this rate and predictions are made on fewer pixels as compared to the input tensor. Default: 1

__init__(opts, in_channels: int, n_anchors: int, n_classes: int, n_coordinates: int | None = 4, proj_channels: int | None = -1, kernel_size: int | None = 3, stride: int | None = 1, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

reset_parameters() → None[source]

forward(x: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.SSDInstanceHead(opts, in_channels: int, n_classes: int | None = 1, inner_dim: int | None = 256, output_stride: int | None = 1, output_size: int | None = 8, *args, **kwargs)[source]

Bases: BaseModule

Instance segmentation head for SSD model.

__init__(opts, in_channels: int, n_classes: int | None = 1, inner_dim: int | None = 256, output_stride: int | None = 1, output_size: int | None = 8, *args, **kwargs) → None[source]

Parameters:

opts – command-line arguments
in_channels (int) – $C$ from an expected input of size $(N, C, H, W)$
n_classes (Optional[int]) – Number of classes. Default: 1
inner_dim – (Optional[int]): Inner dimension of the instance head. Default: 256
output_stride (Optional[int]) – Output stride of the feature map. Output stride is the ratio of input to the feature map size. Default: 1
output_size (Optional[int]) – Output size of the instances extracted from RoIAlign layer. Default: 8

reset_parameters() → None[source]

forward(x: Tensor, boxes: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.EfficientNetBlock(stochastic_depth_prob: float, *args, **kwargs)[source]

Bases: InvertedResidualSE

This class implements a variant of the inverted residual block with squeeze-excitation unit, as described in MobileNetv3 paper. This variant includes stochastic depth, as used in EfficientNet paper.

Parameters:

stochastic_depth_prob – float,
arguments (For other) –
class. (refer to the parent) –

Shape:

Input: $(N, C_{i n}, H_{i n}, W_{i n})$
Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

__init__(stochastic_depth_prob: float, *args, **kwargs) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.SwinTransformerBlock(opts, embed_dim: int, num_heads: int, window_size: List[int], shift_size: List[int], mlp_ratio: float = 4.0, dropout: float = 0.0, attn_dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, stochastic_depth_prob: float = 0.0, norm_layer: str | None = 'layer_norm')[source]

Bases: BaseModule

Swin Transformer Block. :param dim: Number of input channels. :type dim: int :param num_heads: Number of attention heads. :type num_heads: int :param window_size: Window size. :type window_size: List[int] :param shift_size: Shift size for shifted window attention. :type shift_size: List[int] :param mlp_ratio: Ratio of mlp hidden dim to embedding dim. Default: 4.0. :type mlp_ratio: float :param dropout: Dropout rate. Default: 0.0. :type dropout: float :param attention_dropout: Attention dropout rate. Default: 0.0. :type attention_dropout: float :param stochastic_depth_prob: (float): Stochastic depth rate. Default: 0.0. :param norm_layer: Normalization layer. Default: nn.LayerNorm. :type norm_layer: nn.Module :param attn_layer: Attention layer. Default: ShiftedWindowAttention :type attn_layer: nn.Module

__init__(opts, embed_dim: int, num_heads: int, window_size: List[int], shift_size: List[int], mlp_ratio: float = 4.0, dropout: float = 0.0, attn_dropout: float | None = 0.0, ffn_dropout: float | None = 0.0, stochastic_depth_prob: float = 0.0, norm_layer: str | None = 'layer_norm')[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.PatchMerging(opts, dim: int, norm_layer: str, strided: bool | None = True)[source]

Bases: BaseModule

Patch Merging Layer. :param dim: Number of input channels. :type dim: int :param norm_layer: Normalization layer name. :type norm_layer: str :param strided: Down-sample the input by a factor of 2. Default is True. :type strided: Optional[bool]

__init__(opts, dim: int, norm_layer: str, strided: bool | None = True)[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, *args, **kwargs) → Tensor[source]

Parameters:: x (Tensor) – input tensor with expected layout of […, H, W, C]
Returns:: Tensor with layout of […, H/2, W/2, 2*C]

class cvnets.modules.Permute(dims: List[int])[source]

Bases: BaseModule

This module returns a view of the tensor input with its dimensions permuted. :param dims: The desired ordering of dimensions :type dims: List[int]

__init__(dims: List[int])[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cvnets.modules.XRegNetBlock(opts: Namespace, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stochastic_depth_prob: float = 0.0)[source]

Bases: BaseModule

This class implements the X block based on the ResNet bottleneck block. See figure 4 of RegNet paper RegNet model

Parameters:

opts – command-line arguments
width_in – The number of input channels
width_out – The number of output channels
stride – Stride for convolution
groups – Number of groups for convolution
bottleneck_multiplier – The number of in/out channels of the intermediate conv layer will be scaled by this value
se_ratio – The numer squeeze-excitation ratio. The number of channels in the SE module will be scaled by this value
stochastic_depth_prob – The stochastic depth probability

__init__(opts: Namespace, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stochastic_depth_prob: float = 0.0) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Forward pass for XRegNetBlock.

Parameters:: x – Batch of images

Retruns:

output of XRegNetBlock including stochastic depth layer and
residual.

Shape:

x: $(N, C_{i n}, H_{i n}, W_{i n})$ Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$

class cvnets.modules.AnyRegNetStage(opts: Namespace, depth: int, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stage_index: int, stochastic_depth_probs: List[float])[source]

Bases: BaseModule

This class implements a ‘stage’ as defined in the RegNet paper. It consists of a sequence of bottleneck blocks.

Parameters:

opts – command-line arguments
depth – The number of XRegNetBlocks in the stage
width_in – The number of input channels of the first block
width_out – The number of output channels of each block
stride – Stride for convolution of first block
groups – Number of groups for the intermediate convolution (bottleneck) layer in each block
bottleneck_multiplier – The number of in/out channels of the intermediate conv layer of each block will be scaled by this value
se_ratio – The numer squeeze-excitation ratio. The number of channels in the SE module of each block will be scaled by this value
stage_depths – A list of the number of blocks in each stage
stage_index – The index of the current stage being constructed
stochastic_depth_prob – The stochastic depth probability

__init__(opts: Namespace, depth: int, width_in: int, width_out: int, stride: int, groups: int, bottleneck_multiplier: float, se_ratio: float, stage_index: int, stochastic_depth_probs: List[float]) → None[source]: Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Forward pass through all blocks in the stage.

Parameters:

x – Batch of images.

Returns:

output of passing x through all blocks in the stage.

Shape:: x: $(N, C_{i n}, H_{i n}, W_{i n})$ Output: $(N, C_{o u t}, H_{o u t}, W_{o u t})$