Training MobileViTv2 Models
ImageNet-1k Training
Single node training of MobileViTv2-2.0
with 8 A100 GPUs. On an average, training takes about 2 days.
PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_2_0_0 --common.override-kwargs scheduler.cosine.max_lr=0.0020 scheduler.cosine.min_lr=0.0002 optim.weight_decay=0.050 model.classification.mitv2.width_multiplier=2.00
You may need to change the
training
andvalidation
dataset locations in the config fileWe train each model using
8 A100
GPUs with an effective batch size of 1024 images (128 images per GPU x 8 GPUs)If you are not using A100 GPUs and getting OOM errors, you may try one of the following options:
Reduce the batch size and increase number of GPUs, so that effective batch size is 1024 images.
Reduce the batch size and use gradient accumulation option (
--common.accum-freq
), so that effective batch size (= images per GPU
*number of GPUs
*--common.accum-freq
) is 1024 images.
Evaluation can be done using the below command:
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_2_0_0 --model.classification.pretrained mobilevitv2_results/width_2_0_0/checkpoint_ema_best.pt --common.override-kwargs model.classification.mitv2.width_multiplier=2.00
Examples are given below to train other variants of MobileViTv2 model. Note that we linearly (1) increase the learning rate and (2) decay the weight decay with respect to MobileViTv2-2.0 configuration.
MobileViTv2-1.75
Single node training of MobileViTv2-1.75 with 8 A100 GPUs PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_7_5 --common.override-kwargs scheduler.cosine.max_lr=0.0026 scheduler.cosine.min_lr=0.00026 optim.weight_decay=0.039 model.classification.mitv2.width_multiplier=1.75
MobileViTv2-1.5
Single node training of MobileViTv2-1.5 with 8 A100 GPUs PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_5_0 --common.override-kwargs scheduler.cosine.max_lr=0.0035 scheduler.cosine.min_lr=0.00035 optim.weight_decay=0.029 model.classification.mitv2.width_multiplier=1.50
MobileViTv2-1.25
Single node training of MobileViTv2-1.25 with 8 A100 GPUs PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_2_5 --common.override-kwargs scheduler.cosine.max_lr=0.0049 scheduler.cosine.min_lr=0.00049 optim.weight_decay=0.020 model.classification.mitv2.width_multiplier=1.25
MobileViTv2-1.00
Single node training of MobileViTv2-1.00 with 8 A100 GPUs PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_1_0_0 --common.override-kwargs scheduler.cosine.max_lr=0.0075 scheduler.cosine.min_lr=0.00075 optim.weight_decay=0.013 model.classification.mitv2.width_multiplier=1.00
MobileViTv2-0.75
Single node training of MobileViTv2-0.75 with 8 A100 GPUs PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_0_7_5 --common.override-kwargs scheduler.cosine.max_lr=0.0090 scheduler.cosine.min_lr=0.00090 optim.weight_decay=0.008 model.classification.mitv2.width_multiplier=0.75
MobileViTv2-0.5
Single node training of MobileViTv2-0.5 with 8 A100 GPUs PYTHONWARNINGS="ignore" cvnets-train --common.config-file config/classification/imagenet/mobilevit_v2.yaml --common.results-loc mobilevitv2_results/width_0_5_0 --common.override-kwargs scheduler.cosine.max_lr=0.0090 scheduler.cosine.min_lr=0.00090 optim.weight_decay=0.004 model.classification.mitv2.width_multiplier=0.50
ImageNet-21k-P Pre-training
ImageNet-21k-P is significantly larger than ImageNet-1k in terms of dataset size. To train on ImageNet-21k-P, we use 32 A100 GPUs, i.e., 4 8-GPU nodes. For faster convergence, we initialize the model with ImageNet-1k checkpoint.
Multi-node training of MobileViTv2-2.0 with 4 8-GPU nodes on the ImageNet-21k-P dataset
Node-0
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_2_0_0 --ddp.rank 0 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.05 model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-1
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_2_0_0 --ddp.rank 8 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.05 model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-2
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_2_0_0 --ddp.rank 16 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.05 model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-3
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_2_0_0 --ddp.rank 24 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.05 model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Evaluation can be done using the below command:
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_2_0_0 --model.classification.pretrained mobilevitv2_results_in21k/checkpoint_ema_best.pt --common.override-kwargs model.classification.mitv2.width_multiplier=2.00 model.classification.finetune_pretrained_model=false
Multi-node training of MobileViTv2-1.75 with 4 8-GPU nodes on the ImageNet-21k-P dataset
Node-0
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_7_5 --ddp.rank 0 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.039 model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-1
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_7_5 --ddp.rank 8 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.039 model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-2
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_7_5 --ddp.rank 16 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.039 model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-3
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_7_5 --ddp.rank 24 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.039 model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Multi-node training of MobileViTv2-1.5 with 4 8-GPU nodes on the ImageNet-21k-P dataset
Node-0
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_5_0 --ddp.rank 0 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.029 model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-1
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_5_0 --ddp.rank 8 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.029 model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-2
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_5_0 --ddp.rank 16 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.029 model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Node-3
cvnets-train --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k/width_1_5_0 --ddp.rank 24 --ddp.world-size 32 --ddp.dist-url 'tcp://IP_OF_NODE0:FREEPORT' --ddp.backend nccl --common.override-kwargs optim.weight_decay=0.029 model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
ImageNet-1k Fine-tuning
Finetune from ImageNet-1k pre-training at 384x384 resolution
Single node finetuning of MobileViTv2-2.0
with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_2_0_0 --common.override-kwargs model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
where LOCATION_OF_IMAGENET_1k_CHECKPOINT
is the location of the best EMA checkpoint obtained after pre-training on the ImageNet-1k dataset at 256x256 resolution.
Evaluation can be done using the below command:
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file config/classification/imagenet_21k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_2_0_0 --model.classification.pretrained mobilevitv2_results_in1k_ft_384/width_2_0_0/checkpoint_ema_best.pt --common.override-kwargs model.classification.mitv2.width_multiplier=2.00
Single node finetuning of MobileViTv2-1.75 with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_1_7_5 --common.override-kwargs model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Single node finetuning of MobileViTv2-1.5 with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_1_5_0 --common.override-kwargs model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Single node finetuning of MobileViTv2-1.25 with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_1_2_5 --common.override-kwargs model.classification.mitv2.width_multiplier=1.25 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Single node finetuning of MobileViTv2-1.00 with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_1_0_0 --common.override-kwargs model.classification.mitv2.width_multiplier=1.00 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Single node finetuning of MobileViTv2-0.75 with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_0_7_5 --common.override-kwargs model.classification.mitv2.width_multiplier=0.75 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Single node finetuning of MobileViTv2-0.5 with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in1k.yaml --common.results-loc mobilevitv2_results_in1k_ft_384/width_0_5_0 --common.override-kwargs model.classification.mitv2.width_multiplier=0.5 model.classification.pretrained="LOCATION_OF_IMAGENET_1k_CHECKPOINT"
Note: If node has more than 2 GPUs, then CUDA_VISIBLE_DEVICES=0,1
can be used to run job on GPU-0 and GPU-1 only.
Finetune from ImageNet-21k-P pre-training at 256x256 resolution
Single node finetuning of MobileViTv2-2.0
with 4 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_in21k_to_1k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k_ft_256/width_2_0_0 --common.override-kwargs model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_21k_CHECKPOINT"
where LOCATION_OF_IMAGENET_21k_CHECKPOINT
is the location of the best EMA checkpoint obtained after pre-training on the ImageNet-21k-P dataset at 256x256 resolution.
Evaluation can be done using the below command:
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file config/classification/finetune_in21k_to_1k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k_ft_256/width_2_0_0 --model.classification.pretrained mobilevitv2_results_in21k_ft_256/width_2_0_0/checkpoint_ema_best.pt --common.override-kwargs model.classification.mitv2.width_multiplier=2.00 model.classification.finetune_pretrained_model=false
Single node finetuning of MobileViTv2-1.75 with 4 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_in21k_to_1k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k_ft_256/width_1_7_5 --common.override-kwargs model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_21k_CHECKPOINT"
Single node finetuning of MobileViTv2-1.5 with 4 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_in21k_to_1k/mobilevit_v2.yaml --common.results-loc mobilevitv2_results_in21k_ft_256/width_1_5_0 --common.override-kwargs model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_21k_CHECKPOINT"
Finetune from ImageNet-21k-P+ImageNet-1k pre-training at 384x384 resolution
Single node finetuning of MobileViTv2-2.0
with 2 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in21k_in1k.yaml --common.results-loc mobilevitv2_results_in21k_1k_ft_384/width_2_0_0 --common.override-kwargs model.classification.mitv2.width_multiplier=2.0 model.classification.pretrained="LOCATION_OF_IMAGENET_21k_1k_CHECKPOINT"
where LOCATION_OF_IMAGENET_21k_1k_CHECKPOINT
is the location of the best EMA checkpoint obtained after finetuning a pre-trained ImageNet-21k-P model on ImageNet-1k at 256x256 resolution.
Evaluation can be done using the below command:
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in21k_in1k.yaml --common.results-loc mobilevitv2_results_in21k_1k_ft_384/width_2_0_0 --model.classification.pretrained mobilevitv2_results_in21k_1k_ft_384/width_2_0_0/checkpoint_ema_best.pt --common.override-kwargs model.classification.mitv2.width_multiplier=2.00 model.classification.finetune_pretrained_model=false
Single node finetuning of MobileViTv2-1.75 with 4 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in21k_in1k.yaml --common.results-loc mobilevitv2_results_in21k_1k_ft_384/width_1_7_5 --common.override-kwargs model.classification.mitv2.width_multiplier=1.75 model.classification.pretrained="LOCATION_OF_IMAGENET_21k_1k_CHECKPOINT"
Single node finetuning of MobileViTv2-1.5 with 4 A100 GPUs
cvnets-train --common.config-file config/classification/finetune_higher_res/mobilevit_v2_in21k_in1k.yaml --common.results-loc mobilevitv2_results_in21k_1k_ft_384/width_1_5_0 --common.override-kwargs model.classification.mitv2.width_multiplier=1.5 model.classification.pretrained="LOCATION_OF_IMAGENET_21k_1k_CHECKPOINT"
Citation
@article{mehta2022separable,
title={Separable Self-attention for Mobile Vision Transformers},
author={Sachin Mehta and Mohammad Rastegari},
year={2022}
}