How to Create a New Dataset Type

Each dataset class in CVNet should be registered with data.dataset.DATASET_REGISTRY. You can either create a new dataset class from scratch or extend one of the existing ones.

This class decorator takes allows you to set a name and task type for the dataset class:

from data.datasets import DATASET_REGISTRY
from data.datasets.dataset_base import BaseImageDataset

@DATASET_REGISTRY.register(name="ade20k", type="segmentation")
class ADE20KDataset(BaseImageDataset):
    # PyTorch Dataset type.

This allows you to specify this dataset in your config file with the following format:

dataset:
  name: "ade20k"
  category: "segmentation"
  # Where the data is stored for train/validation (can be different)
  root_train: "/mnt/vision_datasets/ADEChallengeData2016/"
  root_val: "/mnt/vision_datasets/ADEChallengeData2016/"

The name and category refer to the dataset name and task. You can optionally specify the data location using root_train and root_val. BaseImageDataset will choose the correct path based on the is_training and is_evaluation parameters.

Currently, all datasets in CVNets are subclasses of either BaseImageDataset or BaseVideoDataset, which are both subclasses of BaseDataset. This is currently only a soft requirement.

Extending an Existing Dataset

Most of the time, there is no need to create a new dataset class from scratch. Instead, you can simply extend an existing dataset like ImagenetDataset.

The ImagenetDataset follows the ImageFolder class in torchvision.datasets.imagenet. If your data follows the same format you can extend ImageNet and only change the parts that are needed, such as including your amazing new transforms:

from data.datasets import DATASET_REGISTRY
from data.datasets.classification.imagenet import ImagenetDataset

@DATASET_REGISTRY.register(name="my-new-dataset", type="classification")
class AmazingDataset(ImagenetDataset):
    def training_transforms(self, size: tuple or int):
        # My amazing new training-time transforms

Keep in mind that you should probably change the root_train and root_val paths to where your data is located:

dataset:
  name: "my-new-dataset"
  category: "classification"
  root_train: "<path-to-training-data>"
  root_val: "<path-to-validation-data>"