data.datasets.multi_modal_img_text package

Subpackages

Submodules

data.datasets.multi_modal_img_text.base_multi_modal_img_text module

class data.datasets.multi_modal_img_text.base_multi_modal_img_text.BaseMultiModalImgText(opts, *args, **kwargs)[source]

Bases: BaseImageDataset

Base class for Image-Text multi-modal learning

Parameters:

opts – command-line arguments

__init__(opts, *args, **kwargs) None[source]
get_zero_shot_dataset(*args, **kwargs) BaseZeroShotDataset | None[source]

If zero-shot evaluation is enabled, zero-shot dataset is returned. Otherwise, None is returned

get_dataset(*args, **kwargs) Any[source]

Helper function to get the dataset. Child classes must override this function

share_dataset_arguments() Dict[str, Any][source]

Returns the number of classes in detection dataset along with super-class arguments.

classmethod add_arguments(parser: ArgumentParser) ArgumentParser[source]

Add dataset-specific arguments to the parser.

get_zero_shot_pair(img_index: int) Tuple[Image, str | List[str] | List[List[str]], int][source]

Get image-text pair for zero-shot dataset along with classification label.

Parameters:

img_index – Image index

Returns:

A tuple of PIL image, captions, and class label

get_dataset_pair(img_index: int) Any[source]

Get image-text pair from the dataset. Sub-classes must implement this method.

extra_repr() str[source]

Extra information to be represented in __repr__. Each line in the output string should be prefixed with \t.

data.datasets.multi_modal_img_text.base_multi_modal_img_text.multi_modal_img_text_collate_fn(batch: List[Mapping[str, Tensor | Mapping[str, Tensor]]], opts: Namespace) Mapping[str, Tensor | Mapping[str, Tensor]][source]

Combines a list of dictionaries into a single dictionary by concatenating matching fields.

data.datasets.multi_modal_img_text.flickr module

class data.datasets.multi_modal_img_text.flickr.FlickrDataset(opts, *args, **kwargs)[source]

Bases: BaseMultiModalImgText

Dataset loader for Flickr-30k and Flickr-8k datasets.

For more info see:

http://hockenmaier.cs.illinois.edu/8k-pictures.html https://shannon.cs.illinois.edu/DenotationGraph/

Splits: train, val, and test

Also known in literature as Karpathy splits https://cs.stanford.edu/people/karpathy/deepimagesent/

Tracking license info:

Captions have CC BY 3.0 license (see links above). Splits are under BSD License (see Github of NeuralTalk by Karpathy et. al.). Images are from Flickr. We do not own them and are only used for research purposes.

Parameters:
  • opts – command-line arguments

  • is_training (Optional[bool]) – A flag used to indicate training or validation mode. Default: True

  • is_evaluation (Optional[bool]) – A flag used to indicate evaluation (or inference) mode. Default: False

get_dataset(*args, **kwargs) None[source]

The data under self.root is expected to consist of:

dataset.json # Karpathy splits + captions images/ # Raw images

The metdatadata cap be downloaded from:

https://cs.stanford.edu/people/karpathy/deepimagesent/flickr30k.zip

Images can be obtained from:

Flickr-8k: http://hockenmaier.cs.illinois.edu/8k-pictures.html Flickr-30k: https://shannon.cs.illinois.edu/DenotationGraph/

data.datasets.multi_modal_img_text.img_text_tar_dataset module

data.datasets.multi_modal_img_text.img_text_tar_dataset.extract_content(tar_file: TarFile, file_name: str) AnyStr[source]

Extract the context of a particular file inside a tar file and returns it.

data.datasets.multi_modal_img_text.img_text_tar_dataset.decode_image(byte_data) Image[source]

Reads the byte image data and returns the PIL image.

data.datasets.multi_modal_img_text.img_text_tar_dataset.decode_text(byte_data) str[source]

Reads the byte text data and returns the decoded string.

data.datasets.multi_modal_img_text.img_text_tar_dataset.async_download_file_from_s3(opts: Namespace, tar_file_name: str, cache_loc: str, *args, **kwargs) None[source]

Helper function to download the files asynchronously from S3.

Parameters:
  • opts – command-line arguments

  • tar_file_name – Name of the tar file

  • cache_loc – Caching location on the local machine

class data.datasets.multi_modal_img_text.img_text_tar_dataset.ImgTextTarDataset(opts, *args, **kwargs)[source]

Bases: BaseMultiModalImgText

ImgTextTarDataset class for datasets that store Image-Text pairs as tar files, each tar file with multiple pairs.

The dataset should be stored in following format where img_text_tar_dataset is the location of directory that has all tar files.

img_text_tar_dataset |— 00000000_0_1000.tar.gz |——– 00000000_0_image |——– 00000000_0_text |——– 00000000_1_image |——– 00000000_1_text |——– …

|— 00000000_1000_2000.tar.gz |——– 00000000_1000_image |——– 00000000_1000_text |——– 00000000_1001_image |——– 00000000_1001_text |——– …

Parameters:

opts – An argparse.Namespace instance.

__init__(opts, *args, **kwargs) None[source]
get_dataset(*args, **kwargs) Dict[str, str][source]

Reads the metadata file and returns a mapping of indices of files stored in a tar file and its name

classmethod add_arguments(parser: ArgumentParser) ArgumentParser[source]

Add dataset-specific arguments to the parser.

get_dataset_pair(img_index: int) Tuple[Image, str, int][source]

For a given image index, read the image file, corresponding caption, and class label. If class label is not present, -1 is returned.

Module contents

data.datasets.multi_modal_img_text.arguments_multi_modal_img_text(parser: ArgumentParser) ArgumentParser[source]