loss_fn.multi_modal_img_text package

Submodules

loss_fn.multi_modal_img_text.base_multi_modal_img_text_criteria module

class loss_fn.multi_modal_img_text.base_multi_modal_img_text_criteria.BaseMultiModalImageTextCriteria(opts: Namespace, *args, **kwargs)[source]

Bases: BaseCriteria

Base class for defining multi-modal image-text loss functions. Sub-classes must implement forward function.

Parameters:

opts – command line arguments

__init__(opts: Namespace, *args, **kwargs) None[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

classmethod add_arguments(parser: ArgumentParser) ArgumentParser[source]

Add criterion-specific arguments to the parser.

loss_fn.multi_modal_img_text.contrastive_loss_clip module

class loss_fn.multi_modal_img_text.contrastive_loss_clip.ContrastiveLossClip(opts: Namespace, *args, **kwargs)[source]

Bases: BaseMultiModalImageTextCriteria

Compute contrastive loss between image and text pairs.

Parameters:

opts – command-line arguments

__init__(opts: Namespace, *args, **kwargs) None[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input_sample: Any, prediction: Dict[str, Tensor], *args, **kwargs) Dict[source]

Computes contrastive loss between image and text representations, optionally with neural aug

Parameters:
  • input_sample – Input to the model.

  • prediction – A mapping of the form (string: Tensor). image and text are mandatory keys.

Shape:

input_sample: This loss function does not care about this argument. prediction[“image”]: Shape is [N, d] prediction[“text”]: Shape is [N, d]

where N is the local batch size and d is the feature dimension.

Returns:

The output dictionary contains four keys (total_loss, image_loss, text_loss, logit_scale) and scalar loss value for each of these keys. total_loss is sum of image_loss and text_loss.

loss_fn.multi_modal_img_text.contrastive_loss_clip.gather_features(image_features: Tensor, text_features: Tensor, use_distributed: bool) Tuple[Tensor, Tensor][source]

Helper function that allows us to gather image and text features from all DDP ranks in a differentiable manner

Parameters:
  • image_features – Image features

  • text_features – Text features

  • use_distributed – DDP training or not

Shapes:

image_features: Shape is [N, d] text_features: Shape is [N, d] where N is the local batch size and d is the feature dimension.

Returns:

A tuple of gathered image and text features across all GPUs. In case of a DDP task, each feature tensor has a dimension of [G, d] where G=NW is the effective batch size and W is world size (or total number of GPUs).

Module contents