loss_fn.multi_modal_img_text package
Submodules
loss_fn.multi_modal_img_text.base_multi_modal_img_text_criteria module
- class loss_fn.multi_modal_img_text.base_multi_modal_img_text_criteria.BaseMultiModalImageTextCriteria(opts: Namespace, *args, **kwargs)[source]
Bases:
BaseCriteria
Base class for defining multi-modal image-text loss functions. Sub-classes must implement forward function.
- Parameters:
opts – command line arguments
loss_fn.multi_modal_img_text.contrastive_loss_clip module
- class loss_fn.multi_modal_img_text.contrastive_loss_clip.ContrastiveLossClip(opts: Namespace, *args, **kwargs)[source]
Bases:
BaseMultiModalImageTextCriteria
Compute contrastive loss between image and text pairs.
- Parameters:
opts – command-line arguments
- __init__(opts: Namespace, *args, **kwargs) None [source]
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(input_sample: Any, prediction: Dict[str, Tensor], *args, **kwargs) Dict [source]
Computes contrastive loss between image and text representations, optionally with neural aug
- Parameters:
input_sample – Input to the model.
prediction – A mapping of the form (string: Tensor). image and text are mandatory keys.
- Shape:
input_sample: This loss function does not care about this argument. prediction[“image”]: Shape is [N, d] prediction[“text”]: Shape is [N, d]
where N is the local batch size and d is the feature dimension.
- Returns:
The output dictionary contains four keys (total_loss, image_loss, text_loss, logit_scale) and scalar loss value for each of these keys. total_loss is sum of image_loss and text_loss.
- loss_fn.multi_modal_img_text.contrastive_loss_clip.gather_features(image_features: Tensor, text_features: Tensor, use_distributed: bool) Tuple[Tensor, Tensor] [source]
Helper function that allows us to gather image and text features from all DDP ranks in a differentiable manner
- Parameters:
image_features – Image features
text_features – Text features
use_distributed – DDP training or not
- Shapes:
image_features: Shape is [N, d] text_features: Shape is [N, d] where N is the local batch size and d is the feature dimension.
- Returns:
A tuple of gathered image and text features across all GPUs. In case of a DDP task, each feature tensor has a dimension of [G, d] where G=NW is the effective batch size and W is world size (or total number of GPUs).