Note
This page was generated from a Jupyter notebook. The original can be downloaded from here.
DNIKit Familiarity: Dataset Distribution#
Compare the distributions of two datasets, e.g. train/test datasets, synthetic/real datasets, etc.
Please see the doc page for a discussion on applying Familiarity to dataset distribution analysis, including what actions can be taken to improve the dataset.
For a more detailed guide on using all of these DNIKit components, try the Familiarity for Rare Data Discovery Notebook.
[1]:
# Don't run this cell if stochasticity is desired
import numpy as np
np.random.seed(42)
Optional: Download MobileNet and CIFAR-10#
This example uses MobileNet (trained on ImageNet) and CIFAR-10, but feel free to use any other model and dataset. This notebook uses TFModelExamples and TFDatasetExamples to load in MobileNet and CIFAR-10. Please see the DNIKit docs for information about how to load a model or dataset. This page also describes how responses can be collected outside of DNIKit, and passed into Familiarity via a Producer.
[2]:
##########################
# User-Defined Variables #
##########################
# Change the following labels to see which labels are more familiar.
# The example illustrates a comparison between the distributions of the train
# and test sets for automobiles, for 100 images.
TRAIN_CLASS_LABEL = 'automobile'
TEST_CLASS_LABEL = 'automobile'
N_SAMPLES = 100
[3]:
from dnikit.processors import ImageResizer, SnapshotSaver
from dnikit.base import Batch, PixelFormat, pipeline, ImageFormat
from dnikit_tensorflow import TFDatasetExamples, TFModelExamples
# Load CIFAR10 dataset and feed into MobileNet,
# observing responses from layer conv_pw_13
mobilenet = TFModelExamples.MobileNet()
mobilenet_preprocessor = mobilenet.preprocessing
assert mobilenet_preprocessor is not None
# Load CIFAR-10 with train and test datasets, and
# attach metadata (labels, dataset origins, image filepaths) to each batch
cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)
# Create pre-processing pipeline
preprocessing_stages = (
# Save a snapshot of the raw image data to refer back to later
SnapshotSaver(),
# Preprocess the image batches in the manner expected by MobileNet
mobilenet_preprocessor,
# Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer
ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),
)
# Create producers for subsets of the dataset for comparing train / test distribution
# :: Note: The subset method will filter the batch LABELS metadata matching the provided dict
data_producers = {
'train': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=["train"], max_samples=N_SAMPLES),
'test': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=["test"], max_samples=N_SAMPLES),
}
2023-08-03 12:41:13.227858: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2023-08-03 12:41:13.227907: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-08-03 12:41:13.227913: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-08-03 12:41:13.227982: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-08-03 12:41:13.228013: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
/private/tmp/dnikit-2.0.0/lib/python3.9/site-packages/keras/src/engine/training.py:3000: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.
saving_api.save_model(
Put it all together to produce familiarity scores#
For a more detailed breakdown of these steps, see the Familiarity for Rare Data Discovery Notebook.
A. Define user variables#
First define some user variables, which can be modified to play around with different classes, or different datasets.
B. Create producers#
[4]:
from dnikit.processors import Cacher, Pooler
producers = {
split: pipeline(
data_producers[split],
# Apply previously-defined preprocessing stages for Mobilenet & CIFAR
*preprocessing_stages,
# run inference -- pass a list of requested responses or a single string
mobilenet.model('conv_pw_13'),
# perform spatial max pooling on the result
Pooler(dim=(1, 2), method=Pooler.Method.MAX),
# Cache results to re-run the pipeline later without recomputing the responses
Cacher()
)
for split in ('train', 'test')
}
Reduce dimensionality of responses#
[5]:
from dnikit.introspectors import DimensionReduction
# Configure the DimensionReduction Introspector
# The dimensionality of the data will be reduced from 1024 to 40
n_dim = 40
# Trigger the pipeline & fit the PCA model on the train dataset, which will used as the base
pca = DimensionReduction.introspect(producers["train"], strategies=DimensionReduction.Strategy.PCA(n_dim))
# Apply the PipelineStage pca object to both train/test pipelines to reduce responses in all batches to a lower dimension
reduced_producers = {
name: pipeline(producer, pca)
for name, producer in producers.items()
}
Build Familiarity model on train & test data combined#
[6]:
from dnikit.introspectors import Familiarity
# The Familiarity model is first fit on the base dataset, which is "train" in this case
# Trigger pipeline & run DNIKit Familiarity, default strategy is Familiarity.Strategy.GMM
familiarity = Familiarity.introspect(reduced_producers['train'])
# Use dict-comprehension to apply familiarity to the train and test datasets individually
scored_producers = {
producer_name : pipeline(
cached_response_producer,
familiarity
)
# reduced_producers maps 'train'/'test' to the split's reduced producer
for producer_name, cached_response_producer in reduced_producers.items()
}
Compute familiarity likelihood score#
Produce the final familiarity likelihood score.
If the likelihood score is close to 0, both distributions are equivalent.
Typically, the train dataset’s mean log score will be smaller than the test dataset’s, since familiarity was fit to this first/train dataset. The more negative the overall likelihood score is, the larger the distribution gap. One of the datasets is likely in need of being re-collected.
It may still happen that the likelihood score is greater than 0. This is also explained by a distribution gap, and will require analysis and possibly data re-collection.
Please refer to the doc page for more information, and check out the other Familiarity use case, discovering rare samples, or the DatasetReport to evaluate why there is a distribution gap.
[7]:
from dnikit.base import Producer
def compute_score_mean(producer: Producer, response_name: str, meta_key: Batch.DictMetaKey) -> float:
""" Compute mean of score, for given metadata key, response name, and producer """
scores = [
batch.metadata[meta_key][response_name][index].score
for batch in producer(32)
for index in range(batch.batch_size)
]
return np.mean(scores)
# Trigger remaining pipeline, compute mean of familiarity scores for both train and test datasets
stats = {
producer_name : compute_score_mean(
producer=producer,
response_name="conv_pw_13",
meta_key=familiarity.meta_key
)
# scored_producers maps 'train'/'test' to the split's scored producer
for producer_name, producer in scored_producers.items()
}
familiarity_ratio = stats['test'] - stats['train']
print(f"Likelihood ratio [{TRAIN_CLASS_LABEL}]->[{TEST_CLASS_LABEL}] = {familiarity_ratio:0.4f}")
Likelihood ratio [automobile]->[automobile] = 12.3229
[ ]: