Dataset Report#

Explore a dataset to find rare data samples, duplicate data, annotation errors, or dataset bias. The DatasetReport is a combination of three DNIKit dataset introspection algorithms:

To explore the dataset in an interactive UI, the Dataset Report results can be fed directly into Symphony, a research platform for creating interactive data science components that allows for filtering, sorting, and exporting data samples.

For motivation behind the Dataset Report, see Description below.

General Usage#

For getting started with DNIKit code, please see the how-to pages.

Assuming a pipeline is set up to produce responses from a model, the DatasetReport can be run as so:

from dnikit.introspectors import DatasetReport

producer = ...  # pipeline setup here

# Run DatasetReport on responses from a producer
report = DatasetReport.introspect(producer, batch_size=128)

Introspection is typically performed on intermediate model responses (rather than the final outputs of a network). Here’s a full example using the CIFAR10 dataset, which uses the outputs of the last convolution layer conv_pw_13 from a MobileNet model to run the analysis:

from dnikit.introspectors import DatasetReport
from dnikit_tensorflow import TFDatasetExamples, TFModelExamples
from dnikit.processors import Cacher, ImageResizer
from dnikit.base import pipeline

# Load CIFAR10 dataset and feed into MobileNet,
# observing responses from layer conv_pw_13
cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)
mobilenet = TFModelExamples.MobileNet()
producer = pipeline(
   cifar10,
   ImageResizer(pixel_format=ImageResizer.Format.HWC, size=(224, 224)),
   mobilenet(requested_responses=['conv_pw_13']),
   Pooler(dim=(1, 2), method=Pooler.Method.MAX),
   Cacher()
)

# Run DatasetReport on intermediate layer conv_pw_13's responses to the data:
report = DatasetReport.introspect(producer)

Visualization#

Exploring with Symphony#

DNIKit’s DatasetReport can also connect with the Symphony UI framework to explore a dataset in a web browser or in a jupyter notebook. Please see Symphony’s documentation for an example of how to feed the output of DatasetReport.introspect directly into Symphony. These reports created with Symphony are interactive and shareable.

Warning

The current release of Symphony operates only on images, audio, and tabular data. To visualize other data types, it’s possible to run the DNIKit side of the DatasetReport on any dataset type and visualize in a custom manner.

pip install "dnikit[dataset-report]"

Exploring as Pandas DataFrame#

The resulting Dataset Report object has a property, data, that is a Pandas DataFrame of all DatasetReport results. Each row represents a data sample, and each column is report data, e.g., duplicate set.

report.data

These results can be visualized in a custom manner, but it’s recommended to try Symphony for image, audio, or tabular data.

Saving and Loading#

To save the report, call to_disk() on the report object. To load a saved report, use DatasetReport.from_disk(filepath).

Config Options#

Dataset Report’s introspect method has a parameter config that accepts a ReportConfig object. The config can be used to run only a subset of introspectors. For instance, to run only duplicates analysis:

from dnikit.introspectors import DatasetReport, ReportConfig

config = ReportConfig(
    projection=None,
    familiarity=None
)

The strategies used in the underlying algorithms can also be modified via the config. See ReportConfig in the API docs for more details.

Description#

Automated dataset diversity analysis often looks inter-class diversity, i.e. diversity across classes, as defined by metadata labels. Known methods include grouping data by label and performing various statistical analyses to see how well the number of data samples or model accuracy is distributed across these different labels.

Intra-class diversity, like fairness within a particular class label, is also important, yet harder to evaluate in an automated fashion. Intra-class diversity analysis is often manual, which doesn’t scale to large datasets. Manual analysis can also make it harder to communicate findings with team members or partners. Because of these problems, sometimes intra-class diversity analysis is skipped altogether.

The Dataset Report aims to automate and simplify the process of analyzing datasets for both inter and intra-class diversity, in a manner that enables sharing and exploration. With the Symphony framework, it’s possible to build a standalone static report or explore results live in a Jupyter notebook. Symphony also contains a centralized filtering, grouping, highlighting, and selection across all widgets, to form a cohesive workspace for dataset exploration.

Example#

A Jupyter notebook that demonstrates how to run the Dataset Report on the CIFAR-10 dataset:

Relevant API#