Dataset Report#
Explore a dataset to find rare data samples, duplicate data, annotation errors,
or dataset bias. The DatasetReport
is a combination of
three DNIKit dataset introspection algorithms:
To explore the dataset in an interactive UI, the Dataset Report results can be fed directly into Symphony, a research platform for creating interactive data science components that allows for filtering, sorting, and exporting data samples.
For motivation behind the Dataset Report, see Description below.
General Usage#
For getting started with DNIKit code, please see the how-to pages.
Assuming a pipeline
is
set up to produce responses from a model, the DatasetReport can be run as so:
from dnikit.introspectors import DatasetReport
producer = ... # pipeline setup here
# Run DatasetReport on responses from a producer
report = DatasetReport.introspect(producer, batch_size=128)
Introspection is typically performed on intermediate model responses
(rather than the final outputs of a network).
Here’s a full example using the CIFAR10 dataset, which uses
the outputs of the last convolution layer conv_pw_13
from a
MobileNet model to run the analysis:
from dnikit.introspectors import DatasetReport
from dnikit_tensorflow import TFDatasetExamples, TFModelExamples
from dnikit.processors import Cacher, ImageResizer
from dnikit.base import pipeline
# Load CIFAR10 dataset and feed into MobileNet,
# observing responses from layer conv_pw_13
cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)
mobilenet = TFModelExamples.MobileNet()
producer = pipeline(
cifar10,
ImageResizer(pixel_format=ImageResizer.Format.HWC, size=(224, 224)),
mobilenet(requested_responses=['conv_pw_13']),
Pooler(dim=(1, 2), method=Pooler.Method.MAX),
Cacher()
)
# Run DatasetReport on intermediate layer conv_pw_13's responses to the data:
report = DatasetReport.introspect(producer)
Visualization#
Exploring with Symphony#
DNIKit’s DatasetReport can also connect with the Symphony UI framework
to explore a dataset in a web browser or in a jupyter notebook. Please see Symphony’s
documentation for an example of how to feed the
output of DatasetReport.introspect
directly into Symphony.
These reports created with Symphony are interactive and shareable.
Warning
The current release of Symphony operates only on images, audio, and tabular data. To visualize other data types, it’s possible to run the DNIKit side of the DatasetReport on any dataset type and visualize in a custom manner.
pip install "dnikit[dataset-report]"
Exploring as Pandas DataFrame#
The resulting Dataset Report
object has a property,
data
, that is a
Pandas DataFrame
of all DatasetReport results. Each row represents a data sample, and each column is report data,
e.g., duplicate set.
report.data
These results can be visualized in a custom manner, but it’s recommended to try Symphony for image, audio, or tabular data.
Saving and Loading#
To save the report, call to_disk()
on the report object.
To load a saved report, use
DatasetReport.from_disk(filepath)
.
Config Options#
Dataset Report’s introspect
method has a parameter config
that accepts a
ReportConfig
object. The config can be
used to run only a subset of introspectors. For instance,
to run only duplicates analysis:
from dnikit.introspectors import DatasetReport, ReportConfig
config = ReportConfig(
projection=None,
familiarity=None
)
The strategies used in the underlying algorithms can also be modified via the config.
See ReportConfig
in the API docs for more details.
Description#
Automated dataset diversity analysis often looks inter-class diversity, i.e. diversity across classes, as defined by metadata labels. Known methods include grouping data by label and performing various statistical analyses to see how well the number of data samples or model accuracy is distributed across these different labels.
Intra-class diversity, like fairness within a particular class label, is also important, yet harder to evaluate in an automated fashion. Intra-class diversity analysis is often manual, which doesn’t scale to large datasets. Manual analysis can also make it harder to communicate findings with team members or partners. Because of these problems, sometimes intra-class diversity analysis is skipped altogether.
The Dataset Report aims to automate and simplify the process of analyzing datasets for both inter and intra-class diversity, in a manner that enables sharing and exploration. With the Symphony framework, it’s possible to build a standalone static report or explore results live in a Jupyter notebook. Symphony also contains a centralized filtering, grouping, highlighting, and selection across all widgets, to form a cohesive workspace for dataset exploration.
Example#
A Jupyter notebook that demonstrates how to run the Dataset Report on the CIFAR-10 dataset: