Dimension Reduction#
DNIKit provides a DimensionReduction
introspector with a variety of strategies (algorithms).
DimensionReduction has two primary uses:
reduce high-dimensional data to something lower for consumption by a different
Introspector
reduce data to 2D or 3D for visualization (e.g. Dataset Report).
Often, model responses are very large in the number of dimensions. However,
some algorithms work better on lower dimensional data.
For example Familiarity
and
even the
DimensionReduction Strategies
other than PCA work better on e.g. 40 dimensional data.
Some of the algorithms state this is useful for reducing the noise in
very high dimensional data.
PCA (Principal Component Analysis) is a great strategy to perform this reduction.
The notebook Dimension Reduction Example Notebook below gives an example of reducing high dimension data for use with various DimensionReduction strategies.
DimensionReduction to 2D is also a nice way to visualize the clusters and relationships in the data. UMAP, PaCMAP and t-SNE are all algorithms that are well suited to this task. The notebook below also shows examples of doing this.
General Usage#
For getting started with DNIKit code, please see the how-to pages.
# a source of embeddings (typically high dimensional data)
response_producer = pipeline(...)
# first, create a dimension reduction `PipelineStage` object (`reducer`, here) that is fit
# to the input data and will be able to project any data to a lower number of dimensions
reducer = DimensionReduction.introspect(
response_producer,
strategies=DimensionReduction.Strategy.PCA(40)
)
# Next, chain the reducer PipelineStage into a new pipeline that will reduce all output data
# from `response_producer` into 40 dimensions
reduced_producer = pipeline(response_producer, reducer)
See the example notebook below for more detailed usage.
Config Options#
DNIKit comes with four Strategies
for performing dimension reduction, each with their own advantages and disadvantages:
PCA
very fast and good for reducing e.g. 1024 -> 40 dimensions
memory efficient
not suitable for 2D projection
UMAP
excellent 2D projections
preserves local but not global structure
PaCMAP
excellent 2D projections
preserves local and global structure
TSNE (t-SNE)
largely replaced by newer strategies
For a more in-depth comparison, please see the example notebook below.
Relevant API#
DimensionReduction
: introspector for Dimension Reduction