Dimension Reduction#

DNIKit provides a DimensionReduction introspector with a variety of strategies (algorithms). DimensionReduction has two primary uses:

  • reduce high-dimensional data to something lower for consumption by a different Introspector

  • reduce data to 2D or 3D for visualization (e.g. Dataset Report).

Often, model responses are very large in the number of dimensions. However, some algorithms work better on lower dimensional data. For example Familiarity and even the DimensionReduction Strategies other than PCA work better on e.g. 40 dimensional data. Some of the algorithms state this is useful for reducing the noise in very high dimensional data. PCA (Principal Component Analysis) is a great strategy to perform this reduction.

The notebook Dimension Reduction Example Notebook below gives an example of reducing high dimension data for use with various DimensionReduction strategies.

DimensionReduction to 2D is also a nice way to visualize the clusters and relationships in the data. UMAP, PaCMAP and t-SNE are all algorithms that are well suited to this task. The notebook below also shows examples of doing this.

General Usage#

For getting started with DNIKit code, please see the how-to pages.

    # a source of embeddings (typically high dimensional data)
    response_producer = pipeline(...)

    # first, create a dimension reduction `PipelineStage` object (`reducer`, here) that is fit
    #    to the input data and will be able to project any data to a lower number of dimensions
    reducer = DimensionReduction.introspect(
    response_producer,
    strategies=DimensionReduction.Strategy.PCA(40)
)

    # Next, chain the reducer PipelineStage into a new pipeline that will reduce all output data
    #    from `response_producer` into 40 dimensions
    reduced_producer = pipeline(response_producer, reducer)

See the example notebook below for more detailed usage.

Config Options#

DNIKit comes with four Strategies for performing dimension reduction, each with their own advantages and disadvantages:

  • PCA
    • very fast and good for reducing e.g. 1024 -> 40 dimensions

    • memory efficient

    • not suitable for 2D projection

  • UMAP
    • excellent 2D projections

    • preserves local but not global structure

  • PaCMAP
    • excellent 2D projections

    • preserves local and global structure

  • TSNE (t-SNE)
    • largely replaced by newer strategies

For a more in-depth comparison, please see the example notebook below.

Relevant API#

Example#

References#