Command Line Utility
The Python package contains a command-line utility for you to quickly explore large text datasets with metadata.


Installation
pip install embedding-atlas
and then launch the command line tool:
embedding-atlas [OPTIONS] INPUTS...
TIP
To avoid package installation issues, we recommend using the uv package manager to install Embedding Atlas and its dependencies. uv allows you to launch the command line tool with a single command:
uvx embedding-atlas
On Windows, you may install the package on either the Windows Subsystem for Linux (WSL) or directly on Windows. To use NVIDIA GPUs, you'll need to install a PyTorch version that supports CUDA, see here for more details.
Loading Data
You can load your data in two ways: locally or from Hugging Face.
Loading Local Data
To get started with your own data, run:
embedding-atlas path_to_dataset.parquet
Loading Hugging Face Data
You can instead load datasets from Hugging Face:
embedding-atlas huggingface_org/dataset_name
Visualizing Embeddings
The script will use SentenceTransformers to compute embedding vectors for the specified column containing the text or image data. You may use the --model
option to specify an embedding model. If not specified, a default model will be used. The current defaults are all-MiniLM-L6-v2
for text and google/vit-base-patch16-384
for images, but these are subject to change in future releases.
After embedding vectors are computed, the script will then project the high-dimensional vectors to 2D with UMAP.
TIP
Optionally, if you know what column your text data is in beforehand, you can specify which column to use with the --text
flag, for example:
embedding-atlas path_to_dataset.parquet --text text_column
Similarly, you may supply the --image
flag for image data, or the --vector
flag for pre-computed embedding vectors.
If you've already pre-computed the embedding projection (e.g., by running your own embedding model and projecting them with UMAP), you may store them as two columns such as projection_x
and projection_y
, and pass them into embedding-atlas
with the --x
and --y
flags:
embedding-atlas path_to_dataset.parquet --x projection_x --y projection_y
You may also pass in the --neighbors
flag to specify the column name for pre-computed nearest neighbors. The neighbors
column should have values in the following format: {"ids": [id1, id2, ...], "distances": [d1, d2, ...]}
. The IDs should be zero-based row indices. If this column is specified, you'll be able to see nearest neighbors for a selected point in the tool.
Once this script completes, it will print out a URL like http://localhost:5055/
. Open the URL in a web browser to view the embedding.
Reproducibility
For reproducible embedding visualizations, we recommend pre-computing both the embedding vectors and their UMAP projections, and storing them with your dataset. This ensures consistency since the default embedding model may change over time, floating-point precision may vary across different devices, and UMAP introduces randomness through both its default random initialization and its use of parallelism (see here).
The embedding_atlas
package provides utility functions to compute the embedding projections:
from embedding_atlas.projection import compute_text_projection
compute_text_projection(df, text="text_column",
x="projection_x", y="projection_y", neighbors="neighbors"
)
Usage
Usage: embedding-atlas [OPTIONS] INPUTS...
Command Line Options
--text text
Column containing text data.
--image text
Column containing image data.
--vector text
Column containing pre-computed vector embeddings.
--split text
Dataset split name(s) to load from Hugging Face datasets. Can be specified multiple times for multiple splits.
--enable-projection / --disable-projection boolean
Compute embedding projections from text/image/vector data. If disabled without pre-computed projections, the embedding view will be unavailable.
--model text
Model name for generating embeddings (e.g., 'all-MiniLM-L6-v2').
--trust-remote-code boolean
Allow execution of remote code when loading models from Hugging Face Hub.
--batch-size integer
Batch size for processing embeddings (default: 32 for text, 16 for images). Larger values use more memory but may be faster.
--x text
Column containing pre-computed X coordinates for the embedding view.
--y text
Column containing pre-computed Y coordinates for the embedding view.
--neighbors text
Column containing pre-computed nearest neighbors in format: {"ids": [n1, n2, ...], "distances": [d1, d2, ...]}. IDs should be zero-based row indices.
--sample integer
Number of random samples to draw from the dataset. Useful for large datasets.
--umap-n-neighbors integer
Number of neighbors to consider for UMAP dimensionality reduction (default: 15).
--umap-min-dist float
The min_dist parameter for UMAP.
--umap-metric text
Distance metric for UMAP computation (default: 'cosine').
--umap-random-state integer
Random seed for reproducible UMAP results.
--duckdb text
DuckDB connection mode: 'wasm' (run in browser), 'server' (run on this server), or URI (e.g., 'ws://localhost:3000').
--host text
Host address for the web server (default: localhost).
--port integer
Port number for the web server (default: 5055).
--auto-port / --no-auto-port boolean
Automatically find an available port if the specified port is in use.
--static text
Custom path to frontend static files directory.
--export-application text
Export the visualization as a standalone web application to the specified ZIP file and exit.
--point-size float
Size of points in the embedding view (default: automatically calculated based on density).
--stop-words text
Path to a file containing stop words to exclude from the text embedding. The file should be a data frame with column 'word'
--labels text
Path to a file containing labels for the embedding view. The file should be a data frame with columns 'x', 'y', 'text', and optionally 'level' and 'priority'
--version boolean
Show the version and exit.