Command Line Utility
The Python package contains a command-line utility for you to quickly explore large text datasets with metadata.


Installation
pip install embedding-atlas
and then launch the command line tool:
embedding-atlas [OPTIONS] INPUTS...
TIP
To avoid package installation issues, we recommend using the uv package manager to install Embedding Atlas and its dependencies. uv allows you to launch the command line tool with a single command:
uvx embedding-atlas
Loading Data
You can load your data in two ways: locally or from Hugging Face.
Loading Local Data
To get started with your own data, run:
embedding-atlas path_to_dataset.parquet
Loading Hugging Face Data
You can instead load datasets from Hugging Face:
embedding-atlas huggingface_org/dataset_name
Visualizing Embeddings
The script will use SentenceTransformers to compute embedding vectors for the specified column containing the text data. The script will then project the high-dimensional embedding vectors to 2D with UMAP.
TIP
Optionally, if you know what column your text data is in beforehand, you can specify which column to use with the --text
flag, for example:
embedding-atlas path_to_dataset.parquet --text text_column
If you've already pre-computed the embedding projection (e.g., by running your own embedding model and projecting them with UMAP), you may store them as two columns such as projection_x
and projection_y
, and pass them into embedding-atlas
with the --x
and --y
flags:
embedding-atlas path_to_dataset.parquet --x projection_x --y projection_y
You may also pass in the --neighbors
flag to specify the column name for pre-computed nearest neighbors. The neighbors
column should have values in the following format: {"ids": [id1, id2, ...], "distances": [d1, d2, ...]}
. If this column is specified, you'll be able to see nearest neighbors for a selected point in the tool.
Once this script completes, it will print out a URL like http://localhost:5055/
. Open the URL in a web browser to view the embedding.
Usage
Usage: embedding-atlas [OPTIONS] INPUTS...
Options:
--text TEXT Column containing text data.
--image TEXT Column containing image data.
--split TEXT Dataset split name(s) to load from Hugging
Face datasets. Can be specified multiple times
for multiple splits.
--embedding / --no-embedding Whether to compute embeddings for the data.
Disable if embeddings are pre-computed or if
you do not want an embedding view.
--model TEXT Model name for generating embeddings (e.g.,
'all-MiniLM-L6-v2').
--trust-remote-code Allow execution of remote code when loading
models from Hugging Face Hub.
--x TEXT Column containing pre-computed X coordinates
for the embedding view.
--y TEXT Column containing pre-computed Y coordinates
for the embedding view.
--neighbors TEXT Column containing pre-computed nearest
neighbors in format: {"ids": [n1, n2, ...],
"distances": [d1, d2, ...]}.
--sample INTEGER Number of random samples to draw from the
dataset. Useful for large datasets.
--umap-n-neighbors INTEGER Number of neighbors to consider for UMAP
dimensionality reduction (default: 15).
--umap-min-dist FLOAT The min_dist parameter for UMAP.
--umap-metric TEXT Distance metric for UMAP computation (default:
'cosine').
--umap-random-state INTEGER Random seed for reproducible UMAP results.
--duckdb TEXT DuckDB connection mode: 'wasm' (run in
browser), 'server' (run on this server), or
URI (e.g., 'ws://localhost:3000').
--host TEXT Host address for the web server (default:
localhost).
--port INTEGER Port number for the web server (default:
5055).
--auto-port / --no-auto-port Automatically find an available port if the
specified port is in use.
--static TEXT Custom path to frontend static files
directory.
--export-application TEXT Export the visualization as a standalone web
application to the specified ZIP file and
exit.
--version Show the version and exit.
--help Show this message and exit.