Skip to content

Command Line Utility

The Python package contains a command-line utility for you to quickly explore large text datasets with metadata.

Installation

bash
pip install embedding-atlas

and then launch the command line tool:

bash
embedding-atlas [OPTIONS] INPUTS...

TIP

To avoid package installation issues, we recommend using the uv package manager to install Embedding Atlas and its dependencies. uv allows you to launch the command line tool with a single command:

bash
uvx embedding-atlas

On Windows, you may install the package on either the Windows Subsystem for Linux (WSL) or directly on Windows. To use NVIDIA GPUs, you'll need to install a PyTorch version that supports CUDA, see here for more details.

Loading Data

You can load your data in two ways: locally or from Hugging Face.

Loading Local Data

To get started with your own data, run:

bash
embedding-atlas path_to_dataset.parquet

Loading Hugging Face Data

You can instead load datasets from Hugging Face:

bash
embedding-atlas huggingface_org/dataset_name

Visualizing Embeddings

The script will use SentenceTransformers to compute embedding vectors for the specified column containing the text or image data. You may use the --model option to specify an embedding model. If not specified, a default model will be used. The current defaults are all-MiniLM-L6-v2 for text and google/vit-base-patch16-384 for images, but these are subject to change in future releases.

After embedding vectors are computed, the script will then project the high-dimensional vectors to 2D with UMAP.

TIP

Optionally, if you know what column your text data is in beforehand, you can specify which column to use with the --text flag, for example:

bash
embedding-atlas path_to_dataset.parquet --text text_column

Similarly, you may supply the --image flag for image data, or the --vector flag for pre-computed embedding vectors.

If you've already pre-computed the embedding projection (e.g., by running your own embedding model and projecting them with UMAP), you may store them as two columns such as projection_x and projection_y, and pass them into embedding-atlas with the --x and --y flags:

bash
embedding-atlas path_to_dataset.parquet --x projection_x --y projection_y

You may also pass in the --neighbors flag to specify the column name for pre-computed nearest neighbors. The neighbors column should have values in the following format: {"ids": [id1, id2, ...], "distances": [d1, d2, ...]}. The IDs should be zero-based row indices. If this column is specified, you'll be able to see nearest neighbors for a selected point in the tool.

Once this script completes, it will print out a URL like http://localhost:5055/. Open the URL in a web browser to view the embedding.

Reproducibility

For reproducible embedding visualizations, we recommend pre-computing both the embedding vectors and their UMAP projections, and storing them with your dataset. This ensures consistency since the default embedding model may change over time, floating-point precision may vary across different devices, and UMAP introduces randomness through both its default random initialization and its use of parallelism (see here).

The embedding_atlas package provides utility functions to compute the embedding projections:

python
from embedding_atlas.projection import compute_text_projection

compute_text_projection(df, text="text_column",
    x="projection_x", y="projection_y", neighbors="neighbors"
)

MCP Support

The command line utility supports Model Context Protocol (MCP). You can enable it with the --mcp flag. When running, it exposes an MCP server that allows AI agents to query the data schema, run SQL queries, create and modify charts, adjust the layout, and capture screenshots.

Usage

Usage: embedding-atlas [OPTIONS] INPUTS...

Command Line Options

--text text

Column containing text data.

--image text

Column containing image data.

--vector text

Column containing pre-computed vector embeddings.

--split text

Dataset split name(s) to load from Hugging Face datasets. Can be specified multiple times for multiple splits.

--enable-projection / --disable-projection boolean

Compute embedding projections from text/image/vector data. If disabled without pre-computed projections, the embedding view will be unavailable.

--model text

Model name for generating embeddings (e.g., 'all-MiniLM-L6-v2').

--trust-remote-code boolean

Allow execution of remote code when loading models from Hugging Face Hub.

--batch-size integer

Batch size for processing embeddings (default: 32 for text, 16 for images). Larger values use more memory but may be faster.

--text-projector choice

Embedding provider: 'sentence_transformers' (local) or 'litellm' (API-based).

--api-key text

API key for litellm embedding provider.

--api-base text

API endpoint for litellm embedding provider.

--dimensions integer

Number of dimensions for output embeddings (litellm only, supported by OpenAI text-embedding-3+).

--sync boolean

Process embeddings synchronously (litellm only). Use for local servers like Ollama to avoid memory issues.

--x text

Column containing pre-computed X coordinates for the embedding view.

--y text

Column containing pre-computed Y coordinates for the embedding view.

--neighbors text

Column containing pre-computed nearest neighbors in format: {"ids": [n1, n2, ...], "distances": [d1, d2, ...]}. IDs should be zero-based row indices.

--query text

Use the result of the given SQL query as input data. In the query, you may refer to the original data as 'data'.

--sample integer

Number of random samples to draw from the dataset. Useful for large datasets. If query is specified, sampling applies after the query.

--umap-n-neighbors integer

Number of neighbors to consider for UMAP dimensionality reduction (default: 15).

--umap-min-dist float

The min_dist parameter for UMAP.

--umap-metric text

Distance metric for UMAP computation (default: 'cosine').

--umap-random-state integer

Random seed for reproducible UMAP results.

--duckdb text

DuckDB connection mode: 'wasm' (run in browser), 'server' (run on this server), or URI (e.g., 'ws://localhost:3000').

--host text

Host address for the web server (default: localhost).

--port integer

Port number for the web server (default: 5055).

--auto-port / --no-auto-port boolean

Automatically find an available port if the specified port is in use.

--cors text

Allow cross-origin requests. Use --cors to allow all origins, or --cors http://example.com for a specific domain (or a comma-separated list of domains).

--static text

Custom path to frontend static files directory.

--export-application text

Export the visualization as a standalone web application to the specified ZIP file and exit.

--with text

Import the given Python module before loading data. For example, you can use this to import fsspec filesystems. Can be specified multiple times to import multiple modules.

--point-size float

Size of points in the embedding view (default: automatically calculated based on density).

--stop-words text

Path to a file containing stop words to exclude from the text embedding. The file should be a table with column 'word'

--labels text

Path to a file containing labels for the embedding view. The file should be a table with columns 'x', 'y', 'text', and optionally 'level' and 'priority'

--mcp / --no-mcp boolean

Enable MCP (Model Context Protocol) server endpoints for external tool integration.

--version boolean

Show the version and exit.