Skip to content

Jupyter Widget

The Python package also provides a Jupyter widget to use Embedding Atlas in your notebooks.

Installation

bash
pip install embedding-atlas

Example

python
from embedding_atlas.widget import EmbeddingAtlasWidget

# Create an Embedding Atlas widget without projection
# This widget will show table and charts only, not the embedding view.
EmbeddingAtlasWidget(df)

# Compute text embedding and projection of the embedding
from embedding_atlas.projection import compute_text_projection

compute_text_projection(df, text="description",
    x="projection_x", y="projection_y", neighbors="neighbors"
)

# Create an Embedding Atlas widget with the pre-computed projection
widget = EmbeddingAtlasWidget(df, text="description",
    x="projection_x", y="projection_y", neighbors="neighbors"
)

# Display the widget
widget

The widget embeds the Embedding Atlas UI into your notebook. You can make selections in the widget, and then use:

python
df = widget.selection()

to get the selection back as a data frame.

Reference

python
from embedding_atlas.widget import EmbeddingAtlasWidget

Below are the constructor options of the widget:

Create an Embedding Atlas widget.

Args:
data_frame:

A DataFrame/Arrow object to "register" with DuckDB.

row_id:

The column name for row id (if not specified, a row id column will be added).

x:

The column name for X axis in the embedding.

y:

The column name for Y axis in the embedding.

text:

The column name for the textual data.

neighbors:

The column name containing precomputed K-nearest neighbors for each point. Each value in the column should be a dictionary with the format: { "ids": [id1, id2, ...], "distances": [distance1, distance2, ...] }.

  • "ids" should be an array of row ids of the neighbors (if row_id is specified, match the value in row_id, otherwise use the row index), sorted by distance.

  • "distances" should contain the corresponding distances to each neighbor.

labels:

Labels for the embedding view. Set to string "automatic" to generate labels automatically, or "disabled" to disable auto labels. Automatic labels are generated by clustering the 2D density distribution and selecting representative keywords using TF-IDF ranking. You can also pass in a list of labels. Each label must contain x and y coordinates and text for the label content. Optionally, you may specify an integer level to roughly control the zoom level where the label appears, and priority for the label's priority. Higher priority labels have a better chance to appear when multiple labels overlap.

stop_words:

Stop words for automatic label generation.

point_size:

Override the default point size for the embedding view.

show_table:

Whether to display the data table when the widget opens.

show_charts:

Whether to display charts when the widget opens.

show_embedding:

Whether to display the embedding view when the widget opens.

connection (DuckDBPyConnection, optional):

A DuckDB connection. Defaults to duckdb.connect().