Jupyter Widget
The Python package also provides a Jupyter widget to use Embedding Atlas in your notebooks.
Installation
pip install embedding-atlas
Example
from embedding_atlas.widget import EmbeddingAtlasWidget
# Create an Embedding Atlas widget without projection
# This widget will show table and charts only, not the embedding view.
EmbeddingAtlasWidget(df)
# Compute text embedding and projection of the embedding
from embedding_atlas.projection import compute_text_projection
compute_text_projection(df, text="description",
x="projection_x", y="projection_y", neighbors="neighbors"
)
# Create an Embedding Atlas widget with the pre-computed projection
widget = EmbeddingAtlasWidget(df, text="description",
x="projection_x", y="projection_y", neighbors="neighbors"
)
# Display the widget
widget
The widget embeds the Embedding Atlas UI into your notebook. You can make selections in the widget, and then use:
df = widget.selection()
to get the selection back as a data frame.
Reference
from embedding_atlas.widget import EmbeddingAtlasWidget
Below are the constructor options of the widget:
Create an Embedding Atlas widget.
- Args:
- data_frame:
A DataFrame/Arrow object to "register" with DuckDB.
- row_id:
The column name for row id (if not specified, a row id column will be added).
- x:
The column name for X axis in the embedding.
- y:
The column name for Y axis in the embedding.
- text:
The column name for the textual data.
- neighbors:
The column name containing precomputed K-nearest neighbors for each point. Each value in the column should be a dictionary with the format:
{ "ids": [id1, id2, ...], "distances": [distance1, distance2, ...] }
."ids"
should be an array of row ids of the neighbors (ifrow_id
is specified, match the value in row_id, otherwise use the row index), sorted by distance."distances"
should contain the corresponding distances to each neighbor.
- labels:
Labels for the embedding view. Set to string
"automatic"
to generate labels automatically, or"disabled"
to disable auto labels. Automatic labels are generated by clustering the 2D density distribution and selecting representative keywords using TF-IDF ranking. You can also pass in a list of labels. Each label must containx
andy
coordinates andtext
for the label content. Optionally, you may specify an integerlevel
to roughly control the zoom level where the label appears, and priority for the label's priority. Higher priority labels have a better chance to appear when multiple labels overlap.- stop_words:
Stop words for automatic label generation.
- point_size:
Override the default point size for the embedding view.
- show_table:
Whether to display the data table when the widget opens.
- show_charts:
Whether to display charts when the widget opens.
- show_embedding:
Whether to display the embedding view when the widget opens.
- connection (DuckDBPyConnection, optional):
A DuckDB connection. Defaults to duckdb.connect().