turicreate.dbscan.create

turicreate.dbscan.create(dataset, features=None, distance=None, radius=1.0, min_core_neighbors=10, verbose=True)

Create a DBSCAN clustering model. The DBSCAN method partitions the input dataset into three types of points, based on the estimated probability density at each point.

  • Core points have a large number of points within a given neighborhood. Specifically, min_core_neighbors must be within distance radius of a point for it to be considered a core point.
  • Boundary points are within distance radius of a core point, but don’t have sufficient neighbors of their own to be considered core.
  • Noise points comprise the remainder of the data. These points have too few neighbors to be considered core points, and are further than distance radius from all core points.

Clusters are formed by connecting core points that are neighbors of each other, then assigning boundary points to their nearest core neighbor’s cluster.

Parameters:
dataset : SFrame

Training data, with each row corresponding to an observation. Must include all features specified in the features parameter, but may have additional columns as well.

features : list[str], optional

Name of the columns with features to use in comparing records. ‘None’ (the default) indicates that all columns of the input dataset should be used to train the model. All features must be numeric, i.e. integer or float types.

distance : str or list[list], optional

Function to measure the distance between any two input data rows. This may be one of two types:

  • String: the name of a standard distance function. One of ‘euclidean’, ‘squared_euclidean’, ‘manhattan’, ‘levenshtein’, ‘jaccard’, ‘weighted_jaccard’, ‘cosine’, or ‘transformed_dot_product’.
  • Composite distance: the weighted sum of several standard distance functions applied to various features. This is specified as a list of distance components, each of which is itself a list containing three items:
    1. list or tuple of feature names (str)
    2. standard distance name (str)
    3. scaling factor (int or float)

For more information about Turi Create distance functions, please see the distances module.

For sparse vectors, missing keys are assumed to have value 0.0.

If ‘distance’ is left unspecified, a composite distance is constructed automatically based on feature types.

radius : int or float, optional

Size of each point’s neighborhood, with respect to the specified distance function.

min_core_neighbors : int, optional

Number of neighbors that must be within distance radius of a point in order for that point to be considered a “core point” of a cluster.

verbose : bool, optional

If True, print progress updates and model details during model creation.

Returns:
out : DBSCANModel

A model containing a cluster label for each row in the input dataset. Also contains the indices of the core points, cluster boundary points, and noise points.

Notes

  • Our implementation of DBSCAN first computes the similarity graph on the input dataset, which can be a computationally intensive process. In the current implementation, some distances are substantially faster than others; in particular “euclidean”, “squared_euclidean”, “cosine”, and “transformed_dot_product” are quite fast, while composite distances can be slow.
  • Any distance function in the Turi Create library may be used with DBSCAN but the results may be poor for distances that violate the standard metric properties, i.e. symmetry, non-negativity, triangle inequality, and identity of indiscernibles. In particular, the DBSCAN algorithm is based on the concept of connecting high-density points that are close to each other into a single cluster, but the notion of close may be very counterintuitive if the chosen distance function is not a valid metric. The distances “euclidean”, “manhattan”, “jaccard”, and “levenshtein” will likely yield the best results.

References

Examples

>>> sf = turicreate.SFrame({
...     'x1': [0.6777, -9.391, 7.0385, 2.2657, 7.7864, -10.16, -8.162,
...            8.8817, -9.525, -9.153, 2.0860, 7.6619, 6.5511, 2.7020],
...     'x2': [5.6110, 8.5139, 5.3913, 5.4743, 8.3606, 7.8843, 2.7305,
...            5.1679, 6.7231, 3.7051, 1.7682, 7.4608, 3.1270, 6.5624]})
...
>>> model = turicreate.dbscan.create(sf, radius=4.25, min_core_neighbors=3)
>>> model.cluster_id.print_rows(15)
+--------+------------+----------+
| row_id | cluster_id |   type   |
+--------+------------+----------+
|   8    |     0      |   core   |
|   7    |     2      |   core   |
|   0    |     1      |   core   |
|   2    |     2      |   core   |
|   3    |     1      |   core   |
|   11   |     2      |   core   |
|   4    |     2      |   core   |
|   1    |     0      | boundary |
|   6    |     0      | boundary |
|   5    |     0      | boundary |
|   9    |     0      | boundary |
|   12   |     2      | boundary |
|   10   |     1      | boundary |
|   13   |     1      | boundary |
+--------+------------+----------+
[14 rows x 3 columns]