turicreate.dbscan.create¶

turicreate.dbscan.
create
(dataset, features=None, distance=None, radius=1.0, min_core_neighbors=10, verbose=True)¶ Create a DBSCAN clustering model. The DBSCAN method partitions the input dataset into three types of points, based on the estimated probability density at each point.
 Core points have a large number of points within a given neighborhood. Specifically, min_core_neighbors must be within distance radius of a point for it to be considered a core point.
 Boundary points are within distance radius of a core point, but don’t have sufficient neighbors of their own to be considered core.
 Noise points comprise the remainder of the data. These points have too few neighbors to be considered core points, and are further than distance radius from all core points.
Clusters are formed by connecting core points that are neighbors of each other, then assigning boundary points to their nearest core neighbor’s cluster.
Parameters:  dataset : SFrame
Training data, with each row corresponding to an observation. Must include all features specified in the features parameter, but may have additional columns as well.
 features : list[str], optional
Name of the columns with features to use in comparing records. ‘None’ (the default) indicates that all columns of the input dataset should be used to train the model. All features must be numeric, i.e. integer or float types.
 distance : str or list[list], optional
Function to measure the distance between any two input data rows. This may be one of two types:
 String: the name of a standard distance function. One of ‘euclidean’, ‘squared_euclidean’, ‘manhattan’, ‘levenshtein’, ‘jaccard’, ‘weighted_jaccard’, ‘cosine’, or ‘transformed_dot_product’.
 Composite distance: the weighted sum of several standard distance
functions applied to various features. This is specified as a list of
distance components, each of which is itself a list containing three
items:
 list or tuple of feature names (str)
 standard distance name (str)
 scaling factor (int or float)
For more information about Turi Create distance functions, please see the
distances
module.For sparse vectors, missing keys are assumed to have value 0.0.
If ‘distance’ is left unspecified, a composite distance is constructed automatically based on feature types.
 radius : int or float, optional
Size of each point’s neighborhood, with respect to the specified distance function.
 min_core_neighbors : int, optional
Number of neighbors that must be within distance radius of a point in order for that point to be considered a “core point” of a cluster.
 verbose : bool, optional
If True, print progress updates and model details during model creation.
Returns:  out : DBSCANModel
A model containing a cluster label for each row in the input dataset. Also contains the indices of the core points, cluster boundary points, and noise points.
See also
Notes
 Our implementation of DBSCAN first computes the similarity graph on the input dataset, which can be a computationally intensive process. In the current implementation, some distances are substantially faster than others; in particular “euclidean”, “squared_euclidean”, “cosine”, and “transformed_dot_product” are quite fast, while composite distances can be slow.
 Any distance function in the Turi Create library may be used with DBSCAN but the results may be poor for distances that violate the standard metric properties, i.e. symmetry, nonnegativity, triangle inequality, and identity of indiscernibles. In particular, the DBSCAN algorithm is based on the concept of connecting highdensity points that are close to each other into a single cluster, but the notion of close may be very counterintuitive if the chosen distance function is not a valid metric. The distances “euclidean”, “manhattan”, “jaccard”, and “levenshtein” will likely yield the best results.
References
 Ester, M., et al. (1996) A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. pp. 226231.
 Wikipedia  DBSCAN
 Visualizing DBSCAN Clustering
Examples
>>> sf = turicreate.SFrame({ ... 'x1': [0.6777, 9.391, 7.0385, 2.2657, 7.7864, 10.16, 8.162, ... 8.8817, 9.525, 9.153, 2.0860, 7.6619, 6.5511, 2.7020], ... 'x2': [5.6110, 8.5139, 5.3913, 5.4743, 8.3606, 7.8843, 2.7305, ... 5.1679, 6.7231, 3.7051, 1.7682, 7.4608, 3.1270, 6.5624]}) ... >>> model = turicreate.dbscan.create(sf, radius=4.25, min_core_neighbors=3) >>> model.cluster_id.print_rows(15) ++++  row_id  cluster_id  type  ++++  8  0  core   7  2  core   0  1  core   2  2  core   3  1  core   11  2  core   4  2  core   1  0  boundary   6  0  boundary   5  0  boundary   9  0  boundary   12  2  boundary   10  1  boundary   13  1  boundary  ++++ [14 rows x 3 columns]