turicreate.kmeans.create¶

turicreate.kmeans.create(dataset, num_clusters=None, features=None, label=None, initial_centers=None, max_iterations=10, batch_size=None, verbose=True)¶

Create a k-means clustering model. The KmeansModel object contains the computed cluster centers and the cluster assignment for each instance in the input ‘dataset’.

Given a number of clusters, k-means iteratively chooses the best cluster centers and assigns nearby points to the best cluster. If no points change cluster membership between iterations, the algorithm terminates.

Parameters:

dataset : SFrame

Each row in the SFrame is an observation.

num_clusters : int

Number of clusters. This is the ‘k’ in k-means.

features : list[str], optional

Names of feature columns to use in computing distances between observations and cluster centers. ‘None’ (the default) indicates that all columns should be used as features. Columns may be of the following types:

Numeric: values of numeric type integer or float.
Array: list of numeric (int or float) values. Each list element is treated as a distinct feature in the model.
Dict: dictionary of keys mapped to numeric values. Each unique key is treated as a distinct feature in the model.

Note that columns of type list are not supported. Convert them to array columns if all entries in the list are of numeric types.

label : str, optional

Name of the column to use as row labels in the Kmeans output. The values in this column must be integers or strings. If not specified, row numbers are used by default.

initial_centers : SFrame, optional

Initial centers to use when starting the K-means algorithm. If specified, this parameter overrides the num_clusters parameter. The ‘initial_centers’ SFrame must contain the same features used in the input ‘dataset’.

If not specified (the default), initial centers are chosen intelligently with the K-means++ algorithm.

max_iterations : int, optional

The maximum number of iterations to run. Prints a warning if the algorithm does not converge after max_iterations iterations. If set to 0, the model returns clusters defined by the initial centers and assignments to those centers.

batch_size : int, optional

Number of randomly-chosen data points to use in each iteration. If ‘None’ (the default) or greater than the number of rows in ‘dataset’, then this parameter is ignored: all rows of dataset are used in each iteration and model training terminates once point assignments stop changing or max_iterations is reached.

verbose : bool, optional

If True, print model training progress to the screen.

Returns:

out : KmeansModel: A Model object containing a cluster id for each vertex, and the centers of the clusters.

See also

KmeansModel

Notes

Integer features in the ‘dataset’ or ‘initial_centers’ inputs are converted internally to float type, and the corresponding features in the output centers are float-typed.
It can be important for the K-means model to standardize the features so they have the same scale. This function does not standardize automatically.

References

Wikipedia - k-means clustering
Artuhur, D. and Vassilvitskii, S. (2007) k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1027-1035.
Elkan, C. (2003) Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning, Volume 3, pp. 147-153.
Sculley, D. (2010) Web Scale K-Means Clustering. In Proceedings of the 19th International Conference on World Wide Web. pp. 1177-1178

Examples

>>> sf = turicreate.SFrame({
...     'x1': [0.6777, -9.391, 7.0385, 2.2657, 7.7864, -10.16, -8.162,
...            8.8817, -9.525, -9.153, 2.0860, 7.6619, 6.5511, 2.7020],
...     'x2': [5.6110, 8.5139, 5.3913, 5.4743, 8.3606, 7.8843, 2.7305,
...            5.1679, 6.7231, 3.7051, 1.7682, 7.4608, 3.1270, 6.5624]})
...
>>> model = turicreate.kmeans.create(sf, num_clusters=3)