Turi Create
4.0
|
#include <toolkits/clustering/kmeans.hpp>
Public Member Functions | |
kmeans_model () | |
~kmeans_model () | |
void | init_options (const std::map< std::string, flexible_type > &_opts) override |
void | train (const sframe &X, const sframe &init_centers, std::string method, bool allow_categorical=false) |
void | train (const sframe &X, const sframe &init_centers, std::string method, const std::vector< flexible_type > &row_labels, const std::string row_label_name, bool allow_categorical=false) |
sframe | predict (const sframe &X) |
sframe | get_cluster_assignments () |
sframe | get_cluster_info () |
size_t | get_version () const override |
void | save_impl (turi::oarchive &oarc) const override |
void | load_version (turi::iarchive &iarc, size_t version) override |
std::vector< std::string > | list_fields () |
const variant_type & | get_value_from_state (std::string key) |
const std::map< std::string, flexible_type > & | get_current_options () const |
std::map< std::string, flexible_type > | get_default_options () const |
const flexible_type & | get_option_value (const std::string &name) const |
const std::map< std::string, variant_type > & | get_state () const |
bool | is_trained () const |
void | set_options (const std::map< std::string, flexible_type > &_options) |
void | add_or_update_state (const std::map< std::string, variant_type > &dict) |
const std::vector< option_handling::option_info > & | get_option_info () const |
virtual const char * | name ()=0 |
virtual const std::string & | uid ()=0 |
void | save_to_url (const std::string &url, const variant_map_type &side_data={}) |
void | save_model_to_data (std::ostream &out) |
const std::map< std::string, std::vector< std::string > > & | list_functions () |
const std::vector< std::string > & | list_get_properties () |
const std::vector< std::string > & | list_set_properties () |
variant_type | call_function (const std::string &function, variant_map_type argument) |
variant_type | get_property (const std::string &property) |
variant_type | set_property (const std::string &property, variant_map_type argument) |
const std::string & | get_docstring (const std::string &symbol) |
virtual void | perform_registration () |
Protected Member Functions | |
void | register_function (std::string fnname, const std::vector< std::string > &arguments, impl_fn fn) |
void | register_defaults (const std::string &fnname, const variant_map_type &arguments) |
void | register_setter (const std::string &propname, impl_fn setfn) |
void | register_getter (const std::string &propname, impl_fn getfn) |
void | register_docstring (const std::pair< std::string, std::string > &fnname_docstring) |
Protected Attributes | |
std::map< std::string, variant_type > | state |
Kmeans clustering model. By default, the model uses the KMeans++ algorithm to choose initial cluster centers, although users may also pass custom initial centers. This implementation uses the implementation of Elkan (2003), which takes advantage of the triangle inequality to reduce the number of distance computations. In addition to storing the n x 1 vectors of cluster assignments and distances from each point to its assigned cluster center (necessary for any Kmeans implementation), the Elkan algorithm also requires computation and storage of all pairwise distances between cluster centers.
The Kmeans model contains the following private data objects:
mldata: The data, in ml_data_2 form. Each row is an observation. After the model is trained, this member is no longer needed, so it is not serialized when the model is saved.
num_examples: Number of points.
assignments: Cluster assignment for each data point.
clusters: Vector of cluster structs, each of which contains a center vector, a count of assigned points, and a mutex lock, so the cluster can safely be updated in parallel.
num_clusters: Number of clusters, set by the user.
max_iterations: Maximum iterations of the main Kmeans loop, excluding initial selection of the centers.
upper_bounds: For every point, an upper bound on the distance from the point to its currently assigned center. Whenever the distance between a point and its assigned center is computed exactly, then this bound is tight, but the bounds also must be adjusted at the end of each iteration to account for the movement of the assigned center. Despite this adjustment, the upper bound often remains small enough to avoid computing the exact distance to other candidate centers.
center_dists: Exact distance between all pairs of cluster centers. Computing this K x K matrix allows us to use the triangle inequality to avoid computing all n x K point-to-center distances in every iteration.
The Kmeans model stores the following members in its state object, which is exposed to the Python API:
options: Option manager which keeps track of default options, current options, option ranges, type etc. This must be initialized only once in the init_options() function.
training_time: Run time of the algorithm, in seconds.
training_iterations: Number of iterations of the main Kmeans loop. If the algorithm does not converge, this is equal to 'max_iterations'.
In addition, the following public methods return information from a trained Kmeans model:
get_cluster_assignments: Returns an SFrame with the fields "row_id", "cluster_id", and "distance" for input data point. The "cluster_id" field (integer) contains the cluster assignment of the data point, and the "distance" field (float) contains the Euclidean distance of the data point to the assigned cluster's center.
get_cluster_info: Returns an SFrame with metadata about each cluster. For each cluster, the output SFrame includes the features describing the center, the number of points assigned to the cluster, and the within-cluster sum of squared Euclidean distances from points to the center.
Definition at line 179 of file kmeans.hpp.
turi::kmeans::kmeans_model::kmeans_model | ( | ) |
Constructor
turi::kmeans::kmeans_model::~kmeans_model | ( | ) |
Destructor.
|
inherited |
Append the key value store of the model.
[in] | dict | Options (Key-Value pairs) to set |
|
inherited |
Calls a user defined function.
sframe turi::kmeans::kmeans_model::get_cluster_assignments | ( | ) |
Write cluster assigments to an SFrame and return. Also records the row index of the point and the distance from the point to its assigned cluster.
sframe turi::kmeans::kmeans_model::get_cluster_info | ( | ) |
Write cluster metadata to an SFrame and return. Records the features for each cluster, the count of assigned points, and the within-cluster sum of squared distances.
|
inherited |
Get current options.
Interfaces with the get_current_options function in the Python side.
|
inherited |
Get default options.
Interfaces with the get_default_options function in the Python side.
|
inherited |
Returns the toolkit documentation for a function or property.
|
inherited |
Returns the option information struct for each of the set parameters.
|
inherited |
Returns the value of an option. Throws an error if the option does not exist.
[in] | name | Name of the option to get. |
|
inherited |
Reads a property.
|
inherited |
Get model.
|
inherited |
Returns the value of a particular key from the state.
From the python side, this is interfaced with the get() function or the [] operator in python.
|
inlineoverridevirtual |
Get the model version number.
GLC version -> Kmeans version
<= 1.3 1 1.4 2 1.5 3 1.9 4
Reimplemented from turi::model_base.
Definition at line 402 of file kmeans.hpp.
|
overridevirtual |
Set the model options. The option manager should throw errors if the options do not satisfy the option manager's conditions.
Reimplemented from turi::ml_model_base.
|
inherited |
Is this model trained.
|
inherited |
Lists all the keys accessible in the "model" map.
This is the function that the list_fields should call in python.
|
inherited |
Lists all the registered functions. Returns a map of function name to array of argument names for the function.
|
inherited |
Lists all the get-table properties of the class.
|
inherited |
Lists all the set-table properties of the class.
|
overridevirtual |
De-serialize the model.
Reimplemented from turi::model_base.
|
pure virtualinherited |
Returns the name of the toolkit class, as exposed to client code. For example, the Python proxy for this instance will have a type with this name.
Note: this function is typically overridden using the BEGIN_CLASS_MEMBER_REGISTRATION macro.
|
virtualinherited |
Declare the base registration function. This class has to be handled specially; the macros don't work here due to the override declarations.
Reimplemented in turi::model_proxy.
Predict the cluster assignment for new data, according to a trained Kmeans model.
X | Input dataset. |
get_cluster_assignments
).
|
protectedinherited |
Registers default argument values
|
protectedinherited |
Adds a docstring for the specified function or property name.
|
protectedinherited |
Adds a function with the specified name, and argument list.
|
protectedinherited |
Adds a property getter with the specified name.
|
protectedinherited |
Adds a property setter with the specified name.
|
overridevirtual |
Serialize the model.
Reimplemented from turi::model_base.
|
inherited |
Save a toolkit class to a data stream.
|
inherited |
Save a toolkit class to disk.
url | The destination url to store the class. |
sidedata | Any additional side information |
|
inherited |
Set one of the options in the algorithm.
The value are checked with the requirements given by the option instance.
[in] | name | Name of the option. |
[in] | value | Value for the option. |
|
inherited |
Sets a property. The new value of the property should appear in the argument map under the key "value".
void turi::kmeans::kmeans_model::train | ( | const sframe & | X, |
const sframe & | init_centers, | ||
std::string | method, | ||
bool | allow_categorical = false |
||
) |
Train the kmeans model, without row labels.
X | Input data. Each row is an observation. |
init_centers | Custom initial centers provided by the user. |
method | Indicate if Lloyd's algorithm should be used instead of Elkan's. Lloyd's is generally substantially slower, but uses very little memory, while Elkan's requires storage of all pairwise cluster center distances. |
void turi::kmeans::kmeans_model::train | ( | const sframe & | X, |
const sframe & | init_centers, | ||
std::string | method, | ||
const std::vector< flexible_type > & | row_labels, | ||
const std::string | row_label_name, | ||
bool | allow_categorical = false |
||
) |
Train the kmeans model, with row labels.
X | Input data. Each row is an observation. |
init_centers | Custom initial centers provided by the user. |
use_naive_method | Indicate if Lloyd's algorithm should be used instead of Elkan's. Lloyd's is generally substantially slower, but uses very little memory, while Elkan's requires storage of all pairwise cluster center distances. |
row_labels | Flexible type row labels. |
row_label_name | Name of the row label column. |
|
pure virtualinherited |
Returns a unique identifier for the toolkit class. It can be any unique ID. The UID is only used at runtime (to determine the concrete type of an arbitrary model_base instance) and is never stored.
Note: this function is typically overridden using the BEGIN_CLASS_MEMBER_REGISTRATION macro.
Implemented in turi::model_proxy.
|
protectedinherited |
All things python
Definition at line 206 of file ml_model.hpp.