Turi Create  4.0
turi::kmeans::kmeans_model Class Referenceabstract

#include <toolkits/clustering/kmeans.hpp>

Public Member Functions

 kmeans_model ()
 
 ~kmeans_model ()
 
void init_options (const std::map< std::string, flexible_type > &_opts) override
 
void train (const sframe &X, const sframe &init_centers, std::string method, bool allow_categorical=false)
 
void train (const sframe &X, const sframe &init_centers, std::string method, const std::vector< flexible_type > &row_labels, const std::string row_label_name, bool allow_categorical=false)
 
sframe predict (const sframe &X)
 
sframe get_cluster_assignments ()
 
sframe get_cluster_info ()
 
size_t get_version () const override
 
void save_impl (turi::oarchive &oarc) const override
 
void load_version (turi::iarchive &iarc, size_t version) override
 
std::vector< std::string > list_fields ()
 
const variant_typeget_value_from_state (std::string key)
 
const std::map< std::string, flexible_type > & get_current_options () const
 
std::map< std::string, flexible_typeget_default_options () const
 
const flexible_typeget_option_value (const std::string &name) const
 
const std::map< std::string, variant_type > & get_state () const
 
bool is_trained () const
 
void set_options (const std::map< std::string, flexible_type > &_options)
 
void add_or_update_state (const std::map< std::string, variant_type > &dict)
 
const std::vector< option_handling::option_info > & get_option_info () const
 
virtual const char * name ()=0
 
virtual const std::string & uid ()=0
 
void save_to_url (const std::string &url, const variant_map_type &side_data={})
 
void save_model_to_data (std::ostream &out)
 
const std::map< std::string, std::vector< std::string > > & list_functions ()
 
const std::vector< std::string > & list_get_properties ()
 
const std::vector< std::string > & list_set_properties ()
 
variant_type call_function (const std::string &function, variant_map_type argument)
 
variant_type get_property (const std::string &property)
 
variant_type set_property (const std::string &property, variant_map_type argument)
 
const std::string & get_docstring (const std::string &symbol)
 
virtual void perform_registration ()
 

Protected Member Functions

void register_function (std::string fnname, const std::vector< std::string > &arguments, impl_fn fn)
 
void register_defaults (const std::string &fnname, const variant_map_type &arguments)
 
void register_setter (const std::string &propname, impl_fn setfn)
 
void register_getter (const std::string &propname, impl_fn getfn)
 
void register_docstring (const std::pair< std::string, std::string > &fnname_docstring)
 

Protected Attributes

std::map< std::string, variant_typestate
 

Detailed Description


Kmeans clustering model

Kmeans clustering model. By default, the model uses the KMeans++ algorithm to choose initial cluster centers, although users may also pass custom initial centers. This implementation uses the implementation of Elkan (2003), which takes advantage of the triangle inequality to reduce the number of distance computations. In addition to storing the n x 1 vectors of cluster assignments and distances from each point to its assigned cluster center (necessary for any Kmeans implementation), the Elkan algorithm also requires computation and storage of all pairwise distances between cluster centers.

Note
This implementation does not currently use the second lemma from the Elkan (2003) paper, which further reduces the number of exact distance computations by storing the lower bound on the distance between every point and every cluster center. This n x K matrix is generally too big to store in memory and too slow to write to as an SFrame.

The Kmeans model contains the following private data objects:

mldata: The data, in ml_data_2 form. Each row is an observation. After the model is trained, this member is no longer needed, so it is not serialized when the model is saved.

num_examples: Number of points.

assignments: Cluster assignment for each data point.

clusters: Vector of cluster structs, each of which contains a center vector, a count of assigned points, and a mutex lock, so the cluster can safely be updated in parallel.

num_clusters: Number of clusters, set by the user.

max_iterations: Maximum iterations of the main Kmeans loop, excluding initial selection of the centers.

upper_bounds: For every point, an upper bound on the distance from the point to its currently assigned center. Whenever the distance between a point and its assigned center is computed exactly, then this bound is tight, but the bounds also must be adjusted at the end of each iteration to account for the movement of the assigned center. Despite this adjustment, the upper bound often remains small enough to avoid computing the exact distance to other candidate centers.

center_dists: Exact distance between all pairs of cluster centers. Computing this K x K matrix allows us to use the triangle inequality to avoid computing all n x K point-to-center distances in every iteration.

The Kmeans model stores the following members in its state object, which is exposed to the Python API:

options: Option manager which keeps track of default options, current options, option ranges, type etc. This must be initialized only once in the init_options() function.

training_time: Run time of the algorithm, in seconds.

training_iterations: Number of iterations of the main Kmeans loop. If the algorithm does not converge, this is equal to 'max_iterations'.

In addition, the following public methods return information from a trained Kmeans model:

get_cluster_assignments: Returns an SFrame with the fields "row_id", "cluster_id", and "distance" for input data point. The "cluster_id" field (integer) contains the cluster assignment of the data point, and the "distance" field (float) contains the Euclidean distance of the data point to the assigned cluster's center.

get_cluster_info: Returns an SFrame with metadata about each cluster. For each cluster, the output SFrame includes the features describing the center, the number of points assigned to the cluster, and the within-cluster sum of squared Euclidean distances from points to the center.

Definition at line 179 of file kmeans.hpp.

Constructor & Destructor Documentation

◆ kmeans_model()

turi::kmeans::kmeans_model::kmeans_model ( )

Constructor

◆ ~kmeans_model()

turi::kmeans::kmeans_model::~kmeans_model ( )

Destructor.

Member Function Documentation

◆ add_or_update_state()

void turi::ml_model_base::add_or_update_state ( const std::map< std::string, variant_type > &  dict)
inherited

Append the key value store of the model.

Parameters
[in]dictOptions (Key-Value pairs) to set

◆ call_function()

variant_type turi::model_base::call_function ( const std::string &  function,
variant_map_type  argument 
)
inherited

Calls a user defined function.

◆ get_cluster_assignments()

sframe turi::kmeans::kmeans_model::get_cluster_assignments ( )

Write cluster assigments to an SFrame and return. Also records the row index of the point and the distance from the point to its assigned cluster.

Returns
out SFrame with row index, cluster assignment, and distance to assigned cluster center for each input data point. cluster's center.

◆ get_cluster_info()

sframe turi::kmeans::kmeans_model::get_cluster_info ( )

Write cluster metadata to an SFrame and return. Records the features for each cluster, the count of assigned points, and the within-cluster sum of squared distances.

Returns
out SFrame with metadata about each cluster, including the center vector, count of assigned points, and within-cluster sum of squared Euclidean distances.

◆ get_current_options()

const std::map<std::string, flexible_type>& turi::ml_model_base::get_current_options ( ) const
inherited

Get current options.

Returns
Dictionary containing current options.

Python side interface

Interfaces with the get_current_options function in the Python side.

◆ get_default_options()

std::map<std::string, flexible_type> turi::ml_model_base::get_default_options ( ) const
inherited

Get default options.

Returns
Dictionary with default options.

Python side interface

Interfaces with the get_default_options function in the Python side.

◆ get_docstring()

const std::string& turi::model_base::get_docstring ( const std::string &  symbol)
inherited

Returns the toolkit documentation for a function or property.

◆ get_option_info()

const std::vector<option_handling::option_info>& turi::ml_model_base::get_option_info ( ) const
inherited

Returns the option information struct for each of the set parameters.

◆ get_option_value()

const flexible_type& turi::ml_model_base::get_option_value ( const std::string &  name) const
inherited

Returns the value of an option. Throws an error if the option does not exist.

Parameters
[in]nameName of the option to get.

◆ get_property()

variant_type turi::model_base::get_property ( const std::string &  property)
inherited

Reads a property.

◆ get_state()

const std::map<std::string, variant_type>& turi::ml_model_base::get_state ( ) const
inherited

Get model.

Returns
Model map.

◆ get_value_from_state()

const variant_type& turi::ml_model_base::get_value_from_state ( std::string  key)
inherited

Returns the value of a particular key from the state.

Returns
Value of a key model_base for details.

Python side interface

From the python side, this is interfaced with the get() function or the [] operator in python.

◆ get_version()

size_t turi::kmeans::kmeans_model::get_version ( ) const
inlineoverridevirtual

Get the model version number.

Version map

GLC version -> Kmeans version


<= 1.3 1 1.4 2 1.5 3 1.9 4

Reimplemented from turi::model_base.

Definition at line 402 of file kmeans.hpp.

◆ init_options()

void turi::kmeans::kmeans_model::init_options ( const std::map< std::string, flexible_type > &  _opts)
overridevirtual

Set the model options. The option manager should throw errors if the options do not satisfy the option manager's conditions.

Reimplemented from turi::ml_model_base.

◆ is_trained()

bool turi::ml_model_base::is_trained ( ) const
inherited

Is this model trained.

Returns
True if already trained.

◆ list_fields()

std::vector<std::string> turi::ml_model_base::list_fields ( )
inherited

Methods with already meaningful default implementations.

Lists all the keys accessible in the "model" map.

Returns
List of keys in the model map. model_base for details.

Python side interface

This is the function that the list_fields should call in python.

◆ list_functions()

const std::map<std::string, std::vector<std::string> >& turi::model_base::list_functions ( )
inherited

Lists all the registered functions. Returns a map of function name to array of argument names for the function.

◆ list_get_properties()

const std::vector<std::string>& turi::model_base::list_get_properties ( )
inherited

Lists all the get-table properties of the class.

◆ list_set_properties()

const std::vector<std::string>& turi::model_base::list_set_properties ( )
inherited

Lists all the set-table properties of the class.

◆ load_version()

void turi::kmeans::kmeans_model::load_version ( turi::iarchive iarc,
size_t  version 
)
overridevirtual

De-serialize the model.

Reimplemented from turi::model_base.

◆ name()

virtual const char* turi::model_base::name ( )
pure virtualinherited

Returns the name of the toolkit class, as exposed to client code. For example, the Python proxy for this instance will have a type with this name.

Note: this function is typically overridden using the BEGIN_CLASS_MEMBER_REGISTRATION macro.

◆ perform_registration()

virtual void turi::model_base::perform_registration ( )
virtualinherited

Declare the base registration function. This class has to be handled specially; the macros don't work here due to the override declarations.

Reimplemented in turi::model_proxy.

◆ predict()

sframe turi::kmeans::kmeans_model::predict ( const sframe X)

Predict the cluster assignment for new data, according to a trained Kmeans model.

Parameters
XInput dataset.
Returns
out SFrame with row index, cluster assignment and distance to assigned cluster center (similar to get_cluster_assignments).

◆ register_defaults()

void turi::model_base::register_defaults ( const std::string &  fnname,
const variant_map_type &  arguments 
)
protectedinherited

Registers default argument values

◆ register_docstring()

void turi::model_base::register_docstring ( const std::pair< std::string, std::string > &  fnname_docstring)
protectedinherited

Adds a docstring for the specified function or property name.

◆ register_function()

void turi::model_base::register_function ( std::string  fnname,
const std::vector< std::string > &  arguments,
impl_fn  fn 
)
protectedinherited

Adds a function with the specified name, and argument list.

◆ register_getter()

void turi::model_base::register_getter ( const std::string &  propname,
impl_fn  getfn 
)
protectedinherited

Adds a property getter with the specified name.

◆ register_setter()

void turi::model_base::register_setter ( const std::string &  propname,
impl_fn  setfn 
)
protectedinherited

Adds a property setter with the specified name.

◆ save_impl()

void turi::kmeans::kmeans_model::save_impl ( turi::oarchive oarc) const
overridevirtual

Serialize the model.

Reimplemented from turi::model_base.

◆ save_model_to_data()

void turi::model_base::save_model_to_data ( std::ostream &  out)
inherited

Save a toolkit class to a data stream.

◆ save_to_url()

void turi::model_base::save_to_url ( const std::string &  url,
const variant_map_type &  side_data = {} 
)
inherited

Save a toolkit class to disk.

Parameters
urlThe destination url to store the class.
sidedataAny additional side information

◆ set_options()

void turi::ml_model_base::set_options ( const std::map< std::string, flexible_type > &  _options)
inherited

Set one of the options in the algorithm.

The value are checked with the requirements given by the option instance.

Parameters
[in]nameName of the option.
[in]valueValue for the option.

◆ set_property()

variant_type turi::model_base::set_property ( const std::string &  property,
variant_map_type  argument 
)
inherited

Sets a property. The new value of the property should appear in the argument map under the key "value".

◆ train() [1/2]

void turi::kmeans::kmeans_model::train ( const sframe X,
const sframe init_centers,
std::string  method,
bool  allow_categorical = false 
)

Train the kmeans model, without row labels.

Parameters
XInput data. Each row is an observation.
init_centersCustom initial centers provided by the user.
methodIndicate if Lloyd's algorithm should be used instead of Elkan's. Lloyd's is generally substantially slower, but uses very little memory, while Elkan's requires storage of all pairwise cluster center distances.

◆ train() [2/2]

void turi::kmeans::kmeans_model::train ( const sframe X,
const sframe init_centers,
std::string  method,
const std::vector< flexible_type > &  row_labels,
const std::string  row_label_name,
bool  allow_categorical = false 
)

Train the kmeans model, with row labels.

Parameters
XInput data. Each row is an observation.
init_centersCustom initial centers provided by the user.
use_naive_methodIndicate if Lloyd's algorithm should be used instead of Elkan's. Lloyd's is generally substantially slower, but uses very little memory, while Elkan's requires storage of all pairwise cluster center distances.
row_labelsFlexible type row labels.
row_label_nameName of the row label column.

◆ uid()

virtual const std::string& turi::model_base::uid ( )
pure virtualinherited

Returns a unique identifier for the toolkit class. It can be any unique ID. The UID is only used at runtime (to determine the concrete type of an arbitrary model_base instance) and is never stored.

Note: this function is typically overridden using the BEGIN_CLASS_MEMBER_REGISTRATION macro.

Implemented in turi::model_proxy.

Member Data Documentation

◆ state

std::map<std::string, variant_type> turi::ml_model_base::state
protectedinherited

All things python

Definition at line 206 of file ml_model.hpp.


The documentation for this class was generated from the following file: