Turi Create  4.0
turi::ml_data Class Reference

#include <ml/ml_data/ml_data.hpp>

Public Types

typedef std::map< std::string, ml_column_modecolumn_mode_map
 This is here to get around 2 clang bugs!
 

Public Member Functions

 ml_data ()
 
 ml_data (const std::shared_ptr< ml_metadata > &metadata)
 
void fill (const sframe &data, const std::string &target_column="", const column_mode_map mode_overrides=column_mode_map(), bool immutable_metadata=false, ml_missing_value_action mva=ml_missing_value_action::ERROR)
 
void fill (const sframe &data, const std::pair< size_t, size_t > &row_bounds, const std::string &target_column="", const column_mode_map mode_overrides=column_mode_map(), bool immutable_metadata=false, ml_missing_value_action _mva=ml_missing_value_action::ERROR)
 
const std::shared_ptr< ml_metadata > & metadata () const
 
size_t num_columns () const
 
size_t num_rows () const
 
size_t size () const
 
bool empty () const
 
ml_data_iterator get_iterator (size_t thread_idx=0, size_t num_threads=1) const
 
bool has_target () const
 
bool has_untranslated_columns () const
 
bool has_translated_columns () const
 
size_t max_row_size () const
 
ml_data create_subsampled_copy (size_t n_rows, size_t random_seed) const
 
ml_data select_rows (const std::vector< size_t > &selection_indices) const
 
ml_data slice (size_t start_row, size_t end_row) const
 
size_t get_version () const
 
void _reindex_blocks (const std::vector< std::vector< size_t > > &reindex_maps)
 

Detailed Description

Row based, SFrame-Like Data storage for Learning and Optimization tasks.

ml_data is a data normalization datastructure that translates user input tables (which can contain arbitrary types like strings, lists, dictionaries, etc) into sparse and dense numeric vectors. This allows toolkits to be implemented in a way that operates on fully mathematical, numeric assumptions, but support a much richer surface area outside.

To support this, ml_data is kind of a complicated datastructure that performs several things.

  • interpret string columns as categorical onto a sparse vector representation, using either one-hot encoding or reference encoding.
  • map list columns onto a sparse vector representation.
  • map dictionary columns onto a sparse vector representation.
  • map dense numeric arrays onto a dense vector representation.
  • etc. Each row of a user input table is hence translated into a mixed dense-sparse vector. This vector then has to be materialized as an SFrame (allowing it to scale to datasets larger than memory).

This can then be used to train other Machine Learning models with.

Finally, the ml_data datastructure has to remember and store the translation mappings so that the exact procedure can be performed later on new data (when using the trained model)

Additionally ml_data also implement strategies for automatic imputation of missing data. For instance, missing numeric columns can be imputed with the mean, missing categorical columns can be imputed with the most common value, etc.

ml_data loads data from an existing sframe, indexes it by mapping all categorical values to unique indices in 0, 1,2,...,n, and records statistics about the values. It then puts it into an efficient row-based data storage structure for use in learning algorithms that need fast row-wise iteration through the features and target. The row based storage structure is designed for fast iteration through the rows and target. ml_data also speeds up data access via caching and a compact layout.

Illustration of the API

Using ml_data

There are a number of use cases for ml_data. The following should address the current use cases.

### To construct the data at train time:

// Constructs an empty ml_data object
ml_data data;
// Sets the data source from X, with target_column_name being the
// target column. (Alternatively, target_column_name may be a
// single-column SFrame giving the target. "" denotes no target
// column present).
data.fill(X, target_column_name);
// After filling, a serializable shared pointer to the metadata
// can be saved for the predict stage. this->metadata is of type
// std::shared_ptr<ml_metadata>.
this->metadata = data.metadata();

### To iterate through the data, single threaded.

for(auto it = data.get_iterator(); !it.done(); ++it) {
....
it->target_value();
it->fill(...);
}

To iterate through the data, threaded.

in_parallel([&](size_t thread_idx, size_t num_threads) {
for(auto it = data.get_iterator(thread_idx, num_threads); !it.done(); ++it) {
....
it->target_value();
it->fill(...);
}
});

To construct the data at predict time:

// Constructs an empty ml_data object, takes construction options
// from original ml_data.
ml_data data(this->metadata);
// Sets the data source from X, with no target column.
data.fill(X);

To serialize the metadata for model serialization

// Type std::shared_ptr<ml_metadata> is fully serializable.
oarc << this->metadata;
iarc >> this->metadata;

To access statistics at train/predict time.

Statistics about each of the columns is fully accessible at any point after training time, and does not change. This is stored with the metadata.

// The number of columns. column_index
// below is between 0 and this value.
this->metadata->num_columns();
// This gives the number of index value at train time. Will never
// change after training time. For categorical types, it gives
// the number of categories at train time. For numerical it is 1
// if scalar and the width of the vector if numeric. feature_idx
// below is between 0 and this value.
this->metadata->index_size(column_index);
// The number of rows having this feature.
this->metadata->statistics(column_index)->count(feature_idx);
// The mean of this feature. Missing is counted as 0.
this->metadata->statistics(column_index)->mean(idx);
// The std dev of this feature. Missing is counted as 0.
this->metadata->statistics(column_index)->stdev(idx);
// The number of rows in which the value of this feature is
// strictly greater than 0.
this->metadata->statistics(column_index)->num_positive(idx);
// The same methods above, but for the target.
this->metadata->target_statistics()->count();
this->metadata->target_statistics()->mean();
this->metadata->target_statistics()->stdev();

Forcing certain column modes.

The different column modes control the behavior of each column. These modes are defined in ml_data_column_modes as an enum and currently allow NUMERIC, NUMERIC_VECTOR, CATEGORICAL, CATEGORICAL_VECTOR, DICTIONARY.

In most cases, there is an obvious default. However, to force some columns to be set to a particular mode, a mode_override parameter is available to the set_data and add_side_data functions as a map from column name to column_mode. This overrides the default choice. The main use case for this is recsys, where user_id and item_id will always be categorical:

data.fill(recsys_data, "rating",
{{"user_id", column_mode::CATEGORICAL},
{"item_id", column_mode::CATEGORICAL}});

Untranslated Columns

Untranslated columns can be specified with the set_data(...) method. The untranslated columns are tracked alongside the regular ones, but are not themselves translated, indexed, or even loaded until iteration. These additional columns are then available using the iterator's fill_untranslated_values function.

The way to mark a column as untranslated is to manually specify its type as ml_column_mode::UNTRANSLATED using the mode_overrides parameter in the set_data method. The example code below illustrates this:

sframe X = make_integer_testing_sframe( {"C1", "C2"}, { {0, 0}, {1, 1}, {2, 2}, {3, 3}, {4, 4} } );
ml_data data;
data.set_data(X, "", {}, { {"C2", ml_column_mode::UNTRANSLATED} });
data.fill();
std::vector<ml_data_entry> x_d;
std::vector<flexible_type> x_f;
////////////////////////////////////////
for(auto it = data.get_iterator(); !it.done(); ++it) {
it->fill(x_d);
ASSERT_EQ(x_d.size(), 1);
ASSERT_EQ(x_d[0].column_index, 0);
ASSERT_EQ(x_d[0].index, 0);
ASSERT_EQ(x_d[0].value, it.row_index());
it->fill_untranslated(x_f);
ASSERT_EQ(x_f.size(), 1);
ASSERT_TRUE(x_f[0] == it.row_index());
}

Definition at line 257 of file ml_data.hpp.

Constructor & Destructor Documentation

◆ ml_data() [1/2]

turi::ml_data::ml_data ( )

Construct an ml_data object based current options.

◆ ml_data() [2/2]

turi::ml_data::ml_data ( const std::shared_ptr< ml_metadata > &  metadata)
explicit

Construct an ml_data object based on previous ml_data metadata.

Member Function Documentation

◆ _reindex_blocks()

void turi::ml_data::_reindex_blocks ( const std::vector< std::vector< size_t > > &  reindex_maps)

Remap all the block indices.

◆ create_subsampled_copy()

ml_data turi::ml_data::create_subsampled_copy ( size_t  n_rows,
size_t  random_seed 
) const

Create a subsampled copy of the current ml_data structure. This allows us quickly create a subset of the data to be used for things like sgd, etc.

If n_rows < size(), exactly n_rows are sampled IID from the dataset. Otherwise, a copy of the current ml_data is returned.

◆ empty()

bool turi::ml_data::empty ( ) const
inline

Returns true if there is no data in the container.

Definition at line 391 of file ml_data.hpp.

◆ fill() [1/2]

void turi::ml_data::fill ( const sframe data,
const std::string &  target_column = "",
const column_mode_map  mode_overrides = column_mode_map(),
bool  immutable_metadata = false,
ml_missing_value_action  mva = ml_missing_value_action::ERROR 
)

Fills the data from an SFrame.

Parameters
dataThe data sframe.
target_columnIf not reusing metadat, specifies the target column. If no target column is present, then use "".
mode_overridesA dictionary of column-name to ml_column_mode mode overrides. These will be used instead of the default flex_type_enum -> ml_column_mode mappings. The main use is to specify integers as categorical or designate some columns as untranslated.
immutable_metadataIf true, then any new values in categorical columns will be mapped to size_t(-1) and not indexed.
mvaThe behavior when missing values are present.

◆ fill() [2/2]

void turi::ml_data::fill ( const sframe data,
const std::pair< size_t, size_t > &  row_bounds,
const std::string &  target_column = "",
const column_mode_map  mode_overrides = column_mode_map(),
bool  immutable_metadata = false,
ml_missing_value_action  _mva = ml_missing_value_action::ERROR 
)

Fills the data from an SFrame.

Parameters
dataThe data sframe.
row_boundsThe (lower, upper) bounds on which rows from the original data sframe are considered. It is as if the original sframe has only these rows.
target_columnIf not reusing metadat, specifies the target column. If no target column is present, then use "".
mode_overridesA dictionary of column-name to ml_column_mode mode overrides. These will be used instead of the default flex_type_enum -> ml_column_mode mappings. The main use is to specify integers as categorical or designate some columns as untranslated.
immutable_metadataIf true, then any new values in categorical columns will be mapped to size_t(-1) and not indexed.
mvaThe behavior when missing values are present.

◆ get_iterator()

ml_data_iterator turi::ml_data::get_iterator ( size_t  thread_idx = 0,
size_t  num_threads = 1 
) const

Return an iterator over part of the data. See iterators/ml_data_iterator.hpp for documentation on the returned iterator.

◆ get_version()

size_t turi::ml_data::get_version ( ) const
inline

Get the current serialization format.

Definition at line 477 of file ml_data.hpp.

◆ has_target()

bool turi::ml_data::has_target ( ) const
inline

Returns true if a target column is present, and false otherwise.

Definition at line 410 of file ml_data.hpp.

◆ has_translated_columns()

bool turi::ml_data::has_translated_columns ( ) const
inline

Returns true if any of the non-target columns are translated.

Definition at line 423 of file ml_data.hpp.

◆ has_untranslated_columns()

bool turi::ml_data::has_untranslated_columns ( ) const
inline

Returns true if there are untranslated columns present, and false otherwise.

Definition at line 417 of file ml_data.hpp.

◆ max_row_size()

size_t turi::ml_data::max_row_size ( ) const
inline

Returns the maximum row size present in the data. This information is calculated when the data is indexed and the ml_data structure is filled. A buffer sized to this is guaranteed to hold any row encountered while iterating through the data.

Definition at line 433 of file ml_data.hpp.

◆ metadata()

const std::shared_ptr<ml_metadata>& turi::ml_data::metadata ( ) const
inline

Direct access to the metadata.

Definition at line 367 of file ml_data.hpp.

◆ num_columns()

size_t turi::ml_data::num_columns ( ) const
inline

Returns the number of columns present.

Definition at line 373 of file ml_data.hpp.

◆ num_rows()

size_t turi::ml_data::num_rows ( ) const
inline

The number of rows present.

Definition at line 379 of file ml_data.hpp.

◆ select_rows()

ml_data turi::ml_data::select_rows ( const std::vector< size_t > &  selection_indices) const

Create a copy of the current ml_data structure, selecting the rows given by selection_indices.

Parameters
selection_indicesA vector of row indices that must be in sorted order. Duplicates are allowed. The returned ml_data contains all the rows given by selection_indices.
Returns
A new ml_data object with containing only the rows given by selection_indices.

◆ size()

size_t turi::ml_data::size ( ) const
inline

The number of rows present.

Definition at line 385 of file ml_data.hpp.

◆ slice()

ml_data turi::ml_data::slice ( size_t  start_row,
size_t  end_row 
) const

Create a sliced copy of the current ml_data structure. This copy is cheap.


The documentation for this class was generated from the following file: