Turi Create
4.0
|
#include <ml/ml_data/ml_data.hpp>
Public Types | |
typedef std::map< std::string, ml_column_mode > | column_mode_map |
This is here to get around 2 clang bugs! | |
Public Member Functions | |
ml_data () | |
ml_data (const std::shared_ptr< ml_metadata > &metadata) | |
void | fill (const sframe &data, const std::string &target_column="", const column_mode_map mode_overrides=column_mode_map(), bool immutable_metadata=false, ml_missing_value_action mva=ml_missing_value_action::ERROR) |
void | fill (const sframe &data, const std::pair< size_t, size_t > &row_bounds, const std::string &target_column="", const column_mode_map mode_overrides=column_mode_map(), bool immutable_metadata=false, ml_missing_value_action _mva=ml_missing_value_action::ERROR) |
const std::shared_ptr< ml_metadata > & | metadata () const |
size_t | num_columns () const |
size_t | num_rows () const |
size_t | size () const |
bool | empty () const |
ml_data_iterator | get_iterator (size_t thread_idx=0, size_t num_threads=1) const |
bool | has_target () const |
bool | has_untranslated_columns () const |
bool | has_translated_columns () const |
size_t | max_row_size () const |
ml_data | create_subsampled_copy (size_t n_rows, size_t random_seed) const |
ml_data | select_rows (const std::vector< size_t > &selection_indices) const |
ml_data | slice (size_t start_row, size_t end_row) const |
size_t | get_version () const |
void | _reindex_blocks (const std::vector< std::vector< size_t > > &reindex_maps) |
Row based, SFrame-Like Data storage for Learning and Optimization tasks.
ml_data
is a data normalization datastructure that translates user input tables (which can contain arbitrary types like strings, lists, dictionaries, etc) into sparse and dense numeric vectors. This allows toolkits to be implemented in a way that operates on fully mathematical, numeric assumptions, but support a much richer surface area outside.
To support this, ml_data
is kind of a complicated datastructure that performs several things.
This can then be used to train other Machine Learning models with.
Finally, the ml_data
datastructure has to remember and store the translation mappings so that the exact procedure can be performed later on new data (when using the trained model)
Additionally ml_data
also implement strategies for automatic imputation of missing data. For instance, missing numeric columns can be imputed with the mean, missing categorical columns can be imputed with the most common value, etc.
ml_data loads data from an existing sframe, indexes it by mapping all categorical values to unique indices in 0, 1,2,...,n, and records statistics about the values. It then puts it into an efficient row-based data storage structure for use in learning algorithms that need fast row-wise iteration through the features and target. The row based storage structure is designed for fast iteration through the rows and target. ml_data also speeds up data access via caching and a compact layout.
There are a number of use cases for ml_data. The following should address the current use cases.
### To construct the data at train time:
### To iterate through the data, single threaded.
Statistics about each of the columns is fully accessible at any point after training time, and does not change. This is stored with the metadata.
The different column modes control the behavior of each column. These modes are defined in ml_data_column_modes as an enum and currently allow NUMERIC, NUMERIC_VECTOR, CATEGORICAL, CATEGORICAL_VECTOR, DICTIONARY.
In most cases, there is an obvious default. However, to force some columns to be set to a particular mode, a mode_override parameter is available to the set_data and add_side_data functions as a map from column name to column_mode. This overrides the default choice. The main use case for this is recsys, where user_id and item_id will always be categorical:
Untranslated columns can be specified with the set_data(...) method. The untranslated columns are tracked alongside the regular ones, but are not themselves translated, indexed, or even loaded until iteration. These additional columns are then available using the iterator's fill_untranslated_values function.
The way to mark a column as untranslated is to manually specify its type as ml_column_mode::UNTRANSLATED using the mode_overrides parameter in the set_data method. The example code below illustrates this:
Definition at line 257 of file ml_data.hpp.
turi::ml_data::ml_data | ( | ) |
Construct an ml_data object based current options.
|
explicit |
void turi::ml_data::_reindex_blocks | ( | const std::vector< std::vector< size_t > > & | reindex_maps | ) |
Remap all the block indices.
ml_data turi::ml_data::create_subsampled_copy | ( | size_t | n_rows, |
size_t | random_seed | ||
) | const |
|
inline |
Returns true if there is no data in the container.
Definition at line 391 of file ml_data.hpp.
void turi::ml_data::fill | ( | const sframe & | data, |
const std::string & | target_column = "" , |
||
const column_mode_map | mode_overrides = column_mode_map() , |
||
bool | immutable_metadata = false , |
||
ml_missing_value_action | mva = ml_missing_value_action::ERROR |
||
) |
Fills the data from an SFrame.
data | The data sframe. |
target_column | If not reusing metadat, specifies the target column. If no target column is present, then use "". |
mode_overrides | A dictionary of column-name to ml_column_mode mode overrides. These will be used instead of the default flex_type_enum -> ml_column_mode mappings. The main use is to specify integers as categorical or designate some columns as untranslated. |
immutable_metadata | If true, then any new values in categorical columns will be mapped to size_t(-1) and not indexed. |
mva | The behavior when missing values are present. |
void turi::ml_data::fill | ( | const sframe & | data, |
const std::pair< size_t, size_t > & | row_bounds, | ||
const std::string & | target_column = "" , |
||
const column_mode_map | mode_overrides = column_mode_map() , |
||
bool | immutable_metadata = false , |
||
ml_missing_value_action | _mva = ml_missing_value_action::ERROR |
||
) |
Fills the data from an SFrame.
data | The data sframe. |
row_bounds | The (lower, upper) bounds on which rows from the original data sframe are considered. It is as if the original sframe has only these rows. |
target_column | If not reusing metadat, specifies the target column. If no target column is present, then use "". |
mode_overrides | A dictionary of column-name to ml_column_mode mode overrides. These will be used instead of the default flex_type_enum -> ml_column_mode mappings. The main use is to specify integers as categorical or designate some columns as untranslated. |
immutable_metadata | If true, then any new values in categorical columns will be mapped to size_t(-1) and not indexed. |
mva | The behavior when missing values are present. |
ml_data_iterator turi::ml_data::get_iterator | ( | size_t | thread_idx = 0 , |
size_t | num_threads = 1 |
||
) | const |
Return an iterator over part of the data. See iterators/ml_data_iterator.hpp for documentation on the returned iterator.
|
inline |
Get the current serialization format.
Definition at line 477 of file ml_data.hpp.
|
inline |
Returns true if a target column is present, and false otherwise.
Definition at line 410 of file ml_data.hpp.
|
inline |
Returns true if any of the non-target columns are translated.
Definition at line 423 of file ml_data.hpp.
|
inline |
Returns true if there are untranslated columns present, and false otherwise.
Definition at line 417 of file ml_data.hpp.
|
inline |
Returns the maximum row size present in the data. This information is calculated when the data is indexed and the ml_data structure is filled. A buffer sized to this is guaranteed to hold any row encountered while iterating through the data.
Definition at line 433 of file ml_data.hpp.
|
inline |
Direct access to the metadata.
Definition at line 367 of file ml_data.hpp.
|
inline |
Returns the number of columns present.
Definition at line 373 of file ml_data.hpp.
|
inline |
The number of rows present.
Definition at line 379 of file ml_data.hpp.
ml_data turi::ml_data::select_rows | ( | const std::vector< size_t > & | selection_indices | ) | const |
Create a copy of the current ml_data structure, selecting the rows given by selection_indices.
selection_indices | A vector of row indices that must be in sorted order. Duplicates are allowed. The returned ml_data contains all the rows given by selection_indices. |
|
inline |
The number of rows present.
Definition at line 385 of file ml_data.hpp.
ml_data turi::ml_data::slice | ( | size_t | start_row, |
size_t | end_row | ||
) | const |
Create a sliced copy of the current ml_data structure. This copy is cheap.