Turi Create
4.0
|
#include <core/storage/sframe_data/sframe.hpp>
Public Types | |
typedef sframe_reader | reader_type |
The reader type. | |
typedef sframe_output_iterator | iterator |
The iterator type which get_output_iterator returns. | |
typedef std::vector< flexible_type > | value_type |
The type contained in the sframe. | |
Public Member Functions | |
sframe () | |
sframe (const sframe &other) | |
sframe (sframe &&other) | |
sframe & | operator= (const sframe &other) |
sframe & | operator= (sframe &&other) |
sframe (std::string frame_idx_file) | |
sframe (sframe_index_file_information frame_index_info) | |
sframe (const std::vector< std::shared_ptr< sarray< flexible_type > > > &new_columns, const std::vector< std::string > &column_names={}, bool fail_on_column_names=true) | |
std::map< std::string, std::shared_ptr< sarray< flexible_type > > > | init_from_csvs (const std::string &path, csv_line_tokenizer &tokenizer, bool use_header, bool continue_on_failure, bool store_errors, std::map< std::string, flex_type_enum > column_type_hints, std::vector< std::string > output_columns=std::vector< std::string >(), size_t row_limit=0, size_t skip_rows=0) |
sframe (const dataframe_t &data) | |
void | open_for_read (sframe_index_file_information frame_index_info) |
void | open_for_read (const std::vector< std::shared_ptr< sarray< flexible_type > > > &new_columns, const std::vector< std::string > &column_names={}, bool fail_on_column_names=true) |
void | open_for_write (const std::vector< std::string > &column_names, const std::vector< flex_type_enum > &column_types, const std::string &frame_sidx_file="", size_t nsegments=SFRAME_DEFAULT_NUM_SEGMENTS, bool fail_on_column_names=true) |
bool | is_opened_for_read () const |
bool | is_opened_for_write () const |
const std::string & | get_index_file () const |
bool | get_metadata (const std::string &key, std::string &val) const |
std::pair< bool, std::string > | get_metadata (const std::string &key) const |
size_t | num_columns () const |
Returns the number of columns in the SFrame. Does not throw. | |
size_t | num_rows () const |
Returns the length of each sarray. | |
size_t | size () const |
std::string | column_name (size_t i) const |
flex_type_enum | column_type (size_t i) const |
flex_type_enum | column_type (const std::string &column_name) const |
const std::vector< std::string > & | column_names () const |
std::vector< flex_type_enum > | column_types () const |
bool | contains_column (const std::string &column_name) const |
size_t | num_segments () const |
size_t | segment_length (size_t i) const |
size_t | column_index (const std::string &column_name) const |
const sframe_index_file_information | get_index_info () const |
sframe | append (const sframe &other) const |
std::unique_ptr< reader_type > | get_reader () const |
std::unique_ptr< reader_type > | get_reader (size_t num_segments) const |
std::unique_ptr< reader_type > | get_reader (const std::vector< size_t > &segment_lengths) const |
dataframe_t | to_dataframe () |
std::shared_ptr< sarray< flexible_type > > | select_column (size_t column_id) const |
std::shared_ptr< sarray< flexible_type > > | select_column (const std::string &name) const |
sframe | select_columns (const std::vector< std::string > &names) const |
sframe | add_column (std::shared_ptr< sarray< flexible_type > > sarr_ptr, const std::string &column_name=std::string("")) const |
void | set_column_name (size_t column_id, const std::string &name) |
sframe | remove_column (size_t column_id) const |
sframe | swap_columns (size_t column_1, size_t column_2) const |
sframe | replace_column (std::shared_ptr< sarray< flexible_type >> sarr_ptr, const std::string &column_name) const |
bool | set_num_segments (size_t numseg) |
iterator | get_output_iterator (size_t segmentid) |
void | close () |
void | flush_write_to_segment (size_t segment) |
void | save_as_csv (std::string csv_file, csv_writer &writer) |
bool | set_metadata (const std::string &key, std::string val) |
void | save (std::string index_file) const |
void | save (oarchive &oarc) const |
void | try_compact () |
void | load (iarchive &iarc) |
std::shared_ptr< sarray_group_format_writer< flexible_type > > | get_internal_writer () |
void | debug_print () |
The SFrame is an immutable object that represents a table with rows and columns. Each column is an sarray<flexible_type>, which is a sequence of an object T split into segments. The sframe writes an sarray for each column of data it is given to disk, each with a prefix that extends the prefix given to open. The SFrame is referenced on disk by a single ".frame_idx" file which then has a list of file names, one file for each column.
The SFrame is write-once, read-many. The SFrame can be opened for writing once, after which it is read-only.
Since each column of the SFrame is an independent sarray, as an independent shared_ptr<sarray<flexible_type> > object, columns can be added / removed to form new sframes without problems. As such, certain operations (such as the object returned by add_column) recan be "ephemeral" in that there is no .frame_idx file on disk backing it. An "ephemeral" frame can be identified by checking the result of get_index_file(). If this is empty, it is an ephemeral frame.
The interface for the SFrame pretty much matches that of the sarray as in the SArray's stored type is std::vector<flexible_type>. The SFrame however, also provides a large number of other capabilities such as csv parsing, construction from sarrays, etc.
Definition at line 67 of file sframe.hpp.
|
inline |
default constructor; does nothing; use open_for_read or open_for_write after construction to read/create an sarray.
Definition at line 88 of file sframe.hpp.
turi::sframe::sframe | ( | const sframe & | other | ) |
Copy constructor. If the source frame is opened for writing, this will throw an exception. Otherwise, this will create a frame opened for reading, which shares column arrays with the source frame.
|
inline |
Move constructor.
Definition at line 102 of file sframe.hpp.
|
inlineexplicit |
Attempts to construct an sframe which reads from the given frame index file. This should be a .frame_idx file. If the index cannot be opened, an exception is thrown.
Definition at line 128 of file sframe.hpp.
|
inlineexplicit |
Construct an sframe from sframe index information.
Definition at line 136 of file sframe.hpp.
|
inlineexplicit |
Constructs an SFrame from a vector of Sarrays.
columns | List of sarrays to form as columns |
column_names | List of the name for each column, with the indices corresponding with the list of columns. If the length of the column_names vector does not match columns, the column gets a default name. For example, if four columns are given and column_names = {id, num}, the columns will be named {"id, "num", "X3", "X4"}. Entries that are zero-length strings will also be given a default name. |
fail_on_column_names | If true, will throw an exception if any column names are unique. If false, will automatically adjust column names so they are unique. |
Throws an exception if any column names are not unique (if fail_on_column_names is true), or if the number of segments, segment sizes, or total sizes of each sarray is not equal. The constructed SFrame is ephemeral, and is not backed by a disk index.
Definition at line 159 of file sframe.hpp.
turi::sframe::sframe | ( | const dataframe_t & | data | ) |
Constructs an SFrame from dataframe_t.
sframe turi::sframe::add_column | ( | std::shared_ptr< sarray< flexible_type > > | sarr_ptr, |
const std::string & | column_name = std::string("") |
||
) | const |
Returns a new ephemeral SFrame with the new column added to the end. The new sframe is "ephemeral" in that it is not backed by an index on disk.
sarr_ptr | Shared pointer to the SArray |
column_name | The name to give this column. If empty it will be given a default name (X<column index>) |
Merges another SFrame with the same schema with the current SFrame returning a new SFrame. Both SFrames can be empty, but cannot be opened for writing.
|
virtual |
Closes the sframe. close() also implicitly closes all segments. After the writer is closed, no segments can be written. After the sframe is closed, it becomes read only and can be read with the get_reader() function.
Implements turi::swriter_base< sframe_output_iterator >.
|
inline |
Returns the column index of column_name.
Throws an exception of the column_ does not exist.
Definition at line 457 of file sframe.hpp.
|
inline |
Returns the name of the given column. Throws an exception if the column id is out of range.
Definition at line 362 of file sframe.hpp.
|
inline |
Returns the column names as a single vector.
Definition at line 401 of file sframe.hpp.
|
inline |
Returns the type of the given column. Throws an exception if the column id is out of range.
Definition at line 374 of file sframe.hpp.
|
inline |
Returns the type of the given column. Throws an exception if the column id is out of range. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Definition at line 394 of file sframe.hpp.
|
inline |
Returns the column types as a single vector.
Definition at line 407 of file sframe.hpp.
|
inline |
Returns true if the sframe contains the given column.
Definition at line 418 of file sframe.hpp.
void turi::sframe::debug_print | ( | ) |
For debug purpose, print the information about the sframe.
void turi::sframe::flush_write_to_segment | ( | size_t | segment | ) |
Flush writes for a particular segment
|
inline |
Return the index file of the sframe
Definition at line 308 of file sframe.hpp.
|
inline |
Returns the current index info of the array.
Definition at line 472 of file sframe.hpp.
|
inline |
Internal API. Used to obtain the internal writer object.
Definition at line 691 of file sframe.hpp.
|
inline |
Reads the value of a key associated with the sframe Returns true on success, false on failure.
Definition at line 318 of file sframe.hpp.
|
inline |
Reads the value of a key associated with the sframe Returns a pair of (true, value) on success, and (false, empty_string) on failure.
Definition at line 330 of file sframe.hpp.
|
virtual |
Gets an output iterator for the given segment. This can be used to write data to the segment, and is currently the only supported way to do so.
The iterator is invalid once the segment is closed (See close). Accessing the iterator after the writer is destroyed is undefined behavior.
Cannot be called until the sframe is open.
Example:
Implements turi::swriter_base< sframe_output_iterator >.
std::unique_ptr<reader_type> turi::sframe::get_reader | ( | ) | const |
Gets an sframe reader object with the segment layout of the first column.
std::unique_ptr<reader_type> turi::sframe::get_reader | ( | size_t | num_segments | ) | const |
Gets an sframe reader object with num_segments number of logical segments.
std::unique_ptr<reader_type> turi::sframe::get_reader | ( | const std::vector< size_t > & | segment_lengths | ) | const |
Gets an sframe reader object with a custom segment layout. segment_lengths must sum up to the same length as the original array.
std::map<std::string, std::shared_ptr<sarray<flexible_type> > > turi::sframe::init_from_csvs | ( | const std::string & | path, |
csv_line_tokenizer & | tokenizer, | ||
bool | use_header, | ||
bool | continue_on_failure, | ||
bool | store_errors, | ||
std::map< std::string, flex_type_enum > | column_type_hints, | ||
std::vector< std::string > | output_columns = std::vector< std::string >() , |
||
size_t | row_limit = 0 , |
||
size_t | skip_rows = 0 |
||
) |
Constructs an SFrame from a csv file.
All columns will be parsed into flex_string unless the column type is specified in the column_type_hints.
path | The url to the csv file. The url can points to local filesystem, hdfs, or s3. |
tokenizer | The tokenization rules to use |
use_header | If true, the first line will be parsed as column headers. Otherwise, R-style column names, i.e. X1, X2, X3... will be used. |
continue_on_failure | If true, lines with parsing errors will be skipped. |
column_type_hints | A map from column name to the column type. |
output_columns | The subset of column names to output |
row_limit | If non-zero, the maximum number of rows to read |
skip_rows | If non-zero, the number of lines to skip at the start of each file |
Throws an exception if IO error or csv parse failed.
|
inline |
Returns true if the Array is opened for reading. i.e. get_reader() will succeed
Definition at line 291 of file sframe.hpp.
|
inline |
Returns true if the Array is opened for writing. i.e. get_output_iterator() will succeed
Definition at line 300 of file sframe.hpp.
void turi::sframe::load | ( | iarchive & | iarc | ) |
SFrame deserializer. iarc must be associated with a directory. Loads from the next prefix inside the directory.
|
inlinevirtual |
Returns the number of segments that this SFrame will be written with. Never fails.
Implements turi::swriter_base< sframe_output_iterator >.
Definition at line 430 of file sframe.hpp.
|
inline |
Initializes the SFrame with an index_information. If the SFrame is already inited, this will throw an exception
Definition at line 215 of file sframe.hpp.
|
inline |
Initializes the SFrame with a collection of columns. If the SFrame is already inited, this will throw an exception. Will throw an exception if column_names are not unique and fail_on_column_names is true.
Definition at line 228 of file sframe.hpp.
|
inline |
Opens the SFrame with an arbitrary temporary file. The array must not already been inited.
column_names | The name for each column. If the vector is shorter than column_types, or empty values are given, names are handled with default names of "X<column id+1>". Each column name must be unique. This will let you write non-unique column names, but if you do that, the sframe will throw an exception while constructing the output of this class. |
column_types | The type of each column expressed as a flexible_type. Currently this is required to tell how many columns are a part of the sframe. Throws an exception if this is an empty vector. |
nsegments | The number of parallel output segments on each sarray. Throws an exception if this is 0. |
frame_sidx_file | If not specified, an argitrary temporary file will be created. Otherwise, all frame files will be written to the same location as the frame_sidx_file. Must end in ".frame_idx" |
fail_on_column_names | If true, will throw an exception if any column names are unique. If false, will automatically adjust column names so they are unique. |
Definition at line 265 of file sframe.hpp.
Assignment operator. If the source frame is opened for writing, this will throw an exception. Otherwise, this will create a frame opened for reading, which shares column arrays with the source frame.
Move Assignment operator. Moves other into this. Other will be cleared as if it is a newly constructed sframe object.
sframe turi::sframe::remove_column | ( | size_t | column_id | ) | const |
Returns a new ephemeral SFrame with the column removed. The new sframe is "ephemeral" in that it is not backed by an index on disk.
column_id | The index of the column to remove. |
sframe turi::sframe::replace_column | ( | std::shared_ptr< sarray< flexible_type >> | sarr_ptr, |
const std::string & | column_name | ||
) | const |
Replace the column of the given column name with a new sarray. Return the new sframe with old column_name sarray replaced by the new sarray.
void turi::sframe::save | ( | std::string | index_file | ) | const |
Saves a copy of the current sframe into a different location. Does not modify the current sframe.
void turi::sframe::save | ( | oarchive & | oarc | ) | const |
SFrame serializer. oarc must be associated with a directory. Saves into a prefix inside the directory.
void turi::sframe::save_as_csv | ( | std::string | csv_file, |
csv_writer & | writer | ||
) |
Saves a copy of the current sframe as a CSV file. Does not modify the current sframe.
csv_file | target CSV file to save into |
writer | The CSV writer configuration |
|
inline |
Return the number of segments in the collection. Will throw an exception if the writer is invalid (there is an error opening/writing files)
Definition at line 445 of file sframe.hpp.
std::shared_ptr<sarray<flexible_type> > turi::sframe::select_column | ( | size_t | column_id | ) | const |
Returns an sarray of the specific column.
Throws an exception if the column does not exist.
std::shared_ptr<sarray<flexible_type> > turi::sframe::select_column | ( | const std::string & | name | ) | const |
Returns an sarray of the specific column by name.
Throws an exception if the column does not exist.
sframe turi::sframe::select_columns | ( | const std::vector< std::string > & | names | ) | const |
Returns new sframe containing only the chosen columns in the same order. The new sframe is "ephemeral" in that it is not backed by an index on disk.
Throws an exception if the column name does not exist.
void turi::sframe::set_column_name | ( | size_t | column_id, |
const std::string & | name | ||
) |
Set the ith column name to name. This can be done when the frame is open in either reading or writing mode. Changes are ephemeral, and do not affect what is stored on disk.
bool turi::sframe::set_metadata | ( | const std::string & | key, |
std::string | val | ||
) |
Adds meta data to the frame. Frame must be first opened for writing.
|
virtual |
Sets the number of segments in the output. Frame must be first opened for writing. Once an output iterator has been obtained, the number of segments can no longer be changed. Returns true on sucess, false on failure.
Implements turi::swriter_base< sframe_output_iterator >.
|
inline |
Returns the number of elements in the sframe. If the sframe was not initialized, returns 0.
Definition at line 354 of file sframe.hpp.
sframe turi::sframe::swap_columns | ( | size_t | column_1, |
size_t | column_2 | ||
) | const |
Returns a new ephemeral SFrame with two columns swapped. The new sframe is "ephemeral" in that it is not backed by an index on disk.
column_1 | The index of the first column. |
column_2 | The index of the second column. |
dataframe_t turi::sframe::to_dataframe | ( | ) |
Converts the sframe into a dataframe_t. Will reset iterators before and after the operation.
void turi::sframe::try_compact | ( | ) |
Attempts to compact if the number of segments in the SArray exceeds SFRAME_COMPACTION_THRESHOLD.