Turi Create  4.0
turi::sframe Class Reference

#include <core/storage/sframe_data/sframe.hpp>

Public Types

typedef sframe_reader reader_type
 The reader type.
 
typedef sframe_output_iterator iterator
 The iterator type which get_output_iterator returns.
 
typedef std::vector< flexible_typevalue_type
 The type contained in the sframe.
 

Public Member Functions

 sframe ()
 
 sframe (const sframe &other)
 
 sframe (sframe &&other)
 
sframeoperator= (const sframe &other)
 
sframeoperator= (sframe &&other)
 
 sframe (std::string frame_idx_file)
 
 sframe (sframe_index_file_information frame_index_info)
 
 sframe (const std::vector< std::shared_ptr< sarray< flexible_type > > > &new_columns, const std::vector< std::string > &column_names={}, bool fail_on_column_names=true)
 
std::map< std::string, std::shared_ptr< sarray< flexible_type > > > init_from_csvs (const std::string &path, csv_line_tokenizer &tokenizer, bool use_header, bool continue_on_failure, bool store_errors, std::map< std::string, flex_type_enum > column_type_hints, std::vector< std::string > output_columns=std::vector< std::string >(), size_t row_limit=0, size_t skip_rows=0)
 
 sframe (const dataframe_t &data)
 
void open_for_read (sframe_index_file_information frame_index_info)
 
void open_for_read (const std::vector< std::shared_ptr< sarray< flexible_type > > > &new_columns, const std::vector< std::string > &column_names={}, bool fail_on_column_names=true)
 
void open_for_write (const std::vector< std::string > &column_names, const std::vector< flex_type_enum > &column_types, const std::string &frame_sidx_file="", size_t nsegments=SFRAME_DEFAULT_NUM_SEGMENTS, bool fail_on_column_names=true)
 
bool is_opened_for_read () const
 
bool is_opened_for_write () const
 
const std::string & get_index_file () const
 
bool get_metadata (const std::string &key, std::string &val) const
 
std::pair< bool, std::string > get_metadata (const std::string &key) const
 
size_t num_columns () const
 Returns the number of columns in the SFrame. Does not throw.
 
size_t num_rows () const
 Returns the length of each sarray.
 
size_t size () const
 
std::string column_name (size_t i) const
 
flex_type_enum column_type (size_t i) const
 
flex_type_enum column_type (const std::string &column_name) const
 
const std::vector< std::string > & column_names () const
 
std::vector< flex_type_enumcolumn_types () const
 
bool contains_column (const std::string &column_name) const
 
size_t num_segments () const
 
size_t segment_length (size_t i) const
 
size_t column_index (const std::string &column_name) const
 
const sframe_index_file_information get_index_info () const
 
sframe append (const sframe &other) const
 
std::unique_ptr< reader_typeget_reader () const
 
std::unique_ptr< reader_typeget_reader (size_t num_segments) const
 
std::unique_ptr< reader_typeget_reader (const std::vector< size_t > &segment_lengths) const
 
dataframe_t to_dataframe ()
 
std::shared_ptr< sarray< flexible_type > > select_column (size_t column_id) const
 
std::shared_ptr< sarray< flexible_type > > select_column (const std::string &name) const
 
sframe select_columns (const std::vector< std::string > &names) const
 
sframe add_column (std::shared_ptr< sarray< flexible_type > > sarr_ptr, const std::string &column_name=std::string("")) const
 
void set_column_name (size_t column_id, const std::string &name)
 
sframe remove_column (size_t column_id) const
 
sframe swap_columns (size_t column_1, size_t column_2) const
 
sframe replace_column (std::shared_ptr< sarray< flexible_type >> sarr_ptr, const std::string &column_name) const
 
bool set_num_segments (size_t numseg)
 
iterator get_output_iterator (size_t segmentid)
 
void close ()
 
void flush_write_to_segment (size_t segment)
 
void save_as_csv (std::string csv_file, csv_writer &writer)
 
bool set_metadata (const std::string &key, std::string val)
 
void save (std::string index_file) const
 
void save (oarchive &oarc) const
 
void try_compact ()
 
void load (iarchive &iarc)
 
std::shared_ptr< sarray_group_format_writer< flexible_type > > get_internal_writer ()
 
void debug_print ()
 

Detailed Description

The SFrame is an immutable object that represents a table with rows and columns. Each column is an sarray<flexible_type>, which is a sequence of an object T split into segments. The sframe writes an sarray for each column of data it is given to disk, each with a prefix that extends the prefix given to open. The SFrame is referenced on disk by a single ".frame_idx" file which then has a list of file names, one file for each column.

The SFrame is write-once, read-many. The SFrame can be opened for writing once, after which it is read-only.

Since each column of the SFrame is an independent sarray, as an independent shared_ptr<sarray<flexible_type> > object, columns can be added / removed to form new sframes without problems. As such, certain operations (such as the object returned by add_column) recan be "ephemeral" in that there is no .frame_idx file on disk backing it. An "ephemeral" frame can be identified by checking the result of get_index_file(). If this is empty, it is an ephemeral frame.

The interface for the SFrame pretty much matches that of the sarray as in the SArray's stored type is std::vector<flexible_type>. The SFrame however, also provides a large number of other capabilities such as csv parsing, construction from sarrays, etc.

Definition at line 67 of file sframe.hpp.

Constructor & Destructor Documentation

◆ sframe() [1/7]

turi::sframe::sframe ( )
inline

default constructor; does nothing; use open_for_read or open_for_write after construction to read/create an sarray.

Definition at line 88 of file sframe.hpp.

◆ sframe() [2/7]

turi::sframe::sframe ( const sframe other)

Copy constructor. If the source frame is opened for writing, this will throw an exception. Otherwise, this will create a frame opened for reading, which shares column arrays with the source frame.

◆ sframe() [3/7]

turi::sframe::sframe ( sframe &&  other)
inline

Move constructor.

Definition at line 102 of file sframe.hpp.

◆ sframe() [4/7]

turi::sframe::sframe ( std::string  frame_idx_file)
inlineexplicit

Attempts to construct an sframe which reads from the given frame index file. This should be a .frame_idx file. If the index cannot be opened, an exception is thrown.

Definition at line 128 of file sframe.hpp.

◆ sframe() [5/7]

turi::sframe::sframe ( sframe_index_file_information  frame_index_info)
inlineexplicit

Construct an sframe from sframe index information.

Definition at line 136 of file sframe.hpp.

◆ sframe() [6/7]

turi::sframe::sframe ( const std::vector< std::shared_ptr< sarray< flexible_type > > > &  new_columns,
const std::vector< std::string > &  column_names = {},
bool  fail_on_column_names = true 
)
inlineexplicit

Constructs an SFrame from a vector of Sarrays.

Parameters
columnsList of sarrays to form as columns
column_namesList of the name for each column, with the indices corresponding with the list of columns. If the length of the column_names vector does not match columns, the column gets a default name. For example, if four columns are given and column_names = {id, num}, the columns will be named {"id, "num", "X3", "X4"}. Entries that are zero-length strings will also be given a default name.
fail_on_column_namesIf true, will throw an exception if any column names are unique. If false, will automatically adjust column names so they are unique.

Throws an exception if any column names are not unique (if fail_on_column_names is true), or if the number of segments, segment sizes, or total sizes of each sarray is not equal. The constructed SFrame is ephemeral, and is not backed by a disk index.

Definition at line 159 of file sframe.hpp.

◆ sframe() [7/7]

turi::sframe::sframe ( const dataframe_t data)

Constructs an SFrame from dataframe_t.

Note
Throw an exception if the dataframe contains undefined values (e.g. in sparse rows),

Member Function Documentation

◆ add_column()

sframe turi::sframe::add_column ( std::shared_ptr< sarray< flexible_type > >  sarr_ptr,
const std::string &  column_name = std::string("") 
) const

Returns a new ephemeral SFrame with the new column added to the end. The new sframe is "ephemeral" in that it is not backed by an index on disk.

Parameters
sarr_ptrShared pointer to the SArray
column_nameThe name to give this column. If empty it will be given a default name (X<column index>)

◆ append()

sframe turi::sframe::append ( const sframe other) const

Merges another SFrame with the same schema with the current SFrame returning a new SFrame. Both SFrames can be empty, but cannot be opened for writing.

◆ close()

void turi::sframe::close ( )
virtual

Closes the sframe. close() also implicitly closes all segments. After the writer is closed, no segments can be written. After the sframe is closed, it becomes read only and can be read with the get_reader() function.

Implements turi::swriter_base< sframe_output_iterator >.

◆ column_index()

size_t turi::sframe::column_index ( const std::string &  column_name) const
inline

Returns the column index of column_name.

Throws an exception of the column_ does not exist.

Definition at line 457 of file sframe.hpp.

◆ column_name()

std::string turi::sframe::column_name ( size_t  i) const
inline

Returns the name of the given column. Throws an exception if the column id is out of range.

Definition at line 362 of file sframe.hpp.

◆ column_names()

const std::vector<std::string>& turi::sframe::column_names ( ) const
inline

Returns the column names as a single vector.

Definition at line 401 of file sframe.hpp.

◆ column_type() [1/2]

flex_type_enum turi::sframe::column_type ( size_t  i) const
inline

Returns the type of the given column. Throws an exception if the column id is out of range.

Definition at line 374 of file sframe.hpp.

◆ column_type() [2/2]

flex_type_enum turi::sframe::column_type ( const std::string &  column_name) const
inline

Returns the type of the given column. Throws an exception if the column id is out of range. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 394 of file sframe.hpp.

◆ column_types()

std::vector<flex_type_enum> turi::sframe::column_types ( ) const
inline

Returns the column types as a single vector.

Definition at line 407 of file sframe.hpp.

◆ contains_column()

bool turi::sframe::contains_column ( const std::string &  column_name) const
inline

Returns true if the sframe contains the given column.

Definition at line 418 of file sframe.hpp.

◆ debug_print()

void turi::sframe::debug_print ( )

For debug purpose, print the information about the sframe.

◆ flush_write_to_segment()

void turi::sframe::flush_write_to_segment ( size_t  segment)

Flush writes for a particular segment

◆ get_index_file()

const std::string& turi::sframe::get_index_file ( ) const
inline

Return the index file of the sframe

Definition at line 308 of file sframe.hpp.

◆ get_index_info()

const sframe_index_file_information turi::sframe::get_index_info ( ) const
inline

Returns the current index info of the array.

Definition at line 472 of file sframe.hpp.

◆ get_internal_writer()

std::shared_ptr<sarray_group_format_writer<flexible_type> > turi::sframe::get_internal_writer ( )
inline

Internal API. Used to obtain the internal writer object.

Definition at line 691 of file sframe.hpp.

◆ get_metadata() [1/2]

bool turi::sframe::get_metadata ( const std::string &  key,
std::string &  val 
) const
inline

Reads the value of a key associated with the sframe Returns true on success, false on failure.

Definition at line 318 of file sframe.hpp.

◆ get_metadata() [2/2]

std::pair<bool, std::string> turi::sframe::get_metadata ( const std::string &  key) const
inline

Reads the value of a key associated with the sframe Returns a pair of (true, value) on success, and (false, empty_string) on failure.

Definition at line 330 of file sframe.hpp.

◆ get_output_iterator()

iterator turi::sframe::get_output_iterator ( size_t  segmentid)
virtual

Gets an output iterator for the given segment. This can be used to write data to the segment, and is currently the only supported way to do so.

The iterator is invalid once the segment is closed (See close). Accessing the iterator after the writer is destroyed is undefined behavior.

Cannot be called until the sframe is open.

Example:

// example to write the same vector to 7 rows of segment 1
// let's say the sframe has 5 columns of type FLEX_TYPE_ENUM::INTEGER
// and sfw is the sframe.
auto iter = sfw.get_output_iterator(1);
std::vector<flexible_type> vals{1,2,3,4,5}
for(int i = 0; i < 7; ++i) {
*iter = vals;
++iter;
}

Implements turi::swriter_base< sframe_output_iterator >.

◆ get_reader() [1/3]

std::unique_ptr<reader_type> turi::sframe::get_reader ( ) const

Gets an sframe reader object with the segment layout of the first column.

◆ get_reader() [2/3]

std::unique_ptr<reader_type> turi::sframe::get_reader ( size_t  num_segments) const

Gets an sframe reader object with num_segments number of logical segments.

◆ get_reader() [3/3]

std::unique_ptr<reader_type> turi::sframe::get_reader ( const std::vector< size_t > &  segment_lengths) const

Gets an sframe reader object with a custom segment layout. segment_lengths must sum up to the same length as the original array.

◆ init_from_csvs()

std::map<std::string, std::shared_ptr<sarray<flexible_type> > > turi::sframe::init_from_csvs ( const std::string &  path,
csv_line_tokenizer tokenizer,
bool  use_header,
bool  continue_on_failure,
bool  store_errors,
std::map< std::string, flex_type_enum column_type_hints,
std::vector< std::string >  output_columns = std::vector< std::string >(),
size_t  row_limit = 0,
size_t  skip_rows = 0 
)

Constructs an SFrame from a csv file.

All columns will be parsed into flex_string unless the column type is specified in the column_type_hints.

Parameters
pathThe url to the csv file. The url can points to local filesystem, hdfs, or s3.
tokenizerThe tokenization rules to use
use_headerIf true, the first line will be parsed as column headers. Otherwise, R-style column names, i.e. X1, X2, X3... will be used.
continue_on_failureIf true, lines with parsing errors will be skipped.
column_type_hintsA map from column name to the column type.
output_columnsThe subset of column names to output
row_limitIf non-zero, the maximum number of rows to read
skip_rowsIf non-zero, the number of lines to skip at the start of each file

Throws an exception if IO error or csv parse failed.

◆ is_opened_for_read()

bool turi::sframe::is_opened_for_read ( ) const
inline

Returns true if the Array is opened for reading. i.e. get_reader() will succeed

Definition at line 291 of file sframe.hpp.

◆ is_opened_for_write()

bool turi::sframe::is_opened_for_write ( ) const
inline

Returns true if the Array is opened for writing. i.e. get_output_iterator() will succeed

Definition at line 300 of file sframe.hpp.

◆ load()

void turi::sframe::load ( iarchive iarc)

SFrame deserializer. iarc must be associated with a directory. Loads from the next prefix inside the directory.

◆ num_segments()

size_t turi::sframe::num_segments ( ) const
inlinevirtual

Returns the number of segments that this SFrame will be written with. Never fails.

Implements turi::swriter_base< sframe_output_iterator >.

Definition at line 430 of file sframe.hpp.

◆ open_for_read() [1/2]

void turi::sframe::open_for_read ( sframe_index_file_information  frame_index_info)
inline

Initializes the SFrame with an index_information. If the SFrame is already inited, this will throw an exception

Definition at line 215 of file sframe.hpp.

◆ open_for_read() [2/2]

void turi::sframe::open_for_read ( const std::vector< std::shared_ptr< sarray< flexible_type > > > &  new_columns,
const std::vector< std::string > &  column_names = {},
bool  fail_on_column_names = true 
)
inline

Initializes the SFrame with a collection of columns. If the SFrame is already inited, this will throw an exception. Will throw an exception if column_names are not unique and fail_on_column_names is true.

Definition at line 228 of file sframe.hpp.

◆ open_for_write()

void turi::sframe::open_for_write ( const std::vector< std::string > &  column_names,
const std::vector< flex_type_enum > &  column_types,
const std::string &  frame_sidx_file = "",
size_t  nsegments = SFRAME_DEFAULT_NUM_SEGMENTS,
bool  fail_on_column_names = true 
)
inline

Opens the SFrame with an arbitrary temporary file. The array must not already been inited.

Parameters
column_namesThe name for each column. If the vector is shorter than column_types, or empty values are given, names are handled with default names of "X<column id+1>". Each column name must be unique. This will let you write non-unique column names, but if you do that, the sframe will throw an exception while constructing the output of this class.
column_typesThe type of each column expressed as a flexible_type. Currently this is required to tell how many columns are a part of the sframe. Throws an exception if this is an empty vector.
nsegmentsThe number of parallel output segments on each sarray. Throws an exception if this is 0.
frame_sidx_fileIf not specified, an argitrary temporary file will be created. Otherwise, all frame files will be written to the same location as the frame_sidx_file. Must end in ".frame_idx"
fail_on_column_namesIf true, will throw an exception if any column names are unique. If false, will automatically adjust column names so they are unique.

Definition at line 265 of file sframe.hpp.

◆ operator=() [1/2]

sframe& turi::sframe::operator= ( const sframe other)

Assignment operator. If the source frame is opened for writing, this will throw an exception. Otherwise, this will create a frame opened for reading, which shares column arrays with the source frame.

◆ operator=() [2/2]

sframe& turi::sframe::operator= ( sframe &&  other)

Move Assignment operator. Moves other into this. Other will be cleared as if it is a newly constructed sframe object.

◆ remove_column()

sframe turi::sframe::remove_column ( size_t  column_id) const

Returns a new ephemeral SFrame with the column removed. The new sframe is "ephemeral" in that it is not backed by an index on disk.

Parameters
column_idThe index of the column to remove.

◆ replace_column()

sframe turi::sframe::replace_column ( std::shared_ptr< sarray< flexible_type >>  sarr_ptr,
const std::string &  column_name 
) const

Replace the column of the given column name with a new sarray. Return the new sframe with old column_name sarray replaced by the new sarray.

◆ save() [1/2]

void turi::sframe::save ( std::string  index_file) const

Saves a copy of the current sframe into a different location. Does not modify the current sframe.

◆ save() [2/2]

void turi::sframe::save ( oarchive oarc) const

SFrame serializer. oarc must be associated with a directory. Saves into a prefix inside the directory.

◆ save_as_csv()

void turi::sframe::save_as_csv ( std::string  csv_file,
csv_writer writer 
)

Saves a copy of the current sframe as a CSV file. Does not modify the current sframe.

Parameters
csv_filetarget CSV file to save into
writerThe CSV writer configuration

◆ segment_length()

size_t turi::sframe::segment_length ( size_t  i) const
inline

Return the number of segments in the collection. Will throw an exception if the writer is invalid (there is an error opening/writing files)

Definition at line 445 of file sframe.hpp.

◆ select_column() [1/2]

std::shared_ptr<sarray<flexible_type> > turi::sframe::select_column ( size_t  column_id) const

Returns an sarray of the specific column.

Throws an exception if the column does not exist.

◆ select_column() [2/2]

std::shared_ptr<sarray<flexible_type> > turi::sframe::select_column ( const std::string &  name) const

Returns an sarray of the specific column by name.

Throws an exception if the column does not exist.

◆ select_columns()

sframe turi::sframe::select_columns ( const std::vector< std::string > &  names) const

Returns new sframe containing only the chosen columns in the same order. The new sframe is "ephemeral" in that it is not backed by an index on disk.

Throws an exception if the column name does not exist.

◆ set_column_name()

void turi::sframe::set_column_name ( size_t  column_id,
const std::string &  name 
)

Set the ith column name to name. This can be done when the frame is open in either reading or writing mode. Changes are ephemeral, and do not affect what is stored on disk.

◆ set_metadata()

bool turi::sframe::set_metadata ( const std::string &  key,
std::string  val 
)

Adds meta data to the frame. Frame must be first opened for writing.

◆ set_num_segments()

bool turi::sframe::set_num_segments ( size_t  numseg)
virtual

Sets the number of segments in the output. Frame must be first opened for writing. Once an output iterator has been obtained, the number of segments can no longer be changed. Returns true on sucess, false on failure.

Implements turi::swriter_base< sframe_output_iterator >.

◆ size()

size_t turi::sframe::size ( ) const
inline

Returns the number of elements in the sframe. If the sframe was not initialized, returns 0.

Definition at line 354 of file sframe.hpp.

◆ swap_columns()

sframe turi::sframe::swap_columns ( size_t  column_1,
size_t  column_2 
) const

Returns a new ephemeral SFrame with two columns swapped. The new sframe is "ephemeral" in that it is not backed by an index on disk.

Parameters
column_1The index of the first column.
column_2The index of the second column.

◆ to_dataframe()

dataframe_t turi::sframe::to_dataframe ( )

Converts the sframe into a dataframe_t. Will reset iterators before and after the operation.

◆ try_compact()

void turi::sframe::try_compact ( )

Attempts to compact if the number of segments in the SArray exceeds SFRAME_COMPACTION_THRESHOLD.


The documentation for this class was generated from the following file: