Turi Create  4.0
turi::v2_block_impl::block_manager Class Reference

#include <core/storage/sframe_data/sarray_v2_block_manager.hpp>

Public Member Functions

 block_manager ()
 default constructor.
 
column_address open_column (std::string column_file)
 
void close_column (column_address addr)
 
size_t num_blocks_in_column (column_address addr)
 
const block_infoget_block_info (block_address addr)
 
const std::vector< std::vector< block_info > > & get_all_block_info (size_t segment_id)
 
std::shared_ptr< std::vector< char > > read_block (block_address addr, block_info **ret_info=NULL)
 
bool read_typed_block (block_address addr, std::vector< flexible_type > &ret, block_info **ret_info=NULL)
 
bool read_typed_blocks (block_address addr, size_t nblocks, std::vector< std::vector< flexible_type > > &ret, std::vector< block_info > *ret_info=NULL)
 
template<typename T >
bool read_block (block_address addr, std::vector< T > &ret, block_info **ret_info=NULL)
 

Static Public Member Functions

static block_managerget_instance ()
 Get singleton instance.
 

Detailed Description

Provides block reading capability in v2 segment files.

This class manages block reading of an SArray/SArray group , and provides functions to query the blocks (such has how many blocks are there in the segment, and how many rows are there in the block etc).

Array Group

An array group is a collection of segment files which contain and represent and collection of arrays (columns).

Essentially an Array Group comprises of the following:

  • group.sidx the group index file. A JSON serialized contents of group_index_file_information. Describes a collection of arrays.
  • group.0000, group.0001, group.0002 Each file is one segment of the array group. (multiple segments in an array group really exist only for parallel writing (and appending) capabilities. On reading, the segment layout is inconsequential, and a logical partitioning across threads is used.)

Each segment file internally then has the following layout (1) Consecutive Block contents, each block 4K aligned. (2) A direct serialization of a vector<vector<block_info> > (blocks[column_id][block_id]) (3) 8 bytes containing the file offset. at which (2) begins

For instance, if there are 2 segments with 3 columns each of 20 rows, we may get the following layout:

group.0001:

  • column 0, block 0, 3 rows
  • column 1, block 0, 3 rows
  • column 2, block 0, 4 rows
  • column 0, block 1, 4 rows
  • column 0, block 2, 3 rows
  • column 1, block 1, 7 rows
  • column 2, block 1, 6 rows

group.0002:

  • column 1, block 0, 5 rows
  • column 0, block 0, 10 rows
  • column 1, block 1, 5 rows
  • column 2, block 0, 10 rows

Observe the following: 1) Each segment contains the same number of rows from each column. (technically the format does not require this, but the writer will always produce this result) 2) Blocks can be of different sizes. (the block_manager and block_writer do not have a block size constraint. The sarray_group_format_writer_v2 tries to keep to a block size of SFRAME_DEFAULT_BLOCK_SIZE after compression, but this is done by performing block size estimation (#bytes written / #rows written). But the format itself does not care. 3) Blocks can be laid out in arbitrary order across columns. Striping of columns is unnecessary) 4) Within each segment, the blocks for a given column are consecutive.

File Addressing

Since an array group (and hence a segment) can contain multiple columns, we need a uniform way of addressing a particular column inside an array group, or inside a segment. Thus the following convention is used:

Given an array group of 3 columns comprising of the files:

  • group.sidx
  • group.0000, group.0001, group.0002, group.0003

Column 0 in the array group can be addressed by opening the index file "group.sidx:0". Similarly, column 2 can be addressed using "group.sidx:2"

Column 2 of the array group thus has the segment files:

  • group.0000:2, group.0001:2, group.0002:2, group.0002:3

By convention if "group.sidx" is opened as a single array, it refers to column 0.

Block Manager

The block manager is a singleton reader object that provides read access to columns. The usage convention is:

  • block_manager& manager = block_manager::get_instance()
  • column_address = manager.open_column("group.0000:2") // opens column 2 in segment
  • .. do stuff ..
  • manager.close_column(column_address) We will expand on .. do stuff .. below.

The reason for having a singleton block manager is to provide better control over file handle utilization. Specifically, the block manager maintains a pool of file handles and will recycle file handles (close them until they are next needed, then reopen and seek) so as to avoid file handle usage exceeding a certain limit (as defined in DEFAULT_FILE_HANDLE_POOL_SIZE) Furthermore, the block manager can combine accesses of multiple columns in the same array group into a single file handle. Future performance improvements involving better IO scheduling can also be performed here.

When a column is opened by open_column(), a column_address is returned. This is a pair of integers of {segment_file_id, and column_id}. column_id is the column within the segment. For instance, opening "group.0000:2" will have column_id = 2. The segment_file_id is an internal ID assigned by the block manager to track all accesses to the file group.0000. All open calls to group.0000 will return the same segment_file_id, and a reference counter is used internally to figure out when the file handle and block metadata can be released. close_column() thus must be called for every call to open_column().

Once the column is opened, num_blocks_in_column() can be used to obtain the number of blocks in the segment file belonging to the column. read_block() or read_typed_block() can then be used to read the blocks. These functions take a block_address, which is a triple of {segment_file_id, column_id, block_id}. The first 2 fields can be copied from the column_address, the block_id is a sequential counter from 0 to num_blocks_in_column() - 1.

Definition at line 155 of file sarray_v2_block_manager.hpp.

Member Function Documentation

◆ close_column()

void turi::v2_block_impl::block_manager::close_column ( column_address  addr)

Releases the column opened with open_column()

◆ get_all_block_info()

const std::vector<std::vector<block_info> >& turi::v2_block_impl::block_manager::get_all_block_info ( size_t  segment_id)

Returns all the blockinfo in a segment

◆ get_block_info()

const block_info& turi::v2_block_impl::block_manager::get_block_info ( block_address  addr)

Returns the number of rows in a block Returns (size_t)(-1) on failure.

◆ num_blocks_in_column()

size_t turi::v2_block_impl::block_manager::num_blocks_in_column ( column_address  addr)

Returns the number of blocks in this column of this segment.

◆ open_column()

column_address turi::v2_block_impl::block_manager::open_column ( std::string  column_file)

Opens a file of the form segment_file:column_number and returns the the column address: {segment_file_id, column_id}.

calling num_blocks_in_column() will return the number of blocks within this column, after which columns can be read by providing {segment_file_id, column_id, block_id} to read_block()

close_column() must be called for each call to open_column()

◆ read_block() [1/2]

std::shared_ptr<std::vector<char> > turi::v2_block_impl::block_manager::read_block ( block_address  addr,
block_info **  ret_info = NULL 
)

Reads a block as bytes a block address ((array_group ID, segment ID, block ID) tuple),

If info is not NULL, A pointer to the block information will be stored info *info. This is a pointer into internal datastructures of the block manager and should not be modified or freed.

Return an empty pointer on failure.

Safe for concurrent operation.

◆ read_block() [2/2]

template<typename T >
bool turi::v2_block_impl::block_manager::read_block ( block_address  addr,
std::vector< T > &  ret,
block_info **  ret_info = NULL 
)
inline

Reads a few blocks starting from a given a block address ((array_group ID, segment ID, block ID) tuple) and deserializes it into an array. The block Returns true on success, false on failure.

May return less than nblocks if addr goes past the last block.

Safe for concurrent operation.

Definition at line 247 of file sarray_v2_block_manager.hpp.

◆ read_typed_block()

bool turi::v2_block_impl::block_manager::read_typed_block ( block_address  addr,
std::vector< flexible_type > &  ret,
block_info **  ret_info = NULL 
)

Reads a block given a block address ((array_group ID, segment ID, block ID) tuple), into a typed array. The block must have been stored as a typed block. Returns true on success, false on failure.

Safe for concurrent operation.

◆ read_typed_blocks()

bool turi::v2_block_impl::block_manager::read_typed_blocks ( block_address  addr,
size_t  nblocks,
std::vector< std::vector< flexible_type > > &  ret,
std::vector< block_info > *  ret_info = NULL 
)

Reads a few blocks starting from a given a block address ((array_group ID, segment ID, block ID) tuple), into a typed array. The block must have been stored as a typed block. Returns true on success, false on failure.

May return less than nblocks if addr goes past the last block.

Safe for concurrent operation.


The documentation for this class was generated from the following file: