Turi Create
4.0
|
#include <core/storage/sframe_data/sarray_v2_block_manager.hpp>
Public Member Functions | |
block_manager () | |
default constructor. | |
column_address | open_column (std::string column_file) |
void | close_column (column_address addr) |
size_t | num_blocks_in_column (column_address addr) |
const block_info & | get_block_info (block_address addr) |
const std::vector< std::vector< block_info > > & | get_all_block_info (size_t segment_id) |
std::shared_ptr< std::vector< char > > | read_block (block_address addr, block_info **ret_info=NULL) |
bool | read_typed_block (block_address addr, std::vector< flexible_type > &ret, block_info **ret_info=NULL) |
bool | read_typed_blocks (block_address addr, size_t nblocks, std::vector< std::vector< flexible_type > > &ret, std::vector< block_info > *ret_info=NULL) |
template<typename T > | |
bool | read_block (block_address addr, std::vector< T > &ret, block_info **ret_info=NULL) |
Static Public Member Functions | |
static block_manager & | get_instance () |
Get singleton instance. | |
Provides block reading capability in v2 segment files.
This class manages block reading of an SArray/SArray group , and provides functions to query the blocks (such has how many blocks are there in the segment, and how many rows are there in the block etc).
An array group is a collection of segment files which contain and represent and collection of arrays (columns).
Essentially an Array Group comprises of the following:
Each segment file internally then has the following layout (1) Consecutive Block contents, each block 4K aligned. (2) A direct serialization of a vector<vector<block_info> > (blocks[column_id][block_id]) (3) 8 bytes containing the file offset. at which (2) begins
For instance, if there are 2 segments with 3 columns each of 20 rows, we may get the following layout:
group.0001:
group.0002:
Observe the following: 1) Each segment contains the same number of rows from each column. (technically the format does not require this, but the writer will always produce this result) 2) Blocks can be of different sizes. (the block_manager and block_writer do not have a block size constraint. The sarray_group_format_writer_v2 tries to keep to a block size of SFRAME_DEFAULT_BLOCK_SIZE after compression, but this is done by performing block size estimation (#bytes written / #rows written). But the format itself does not care. 3) Blocks can be laid out in arbitrary order across columns. Striping of columns is unnecessary) 4) Within each segment, the blocks for a given column are consecutive.
Since an array group (and hence a segment) can contain multiple columns, we need a uniform way of addressing a particular column inside an array group, or inside a segment. Thus the following convention is used:
Given an array group of 3 columns comprising of the files:
Column 0 in the array group can be addressed by opening the index file "group.sidx:0". Similarly, column 2 can be addressed using "group.sidx:2"
Column 2 of the array group thus has the segment files:
By convention if "group.sidx" is opened as a single array, it refers to column 0.
The block manager is a singleton reader object that provides read access to columns. The usage convention is:
The reason for having a singleton block manager is to provide better control over file handle utilization. Specifically, the block manager maintains a pool of file handles and will recycle file handles (close them until they are next needed, then reopen and seek) so as to avoid file handle usage exceeding a certain limit (as defined in DEFAULT_FILE_HANDLE_POOL_SIZE) Furthermore, the block manager can combine accesses of multiple columns in the same array group into a single file handle. Future performance improvements involving better IO scheduling can also be performed here.
When a column is opened by open_column(), a column_address is returned. This is a pair of integers of {segment_file_id, and column_id}. column_id is the column within the segment. For instance, opening "group.0000:2" will have column_id = 2. The segment_file_id is an internal ID assigned by the block manager to track all accesses to the file group.0000. All open calls to group.0000 will return the same segment_file_id, and a reference counter is used internally to figure out when the file handle and block metadata can be released. close_column() thus must be called for every call to open_column().
Once the column is opened, num_blocks_in_column() can be used to obtain the number of blocks in the segment file belonging to the column. read_block() or read_typed_block() can then be used to read the blocks. These functions take a block_address, which is a triple of {segment_file_id, column_id, block_id}. The first 2 fields can be copied from the column_address, the block_id is a sequential counter from 0 to num_blocks_in_column() - 1.
Definition at line 155 of file sarray_v2_block_manager.hpp.
void turi::v2_block_impl::block_manager::close_column | ( | column_address | addr | ) |
Releases the column opened with open_column()
const std::vector<std::vector<block_info> >& turi::v2_block_impl::block_manager::get_all_block_info | ( | size_t | segment_id | ) |
Returns all the blockinfo in a segment
const block_info& turi::v2_block_impl::block_manager::get_block_info | ( | block_address | addr | ) |
Returns the number of rows in a block Returns (size_t)(-1) on failure.
size_t turi::v2_block_impl::block_manager::num_blocks_in_column | ( | column_address | addr | ) |
Returns the number of blocks in this column of this segment.
column_address turi::v2_block_impl::block_manager::open_column | ( | std::string | column_file | ) |
Opens a file of the form segment_file:column_number and returns the the column address: {segment_file_id, column_id}.
calling num_blocks_in_column() will return the number of blocks within this column, after which columns can be read by providing {segment_file_id, column_id, block_id} to read_block()
close_column() must be called for each call to open_column()
std::shared_ptr<std::vector<char> > turi::v2_block_impl::block_manager::read_block | ( | block_address | addr, |
block_info ** | ret_info = NULL |
||
) |
Reads a block as bytes a block address ((array_group ID, segment ID, block ID) tuple),
If info is not NULL, A pointer to the block information will be stored info *info. This is a pointer into internal datastructures of the block manager and should not be modified or freed.
Return an empty pointer on failure.
Safe for concurrent operation.
|
inline |
Reads a few blocks starting from a given a block address ((array_group ID, segment ID, block ID) tuple) and deserializes it into an array. The block Returns true on success, false on failure.
May return less than nblocks if addr goes past the last block.
Safe for concurrent operation.
Definition at line 247 of file sarray_v2_block_manager.hpp.
bool turi::v2_block_impl::block_manager::read_typed_block | ( | block_address | addr, |
std::vector< flexible_type > & | ret, | ||
block_info ** | ret_info = NULL |
||
) |
Reads a block given a block address ((array_group ID, segment ID, block ID) tuple), into a typed array. The block must have been stored as a typed block. Returns true on success, false on failure.
Safe for concurrent operation.
bool turi::v2_block_impl::block_manager::read_typed_blocks | ( | block_address | addr, |
size_t | nblocks, | ||
std::vector< std::vector< flexible_type > > & | ret, | ||
std::vector< block_info > * | ret_info = NULL |
||
) |
Reads a few blocks starting from a given a block address ((array_group ID, segment ID, block ID) tuple), into a typed array. The block must have been stored as a typed block. Returns true on success, false on failure.
May return less than nblocks if addr goes past the last block.
Safe for concurrent operation.