Turi Create
4.0
|
#include <core/storage/sframe_data/groupby_aggregate_impl.hpp>
Public Member Functions | |
group_aggregate_container (size_t max_buffer_size, size_t num_segments) | |
group_aggregate_container (const group_aggregate_container &other)=delete | |
Deleted copy constructor. | |
group_aggregate_container & | operator= (const group_aggregate_container &other)=delete |
Deleted assignment operator. | |
void | define_group (std::vector< size_t > column_numbers, std::shared_ptr< group_aggregate_value > aggregator) |
void | add (const std::vector< flexible_type > &val, size_t num_keys) |
Add a new element to the container. | |
void | add (const sframe_rows::row &val, size_t num_keys) |
Add a new element to the container. | |
void | group_and_write (sframe &out) |
Sort all elements in the container and writes to the output. | |
void | init_tls () |
init tls data members | |
This maintains the complete aggregation result for an SFrame.
This class essentially implements the entire groupby aggregation algorithm. Aggregation groups are defined using define_group. After which, rows are inserted using the add method.
num_segments is the maximum degree of parallelism permissible. This is the number of segments of the output SFrame.
Keys are first hashed into a segment, and each segment is then executed independently. For each segment, rows are aggregated in memory as much as possible until we exceed the max_buffer_size number of keys stored in memory. If more keys are needed the existing keys are all sorted and flushed out to disk. This process repeats until all data is read.
Then when group_and_write is called, a k-way merge is performed across all the sorted ranges of keys on disk to write the final output.
Definition at line 196 of file groupby_aggregate_impl.hpp.
turi::groupby_aggregate_impl::group_aggregate_container::group_aggregate_container | ( | size_t | max_buffer_size, |
size_t | num_segments | ||
) |
max_buffer_size | Maximum number of aggregators to store in memory per segment |
num_segments | Maximum degree of parallelism |
void turi::groupby_aggregate_impl::group_aggregate_container::define_group | ( | std::vector< size_t > | column_numbers, |
std::shared_ptr< group_aggregate_value > | aggregator | ||
) |
Adds a new group operation which groups the values of a column