Turi Create  4.0
Groupby Aggregation

Namespaces

 turi::groupby_aggregate_impl
 
 turi::groupby_operators
 
 turi::rolling_aggregate
 

Classes

class  turi::group_aggregate_value
 
class  turi::hash_bucket< T >
 
class  turi::hash_bucket_container< T >
 

Functions

std::shared_ptr< group_aggregate_valueturi::get_builtin_group_aggregator (const std::string &)
 
sframe turi::group (sframe sframe_in, std::string key_column)
 
sframe turi::groupby_aggregate (const sframe &source, const std::vector< std::string > &keys, const std::vector< std::string > &group_output_columns, const std::vector< std::pair< std::vector< std::string >, std::shared_ptr< group_aggregate_value >>> &groups, size_t max_buffer_size=SFRAME_GROUPBY_BUFFER_NUM_ROWS)
 
std::shared_ptr< sarray< flexible_type > > turi::rolling_aggregate::rolling_apply (const sarray< flexible_type > &input, std::shared_ptr< group_aggregate_value > agg_op, ssize_t window_start, ssize_t window_end, size_t min_observations)
 
template<typename Iterator >
flexible_type turi::rolling_aggregate::full_window_aggregate (std::shared_ptr< group_aggregate_value > agg_op, Iterator first, Iterator last)
 Aggregate functions.
 
template<typename Iterator >
bool turi::rolling_aggregate::has_min_observations (size_t min_observations, Iterator first, Iterator last)
 

Detailed Description

Hash function.

This allows us to add groupby_element to an std::unordered_set

Function Documentation

◆ get_builtin_group_aggregator()

std::shared_ptr<group_aggregate_value> turi::get_builtin_group_aggregator ( const std::string &  )

Helper function to convert string aggregator name into builtin aggregator value.

Implementation is in groupby_operators.hpp

◆ group()

sframe turi::group ( sframe  sframe_in,
std::string  key_column 
)

Group the sframe rows by the key_column.

Like a sort, but not.

◆ groupby_aggregate()

sframe turi::groupby_aggregate ( const sframe source,
const std::vector< std::string > &  keys,
const std::vector< std::string > &  group_output_columns,
const std::vector< std::pair< std::vector< std::string >, std::shared_ptr< group_aggregate_value >>> &  groups,
size_t  max_buffer_size = SFRAME_GROUPBY_BUFFER_NUM_ROWS 
)

Groupby Aggregate function for an SFrame. Given the source SFrame this function performs a group-by aggregate of the SFrame, using one or more columns to define the group key, and a descriptor for how to aggregate other non-key columns.

For instance given an SFrame:

* user_id  movie_id  rating  time
*      5        10       1    4pm
*      5        15       2    1pm
*      6        12       1    2pm
*      7        13       1    3am
* 
sframe output = turi::groupby_aggregate(input,
{"user_id"},
{"movie_count", "rating_sum"},
{{"movie_id", std::make_shared<groupby_operators::count>()},
{"rating", std::make_shared<groupby_operators::sum>()}});

will generate groups based on the user_id column, and within each group, count the movie_id, and sum the ratings.

* user_id  "Count of movie_id"  "Sum of rating"
*      5                    2               3
*      6                    1               1
*      7                    1               1
* 

See groupby_aggregate_operators for operators that have been implemented.

Describing a Group

A group is basically a pair of column-name and the operator. The column name can be any existing column in the table (there is no restriction. You can group on user_id and aggregate on user_id, though the result is typically not very meaningful). A special column name with the empty string "" is also defined in which case, the aggregator will be sent a flexible type of type FLEX_UNDEFINED for every row (this is useful for COUNT).

Parameters
sourceThe input SFrame to group
keysAn array of column names to generate the group on
group_output_columnsThe output column names for each aggregate. This must be the same length as the 'groups' parameter. Output column names must be unique and must not share similar column names as keys. If there are any empty entries, their values will be automatically assigned.
groupsA collection of {column_names, group operator} pairs describing the aggregates to generate. You can have multiple aggregators for each set of columns. You do not need every column in the source to be represented. This must be the same length as the 'group_output_columns' parameter.
max_buffer_sizeThe maximum size of intermediate aggregation buffers
Returns
The new aggregated SFrame. throws a string exception on failures.

◆ has_min_observations()

template<typename Iterator >
bool turi::rolling_aggregate::has_min_observations ( size_t  min_observations,
Iterator  first,
Iterator  last 
)

Scans the current window to check for the number of non-NULL values.

Returns true if the number of non-NULL values is >= min_observations, false otherwise.

Definition at line 84 of file rolling_aggregate.hpp.

◆ rolling_apply()

std::shared_ptr<sarray<flexible_type> > turi::rolling_aggregate::rolling_apply ( const sarray< flexible_type > &  input,
std::shared_ptr< group_aggregate_value agg_op,
ssize_t  window_start,
ssize_t  window_end,
size_t  min_observations 
)

Apply an aggregate function over a moving window.

Parameters
inputThe input SArray (expects to be materialized)
agg_opThe aggregator. These classes are the same as used by groupby.
window_startThe start of the moving window relative to the current value being calculated, inclusive. For example, 2 values behind the current would be -2, and 0 indicates that the start of the window is the current value.
window_endThe end of the moving window relative to the current value being calculated, inclusive. Must be greater than window_start. For example, 0 would indicate that the current value is the end of the window, and 2 would indicate that the window ends at 2 data values after the current.
min_observationsThe minimum allowed number of non-NULL values in the moving window for the emitted value to be non-NULL. size_t(-1) indicates that all values must be non-NULL.

Returns an SArray of the same length as the input, with a type that matches the type output by the aggregation function.

Throws an exception if:

  • window_end < window_start
  • The window size is excessively large (currently hardcoded to UINT_MAX).
  • The given function name corresponds to a function that will not operate on the data type of the input SArray.
  • The aggregation function returns more than one non-NULL types.