turicreate.SFrame.groupby¶

SFrame.groupby(key_column_names, operations, *args)¶

Perform a group on the key_column_names followed by aggregations on the columns listed in operations.

The operations parameter is a dictionary that indicates which aggregation operators to use and which columns to use them on. The available operators are SUM, MAX, MIN, COUNT, AVG, VAR, STDV, CONCAT, SELECT_ONE, ARGMIN, ARGMAX, and QUANTILE. For convenience, aggregators MEAN, STD, and VARIANCE are available as synonyms for AVG, STDV, and VAR. See aggregate for more detail on the aggregators.

Parameters:

key_column_names : string | list[string]: Column(s) to group by. Key columns can be of any type other than dictionary.
operations : dict, list: Dictionary of columns and aggregation operations. Each key is a output column name and each value is an aggregator. This can also be a list of aggregators, in which case column names will be automatically assigned.
*args: All other remaining arguments will be interpreted in the same way as the operations argument.

Returns:

out_sf : SFrame: A new SFrame, with a column for each groupby column and each aggregation operation.

See also

aggregate

Notes

Numeric aggregators (such as sum, mean, stdev etc.) follow the skip

None policy i.e they will omit all missing values from the aggregation. As an example, sum([None, 5, 10]) = 15 because the None value is skipped. * Aggregators have a default value when no values (after skipping all None values) are present. Default values are None for [‘ARGMAX’, ‘ARGMIN’, ‘AVG’, ‘STD’, ‘MEAN’, ‘MIN’, ‘MAX’], 0 for [‘COUNT’ ‘COUNT_DISTINCT’, ‘DISTINCT’] [] for ‘CONCAT’, ‘QUANTILE’, ‘DISTINCT’, and {} for ‘FREQ_COUNT’.

Examples

Suppose we have an SFrame with movie ratings by many users.

>>> import turicreate.aggregate as agg
>>> url = 'https://static.turi.com/datasets/rating_data_example.csv'
>>> sf = turicreate.SFrame.read_csv(url)
>>> sf
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|  25933  |   1663   |   4    |
|  25934  |   1663   |   4    |
|  25935  |   1663   |   4    |
|  25936  |   1663   |   5    |
|  25937  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Compute the number of occurrences of each user.

>>> user_count = sf.groupby(key_column_names='user_id',
...                         operations={'count': agg.COUNT()})
>>> user_count
+---------+-------+
| user_id | count |
+---------+-------+
|  62361  |   1   |
|  30727  |   1   |
|  40111  |   1   |
|  50513  |   1   |
|  35140  |   1   |
|  42352  |   1   |
|  29667  |   1   |
|  46242  |   1   |
|  58310  |   1   |
|  64614  |   1   |
|   ...   |  ...  |
+---------+-------+
[9852 rows x 2 columns]

Compute the mean and standard deviation of ratings per user.

>>> user_rating_stats = sf.groupby(key_column_names='user_id',
...                                operations={
...                                    'mean_rating': agg.MEAN('rating'),
...                                    'std_rating': agg.STD('rating')
...                                })
>>> user_rating_stats
+---------+-------------+------------+
| user_id | mean_rating | std_rating |
+---------+-------------+------------+
|  62361  |     5.0     |    0.0     |
|  30727  |     4.0     |    0.0     |
|  40111  |     2.0     |    0.0     |
|  50513  |     4.0     |    0.0     |
|  35140  |     4.0     |    0.0     |
|  42352  |     5.0     |    0.0     |
|  29667  |     4.0     |    0.0     |
|  46242  |     5.0     |    0.0     |
|  58310  |     2.0     |    0.0     |
|  64614  |     2.0     |    0.0     |
|   ...   |     ...     |    ...     |
+---------+-------------+------------+
[9852 rows x 3 columns]

Compute the movie with the minimum rating per user.

>>> chosen_movies = sf.groupby(key_column_names='user_id',
...                            operations={
...                                'worst_movies': agg.ARGMIN('rating','movie_id')
...                            })
>>> chosen_movies
+---------+-------------+
| user_id | worst_movies |
+---------+-------------+
|  62361  |     1663    |
|  30727  |     1663    |
|  40111  |     1663    |
|  50513  |     1663    |
|  35140  |     1663    |
|  42352  |     1663    |
|  29667  |     1663    |
|  46242  |     1663    |
|  58310  |     1663    |
|  64614  |     1663    |
|   ...   |     ...     |
+---------+-------------+
[9852 rows x 2 columns]

Compute the movie with the max rating per user and also the movie with the maximum imdb-ranking per user.

>>> sf['imdb-ranking'] = sf['rating'] * 10
>>> chosen_movies = sf.groupby(key_column_names='user_id',
...         operations={('max_rating_movie','max_imdb_ranking_movie'): agg.ARGMAX(('rating','imdb-ranking'),'movie_id')})
>>> chosen_movies
+---------+------------------+------------------------+
| user_id | max_rating_movie | max_imdb_ranking_movie |
+---------+------------------+------------------------+
|  62361  |       1663       |          16630         |
|  30727  |       1663       |          16630         |
|  40111  |       1663       |          16630         |
|  50513  |       1663       |          16630         |
|  35140  |       1663       |          16630         |
|  42352  |       1663       |          16630         |
|  29667  |       1663       |          16630         |
|  46242  |       1663       |          16630         |
|  58310  |       1663       |          16630         |
|  64614  |       1663       |          16630         |
|   ...   |       ...        |          ...           |
+---------+------------------+------------------------+
[9852 rows x 3 columns]

Compute the movie with the max rating per user.

>>> chosen_movies = sf.groupby(key_column_names='user_id',
            operations={'best_movies': agg.ARGMAX('rating','movie')})

Compute the movie with the max rating per user and also the movie with the maximum imdb-ranking per user.

>>> chosen_movies = sf.groupby(key_column_names='user_id',
           operations={('max_rating_movie','max_imdb_ranking_movie'): agg.ARGMAX(('rating','imdb-ranking'),'movie')})

Compute the count, mean, and standard deviation of ratings per (user, time), automatically assigning output column names.

>>> sf['time'] = sf.apply(lambda x: (x['user_id'] + x['movie_id']) % 11 + 2000)
>>> user_rating_stats = sf.groupby(['user_id', 'time'],
...                                [agg.COUNT(),
...                                 agg.AVG('rating'),
...                                 agg.STDV('rating')])
>>> user_rating_stats
+------+---------+-------+---------------+----------------+
| time | user_id | Count | Avg of rating | Stdv of rating |
+------+---------+-------+---------------+----------------+
| 2006 |  61285  |   1   |      4.0      |      0.0       |
| 2000 |  36078  |   1   |      4.0      |      0.0       |
| 2003 |  47158  |   1   |      3.0      |      0.0       |
| 2007 |  34446  |   1   |      3.0      |      0.0       |
| 2010 |  47990  |   1   |      3.0      |      0.0       |
| 2003 |  42120  |   1   |      5.0      |      0.0       |
| 2007 |  44940  |   1   |      4.0      |      0.0       |
| 2008 |  58240  |   1   |      4.0      |      0.0       |
| 2002 |   102   |   1   |      1.0      |      0.0       |
| 2009 |  52708  |   1   |      3.0      |      0.0       |
| ...  |   ...   |  ...  |      ...      |      ...       |
+------+---------+-------+---------------+----------------+
[10000 rows x 5 columns]

The groupby function can take a variable length list of aggregation specifiers so if we want the count and the 0.25 and 0.75 quantiles of ratings:

>>> user_rating_stats = sf.groupby(['user_id', 'time'], agg.COUNT(),
...                                {'rating_quantiles': agg.QUANTILE('rating',[0.25, 0.75])})
>>> user_rating_stats
+------+---------+-------+------------------------+
| time | user_id | Count |    rating_quantiles    |
+------+---------+-------+------------------------+
| 2006 |  61285  |   1   | array('d', [4.0, 4.0]) |
| 2000 |  36078  |   1   | array('d', [4.0, 4.0]) |
| 2003 |  47158  |   1   | array('d', [3.0, 3.0]) |
| 2007 |  34446  |   1   | array('d', [3.0, 3.0]) |
| 2010 |  47990  |   1   | array('d', [3.0, 3.0]) |
| 2003 |  42120  |   1   | array('d', [5.0, 5.0]) |
| 2007 |  44940  |   1   | array('d', [4.0, 4.0]) |
| 2008 |  58240  |   1   | array('d', [4.0, 4.0]) |
| 2002 |   102   |   1   | array('d', [1.0, 1.0]) |
| 2009 |  52708  |   1   | array('d', [3.0, 3.0]) |
| ...  |   ...   |  ...  |          ...           |
+------+---------+-------+------------------------+
[10000 rows x 4 columns]

To put all items a user rated into one list value by their star rating:

>>> user_rating_stats = sf.groupby(["user_id", "rating"],
...                                {"rated_movie_ids":agg.CONCAT("movie_id")})
>>> user_rating_stats
+--------+---------+----------------------+
| rating | user_id |     rated_movie_ids  |
+--------+---------+----------------------+
|   3    |  31434  | array('d', [1663.0]) |
|   5    |  25944  | array('d', [1663.0]) |
|   4    |  38827  | array('d', [1663.0]) |
|   4    |  51437  | array('d', [1663.0]) |
|   4    |  42549  | array('d', [1663.0]) |
|   4    |  49532  | array('d', [1663.0]) |
|   3    |  26124  | array('d', [1663.0]) |
|   4    |  46336  | array('d', [1663.0]) |
|   4    |  52133  | array('d', [1663.0]) |
|   5    |  62361  | array('d', [1663.0]) |
|  ...   |   ...   |         ...          |
+--------+---------+----------------------+
[9952 rows x 3 columns]

To put all items and rating of a given user together into a dictionary value:

>>> user_rating_stats = sf.groupby("user_id",
...                                {"movie_rating":agg.CONCAT("movie_id", "rating")})
>>> user_rating_stats
+---------+--------------+
| user_id | movie_rating |
+---------+--------------+
|  62361  |  {1663: 5}   |
|  30727  |  {1663: 4}   |
|  40111  |  {1663: 2}   |
|  50513  |  {1663: 4}   |
|  35140  |  {1663: 4}   |
|  42352  |  {1663: 5}   |
|  29667  |  {1663: 4}   |
|  46242  |  {1663: 5}   |
|  58310  |  {1663: 2}   |
|  64614  |  {1663: 2}   |
|   ...   |     ...      |
+---------+--------------+
[9852 rows x 2 columns]