turicreate.SFrame.groupby

SFrame.groupby(key_column_names, operations, *args)

Perform a group on the key_column_names followed by aggregations on the columns listed in operations.

The operations parameter is a dictionary that indicates which aggregation operators to use and which columns to use them on. The available operators are SUM, MAX, MIN, COUNT, AVG, VAR, STDV, CONCAT, SELECT_ONE, ARGMIN, ARGMAX, and QUANTILE. For convenience, aggregators MEAN, STD, and VARIANCE are available as synonyms for AVG, STDV, and VAR. See aggregate for more detail on the aggregators.

Parameters:
key_column_names : string | list[string]

Column(s) to group by. Key columns can be of any type other than dictionary.

operations : dict, list

Dictionary of columns and aggregation operations. Each key is a output column name and each value is an aggregator. This can also be a list of aggregators, in which case column names will be automatically assigned.

*args

All other remaining arguments will be interpreted in the same way as the operations argument.

Returns:
out_sf : SFrame

A new SFrame, with a column for each groupby column and each aggregation operation.

See also

aggregate

Notes

  • Numeric aggregators (such as sum, mean, stdev etc.) follow the skip

None policy i.e they will omit all missing values from the aggregation. As an example, sum([None, 5, 10]) = 15 because the None value is skipped. * Aggregators have a default value when no values (after skipping all None values) are present. Default values are None for [‘ARGMAX’, ‘ARGMIN’, ‘AVG’, ‘STD’, ‘MEAN’, ‘MIN’, ‘MAX’], 0 for [‘COUNT’ ‘COUNT_DISTINCT’, ‘DISTINCT’] [] for ‘CONCAT’, ‘QUANTILE’, ‘DISTINCT’, and {} for ‘FREQ_COUNT’.

Examples

Suppose we have an SFrame with movie ratings by many users.

>>> import turicreate.aggregate as agg
>>> url = 'https://static.turi.com/datasets/rating_data_example.csv'
>>> sf = turicreate.SFrame.read_csv(url)
>>> sf
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|  25933  |   1663   |   4    |
|  25934  |   1663   |   4    |
|  25935  |   1663   |   4    |
|  25936  |   1663   |   5    |
|  25937  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Compute the number of occurrences of each user.

>>> user_count = sf.groupby(key_column_names='user_id',
...                         operations={'count': agg.COUNT()})
>>> user_count
+---------+-------+
| user_id | count |
+---------+-------+
|  62361  |   1   |
|  30727  |   1   |
|  40111  |   1   |
|  50513  |   1   |
|  35140  |   1   |
|  42352  |   1   |
|  29667  |   1   |
|  46242  |   1   |
|  58310  |   1   |
|  64614  |   1   |
|   ...   |  ...  |
+---------+-------+
[9852 rows x 2 columns]

Compute the mean and standard deviation of ratings per user.

>>> user_rating_stats = sf.groupby(key_column_names='user_id',
...                                operations={
...                                    'mean_rating': agg.MEAN('rating'),
...                                    'std_rating': agg.STD('rating')
...                                })
>>> user_rating_stats
+---------+-------------+------------+
| user_id | mean_rating | std_rating |
+---------+-------------+------------+
|  62361  |     5.0     |    0.0     |
|  30727  |     4.0     |    0.0     |
|  40111  |     2.0     |    0.0     |
|  50513  |     4.0     |    0.0     |
|  35140  |     4.0     |    0.0     |
|  42352  |     5.0     |    0.0     |
|  29667  |     4.0     |    0.0     |
|  46242  |     5.0     |    0.0     |
|  58310  |     2.0     |    0.0     |
|  64614  |     2.0     |    0.0     |
|   ...   |     ...     |    ...     |
+---------+-------------+------------+
[9852 rows x 3 columns]

Compute the movie with the minimum rating per user.

>>> chosen_movies = sf.groupby(key_column_names='user_id',
...                            operations={
...                                'worst_movies': agg.ARGMIN('rating','movie_id')
...                            })
>>> chosen_movies
+---------+-------------+
| user_id | worst_movies |
+---------+-------------+
|  62361  |     1663    |
|  30727  |     1663    |
|  40111  |     1663    |
|  50513  |     1663    |
|  35140  |     1663    |
|  42352  |     1663    |
|  29667  |     1663    |
|  46242  |     1663    |
|  58310  |     1663    |
|  64614  |     1663    |
|   ...   |     ...     |
+---------+-------------+
[9852 rows x 2 columns]

Compute the movie with the max rating per user and also the movie with the maximum imdb-ranking per user.

>>> sf['imdb-ranking'] = sf['rating'] * 10
>>> chosen_movies = sf.groupby(key_column_names='user_id',
...         operations={('max_rating_movie','max_imdb_ranking_movie'): agg.ARGMAX(('rating','imdb-ranking'),'movie_id')})
>>> chosen_movies
+---------+------------------+------------------------+
| user_id | max_rating_movie | max_imdb_ranking_movie |
+---------+------------------+------------------------+
|  62361  |       1663       |          16630         |
|  30727  |       1663       |          16630         |
|  40111  |       1663       |          16630         |
|  50513  |       1663       |          16630         |
|  35140  |       1663       |          16630         |
|  42352  |       1663       |          16630         |
|  29667  |       1663       |          16630         |
|  46242  |       1663       |          16630         |
|  58310  |       1663       |          16630         |
|  64614  |       1663       |          16630         |
|   ...   |       ...        |          ...           |
+---------+------------------+------------------------+
[9852 rows x 3 columns]

Compute the movie with the max rating per user.

>>> chosen_movies = sf.groupby(key_column_names='user_id',
            operations={'best_movies': agg.ARGMAX('rating','movie')})

Compute the movie with the max rating per user and also the movie with the maximum imdb-ranking per user.

>>> chosen_movies = sf.groupby(key_column_names='user_id',
           operations={('max_rating_movie','max_imdb_ranking_movie'): agg.ARGMAX(('rating','imdb-ranking'),'movie')})

Compute the count, mean, and standard deviation of ratings per (user, time), automatically assigning output column names.

>>> sf['time'] = sf.apply(lambda x: (x['user_id'] + x['movie_id']) % 11 + 2000)
>>> user_rating_stats = sf.groupby(['user_id', 'time'],
...                                [agg.COUNT(),
...                                 agg.AVG('rating'),
...                                 agg.STDV('rating')])
>>> user_rating_stats
+------+---------+-------+---------------+----------------+
| time | user_id | Count | Avg of rating | Stdv of rating |
+------+---------+-------+---------------+----------------+
| 2006 |  61285  |   1   |      4.0      |      0.0       |
| 2000 |  36078  |   1   |      4.0      |      0.0       |
| 2003 |  47158  |   1   |      3.0      |      0.0       |
| 2007 |  34446  |   1   |      3.0      |      0.0       |
| 2010 |  47990  |   1   |      3.0      |      0.0       |
| 2003 |  42120  |   1   |      5.0      |      0.0       |
| 2007 |  44940  |   1   |      4.0      |      0.0       |
| 2008 |  58240  |   1   |      4.0      |      0.0       |
| 2002 |   102   |   1   |      1.0      |      0.0       |
| 2009 |  52708  |   1   |      3.0      |      0.0       |
| ...  |   ...   |  ...  |      ...      |      ...       |
+------+---------+-------+---------------+----------------+
[10000 rows x 5 columns]

The groupby function can take a variable length list of aggregation specifiers so if we want the count and the 0.25 and 0.75 quantiles of ratings:

>>> user_rating_stats = sf.groupby(['user_id', 'time'], agg.COUNT(),
...                                {'rating_quantiles': agg.QUANTILE('rating',[0.25, 0.75])})
>>> user_rating_stats
+------+---------+-------+------------------------+
| time | user_id | Count |    rating_quantiles    |
+------+---------+-------+------------------------+
| 2006 |  61285  |   1   | array('d', [4.0, 4.0]) |
| 2000 |  36078  |   1   | array('d', [4.0, 4.0]) |
| 2003 |  47158  |   1   | array('d', [3.0, 3.0]) |
| 2007 |  34446  |   1   | array('d', [3.0, 3.0]) |
| 2010 |  47990  |   1   | array('d', [3.0, 3.0]) |
| 2003 |  42120  |   1   | array('d', [5.0, 5.0]) |
| 2007 |  44940  |   1   | array('d', [4.0, 4.0]) |
| 2008 |  58240  |   1   | array('d', [4.0, 4.0]) |
| 2002 |   102   |   1   | array('d', [1.0, 1.0]) |
| 2009 |  52708  |   1   | array('d', [3.0, 3.0]) |
| ...  |   ...   |  ...  |          ...           |
+------+---------+-------+------------------------+
[10000 rows x 4 columns]

To put all items a user rated into one list value by their star rating:

>>> user_rating_stats = sf.groupby(["user_id", "rating"],
...                                {"rated_movie_ids":agg.CONCAT("movie_id")})
>>> user_rating_stats
+--------+---------+----------------------+
| rating | user_id |     rated_movie_ids  |
+--------+---------+----------------------+
|   3    |  31434  | array('d', [1663.0]) |
|   5    |  25944  | array('d', [1663.0]) |
|   4    |  38827  | array('d', [1663.0]) |
|   4    |  51437  | array('d', [1663.0]) |
|   4    |  42549  | array('d', [1663.0]) |
|   4    |  49532  | array('d', [1663.0]) |
|   3    |  26124  | array('d', [1663.0]) |
|   4    |  46336  | array('d', [1663.0]) |
|   4    |  52133  | array('d', [1663.0]) |
|   5    |  62361  | array('d', [1663.0]) |
|  ...   |   ...   |         ...          |
+--------+---------+----------------------+
[9952 rows x 3 columns]

To put all items and rating of a given user together into a dictionary value:

>>> user_rating_stats = sf.groupby("user_id",
...                                {"movie_rating":agg.CONCAT("movie_id", "rating")})
>>> user_rating_stats
+---------+--------------+
| user_id | movie_rating |
+---------+--------------+
|  62361  |  {1663: 5}   |
|  30727  |  {1663: 4}   |
|  40111  |  {1663: 2}   |
|  50513  |  {1663: 4}   |
|  35140  |  {1663: 4}   |
|  42352  |  {1663: 5}   |
|  29667  |  {1663: 4}   |
|  46242  |  {1663: 5}   |
|  58310  |  {1663: 2}   |
|  64614  |  {1663: 2}   |
|   ...   |     ...      |
+---------+--------------+
[9852 rows x 2 columns]