Introduction to SFrames

SFrames are the primary data structure for extracting data from other sources for use in Turi Create.

SFrames is a scalable data frame. They are disk backed data frames. So you can eaisly work with datasets that are larger than your available RAM.

SFrames can extract data from the following static file formats:

A very common data format is the comma separated value (csv) file, which is what we'll use for these examples. We will use some preprocessed data from the Million Song Dataset to aid our SFrame-related examples.1 The first table contains metadata about each song in the database. Here's how we load it into an SFrame:

import turicreate as tc
songs = tc.SFrame.read_csv("millionsong/song_data.csv")

No options are needed for the simplest case, as the SFrame parser infers column types. Of course, there are many options you may need to specify when importing a csv file. Some of the more common options come in to play when we load the usage data of users listening to these songs online:

usage_data = tc.SFrame.read_csv("millionsong/10000.txt",
                                header=False,
                                delimiter='\t',
                                column_type_hints={'X3':int})

The header and delimiter options are needed because this particular csv file does not provide column names in its first line, and the values are separated by tabs, not commas. The column_type_hints keeps the SFrame csv parser from attempting to infer the datatype of each column, which it does by default. For a full list of options when parsing csv files, check our API Reference.

Once done we can inspect the first few rows of the tables we've imported:

songs
Columns:
    song_id    str
    title    str
    release    str
    artist_name    str
    year    int

Rows: 1000000

+--------------------+--------------------------------+
|      song_id       |             title              |
+--------------------+--------------------------------+
| SOQMMHC12AB0180CB8 |          Silent Night          |
| SOVFVAK12A8C1350D9 |          Tanssi vaan           |
| SOGTUKN12AB017F4F1 |       No One Could Ever        |
| SOBNYVR12A8C13558C |      Si Vos Quer\xc3\xa9s      |
| SOHSBXH12A8C13B0DF |        Tantce Of Aspens        |
| SOZVAPQ12A8C13B63C | Symphony No. 1 G minor "Si ... |
| SOQVRHI12A6D4FB2D7 |        We Have Got Love        |
| SOEYRFT12AB018936C |       2 Da Beat Ch'yall        |
| SOPMIYT12A6D4F851E |            Goodbye             |
| SOJCFMH12A8C13B0C2 |   Mama_ mama can't you see ?   |
+--------------------+--------------------------------+
+--------------------------------+--------------------------------+------+
|            release             |          artist_name           | year |
+--------------------------------+--------------------------------+------+
|     Monster Ballads X-Mas      |        Faster Pussy cat        | 2003 |
|       Karkuteill\xc3\xa4       |        Karkkiautomaatti        | 1995 |
|             Butter             |         Hudson Mohawke         | 2006 |
|            De Culo             |          Yerba Brava           | 2003 |
| Rene Ablaze Presents Winte ... |           Der Mystic           |  0   |
| Berwald: Symphonies Nos. 1 ... |        David Montgomery        |  0   |
|   Strictly The Best Vol. 34    |       Sasha / Turbulence       |  0   |
|            Da Bomb             |           Kris Kross           | 1993 |
|           Danny Boy            |          Joseph Locke          |  0   |
| March to cadence with the  ... | The Sun Harbor's Chorus-Do ... |  0   |
|              ...               |              ...               | ...  |
+--------------------------------+--------------------------------+------+
[1000000 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
usage_data
Columns:
    X1    str
    X2    str
    X3    int

Rows: 2000000

+--------------------------------+--------------------+-----+
|               X1               |         X2         |  X3 |
+--------------------------------+--------------------+-----+
| b80344d063b5ccb3212f76538f ... | SOAKIMP12A8C130995 |  1  |
| b80344d063b5ccb3212f76538f ... | SOBBMDR12A8C13253B |  2  |
| b80344d063b5ccb3212f76538f ... | SOBXHDL12A81C204C0 |  1  |
| b80344d063b5ccb3212f76538f ... | SOBYHAJ12A6701BF1D |  1  |
| b80344d063b5ccb3212f76538f ... | SODACBL12A8C13C273 |  1  |
| b80344d063b5ccb3212f76538f ... | SODDNQT12A6D4F5F7E |  5  |
| b80344d063b5ccb3212f76538f ... | SODXRTY12AB0180F3B |  1  |
| b80344d063b5ccb3212f76538f ... | SOFGUAY12AB017B0A8 |  1  |
| b80344d063b5ccb3212f76538f ... | SOFRQTD12A81C233C0 |  1  |
| b80344d063b5ccb3212f76538f ... | SOHQWYZ12A6D4FA701 |  1  |
|              ...               |        ...         | ... |
+--------------------------------+--------------------+-----+
[2000000 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Here we might want to rename columns from the default names:

usage_data.rename({'X1':'user_id', 'X2':'song_id', 'X3':'listen_count'})

SFrames can be saved as a csv file or in the SFrame binary format. If your SFrame is saved in binary format loading it is instantaneous, so we won't ever have to parse that file again. Here, the default is to save in binary format, and we supply the name of a directory to be created which will hold the binary files:

usage_data.save('./music_usage_data.sframe')

Loading is then very fast:

same_usage_data = tc.load_sframe('./music_usage_data.sframe')

In addition to these functions, JSON imports and exports and SQL/ODBC imports are also supported. For further information see the respective pages in the Turi Create API Documentation:

Data Types

An SFrame is made up of columns of a contiguous type. For instance the songs SFrame is made up of 5 columns of the following types

    song_id        str
    title        str
    release        str
    artist_name    str
    year        int

In this SFrame we see only string (str) and integer (int) columns, but a number of datatypes are supported:

  • int (signed 64-bit integer)
  • float (double-precision floating point)
  • str (string)
  • array.array (1-D array of doubles)
  • list (arbitrarily list of elements)
  • dict (arbitrary dictionary of elements)
  • datetime.datetime (datetime with microsecond precision)
  • image (image)

Random SFrames

generate_random_sframe: The option enables random sframe generation having the number of observations, seed used for determining the running, column types denoting each character having one type of column.

results matching ""

    No results matching ""