turicreate.SFrame.read_csv¶

classmethod SFrame.read_csv(url, delimiter=',', header=True, error_bad_lines=False, comment_char='', escape_char='\\', double_quote=True, quote_char='"', skip_initial_space=True, column_type_hints=None, na_values=['NA'], line_terminator='\n', usecols=[], nrows=None, skiprows=0, verbose=True, nrows_to_infer=100, true_values=[], false_values=[], _only_raw_string_substitutions=False, **kwargs)¶

Constructs an SFrame from a CSV file or a path to multiple CSVs.

Parameters:

url : string

Location of the CSV file or directory to load. If URL is a directory or a “glob” pattern, all matching files will be loaded.

delimiter : string, optional

This describes the delimiter used for parsing csv files.

header : bool, optional

If true, uses the first row as the column names. Otherwise use the default column names : ‘X1, X2, …’.

error_bad_lines : bool

If true, will fail upon encountering a bad line. If false, will continue parsing skipping lines which fail to parse correctly. A sample of the first 10 encountered bad lines will be printed.

comment_char : string, optional

The character which denotes that the remainder of the line is a comment.

escape_char : string, optional

Character which begins a C escape sequence. Defaults to backslash() Set to None to disable.

double_quote : bool, optional

If True, two consecutive quotes in a string are parsed to a single quote.

quote_char : string, optional

Character sequence that indicates a quote.

skip_initial_space : bool, optional

Ignore extra spaces at the start of a field

column_type_hints : None, type, list[type], dict[string, type], optional

This provides type hints for each column. By default, this method attempts to detect the type of each column automatically.

Supported types are int, float, str, list, dict, and array.array.

If a single type is provided, the type will be applied to all columns. For instance, column_type_hints=float will force all columns to be parsed as float.
If a list of types is provided, the types applies to each column in order, e.g.[int, float, str] will parse the first column as int, second as float and third as string.
If a dictionary of column name to type is provided, each type value in the dictionary is applied to the key it belongs to. For instance {‘user’:int} will hint that the column called “user” should be parsed as an integer, and the rest will be type inferred.

na_values : str | list of str, optional

A string or list of strings to be interpreted as missing values.

true_values : str | list of str, optional

A string or list of strings to be interpreted as 1

false_values : str | list of str, optional

A string or list of strings to be interpreted as 0

line_terminator : str, optional

A string to be interpreted as the line terminator. Defaults to “

“

which will also correctly match Mac, Linux and Windows line endings (“r”, “n” and “rn” respectively)

usecols : list of str, optional: A subset of column names to output. If unspecified (default), all columns will be read. This can provide performance gains if the number of columns are large. If the input file has no headers, usecols=[‘X1’,’X3’] will read columns 1 and 3.
nrows : int, optional: If set, only this many rows will be read from the file.
skiprows : int, optional: If set, this number of rows at the start of the file are skipped.
verbose : bool, optional: If True, print the progress.
nrows_to_infer : integer: The number of rows used to infer column types.

Returns:

out : SFrame

Examples

Read a regular csv file, with all default options, automatically determine types:

>>> url = 'https://static.turi.com/datasets/rating_data_example.csv'
>>> sf = turicreate.SFrame.read_csv(url)
>>> sf
Columns:
  user_id int
  movie_id  int
  rating  int
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Read only the first 100 lines of the csv file:

>>> sf = turicreate.SFrame.read_csv(url, nrows=100)
>>> sf
Columns:
  user_id int
  movie_id  int
  rating  int
Rows: 100
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[100 rows x 3 columns]

Read all columns as str type

>>> sf = turicreate.SFrame.read_csv(url, column_type_hints=str)
>>> sf
Columns:
  user_id  str
  movie_id  str
  rating  str
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Specify types for a subset of columns and leave the rest to be str.

>>> sf = turicreate.SFrame.read_csv(url,
...                               column_type_hints={
...                               'user_id':int, 'rating':float
...                               })
>>> sf
Columns:
  user_id str
  movie_id  str
  rating  float
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |  3.0   |
|  25907  |   1663   |  3.0   |
|  25923  |   1663   |  3.0   |
|  25924  |   1663   |  3.0   |
|  25928  |   1663   |  2.0   |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Not treat first line as header:

>>> sf = turicreate.SFrame.read_csv(url, header=False)
>>> sf
Columns:
  X1  str
  X2  str
  X3  str
Rows: 10001
+---------+----------+--------+
|    X1   |    X2    |   X3   |
+---------+----------+--------+
| user_id | movie_id | rating |
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10001 rows x 3 columns]

Treat ‘3’ as missing value:

>>> sf = turicreate.SFrame.read_csv(url, na_values=['3'], column_type_hints=str)
>>> sf
Columns:
  user_id str
  movie_id  str
  rating  str
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |  None  |
|  25907  |   1663   |  None  |
|  25923  |   1663   |  None  |
|  25924  |   1663   |  None  |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Throw error on parse failure:

>>> bad_url = 'https://static.turi.com/datasets/bad_csv_example.csv'
>>> sf = turicreate.SFrame.read_csv(bad_url, error_bad_lines=True)
RuntimeError: Runtime Exception. Unable to parse line "x,y,z,a,b,c"
Set error_bad_lines=False to skip bad lines