turicreate.SFrame.read_csv¶
-
classmethod
SFrame.
read_csv
(url, delimiter=',', header=True, error_bad_lines=False, comment_char='', escape_char='\\', double_quote=True, quote_char='"', skip_initial_space=True, column_type_hints=None, na_values=['NA'], line_terminator='\n', usecols=[], nrows=None, skiprows=0, verbose=True, nrows_to_infer=100, true_values=[], false_values=[], _only_raw_string_substitutions=False, **kwargs)¶ - Constructs an SFrame from a CSV file or a path to multiple CSVs.
Parameters: - url : string
Location of the CSV file or directory to load. If URL is a directory or a “glob” pattern, all matching files will be loaded.
- delimiter : string, optional
This describes the delimiter used for parsing csv files.
- header : bool, optional
If true, uses the first row as the column names. Otherwise use the default column names : ‘X1, X2, …’.
- error_bad_lines : bool
If true, will fail upon encountering a bad line. If false, will continue parsing skipping lines which fail to parse correctly. A sample of the first 10 encountered bad lines will be printed.
- comment_char : string, optional
The character which denotes that the remainder of the line is a comment.
- escape_char : string, optional
Character which begins a C escape sequence. Defaults to backslash() Set to None to disable.
- double_quote : bool, optional
If True, two consecutive quotes in a string are parsed to a single quote.
- quote_char : string, optional
Character sequence that indicates a quote.
- skip_initial_space : bool, optional
Ignore extra spaces at the start of a field
- column_type_hints : None, type, list[type], dict[string, type], optional
This provides type hints for each column. By default, this method attempts to detect the type of each column automatically.
Supported types are int, float, str, list, dict, and array.array.
- If a single type is provided, the type will be applied to all columns. For instance, column_type_hints=float will force all columns to be parsed as float.
- If a list of types is provided, the types applies to each column in order, e.g.[int, float, str] will parse the first column as int, second as float and third as string.
- If a dictionary of column name to type is provided, each type value in the dictionary is applied to the key it belongs to. For instance {‘user’:int} will hint that the column called “user” should be parsed as an integer, and the rest will be type inferred.
- na_values : str | list of str, optional
A string or list of strings to be interpreted as missing values.
- true_values : str | list of str, optional
A string or list of strings to be interpreted as 1
- false_values : str | list of str, optional
A string or list of strings to be interpreted as 0
- line_terminator : str, optional
A string to be interpreted as the line terminator. Defaults to “
- “
which will also correctly match Mac, Linux and Windows line endings (“r”, “n” and “rn” respectively)
- usecols : list of str, optional
A subset of column names to output. If unspecified (default), all columns will be read. This can provide performance gains if the number of columns are large. If the input file has no headers, usecols=[‘X1’,’X3’] will read columns 1 and 3.
- nrows : int, optional
If set, only this many rows will be read from the file.
- skiprows : int, optional
If set, this number of rows at the start of the file are skipped.
- verbose : bool, optional
If True, print the progress.
- nrows_to_infer : integer
The number of rows used to infer column types.
Returns: - out : SFrame
Examples
Read a regular csv file, with all default options, automatically determine types:
>>> url = 'https://static.turi.com/datasets/rating_data_example.csv' >>> sf = turicreate.SFrame.read_csv(url) >>> sf Columns: user_id int movie_id int rating int Rows: 10000 +---------+----------+--------+ | user_id | movie_id | rating | +---------+----------+--------+ | 25904 | 1663 | 3 | | 25907 | 1663 | 3 | | 25923 | 1663 | 3 | | 25924 | 1663 | 3 | | 25928 | 1663 | 2 | | ... | ... | ... | +---------+----------+--------+ [10000 rows x 3 columns]
Read only the first 100 lines of the csv file:
>>> sf = turicreate.SFrame.read_csv(url, nrows=100) >>> sf Columns: user_id int movie_id int rating int Rows: 100 +---------+----------+--------+ | user_id | movie_id | rating | +---------+----------+--------+ | 25904 | 1663 | 3 | | 25907 | 1663 | 3 | | 25923 | 1663 | 3 | | 25924 | 1663 | 3 | | 25928 | 1663 | 2 | | ... | ... | ... | +---------+----------+--------+ [100 rows x 3 columns]
Read all columns as str type
>>> sf = turicreate.SFrame.read_csv(url, column_type_hints=str) >>> sf Columns: user_id str movie_id str rating str Rows: 10000 +---------+----------+--------+ | user_id | movie_id | rating | +---------+----------+--------+ | 25904 | 1663 | 3 | | 25907 | 1663 | 3 | | 25923 | 1663 | 3 | | 25924 | 1663 | 3 | | 25928 | 1663 | 2 | | ... | ... | ... | +---------+----------+--------+ [10000 rows x 3 columns]
Specify types for a subset of columns and leave the rest to be str.
>>> sf = turicreate.SFrame.read_csv(url, ... column_type_hints={ ... 'user_id':int, 'rating':float ... }) >>> sf Columns: user_id str movie_id str rating float Rows: 10000 +---------+----------+--------+ | user_id | movie_id | rating | +---------+----------+--------+ | 25904 | 1663 | 3.0 | | 25907 | 1663 | 3.0 | | 25923 | 1663 | 3.0 | | 25924 | 1663 | 3.0 | | 25928 | 1663 | 2.0 | | ... | ... | ... | +---------+----------+--------+ [10000 rows x 3 columns]
Not treat first line as header:
>>> sf = turicreate.SFrame.read_csv(url, header=False) >>> sf Columns: X1 str X2 str X3 str Rows: 10001 +---------+----------+--------+ | X1 | X2 | X3 | +---------+----------+--------+ | user_id | movie_id | rating | | 25904 | 1663 | 3 | | 25907 | 1663 | 3 | | 25923 | 1663 | 3 | | 25924 | 1663 | 3 | | 25928 | 1663 | 2 | | ... | ... | ... | +---------+----------+--------+ [10001 rows x 3 columns]
Treat ‘3’ as missing value:
>>> sf = turicreate.SFrame.read_csv(url, na_values=['3'], column_type_hints=str) >>> sf Columns: user_id str movie_id str rating str Rows: 10000 +---------+----------+--------+ | user_id | movie_id | rating | +---------+----------+--------+ | 25904 | 1663 | None | | 25907 | 1663 | None | | 25923 | 1663 | None | | 25924 | 1663 | None | | 25928 | 1663 | 2 | | ... | ... | ... | +---------+----------+--------+ [10000 rows x 3 columns]
Throw error on parse failure:
>>> bad_url = 'https://static.turi.com/datasets/bad_csv_example.csv' >>> sf = turicreate.SFrame.read_csv(bad_url, error_bad_lines=True) RuntimeError: Runtime Exception. Unable to parse line "x,y,z,a,b,c" Set error_bad_lines=False to skip bad lines