turicreate.SFrame.read_csv_with_errors

classmethod SFrame.read_csv_with_errors(url, delimiter=',', header=True, comment_char='', escape_char='\\', double_quote=True, quote_char='"', skip_initial_space=True, column_type_hints=None, na_values=['NA'], line_terminator='\n', usecols=[], nrows=None, skiprows=0, verbose=True, nrows_to_infer=100, true_values=[], false_values=[], _only_raw_string_substitutions=False, **kwargs)

Constructs an SFrame from a CSV file or a path to multiple CSVs, and returns a pair containing the SFrame and a dict of filenames to SArrays indicating for each file, what are the incorrectly parsed lines encountered.

Parameters:
url : string

Location of the CSV file or directory to load. If URL is a directory or a “glob” pattern, all matching files will be loaded.

delimiter : string, optional

This describes the delimiter used for parsing csv files.

header : bool, optional

If true, uses the first row as the column names. Otherwise use the default column names: ‘X1, X2, …’.

comment_char : string, optional

The character which denotes that the remainder of the line is a comment.

escape_char : string, optional

Character which begins a C escape sequence. Defaults to backslash() Set to None to disable.

double_quote : bool, optional

If True, two consecutive quotes in a string are parsed to a single quote.

quote_char : string, optional

Character sequence that indicates a quote.

skip_initial_space : bool, optional

Ignore extra spaces at the start of a field

column_type_hints : None, type, list[type], dict[string, type], optional

This provides type hints for each column. By default, this method attempts to detect the type of each column automatically.

Supported types are int, float, str, list, dict, and array.array.

  • If a single type is provided, the type will be applied to all columns. For instance, column_type_hints=float will force all columns to be parsed as float.
  • If a list of types is provided, the types applies to each column in order, e.g.[int, float, str] will parse the first column as int, second as float and third as string.
  • If a dictionary of column name to type is provided, each type value in the dictionary is applied to the key it belongs to. For instance {‘user’:int} will hint that the column called “user” should be parsed as an integer, and the rest will be type inferred.
na_values : str | list of str, optional

A string or list of strings to be interpreted as missing values.

true_values : str | list of str, optional

A string or list of strings to be interpreted as 1

false_values : str | list of str, optional

A string or list of strings to be interpreted as 0

line_terminator : str, optional

A string to be interpreted as the line terminator. Defaults to “n” which will also correctly match Mac, Linux and Windows line endings (“r”, “n” and “rn” respectively)

usecols : list of str, optional

A subset of column names to output. If unspecified (default), all columns will be read. This can provide performance gains if the number of columns are large. If the input file has no headers, usecols=[‘X1’,’X3’] will read columns 1 and 3.

nrows : int, optional

If set, only this many rows will be read from the file.

skiprows : int, optional

If set, this number of rows at the start of the file are skipped.

verbose : bool, optional

If True, print the progress.

Returns:
out : tuple

The first element is the SFrame with good data. The second element is a dictionary of filenames to SArrays indicating for each file, what are the incorrectly parsed lines encountered.

See also

read_csv, SFrame

Examples

>>> bad_url = 'https://static.turi.com/datasets/bad_csv_example.csv'
>>> (sf, bad_lines) = turicreate.SFrame.read_csv_with_errors(bad_url)
>>> sf
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[98 rows x 3 columns]
>>> bad_lines
{'https://static.turi.com/datasets/bad_csv_example.csv': dtype: str
 Rows: 1
 ['x,y,z,a,b,c']}