#include <core/storage/sframe_data/csv_line_tokenizer.hpp>

Public Member Functions
	csv_line_tokenizer ()

void	init ()

bool	tokenize_line (const char *str, size_t len, std::vector< std::string > &output)

bool	tokenize_line (const char *str, size_t len, std::function< bool(std::string &, size_t)> fn)

size_t	tokenize_line (char str, size_t len, std::vector< flexible_type > &output, bool permit_undefined, const std::vector< size_t > output_order=nullptr)

bool	parse_as (char *buf, size_t len, const char raw, size_t rawlen, flexible_type &out, bool recursive_parse=false)

const std::string &	get_last_parse_error_diagnosis () const

Public Attributes
bool	preserve_quoting = false

bool	use_escape_char = true

char	escape_char = '\\'

bool	skip_initial_space = true

std::string	delimiter = ","

std::string	line_terminator = "\n"

char	comment_char = '#'

bool	has_comment_char = true

bool	double_quote = false

char	quote_char = '\"'

std::vector< std::string >	na_values

std::unordered_set< std::string >	true_values

std::unordered_set< std::string >	false_values

bool	only_raw_string_substitutions = false

Detailed Description

CSV Line Tokenizer.

To use, simply set the appropriate options inside the struct, and use one of the tokenize_line functions to parse a line inside a CSV file.

Note: This parser at the moment only handles the case where each row of the CSV is on one line. It is in fact very possible that this is not the case. Pandas in particular permits line breaks inside of quoted strings, and vectors, and that is quite problematic.

Definition at line 38 of file csv_line_tokenizer.hpp.

Constructor & Destructor Documentation

◆ csv_line_tokenizer()

turi::csv_line_tokenizer::csv_line_tokenizer ( )

Constructor. Does nothing but set up internal buffers.

Member Function Documentation

◆ get_last_parse_error_diagnosis()

const std::string& turi::csv_line_tokenizer::get_last_parse_error_diagnosis ( ) const

Returns a printable string describing the parse error. This is only filled when tokenize_line fails. The string is not cleared when tokenize line succeeds so this should not be used for flagging parse errors.

◆ init()

void turi::csv_line_tokenizer::init ( )

called before any parsing functions are used. Initializes the spirit parser.

◆ parse_as()

bool turi::csv_line_tokenizer::parse_as	(	char **	buf,
		size_t	len,
		const char *	raw,
		size_t	rawlen,
		flexible_type &	out,
		bool	recursive_parse = `false`
	)

Parse the buf content into flexible_type. The type of the flexible_type is determined by the out variable.

If recursive_parse is set to true, things which parse to strings will attempt to be reparsed. This allows for instance the quoted element "123" to be parsed as an integer instead of a string.

If recursive_parse is true, the contents of the buffer may be modified (the buffer itself is used to maintain the recursive parse state)

◆ tokenize_line() [1/3]

bool turi::csv_line_tokenizer::tokenize_line	(	const char *	str,
		size_t	len,
		std::vector< std::string > &	output
	)

Tokenize a single CSV line into seperate fields. The output vector will be cleared, and each field will be inserted into the output vector. Returns true on success and false on failure.

Parameters

str	Pointer to string to tokenize. Contents of string may be modified.
len	Length of string to tokenize
output	Output vector which will contain the result

Returns: true on success, false on failure.

◆ tokenize_line() [2/3]

bool turi::csv_line_tokenizer::tokenize_line	(	const char *	str,
		size_t	len,
		std::function< bool(std::string &, size_t)>	fn
	)

Tokenize a single CSV line into seperate fields, calling a callback for each parsed token.

The function is of the form:

bool receive_token(const char* buffer, size_t len) {
  // add the first len bytes of the buffer as the parsed token
  // return true on success and false on failure.
  // if this function returns false, the tokenize_line call will also
  // return false
  // The buffer may be modified
}

For instance, to insert the parsed tokens into an output vector, the following code could be used:

return tokenize_line(str,
                [&](const char* buf, size_t len)->bool {
                  output.emplace_back(buf, len);
                  return true;
                });

Parameters

str	Pointer to line to tokenize. Contents of string may be modified.
len	Length of line to tokenize
fn	Callback function which is called on every token

Returns: true on success, false on failure.

◆ tokenize_line() [3/3]

size_t turi::csv_line_tokenizer::tokenize_line	(	char *	str,
		size_t	len,
		std::vector< flexible_type > &	output,
		bool	permit_undefined,
		const std::vector< size_t > *	output_order = `nullptr`
	)

Tokenizes a line directly into array of flexible_type and type specifiers. This version of tokenize line is strict, requiring that the length of the output vector matches up exactly with the number of columns, and the types of the flexible_type be fully specified.

For instance: If my input line is

*     1, hello world, 2.0
*

then output vector must have 3 elements.

If the types of the 3 elements in the output vector are: [flex_type_enum::INTEGER, flex_type_enum::STRING, flex_type_enum::FLOAT] then, they will be parsed as such emitting an output of [1, "hello world", 2.0].

However, if the types of the 3 elements in the output vector are: [flex_type_enum::STRING, flex_type_enum::STRING, flex_type_enum::STRING] then, the output will contain be ["1", "hello world", "2.0"].

Type interpretation failures will produce an error. For instance if the types are [flex_type_enum::STRING, flex_type_enum::INTEGER, flex_type_enum::STRING], since the second element cannot be correctly interpreted as an integer, the tokenization will fail.

The types current supported are:

flex_type_enum::INTEGER
flex_type_enum::FLOAT
flex_type_enum::STRING
flex_type_enum::VECTOR (a vector of numbers specified like [1 2 3] but allowing separators to be spaces, commas(,) or semicolons(;). The separator should not match the CSV separator since the parsers are independent)

The tokenizer will not modify the types of the output vector. However, if permit_undefined is specified, the output type can be set to flex_type_enum::UNDEFINED for an empty non-string field. For instance:

If my input line is

*     1, , 2.0
*

If I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::STRING, flex_type_enum::FLOAT] This will be parsed as [1, "", 2.0] regardless of permit_undefined.

However if I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::INTEGER, flex_type_enum::FLOAT] and permit_undefined == false, This will be parsed as [1, 0, 2.0].

And if I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::INTEGER, flex_type_enum::FLOAT] and permit_undefined == true, This will be parsed as [1, UNDEFINED, 2.0].

Parameters

str	Pointer to line to tokenize
len	Length of line to tokenize
output	The output vector which is of the same length as the number of columns, and has all the types specified.
permit_undefined	Allows output vector to repr
output_order	a pointer to an array of the same length as the output. Essentially column 'i' will be written to output_order[i]. if output_order[i] == -1, the column is ignored. If output_order == nullptr, this is equivalent to the having output_order[i] == i

Returns: the number of output entries filled.

Member Data Documentation

◆ comment_char

char turi::csv_line_tokenizer::comment_char = '#'

The character used to begin a comment (Default '#'). An occurance of this character outside of quoted strings will cause the parser to ignore the remainder of the line.

* # this is a
* # comment
* user,name,rating
* 123,hello,45
* 312,chu, 21
* 333,zzz, 3 # this is also a comment
* 444,aaa, 51
*

Definition at line 95 of file csv_line_tokenizer.hpp.

◆ delimiter

std::string turi::csv_line_tokenizer::delimiter = ","

The delimiter character to use to separate fields (Default ",")

Definition at line 71 of file csv_line_tokenizer.hpp.

◆ double_quote

bool turi::csv_line_tokenizer::double_quote = false

If set to true, pairs of quote characters in a quoted string are interpreted as a single quote (Default false). For instance, if set to true, the 2nd field of the 2nd line is read as "hello "world""

* user, message
* 123, "hello ""world"""
*

Definition at line 112 of file csv_line_tokenizer.hpp.

◆ escape_char

char turi::csv_line_tokenizer::escape_char = '\\'

The character to use to identify the beginning of a C escape sequence (Defualt '\'). i.e. "\n" will be converted to the '
' character, "\\" will be converted to "\", etc. Note that only the single character escapes are converted. unicode (), octal (), hexadecimal () are not interpreted.

Definition at line 60 of file csv_line_tokenizer.hpp.

◆ false_values

std::unordered_set<std::string> turi::csv_line_tokenizer::false_values

string values which map to numeric 0

Definition at line 134 of file csv_line_tokenizer.hpp.

◆ has_comment_char

bool turi::csv_line_tokenizer::has_comment_char = true

Whether comment char is used

Definition at line 100 of file csv_line_tokenizer.hpp.

◆ line_terminator

std::string turi::csv_line_tokenizer::line_terminator = "\n"

The string to use to separate lines. Defaults to "\n". Setting the new line string to "\n" has special effects in that it causes "\r", "\r\n" and "\n" to be all interpreted as new lines.

Definition at line 79 of file csv_line_tokenizer.hpp.

◆ na_values

std::vector<std::string> turi::csv_line_tokenizer::na_values

The strings which will be parsed as missing values.

(also see empty_string_in_na_values)

Definition at line 124 of file csv_line_tokenizer.hpp.

◆ only_raw_string_substitutions

bool turi::csv_line_tokenizer::only_raw_string_substitutions = false

If this is set (defaults to false), then the true/false/na substitutions are only permitted on raw unparsed strings; that is strings before dequoting, de-escaping, etc.

Definition at line 141 of file csv_line_tokenizer.hpp.

◆ preserve_quoting

bool turi::csv_line_tokenizer::preserve_quoting = false

If set to true, quotes inside a field will be preserved (Default false). i.e. if set to true, the 2nd entry in the following row will be read as ""hello world"" with the quote characters.

*   1,"hello world",5
*

Definition at line 47 of file csv_line_tokenizer.hpp.

◆ quote_char

char turi::csv_line_tokenizer::quote_char = '\"'

The quote character to use (Default '"')

Definition at line 117 of file csv_line_tokenizer.hpp.

◆ skip_initial_space

bool turi::csv_line_tokenizer::skip_initial_space = true

If set to true, initial spaces before fields are ignored (Default true).

Definition at line 65 of file csv_line_tokenizer.hpp.

◆ true_values

std::unordered_set<std::string> turi::csv_line_tokenizer::true_values

string values which map to numeric 1

Definition at line 129 of file csv_line_tokenizer.hpp.

◆ use_escape_char

bool turi::csv_line_tokenizer::use_escape_char = true

If escape_char is used.

Definition at line 52 of file csv_line_tokenizer.hpp.

The documentation for this struct was generated from the following file:

core/storage/sframe_data/csv_line_tokenizer.hpp

Public Member Functions

Public Attributes

Detailed Description

Constructor & Destructor Documentation

◆ csv_line_tokenizer()

Member Function Documentation

◆ get_last_parse_error_diagnosis()

◆ init()

◆ parse_as()

◆ tokenize_line() [1/3]

◆ tokenize_line() [2/3]

◆ tokenize_line() [3/3]

Member Data Documentation

◆ comment_char

◆ delimiter

◆ double_quote

◆ escape_char

◆ false_values

◆ has_comment_char

◆ line_terminator

◆ na_values

◆ only_raw_string_substitutions

◆ preserve_quoting

◆ quote_char

◆ skip_initial_space

◆ true_values

◆ use_escape_char