Turi Create  4.0
turi::csv_line_tokenizer Struct Reference

#include <core/storage/sframe_data/csv_line_tokenizer.hpp>

Public Member Functions

 csv_line_tokenizer ()
 
void init ()
 
bool tokenize_line (const char *str, size_t len, std::vector< std::string > &output)
 
bool tokenize_line (const char *str, size_t len, std::function< bool(std::string &, size_t)> fn)
 
size_t tokenize_line (char *str, size_t len, std::vector< flexible_type > &output, bool permit_undefined, const std::vector< size_t > *output_order=nullptr)
 
bool parse_as (char **buf, size_t len, const char *raw, size_t rawlen, flexible_type &out, bool recursive_parse=false)
 
const std::string & get_last_parse_error_diagnosis () const
 

Public Attributes

bool preserve_quoting = false
 
bool use_escape_char = true
 
char escape_char = '\\'
 
bool skip_initial_space = true
 
std::string delimiter = ","
 
std::string line_terminator = "\n"
 
char comment_char = '#'
 
bool has_comment_char = true
 
bool double_quote = false
 
char quote_char = '\"'
 
std::vector< std::string > na_values
 
std::unordered_set< std::string > true_values
 
std::unordered_set< std::string > false_values
 
bool only_raw_string_substitutions = false
 

Detailed Description

CSV Line Tokenizer.

To use, simply set the appropriate options inside the struct, and use one of the tokenize_line functions to parse a line inside a CSV file.

Note
This parser at the moment only handles the case where each row of the CSV is on one line. It is in fact very possible that this is not the case. Pandas in particular permits line breaks inside of quoted strings, and vectors, and that is quite problematic.

Definition at line 38 of file csv_line_tokenizer.hpp.

Constructor & Destructor Documentation

◆ csv_line_tokenizer()

turi::csv_line_tokenizer::csv_line_tokenizer ( )

Constructor. Does nothing but set up internal buffers.

Member Function Documentation

◆ get_last_parse_error_diagnosis()

const std::string& turi::csv_line_tokenizer::get_last_parse_error_diagnosis ( ) const

Returns a printable string describing the parse error. This is only filled when tokenize_line fails. The string is not cleared when tokenize line succeeds so this should not be used for flagging parse errors.

◆ init()

void turi::csv_line_tokenizer::init ( )

called before any parsing functions are used. Initializes the spirit parser.

◆ parse_as()

bool turi::csv_line_tokenizer::parse_as ( char **  buf,
size_t  len,
const char *  raw,
size_t  rawlen,
flexible_type out,
bool  recursive_parse = false 
)

Parse the buf content into flexible_type. The type of the flexible_type is determined by the out variable.

If recursive_parse is set to true, things which parse to strings will attempt to be reparsed. This allows for instance the quoted element "123" to be parsed as an integer instead of a string.

If recursive_parse is true, the contents of the buffer may be modified (the buffer itself is used to maintain the recursive parse state)

◆ tokenize_line() [1/3]

bool turi::csv_line_tokenizer::tokenize_line ( const char *  str,
size_t  len,
std::vector< std::string > &  output 
)

Tokenize a single CSV line into seperate fields. The output vector will be cleared, and each field will be inserted into the output vector. Returns true on success and false on failure.

Parameters
strPointer to string to tokenize. Contents of string may be modified.
lenLength of string to tokenize
outputOutput vector which will contain the result
Returns
true on success, false on failure.

◆ tokenize_line() [2/3]

bool turi::csv_line_tokenizer::tokenize_line ( const char *  str,
size_t  len,
std::function< bool(std::string &, size_t)>  fn 
)

Tokenize a single CSV line into seperate fields, calling a callback for each parsed token.

The function is of the form:

bool receive_token(const char* buffer, size_t len) {
// add the first len bytes of the buffer as the parsed token
// return true on success and false on failure.
// if this function returns false, the tokenize_line call will also
// return false
// The buffer may be modified
}

For instance, to insert the parsed tokens into an output vector, the following code could be used:

return tokenize_line(str,
[&](const char* buf, size_t len)->bool {
output.emplace_back(buf, len);
return true;
});
Parameters
strPointer to line to tokenize. Contents of string may be modified.
lenLength of line to tokenize
fnCallback function which is called on every token
Returns
true on success, false on failure.

◆ tokenize_line() [3/3]

size_t turi::csv_line_tokenizer::tokenize_line ( char *  str,
size_t  len,
std::vector< flexible_type > &  output,
bool  permit_undefined,
const std::vector< size_t > *  output_order = nullptr 
)

Tokenizes a line directly into array of flexible_type and type specifiers. This version of tokenize line is strict, requiring that the length of the output vector matches up exactly with the number of columns, and the types of the flexible_type be fully specified.

For instance: If my input line is

*     1, hello world, 2.0
* 

then output vector must have 3 elements.

If the types of the 3 elements in the output vector are: [flex_type_enum::INTEGER, flex_type_enum::STRING, flex_type_enum::FLOAT] then, they will be parsed as such emitting an output of [1, "hello world", 2.0].

However, if the types of the 3 elements in the output vector are: [flex_type_enum::STRING, flex_type_enum::STRING, flex_type_enum::STRING] then, the output will contain be ["1", "hello world", "2.0"].

Type interpretation failures will produce an error. For instance if the types are [flex_type_enum::STRING, flex_type_enum::INTEGER, flex_type_enum::STRING], since the second element cannot be correctly interpreted as an integer, the tokenization will fail.

The types current supported are:

The tokenizer will not modify the types of the output vector. However, if permit_undefined is specified, the output type can be set to flex_type_enum::UNDEFINED for an empty non-string field. For instance:

If my input line is

*     1, , 2.0
* 

If I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::STRING, flex_type_enum::FLOAT] This will be parsed as [1, "", 2.0] regardless of permit_undefined.

However if I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::INTEGER, flex_type_enum::FLOAT] and permit_undefined == false, This will be parsed as [1, 0, 2.0].

And if I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::INTEGER, flex_type_enum::FLOAT] and permit_undefined == true, This will be parsed as [1, UNDEFINED, 2.0].

Parameters
strPointer to line to tokenize
lenLength of line to tokenize
outputThe output vector which is of the same length as the number of columns, and has all the types specified.
permit_undefinedAllows output vector to repr
output_ordera pointer to an array of the same length as the output. Essentially column 'i' will be written to output_order[i]. if output_order[i] == -1, the column is ignored. If output_order == nullptr, this is equivalent to the having output_order[i] == i
Returns
the number of output entries filled.

Member Data Documentation

◆ comment_char

char turi::csv_line_tokenizer::comment_char = '#'

The character used to begin a comment (Default '#'). An occurance of this character outside of quoted strings will cause the parser to ignore the remainder of the line.

* # this is a
* # comment
* user,name,rating
* 123,hello,45
* 312,chu, 21
* 333,zzz, 3 # this is also a comment
* 444,aaa, 51
* 

Definition at line 95 of file csv_line_tokenizer.hpp.

◆ delimiter

std::string turi::csv_line_tokenizer::delimiter = ","

The delimiter character to use to separate fields (Default ",")

Definition at line 71 of file csv_line_tokenizer.hpp.

◆ double_quote

bool turi::csv_line_tokenizer::double_quote = false

If set to true, pairs of quote characters in a quoted string are interpreted as a single quote (Default false). For instance, if set to true, the 2nd field of the 2nd line is read as "hello "world""

* user, message
* 123, "hello ""world"""
* 

Definition at line 112 of file csv_line_tokenizer.hpp.

◆ escape_char

char turi::csv_line_tokenizer::escape_char = '\\'

The character to use to identify the beginning of a C escape sequence (Defualt '\'). i.e. "\n" will be converted to the '
' character, "\\" will be converted to "\", etc. Note that only the single character escapes are converted. unicode (), octal (), hexadecimal () are not interpreted.

Definition at line 60 of file csv_line_tokenizer.hpp.

◆ false_values

std::unordered_set<std::string> turi::csv_line_tokenizer::false_values

string values which map to numeric 0

Definition at line 134 of file csv_line_tokenizer.hpp.

◆ has_comment_char

bool turi::csv_line_tokenizer::has_comment_char = true

Whether comment char is used

Definition at line 100 of file csv_line_tokenizer.hpp.

◆ line_terminator

std::string turi::csv_line_tokenizer::line_terminator = "\n"

The string to use to separate lines. Defaults to "\n". Setting the new line string to "\n" has special effects in that it causes "\r", "\r\n" and "\n" to be all interpreted as new lines.

Definition at line 79 of file csv_line_tokenizer.hpp.

◆ na_values

std::vector<std::string> turi::csv_line_tokenizer::na_values

The strings which will be parsed as missing values.

(also see empty_string_in_na_values)

Definition at line 124 of file csv_line_tokenizer.hpp.

◆ only_raw_string_substitutions

bool turi::csv_line_tokenizer::only_raw_string_substitutions = false

If this is set (defaults to false), then the true/false/na substitutions are only permitted on raw unparsed strings; that is strings before dequoting, de-escaping, etc.

Definition at line 141 of file csv_line_tokenizer.hpp.

◆ preserve_quoting

bool turi::csv_line_tokenizer::preserve_quoting = false

If set to true, quotes inside a field will be preserved (Default false). i.e. if set to true, the 2nd entry in the following row will be read as ""hello world"" with the quote characters.

*   1,"hello world",5
* 

Definition at line 47 of file csv_line_tokenizer.hpp.

◆ quote_char

char turi::csv_line_tokenizer::quote_char = '\"'

The quote character to use (Default '"')

Definition at line 117 of file csv_line_tokenizer.hpp.

◆ skip_initial_space

bool turi::csv_line_tokenizer::skip_initial_space = true

If set to true, initial spaces before fields are ignored (Default true).

Definition at line 65 of file csv_line_tokenizer.hpp.

◆ true_values

std::unordered_set<std::string> turi::csv_line_tokenizer::true_values

string values which map to numeric 1

Definition at line 129 of file csv_line_tokenizer.hpp.

◆ use_escape_char

bool turi::csv_line_tokenizer::use_escape_char = true

If escape_char is used.

Definition at line 52 of file csv_line_tokenizer.hpp.


The documentation for this struct was generated from the following file: