Turi Create
4.0
|
#include <core/storage/sframe_data/csv_line_tokenizer.hpp>
Public Member Functions | |
csv_line_tokenizer () | |
void | init () |
bool | tokenize_line (const char *str, size_t len, std::vector< std::string > &output) |
bool | tokenize_line (const char *str, size_t len, std::function< bool(std::string &, size_t)> fn) |
size_t | tokenize_line (char *str, size_t len, std::vector< flexible_type > &output, bool permit_undefined, const std::vector< size_t > *output_order=nullptr) |
bool | parse_as (char **buf, size_t len, const char *raw, size_t rawlen, flexible_type &out, bool recursive_parse=false) |
const std::string & | get_last_parse_error_diagnosis () const |
Public Attributes | |
bool | preserve_quoting = false |
bool | use_escape_char = true |
char | escape_char = '\\' |
bool | skip_initial_space = true |
std::string | delimiter = "," |
std::string | line_terminator = "\n" |
char | comment_char = '#' |
bool | has_comment_char = true |
bool | double_quote = false |
char | quote_char = '\"' |
std::vector< std::string > | na_values |
std::unordered_set< std::string > | true_values |
std::unordered_set< std::string > | false_values |
bool | only_raw_string_substitutions = false |
CSV Line Tokenizer.
To use, simply set the appropriate options inside the struct, and use one of the tokenize_line functions to parse a line inside a CSV file.
Definition at line 38 of file csv_line_tokenizer.hpp.
turi::csv_line_tokenizer::csv_line_tokenizer | ( | ) |
Constructor. Does nothing but set up internal buffers.
const std::string& turi::csv_line_tokenizer::get_last_parse_error_diagnosis | ( | ) | const |
Returns a printable string describing the parse error. This is only filled when tokenize_line fails. The string is not cleared when tokenize line succeeds so this should not be used for flagging parse errors.
void turi::csv_line_tokenizer::init | ( | ) |
called before any parsing functions are used. Initializes the spirit parser.
bool turi::csv_line_tokenizer::parse_as | ( | char ** | buf, |
size_t | len, | ||
const char * | raw, | ||
size_t | rawlen, | ||
flexible_type & | out, | ||
bool | recursive_parse = false |
||
) |
Parse the buf content into flexible_type. The type of the flexible_type is determined by the out variable.
If recursive_parse is set to true, things which parse to strings will attempt to be reparsed. This allows for instance the quoted element "123" to be parsed as an integer instead of a string.
If recursive_parse is true, the contents of the buffer may be modified (the buffer itself is used to maintain the recursive parse state)
bool turi::csv_line_tokenizer::tokenize_line | ( | const char * | str, |
size_t | len, | ||
std::vector< std::string > & | output | ||
) |
Tokenize a single CSV line into seperate fields. The output vector will be cleared, and each field will be inserted into the output vector. Returns true on success and false on failure.
str | Pointer to string to tokenize. Contents of string may be modified. |
len | Length of string to tokenize |
output | Output vector which will contain the result |
bool turi::csv_line_tokenizer::tokenize_line | ( | const char * | str, |
size_t | len, | ||
std::function< bool(std::string &, size_t)> | fn | ||
) |
Tokenize a single CSV line into seperate fields, calling a callback for each parsed token.
The function is of the form:
For instance, to insert the parsed tokens into an output vector, the following code could be used:
str | Pointer to line to tokenize. Contents of string may be modified. |
len | Length of line to tokenize |
fn | Callback function which is called on every token |
size_t turi::csv_line_tokenizer::tokenize_line | ( | char * | str, |
size_t | len, | ||
std::vector< flexible_type > & | output, | ||
bool | permit_undefined, | ||
const std::vector< size_t > * | output_order = nullptr |
||
) |
Tokenizes a line directly into array of flexible_type and type specifiers. This version of tokenize line is strict, requiring that the length of the output vector matches up exactly with the number of columns, and the types of the flexible_type be fully specified.
For instance: If my input line is
* 1, hello world, 2.0 *
then output vector must have 3 elements.
If the types of the 3 elements in the output vector are: [flex_type_enum::INTEGER, flex_type_enum::STRING, flex_type_enum::FLOAT] then, they will be parsed as such emitting an output of [1, "hello world", 2.0].
However, if the types of the 3 elements in the output vector are: [flex_type_enum::STRING, flex_type_enum::STRING, flex_type_enum::STRING] then, the output will contain be ["1", "hello world", "2.0"].
Type interpretation failures will produce an error. For instance if the types are [flex_type_enum::STRING, flex_type_enum::INTEGER, flex_type_enum::STRING], since the second element cannot be correctly interpreted as an integer, the tokenization will fail.
The types current supported are:
The tokenizer will not modify the types of the output vector. However, if permit_undefined is specified, the output type can be set to flex_type_enum::UNDEFINED for an empty non-string field. For instance:
If my input line is
* 1, , 2.0 *
If I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::STRING, flex_type_enum::FLOAT] This will be parsed as [1, "", 2.0] regardless of permit_undefined.
However if I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::INTEGER, flex_type_enum::FLOAT] and permit_undefined == false, This will be parsed as [1, 0, 2.0].
And if I have type specifiers [flex_type_enum::INTEGER, flex_type_enum::INTEGER, flex_type_enum::FLOAT] and permit_undefined == true, This will be parsed as [1, UNDEFINED, 2.0].
str | Pointer to line to tokenize |
len | Length of line to tokenize |
output | The output vector which is of the same length as the number of columns, and has all the types specified. |
permit_undefined | Allows output vector to repr |
output_order | a pointer to an array of the same length as the output. Essentially column 'i' will be written to output_order[i]. if output_order[i] == -1, the column is ignored. If output_order == nullptr, this is equivalent to the having output_order[i] == i |
char turi::csv_line_tokenizer::comment_char = '#' |
The character used to begin a comment (Default '#'). An occurance of this character outside of quoted strings will cause the parser to ignore the remainder of the line.
* # this is a * # comment * user,name,rating * 123,hello,45 * 312,chu, 21 * 333,zzz, 3 # this is also a comment * 444,aaa, 51 *
Definition at line 95 of file csv_line_tokenizer.hpp.
std::string turi::csv_line_tokenizer::delimiter = "," |
The delimiter character to use to separate fields (Default ",")
Definition at line 71 of file csv_line_tokenizer.hpp.
bool turi::csv_line_tokenizer::double_quote = false |
If set to true, pairs of quote characters in a quoted string are interpreted as a single quote (Default false). For instance, if set to true, the 2nd field of the 2nd line is read as "hello "world""
* user, message * 123, "hello ""world""" *
Definition at line 112 of file csv_line_tokenizer.hpp.
char turi::csv_line_tokenizer::escape_char = '\\' |
The character to use to identify the beginning of a C escape sequence (Defualt '\'). i.e. "\n" will be converted to the '
' character, "\\" will be converted to "\", etc. Note that only the single character escapes are converted. unicode (), octal (), hexadecimal () are not interpreted.
Definition at line 60 of file csv_line_tokenizer.hpp.
std::unordered_set<std::string> turi::csv_line_tokenizer::false_values |
string values which map to numeric 0
Definition at line 134 of file csv_line_tokenizer.hpp.
bool turi::csv_line_tokenizer::has_comment_char = true |
Whether comment char is used
Definition at line 100 of file csv_line_tokenizer.hpp.
std::string turi::csv_line_tokenizer::line_terminator = "\n" |
The string to use to separate lines. Defaults to "\n". Setting the new line string to "\n" has special effects in that it causes "\r", "\r\n" and "\n" to be all interpreted as new lines.
Definition at line 79 of file csv_line_tokenizer.hpp.
std::vector<std::string> turi::csv_line_tokenizer::na_values |
The strings which will be parsed as missing values.
(also see empty_string_in_na_values)
Definition at line 124 of file csv_line_tokenizer.hpp.
bool turi::csv_line_tokenizer::only_raw_string_substitutions = false |
If this is set (defaults to false), then the true/false/na substitutions are only permitted on raw unparsed strings; that is strings before dequoting, de-escaping, etc.
Definition at line 141 of file csv_line_tokenizer.hpp.
bool turi::csv_line_tokenizer::preserve_quoting = false |
If set to true, quotes inside a field will be preserved (Default false). i.e. if set to true, the 2nd entry in the following row will be read as ""hello world"" with the quote characters.
* 1,"hello world",5 *
Definition at line 47 of file csv_line_tokenizer.hpp.
char turi::csv_line_tokenizer::quote_char = '\"' |
The quote character to use (Default '"')
Definition at line 117 of file csv_line_tokenizer.hpp.
bool turi::csv_line_tokenizer::skip_initial_space = true |
If set to true, initial spaces before fields are ignored (Default true).
Definition at line 65 of file csv_line_tokenizer.hpp.
std::unordered_set<std::string> turi::csv_line_tokenizer::true_values |
string values which map to numeric 1
Definition at line 129 of file csv_line_tokenizer.hpp.
bool turi::csv_line_tokenizer::use_escape_char = true |
If escape_char is used.
Definition at line 52 of file csv_line_tokenizer.hpp.