turicreate.text_analytics.tokenize¶

turicreate.text_analytics.tokenize(text, to_lower=False, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'])¶

Tokenize the input SArray of text strings and return the list of tokens.

Parameters:

text : SArray[str]: Input data of strings representing English text. This tokenizer is not intended to process XML, HTML, or other structured text formats.
to_lower : bool, optional: If True, all strings are converted to lower case before tokenization.
delimiters : list[str], None, optional: Input strings are tokenized using delimiter characters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations.

Returns:

out : SArray[list]: Each text string in the input is mapped to a list of tokens.

See also

count_words, count_ngrams, tf_idf

References

Penn treebank tokenization

Examples

>>> import turicreate

>>> docs = turicreate.SArray(['This is the first sentence.',
                              "This one, it's the second sentence."])

# Default tokenization by space characters
>>> turicreate.text_analytics.tokenize(docs)
dtype: list
Rows: 2
[['This', 'is', 'the', 'first', 'sentence.'],
 ['This', 'one,', "it's", 'the', 'second', 'sentence.']]

# Penn treebank-style tokenization
>>> turicreate.text_analytics.tokenize(docs, delimiters=None)
dtype: list
Rows: 2
[['This', 'is', 'the', 'first', 'sentence', '.'],
 ['This', 'one', ',', 'it', "'s", 'the', 'second', 'sentence', '.']]