turicreate.text_analytics.tokenize

turicreate.text_analytics.tokenize(text, to_lower=False, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'])

Tokenize the input SArray of text strings and return the list of tokens.

Parameters:
text : SArray[str]

Input data of strings representing English text. This tokenizer is not intended to process XML, HTML, or other structured text formats.

to_lower : bool, optional

If True, all strings are converted to lower case before tokenization.

delimiters : list[str], None, optional

Input strings are tokenized using delimiter characters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations.

Returns:
out : SArray[list]

Each text string in the input is mapped to a list of tokens.

References

Examples

>>> import turicreate

>>> docs = turicreate.SArray(['This is the first sentence.',
                              "This one, it's the second sentence."])

# Default tokenization by space characters
>>> turicreate.text_analytics.tokenize(docs)
dtype: list
Rows: 2
[['This', 'is', 'the', 'first', 'sentence.'],
 ['This', 'one,', "it's", 'the', 'second', 'sentence.']]

# Penn treebank-style tokenization
>>> turicreate.text_analytics.tokenize(docs, delimiters=None)
dtype: list
Rows: 2
[['This', 'is', 'the', 'first', 'sentence', '.'],
 ['This', 'one', ',', 'it', "'s", 'the', 'second', 'sentence', '.']]