turicreate.text_analytics.tokenize¶
-
turicreate.text_analytics.
tokenize
(text, to_lower=False, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'])¶ Tokenize the input SArray of text strings and return the list of tokens.
Parameters: - text : SArray[str]
Input data of strings representing English text. This tokenizer is not intended to process XML, HTML, or other structured text formats.
- to_lower : bool, optional
If True, all strings are converted to lower case before tokenization.
- delimiters : list[str], None, optional
Input strings are tokenized using delimiter characters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations.
Returns: - out : SArray[list]
Each text string in the input is mapped to a list of tokens.
See also
References
Examples
>>> import turicreate >>> docs = turicreate.SArray(['This is the first sentence.', "This one, it's the second sentence."]) # Default tokenization by space characters >>> turicreate.text_analytics.tokenize(docs) dtype: list Rows: 2 [['This', 'is', 'the', 'first', 'sentence.'], ['This', 'one,', "it's", 'the', 'second', 'sentence.']] # Penn treebank-style tokenization >>> turicreate.text_analytics.tokenize(docs, delimiters=None) dtype: list Rows: 2 [['This', 'is', 'the', 'first', 'sentence', '.'], ['This', 'one', ',', 'it', "'s", 'the', 'second', 'sentence', '.']]