turicreate.text_analytics.drop_words¶

turicreate.text_analytics.drop_words(text, threshold=2, to_lower=True, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'], stop_words=None)¶

Remove words that occur below a certain number of times in an SArray. This is a common method of cleaning text before it is used, and can increase the quality and explainability of the models learned on the transformed data.

RareWordTrimmer can be applied to all the string-, dictionary-, and list-typed columns in an SArray.

string : The string is first tokenized. By default, all letters are first converted to lower case, then tokenized by space characters. Each token is taken to be a word, and the words occurring below a threshold number of times across the entire column are removed, then the remaining tokens are concatenated back into a string.
list : Each element of the list must be a string, where each element is assumed to be a token. The remaining tokens are then filtered by count occurrences and a threshold value.
dict : The method first obtains the list of keys in the dictionary. This list is then processed as a standard list, except the value of each key must be of integer type and is considered to be the count of that key.

Parameters:

text : SArray[str | dict | list]: The input text data.
threshold : int, optional: The count below which words are removed from the input.
stop_words: list[str], optional: A manually specified list of stop words, which are removed regardless of count.
to_lower : bool, optional: Indicates whether to map the input strings to lower case before counting.
delimiters: list[string], optional: A list of delimiter characters for tokenization. By default, the list is defined to be the list of space characters. The user can define any custom list of single-character delimiters. Alternatively, setting delimiters=None will use a Penn treebank type tokenization, which is better at handling punctuations. (See reference below for details.)

Returns:

out : SArray.: An SArray with words below a threshold removed.

See also

count_ngrams, tf_idf, tokenize

References

Penn treebank tokenization

Examples

>>> import turicreate

# Create input data
>>> sa = turicreate.SArray(["The quick brown fox jumps in a fox like way.",
                            "Word word WORD, word!!!word"])

# Run drop_words
>>> turicreate.text_analytics.drop_words(sa)
dtype: str
Rows: 2
['fox fox', 'word word']

# Run drop_words with Penn treebank style tokenization to handle
# punctuations
>>> turicreate.text_analytics.drop_words(sa, delimiters=None)
dtype: str
Rows: 2
['fox fox', 'word word word']

# Run drop_words with dictionary input
>>> sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 2},
                            {'a dog': 0, 'a dog cat': 5}])
>>> turicreate.text_analytics.drop_words(sa)
dtype: dict
Rows: 2
[{'bob alice': 2}, {'a dog cat': 5}]

# Run drop_words with list input
>>> sa = turicreate.SArray([['one', 'bar bah', 'One'],
                            ['a dog', 'a dog cat', 'A DOG']])
>>> turicreate.text_analytics.drop_words(sa)
dtype: list
Rows: 2
[['one', 'one'], ['a dog', 'a dog']]