turicreate.text_analytics.count_words

turicreate.text_analytics.count_words(text, to_lower=True, delimiters=['r', 'x0b', 'n', 'x0c', 't', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~'])

If text is an SArray of strings or an SArray of lists of strings, the occurances of word are counted for each row in the SArray.

If text is an SArray of dictionaries, the keys are tokenized and the values are the counts. Counts for the same word, in the same row, are added together.

This output is commonly known as the “bag-of-words” representation of text data.

Parameters:
text : SArray[str | dict | list]

SArray of type: string, dict or list.

to_lower : bool, optional

If True, all strings are converted to lower case before counting.

delimiters : list[str], None, optional

Input strings are tokenized using delimiters characters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations.

Returns:
out : SArray[dict]

An SArray with the same length as the`text` input. For each row, the keys of the dictionary are the words and the values are the corresponding counts.

References

Examples

>>> import turicreate

# Create input data
>>> sa = turicreate.SArray(["The quick brown fox jumps.",
                            "Word word WORD, word!!!word"])

# Run count_words
>>> turicreate.text_analytics.count_words(sa)
dtype: dict
Rows: 2
[{'quick': 1, 'brown': 1, 'the': 1, 'fox': 1, 'jumps.': 1},
 {'word,': 5}]

# Run count_words with Penn treebank style tokenization to handle
# punctuations
>>> turicreate.text_analytics.count_words(sa, delimiters=None)
dtype: dict
Rows: 2
[{'brown': 1, 'jumps': 1, 'fox': 1, '.': 1, 'quick': 1, 'the': 1},
 {'word': 3, 'word!!!word': 1, ',': 1}]

# Run count_words with dictionary input
>>> sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 0.5},
                            {'a dog': 0, 'a dog cat': 5}])
>>> turicreate.text_analytics.count_words(sa)
dtype: dict
Rows: 2
[{'bob': 1.5, 'alice': 1.5}, {'a': 5, 'dog': 5, 'cat': 5}]

# Run count_words with list input
>>> sa = turicreate.SArray([['one', 'bar bah'], ['a dog', 'a dog cat']])
>>> turicreate.text_analytics.count_words(sa)
dtype: dict
Rows: 2
[{'bar': 1, 'bah': 1, 'one': 1}, {'a': 2, 'dog': 2, 'cat': 1}]