turicreate.text_analytics.count_ngrams

turicreate.text_analytics.count_ngrams(text, n=2, method='word', to_lower=True, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'], ignore_punct=True, ignore_space=True)

Return an SArray of dict type where each element contains the count for each of the n-grams that appear in the corresponding input element. The n-grams can be specified to be either character n-grams or word n-grams. The input SArray could contain strings, dicts with string keys and numeric values, or lists of strings.

Parameters:
Text : SArray[str | dict | list]

Input text data.

n : int, optional

The number of words in each n-gram. An n value of 1 returns word counts.

method : {‘word’, ‘character’}, optional

If “word”, the function performs a count of word n-grams. If “character”, does a character n-gram count.

to_lower : bool, optional

If True, all words are converted to lower case before counting.

delimiters : list[str], None, optional

If method is “word”, input strings are tokenized using delimiters characters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations. If method is “character,” this option is ignored.

ignore_punct : bool, optional

If method is “character”, indicates if punctuations between words are counted as part of the n-gram. For instance, with the input SArray element of “fun.games”, if this parameter is set to False one tri-gram would be ‘n.g’. If ignore_punct is set to True, there would be no such tri-gram (there would still be ‘nga’). This parameter has no effect if the method is set to “word”.

ignore_space : bool, optional

If method is “character”, indicates if spaces between words are counted as part of the n-gram. For instance, with the input SArray element of “fun games”, if this parameter is set to False one tri-gram would be ‘n g’. If ignore_space is set to True, there would be no such tri-gram (there would still be ‘nga’). This parameter has no effect if the method is set to “word”.

Returns:
out : SArray[dict]

An SArray of dictionary type, where each key is the n-gram string and each value is its count.

See also

count_words, tokenize

Notes

  • Ignoring case (with to_lower) involves a full string copy of the SArray data. To increase speed for large documents, set to_lower to False.
  • Punctuation and spaces are both delimiters by default when counting word n-grams. When counting character n-grams, one may choose to ignore punctuations, spaces, neither, or both.

References

Examples

>>> import turicreate

# Counting word n-grams:
>>> sa = turicreate.SArray(['I like big dogs. I LIKE BIG DOGS.'])
>>> turicreate.text_analytics.count_ngrams(sa, 3)
dtype: dict
Rows: 1
[{'big dogs i': 1, 'like big dogs': 2, 'dogs i like': 1, 'i like big': 2}]

# Counting character n-grams:
>>> sa = turicreate.SArray(['Fun. Is. Fun'])
>>> turicreate.text_analytics.count_ngrams(sa, 3, "character")
dtype: dict
Rows: 1
{'fun': 2, 'nis': 1, 'sfu': 1, 'isf': 1, 'uni': 1}]

# Run count_ngrams with dictionary input
>>> sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 0.5},
                            {'a dog': 0, 'a dog cat': 5}])
>>> turicreate.text_analytics.count_ngrams(sa)
dtype: dict
Rows: 2
[{'bob alice': 0.5, 'alice bob': 1}, {'dog cat': 5, 'a dog': 5}]

# Run count_ngrams with list input
>>> sa = turicreate.SArray([['one', 'bar bah'], ['a dog', 'a dog cat']])
>>> turicreate.text_analytics.count_ngrams(sa)
dtype: dict
Rows: 2
[{'bar bah': 1}, {'dog cat': 1, 'a dog': 2}]