turicreate.text_analytics.count_ngrams¶
-
turicreate.text_analytics.
count_ngrams
(text, n=2, method='word', to_lower=True, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'], ignore_punct=True, ignore_space=True)¶ Return an SArray of
dict
type where each element contains the count for each of the n-grams that appear in the corresponding input element. The n-grams can be specified to be either character n-grams or word n-grams. The input SArray could contain strings, dicts with string keys and numeric values, or lists of strings.Parameters: - Text : SArray[str | dict | list]
Input text data.
- n : int, optional
The number of words in each n-gram. An
n
value of 1 returns word counts.- method : {‘word’, ‘character’}, optional
If “word”, the function performs a count of word n-grams. If “character”, does a character n-gram count.
- to_lower : bool, optional
If True, all words are converted to lower case before counting.
- delimiters : list[str], None, optional
If method is “word”, input strings are tokenized using delimiters characters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations. If method is “character,” this option is ignored.
- ignore_punct : bool, optional
If method is “character”, indicates if punctuations between words are counted as part of the n-gram. For instance, with the input SArray element of “fun.games”, if this parameter is set to False one tri-gram would be ‘n.g’. If
ignore_punct
is set to True, there would be no such tri-gram (there would still be ‘nga’). This parameter has no effect if the method is set to “word”.- ignore_space : bool, optional
If method is “character”, indicates if spaces between words are counted as part of the n-gram. For instance, with the input SArray element of “fun games”, if this parameter is set to False one tri-gram would be ‘n g’. If
ignore_space
is set to True, there would be no such tri-gram (there would still be ‘nga’). This parameter has no effect if the method is set to “word”.
Returns: - out : SArray[dict]
An SArray of dictionary type, where each key is the n-gram string and each value is its count.
See also
Notes
- Ignoring case (with
to_lower
) involves a full string copy of the SArray data. To increase speed for large documents, setto_lower
to False. - Punctuation and spaces are both delimiters by default when counting word n-grams. When counting character n-grams, one may choose to ignore punctuations, spaces, neither, or both.
References
Examples
>>> import turicreate # Counting word n-grams: >>> sa = turicreate.SArray(['I like big dogs. I LIKE BIG DOGS.']) >>> turicreate.text_analytics.count_ngrams(sa, 3) dtype: dict Rows: 1 [{'big dogs i': 1, 'like big dogs': 2, 'dogs i like': 1, 'i like big': 2}] # Counting character n-grams: >>> sa = turicreate.SArray(['Fun. Is. Fun']) >>> turicreate.text_analytics.count_ngrams(sa, 3, "character") dtype: dict Rows: 1 {'fun': 2, 'nis': 1, 'sfu': 1, 'isf': 1, 'uni': 1}] # Run count_ngrams with dictionary input >>> sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 0.5}, {'a dog': 0, 'a dog cat': 5}]) >>> turicreate.text_analytics.count_ngrams(sa) dtype: dict Rows: 2 [{'bob alice': 0.5, 'alice bob': 1}, {'dog cat': 5, 'a dog': 5}] # Run count_ngrams with list input >>> sa = turicreate.SArray([['one', 'bar bah'], ['a dog', 'a dog cat']]) >>> turicreate.text_analytics.count_ngrams(sa) dtype: dict Rows: 2 [{'bar bah': 1}, {'dog cat': 1, 'a dog': 2}]