This module provides utilities for doing text processing.

Note that standard SArray utilities can be used for transforming text data into “bag of words” format, where a document is represented as a dictionary mapping unique words with the number of times that word occurs in the document. See count_words() for more details. Also, see pack_columns() and unstack() for ways of creating SArrays containing dictionary types.

We provide methods for learning topic models, which can be useful for modeling large document collections. See create() for more info.

term frequency transformations

bm25 For a given query and set of documents, compute the BM25 score for each document.
tf_idf Compute the TF-IDF scores for each word in each document.

topic models

topic_model.create Create a topic model from the given data set.
topic_model.TopicModel TopicModel objects can be used to predict the underlying topic of a document.


count_words If text is an SArray of strings or an SArray of lists of strings, the occurances of word are counted for each row in the SArray.
count_ngrams Return an SArray of dict type where each element contains the count for each of the n-grams that appear in the corresponding input element.
parse_sparse Parse a file that’s in libSVM format.
parse_docword Parse a file that’s in “docword” format.
random_split Utility for performing a random split for text data that is already in bag-of-words format.
stop_words Get common words that are often removed during preprocessing of text data, i.e.
tokenize Tokenize the input SArray of text strings and return the list of tokens.
drop_words Remove words that occur below a certain number of times in an SArray.