text_analytics
¶
This module provides utilities for doing text processing.
Note that standard SArray utilities can be used for transforming text data
into “bag of words” format, where a document is represented as a
dictionary mapping unique words with the number of times that word occurs
in the document. See count_words()
for more details. Also, see pack_columns()
and
unstack()
for ways of creating SArrays
containing dictionary types.
We provide methods for learning topic models, which can be useful for modeling
large document collections. See create()
for
more info.
term frequency transformations¶
bm25 |
For a given query and set of documents, compute the BM25 score for each document. |
tf_idf |
Compute the TF-IDF scores for each word in each document. |
topic models¶
topic_model.create |
Create a topic model from the given data set. |
topic_model.TopicModel |
TopicModel objects can be used to predict the underlying topic of a document. |
utilities¶
count_words |
If text is an SArray of strings or an SArray of lists of strings, the occurances of word are counted for each row in the SArray. |
count_ngrams |
Return an SArray of dict type where each element contains the count for each of the n-grams that appear in the corresponding input element. |
parse_sparse |
Parse a file that’s in libSVM format. |
parse_docword |
Parse a file that’s in “docword” format. |
random_split |
Utility for performing a random split for text data that is already in bag-of-words format. |
stop_words |
Get common words that are often removed during preprocessing of text data, i.e. |
tokenize |
Tokenize the input SArray of text strings and return the list of tokens. |
drop_words |
Remove words that occur below a certain number of times in an SArray. |