turicreate.topic_model.create¶

turicreate.topic_model.create(dataset, num_topics=10, initial_topics=None, alpha=None, beta=0.1, num_iterations=10, num_burnin=5, associations=None, verbose=False, print_interval=10, validation_set=None, method='auto')¶

Create a topic model from the given data set. A topic model assumes each document is a mixture of a set of topics, where for each topic some words are more likely than others. One statistical approach to do this is called a “topic model”. This method learns a topic model for the given document collection.

Parameters:

dataset : SArray of type dict or SFrame with a single column of type dict

A bag of words representation of a document corpus. Each element is a dictionary representing a single document, where the keys are words and the values are the number of times that word occurs in that document.

num_topics : int, optional

The number of topics to learn.

initial_topics : SFrame, optional

An SFrame with a column of unique words representing the vocabulary and a column of dense vectors representing probability of that word given each topic. When provided, these values are used to initialize the algorithm.

alpha : float, optional

Hyperparameter that controls the diversity of topics in a document. Smaller values encourage fewer topics per document. Provided value must be positive. Default value is 50/num_topics.

beta : float, optional

Hyperparameter that controls the diversity of words in a topic. Smaller values encourage fewer words per topic. Provided value must be positive.

num_iterations : int, optional

The number of iterations to perform.

num_burnin : int, optional

The number of iterations to perform when inferring the topics for documents at prediction time.

verbose : bool, optional

When True, print most probable words for each topic while printing progress.

print_interval : int, optional

The number of iterations to wait between progress reports.

associations : SFrame, optional

An SFrame with two columns named “word” and “topic” containing words and the topic id that the word should be associated with. These words are not considered during learning.

validation_set : SArray of type dict or SFrame with a single column

A bag of words representation of a document corpus, similar to the format required for dataset. This will be used to monitor model performance during training. Each document in the provided validation set is randomly split: the first portion is used estimate which topic each document belongs to, and the second portion is used to estimate the model’s performance at predicting the unseen words in the test data.

method : {‘cgs’, ‘alias’}, optional

The algorithm used for learning the model.

cgs: Collapsed Gibbs sampling
alias: AliasLDA method.

Returns:

out : TopicModel

A fitted topic model. This can be used with get_topics() and predict(). While fitting is in progress, several metrics are shown, including:

Field	Description
Elapsed Time	The number of elapsed seconds.
Tokens/second	The number of unique words processed per second
Est. Perplexity	An estimate of the model’s ability to model the training data. See the documentation on evaluate.

See also

TopicModel, TopicModel.get_topics, TopicModel.predict
turicreate.SArray.dict_trim_by_keys, TopicModel.evaluate

References

Wikipedia - Latent Dirichlet allocation
Alias method: Li, A. et al. (2014) Reducing the Sampling Complexity of Topic Models.. KDD 2014.

Examples

The following example includes an SArray of documents, where each element represents a document in “bag of words” representation – a dictionary with word keys and whose values are the number of times that word occurred in the document:

>>> docs = turicreate.SArray('https://static.turi.com/datasets/nytimes')

Once in this form, it is straightforward to learn a topic model.

>>> m = turicreate.topic_model.create(docs)

It is also easy to create a new topic model from an old one – whether it was created using Turi Create or another package.

>>> m2 = turicreate.topic_model.create(docs, initial_topics=m['topics'])

To manually fix several words to always be assigned to a topic, use the associations argument. The following will ensure that topic 0 has the most probability for each of the provided words:

>>> from turicreate import SFrame
>>> associations = SFrame({'word':['hurricane', 'wind', 'storm'],
                           'topic': [0, 0, 0]})
>>> m = turicreate.topic_model.create(docs,
                                    associations=associations)

More advanced usage allows you to control aspects of the model and the learning method.

>>> import turicreate as tc
>>> m = tc.topic_model.create(docs,
                              num_topics=20,       # number of topics
                              num_iterations=10,   # algorithm parameters
                              alpha=.01, beta=.1)  # hyperparameters

To evaluate the model’s ability to generalize, we can create a train/test split where a portion of the words in each document are held out from training.

>>> train, test = tc.text_analytics.random_split(.8)
>>> m = tc.topic_model.create(train)
>>> results = m.evaluate(test)
>>> print results['perplexity']