turicreate.topic_model.create¶
-
turicreate.topic_model.
create
(dataset, num_topics=10, initial_topics=None, alpha=None, beta=0.1, num_iterations=10, num_burnin=5, associations=None, verbose=False, print_interval=10, validation_set=None, method='auto')¶ Create a topic model from the given data set. A topic model assumes each document is a mixture of a set of topics, where for each topic some words are more likely than others. One statistical approach to do this is called a “topic model”. This method learns a topic model for the given document collection.
Parameters: - dataset : SArray of type dict or SFrame with a single column of type dict
A bag of words representation of a document corpus. Each element is a dictionary representing a single document, where the keys are words and the values are the number of times that word occurs in that document.
- num_topics : int, optional
The number of topics to learn.
- initial_topics : SFrame, optional
An SFrame with a column of unique words representing the vocabulary and a column of dense vectors representing probability of that word given each topic. When provided, these values are used to initialize the algorithm.
- alpha : float, optional
Hyperparameter that controls the diversity of topics in a document. Smaller values encourage fewer topics per document. Provided value must be positive. Default value is 50/num_topics.
- beta : float, optional
Hyperparameter that controls the diversity of words in a topic. Smaller values encourage fewer words per topic. Provided value must be positive.
- num_iterations : int, optional
The number of iterations to perform.
- num_burnin : int, optional
The number of iterations to perform when inferring the topics for documents at prediction time.
- verbose : bool, optional
When True, print most probable words for each topic while printing progress.
- print_interval : int, optional
The number of iterations to wait between progress reports.
- associations : SFrame, optional
An SFrame with two columns named “word” and “topic” containing words and the topic id that the word should be associated with. These words are not considered during learning.
- validation_set : SArray of type dict or SFrame with a single column
A bag of words representation of a document corpus, similar to the format required for dataset. This will be used to monitor model performance during training. Each document in the provided validation set is randomly split: the first portion is used estimate which topic each document belongs to, and the second portion is used to estimate the model’s performance at predicting the unseen words in the test data.
- method : {‘cgs’, ‘alias’}, optional
The algorithm used for learning the model.
- cgs: Collapsed Gibbs sampling
- alias: AliasLDA method.
Returns: - out : TopicModel
A fitted topic model. This can be used with
get_topics()
andpredict()
. While fitting is in progress, several metrics are shown, including:Field Description Elapsed Time The number of elapsed seconds. Tokens/second The number of unique words processed per second Est. Perplexity An estimate of the model’s ability to model the training data. See the documentation on evaluate.
See also
References
- Wikipedia - Latent Dirichlet allocation
- Alias method: Li, A. et al. (2014) Reducing the Sampling Complexity of Topic Models.. KDD 2014.
Examples
The following example includes an SArray of documents, where each element represents a document in “bag of words” representation – a dictionary with word keys and whose values are the number of times that word occurred in the document:
>>> docs = turicreate.SArray('https://static.turi.com/datasets/nytimes')
Once in this form, it is straightforward to learn a topic model.
>>> m = turicreate.topic_model.create(docs)
It is also easy to create a new topic model from an old one – whether it was created using Turi Create or another package.
>>> m2 = turicreate.topic_model.create(docs, initial_topics=m['topics'])
To manually fix several words to always be assigned to a topic, use the associations argument. The following will ensure that topic 0 has the most probability for each of the provided words:
>>> from turicreate import SFrame >>> associations = SFrame({'word':['hurricane', 'wind', 'storm'], 'topic': [0, 0, 0]}) >>> m = turicreate.topic_model.create(docs, associations=associations)
More advanced usage allows you to control aspects of the model and the learning method.
>>> import turicreate as tc >>> m = tc.topic_model.create(docs, num_topics=20, # number of topics num_iterations=10, # algorithm parameters alpha=.01, beta=.1) # hyperparameters
To evaluate the model’s ability to generalize, we can create a train/test split where a portion of the words in each document are held out from training.
>>> train, test = tc.text_analytics.random_split(.8) >>> m = tc.topic_model.create(train) >>> results = m.evaluate(test) >>> print results['perplexity']