turicreate.topic_model.TopicModel.get_topics¶
-
TopicModel.
get_topics
(topic_ids=None, num_words=5, cdf_cutoff=1.0, output_type='topic_probabilities')¶ Get the words associated with a given topic. The score column is the probability of choosing that word given that you have chosen a particular topic.
Parameters: - topic_ids : list of int, optional
The topics to retrieve words. Topic ids are zero-based. Throws an error if greater than or equal to m[‘num_topics’], or if the requested topic name is not present.
- num_words : int, optional
The number of words to show.
- cdf_cutoff : float, optional
Allows one to only show the most probable words whose cumulative probability is below this cutoff. For example if there exist three words where
\[ \begin{align}\begin{aligned}p(word_1 | topic_k) = .1\\p(word_2 | topic_k) = .2\\p(word_3 | topic_k) = .05\end{aligned}\end{align} \]then setting \(cdf_{cutoff}=.3\) would return only \(word_1\) and \(word_2\) since \(p(word_1 | topic_k) + p(word_2 | topic_k) <= cdf_{cutoff}\)
- output_type : {‘topic_probabilities’ | ‘topic_words’}, optional
Determine the type of desired output. See below.
Returns: - out : SFrame
If output_type is ‘topic_probabilities’, then the returned value is an SFrame with a column of words ranked by a column of scores for each topic. Otherwise, the returned value is a SArray where each element is a list of the most probable words for each topic.
Examples
Get the highest ranked words for all topics.
>>> docs = turicreate.SArray('https://static.turi.com/datasets/nips-text') >>> m = turicreate.topic_model.create(docs, num_iterations=50) >>> m.get_topics() +-------+----------+-----------------+ | topic | word | score | +-------+----------+-----------------+ | 0 | cell | 0.028974400831 | | 0 | input | 0.0259470208503 | | 0 | image | 0.0215721599763 | | 0 | visual | 0.0173635081992 | | 0 | object | 0.0172447874156 | | 1 | function | 0.0482834508265 | | 1 | input | 0.0456270024091 | | 1 | point | 0.0302662839454 | | 1 | result | 0.0239474934631 | | 1 | problem | 0.0231750116011 | | ... | ... | ... | +-------+----------+-----------------+
Get the highest ranked words for topics 0 and 1 and show 15 words per topic.
>>> m.get_topics([0, 1], num_words=15) +-------+----------+------------------+ | topic | word | score | +-------+----------+------------------+ | 0 | cell | 0.028974400831 | | 0 | input | 0.0259470208503 | | 0 | image | 0.0215721599763 | | 0 | visual | 0.0173635081992 | | 0 | object | 0.0172447874156 | | 0 | response | 0.0139740298286 | | 0 | layer | 0.0122585145062 | | 0 | features | 0.0115343177265 | | 0 | feature | 0.0103530459301 | | 0 | spatial | 0.00823387994361 | | ... | ... | ... | +-------+----------+------------------+
If one wants to instead just get the top words per topic, one may change the format of the output as follows.
>>> topics = m.get_topics(output_type='topic_words') dtype: list Rows: 10 [['cell', 'image', 'input', 'object', 'visual'], ['algorithm', 'data', 'learning', 'method', 'set'], ['function', 'input', 'point', 'problem', 'result'], ['model', 'output', 'pattern', 'set', 'unit'], ['action', 'learning', 'net', 'problem', 'system'], ['error', 'function', 'network', 'parameter', 'weight'], ['information', 'level', 'neural', 'threshold', 'weight'], ['control', 'field', 'model', 'network', 'neuron'], ['hidden', 'layer', 'system', 'training', 'vector'], ['component', 'distribution', 'local', 'model', 'optimal']]