turicreate.topic_model.TopicModel.get_topics¶

TopicModel.get_topics(topic_ids=None, num_words=5, cdf_cutoff=1.0, output_type='topic_probabilities')¶

Get the words associated with a given topic. The score column is the probability of choosing that word given that you have chosen a particular topic.

Parameters:

topic_ids : list of int, optional

The topics to retrieve words. Topic ids are zero-based. Throws an error if greater than or equal to m[‘num_topics’], or if the requested topic name is not present.

num_words : int, optional

The number of words to show.

cdf_cutoff : float, optional

Allows one to only show the most probable words whose cumulative probability is below this cutoff. For example if there exist three words where

$\begin{align}\begin{aligned}p(word_1 | topic_k) = .1\\p(word_2 | topic_k) = .2\\p(word_3 | topic_k) = .05\end{aligned}\end{align}$

then setting $cdf_{cutoff}=.3$ would return only $word_1$ and $word_2$ since $p(word_1 | topic_k) + p(word_2 | topic_k) <= cdf_{cutoff}$

output_type : {‘topic_probabilities’ | ‘topic_words’}, optional

Determine the type of desired output. See below.

Returns:

out : SFrame: If output_type is ‘topic_probabilities’, then the returned value is an SFrame with a column of words ranked by a column of scores for each topic. Otherwise, the returned value is a SArray where each element is a list of the most probable words for each topic.

Examples

Get the highest ranked words for all topics.

>>> docs = turicreate.SArray('https://static.turi.com/datasets/nips-text')
>>> m = turicreate.topic_model.create(docs,
                                    num_iterations=50)
>>> m.get_topics()
+-------+----------+-----------------+
| topic |   word   |      score      |
+-------+----------+-----------------+
|   0   |   cell   |  0.028974400831 |
|   0   |  input   | 0.0259470208503 |
|   0   |  image   | 0.0215721599763 |
|   0   |  visual  | 0.0173635081992 |
|   0   |  object  | 0.0172447874156 |
|   1   | function | 0.0482834508265 |
|   1   |  input   | 0.0456270024091 |
|   1   |  point   | 0.0302662839454 |
|   1   |  result  | 0.0239474934631 |
|   1   | problem  | 0.0231750116011 |
|  ...  |   ...    |       ...       |
+-------+----------+-----------------+

Get the highest ranked words for topics 0 and 1 and show 15 words per topic.

>>> m.get_topics([0, 1], num_words=15)
+-------+----------+------------------+
| topic |   word   |      score       |
+-------+----------+------------------+
|   0   |   cell   |  0.028974400831  |
|   0   |  input   | 0.0259470208503  |
|   0   |  image   | 0.0215721599763  |
|   0   |  visual  | 0.0173635081992  |
|   0   |  object  | 0.0172447874156  |
|   0   | response | 0.0139740298286  |
|   0   |  layer   | 0.0122585145062  |
|   0   | features | 0.0115343177265  |
|   0   | feature  | 0.0103530459301  |
|   0   | spatial  | 0.00823387994361 |
|  ...  |   ...    |       ...        |
+-------+----------+------------------+

If one wants to instead just get the top words per topic, one may change the format of the output as follows.

>>> topics = m.get_topics(output_type='topic_words')
dtype: list
Rows: 10
[['cell', 'image', 'input', 'object', 'visual'],
 ['algorithm', 'data', 'learning', 'method', 'set'],
 ['function', 'input', 'point', 'problem', 'result'],
 ['model', 'output', 'pattern', 'set', 'unit'],
 ['action', 'learning', 'net', 'problem', 'system'],
 ['error', 'function', 'network', 'parameter', 'weight'],
 ['information', 'level', 'neural', 'threshold', 'weight'],
 ['control', 'field', 'model', 'network', 'neuron'],
 ['hidden', 'layer', 'system', 'training', 'vector'],
 ['component', 'distribution', 'local', 'model', 'optimal']]