turicreate.text_classifier.create

turicreate.text_classifier.create(dataset, target, features=None, drop_stop_words=True, word_count_threshold=2, method='auto', validation_set='auto', max_iterations=10)

Create a model that trains a classifier to classify text from a collection of documents. The model is a LogisticClassifier model trained using a bag-of-words representation of the text dataset.

Parameters:
dataset : SFrame

Contains one or more columns of text data. This can be unstructured text dataset, such as that appearing in forums, user-generated reviews, etc.

target : str

The column name containing class labels for each document.

features : list[str], optional

The column names of interest containing text dataset. Each provided column must be str type. Defaults to using all columns of type str.

drop_stop_words : bool, optional

Ignore very common words, eg: “the”, “a”, “is”. For the complete list of stop words, see: text_classifier.drop_words().

word_count_threshold : int, optional

Words which occur less than this often, in the entire dataset, will be ignored.

method: str, optional

Method to use for feature engineering and modeling. Currently only bag-of-words and logistic classifier (‘bow-logistic’) is available.

validation_set : SFrame, optional

A dataset for monitoring the model’s generalization performance. For each row of the progress table, the chosen metrics are computed for both the provided training dataset and the validation_set. The format of this SFrame must be the same as the training set. By default this argument is set to ‘auto’ and a validation set is automatically sampled and used for progress printing. If validation_set is set to None, then no additional metrics are computed. The default value is ‘auto’.

max_iterations : int, optional

The maximum number of allowed passes through the data. More passes over the data can result in a more accurately trained model. Consider increasing this (the default value is 10) if the training accuracy is low and the Grad-Norm in the display is large.

Returns:
out : TextClassifier

See also

text_classifier.stop_words, text_classifier.drop_words

Examples

>>> import turicreate as tc
>>> dataset = tc.SFrame({'rating': [1, 5], 'text': ['hate it', 'love it']})
>>> m = tc.text_classifier.create(dataset, 'rating', features=['text'])
>>> m.predict(dataset)

You may also evaluate predictions against known text scores.

>>> metrics = m.evaluate(dataset)