turicreate.text_classifier.create

turicreate.text_classifier.create(dataset, target, features=None, drop_stop_words=True, word_count_threshold=2, method='auto', validation_set='auto', max_iterations=10, l2_penalty=0.2)

Create a model that trains a classifier to classify text from a collection of documents. The model is a LogisticClassifier model trained using a bag-of-words representation of the text dataset.

Parameters:
dataset : SFrame

Contains one or more columns of text data. This can be unstructured text dataset, such as that appearing in forums, user-generated reviews, etc.

target : str

The column name containing class labels for each document.

features : list[str], optional

The column names of interest containing text dataset. Each provided column must be str type. Defaults to using all columns of type str.

drop_stop_words : bool, optional

Ignore very common words, eg: “the”, “a”, “is”. For the complete list of stop words, see: text_classifier.drop_words().

word_count_threshold : int, optional

Words which occur less than this often, in the entire dataset, will be ignored.

method: str, optional

Method to use for feature engineering and modeling. Currently only bag-of-words and logistic classifier (‘bow-logistic’) is available.

validation_set : SFrame, optional

A dataset for monitoring the model’s generalization performance. For each row of the progress table, the chosen metrics are computed for both the provided training dataset and the validation_set. The format of this SFrame must be the same as the training set. By default this argument is set to ‘auto’ and a validation set is automatically sampled and used for progress printing. If validation_set is set to None, then no additional metrics are computed. The default value is ‘auto’.

max_iterations : int, optional

The maximum number of allowed passes through the data. More passes over the data can result in a more accurately trained model. Consider increasing this (the default value is 10) if the training accuracy is low and the Grad-Norm in the display is large.

l2_penalty : float, optional

Weight on l2 regularization of the model. The larger this weight, the more the model coefficients shrink toward 0. This introduces bias into the model but decreases variance, potentially leading to better predictions. The default value is 0.2; setting this parameter to 0 corresponds to unregularized logistic regression. See the ridge regression reference for more detail.

Returns:
out : TextClassifier

See also

text_classifier.stop_words, text_classifier.drop_words

Examples

>>> import turicreate as tc
>>> dataset = tc.SFrame({'rating': [1, 5], 'text': ['hate it', 'love it']})
>>> m = tc.text_classifier.create(dataset, 'rating', features=['text'])
>>> m.predict(dataset)

You may also evaluate predictions against known text scores.

>>> metrics = m.evaluate(dataset)