turicreate.recommender.util.compare_models(dataset, models, model_names=None, user_sample=1.0, metric='auto', target=None, exclude_known_for_precision_recall=True, make_plot=False, verbose=True, **kwargs)

Compare the prediction or recommendation performance of recommender models on a common test dataset.

Models that are trained to predict ratings are compared separately from models that are trained without target ratings. The ratings prediction models are compared on root-mean-squared error, and the rest are compared on precision-recall.

dataset : SFrame

The dataset to use for model evaluation.

models : list[recommender models]

List of trained recommender models.

model_names : list[str], optional

List of model name strings for display.

user_sample : float, optional

Sampling proportion of unique users to use in estimating model performance. Defaults to 1.0, i.e. use all users in the dataset.

metric : str, {‘auto’, ‘rmse’, ‘precision_recall’}, optional

Metric for the evaluation. The default automatically splits models into two groups with their default evaluation metric respectively: ‘rmse’ for models trained with a target, and ‘precision_recall’ otherwise.

target : str, optional

The name of the target column for evaluating rmse. If the model is trained with a target column, the default is to using the same column. If the model is trained without a target column and metric=’rmse’, then this option must be provided by user.

exclude_known_for_precision_recall : bool, optional

A useful option when metric=’precision_recall’. Recommender models automatically exclude items seen in the training data from the final recommendation list. If the input evaluation dataset is the same as the data used for training the models, set this option to False.

verbose : bool, optional

If true, print the progress.

out : list[SFrame]

A list of results where each one is an sframe of evaluation results of the respective model on the given dataset


If you have created two ItemSimilarityRecommenders m1 and m2 and have an SFrame test_data, then you may compare the performance of the two models on test data using:

>>> import turicreate
>>> train_data = turicreate.SFrame({'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
...                               'item_id': ["a", "c", "e", "b", "f", "b", "c", "d"]})
>>> test_data = turicreate.SFrame({'user_id': ["0", "0", "1", "1", "1", "2", "2"],
...                              'item_id': ["b", "d", "a", "c", "e", "a", "e"]})
>>> m1 = turicreate.item_similarity_recommender.create(train_data)
>>> m2 = turicreate.item_similarity_recommender.create(train_data, only_top_k=1)
>>> turicreate.recommender.util.compare_models(test_data, [m1, m2], model_names=["m1", "m2"])

The evaluation metric is automatically set to ‘precision_recall’, and the evaluation will be based on recommendations that exclude items seen in the training data.

If you want to evaluate on the original training set:

>>> turicreate.recommender.util.compare_models(train_data, [m1, m2],
...                                     exclude_known_for_precision_recall=False)

Suppose you have four models, two trained with a target rating column, and the other two trained without a target. By default, the models are put into two different groups with “rmse”, and “precision-recall” as the evaluation metric respectively.

>>> train_data2 = turicreate.SFrame({'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
...                                'item_id': ["a", "c", "e", "b", "f", "b", "c", "d"],
...                                'rating': [1, 3, 4, 5, 3, 4, 2, 5]})
>>> test_data2 = turicreate.SFrame({'user_id': ["0", "0", "1", "1", "1", "2", "2"],
...                               'item_id': ["b", "d", "a", "c", "e", "a", "e"],
...                               'rating': [3, 5, 4, 4, 3, 5, 2]})
>>> m3 = turicreate.factorization_recommender.create(train_data2, target='rating')
>>> m4 = turicreate.factorization_recommender.create(train_data2, target='rating')
>>> turicreate.recommender.util.compare_models(test_data2, [m3, m4])

To compare all four models using the same ‘precision_recall’ metric, you can do:

>>> turicreate.recommender.util.compare_models(test_data2, [m1, m2, m3, m4],
...                                          metric='precision_recall')