turicreate.evaluation.f1_score

turicreate.evaluation.f1_score(targets, predictions, average='macro')

Compute the F1 score (sometimes known as the balanced F-score or F-measure). The F1 score is commonly interpreted as the average of precision and recall. The score lies in the range [0,1] with 1 being ideal and 0 being the worst.

The F1 score is defined as:

\[f_{1} = \frac{2 \times p \times r}{p + r}\]

Where \(p\) is the precision and \(r\) is the recall.

Parameters:
targets : SArray

An SArray of ground truth class labels. Can be of any type except float.

predictions : SArray

The class prediction that corresponds to each target value. This SArray must have the same length as targets and must be of the same type as the targets SArray.

average : string, [None, ‘macro’ (default), ‘micro’]

Metric averaging strategies for multiclass classification. Averaging strategies can be one of the following:

  • None: No averaging is performed and a single metric is returned for each class.
  • ‘micro’: Calculate metrics globally by counting the total true positives, false negatives and false positives.
  • ‘macro’: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

For a more precise definition of micro and macro averaging refer to [1] below.

Returns:
out : float (for binary classification) or dict[float] (multi-class, average=None)

Score for the positive class (for binary classification) or an average score for each class for multi-class classification. If average=None, then a dictionary is returned where the key is the class label and the value is the score for the corresponding class label.

Notes

  • For binary classification, when the target label is of type “string”, then the labels are sorted alphanumerically and the largest label is chosen as the “positive” label. For example, if the classifier labels are {“cat”, “dog”}, then “dog” is chosen as the positive label for the binary classification case.

References

  • [1] Sokolova, Marina, and Guy Lapalme. “A systematic analysis of performance measures for classification tasks.” Information Processing & Management 45.4 (2009): 427-437.

Examples

# Targets and Predictions
>>> targets = turicreate.SArray([0, 1, 2, 3, 0, 1, 2, 3])
>>> predictions = turicreate.SArray([1, 0, 2, 1, 3, 1, 0, 1])

# Micro average of the F-1 score
>>> turicreate.evaluation.f1_score(targets, predictions,
...                              average = 'micro')
0.25

# Macro average of the F-1 score
>>> turicreate.evaluation.f1_score(targets, predictions,
...                              average = 'macro')
0.25

# F-1 score for each class.
>>> turicreate.evaluation.f1_score(targets, predictions,
...                              average = None)
{0: 0.0, 1: 0.4166666666666667, 2: 0.5555555555555556, 3: 0.0}

This metric also works for string classes.

# Targets and Predictions
>>> targets = turicreate.SArray(
...      ["cat", "dog", "foosa", "snake", "cat", "dog", "foosa", "snake"])
>>> predictions = turicreate.SArray(
...      ["dog", "cat", "foosa", "dog", "snake", "dog", "cat", "dog"])

# Micro average of the F-1 score
>>> turicreate.evaluation.f1_score(targets, predictions,
...                              average = 'micro')
0.25

# Macro average of the F-1 score
>>> turicreate.evaluation.f1_score(targets, predictions,
...                             average = 'macro')
0.25

# F-1 score for each class.
>>> turicreate.evaluation.f1_score(targets, predictions,
...                              average = None)
{'cat': 0.0, 'dog': 0.4166666666666667, 'foosa': 0.5555555555555556, 'snake': 0.0}