F1-score#
The F1-score, also known as balanced F-score or F-measure, is a metric that combines two competing metrics, precision and recall, with an equal weight. F1-score is the harmonic mean between precision and recall, and symmetrically represents both in one metric.
Guides: Precision and Recall
Read the precision and the recall guides if you're not familiar with those metrics.
Precision and recall offer a trade-off: increasing precision often reduces recall, and vice versa. This is called the precision/recall trade-off.
Ideally, we want to maximize both precision and recall to obtain the perfect model. This is where the F1-score comes in play. Because the F1-score is the harmonic mean of precision and recall, maximizing the F1-score implies simultaneously maximizing both precision and recall. Thus, the F1-score has become a popular metric for the evaluation of many workflows, such as classification, object detection, semantic segmentation, and information retrieval.
-
API Reference:
f1_score
↗
Example
To see an example of of the F1 Score, checkout the Object Detection (COCO 2014) on app.kolena.com/try.
Implementation Details#
Using TP / FP / FN / TN, we can define precision and recall. The F1-score is computed by taking the harmonic mean of precision and recall.
The F1-score is defined:
It can also be calculated directly from true positive (TP) / false positive (FP) / false negative (FN) counts:
Examples#
Perfect inferences:
Metric | Value |
---|---|
TP | 20 |
FP | 0 |
FN | 0 |
Partially correct inferences, where every ground truth is recalled by an inference:
Metric | Value |
---|---|
TP | 25 |
FP | 75 |
FN | 0 |
Perfect inferences but some ground truths are missed:
Metric | Value |
---|---|
TP | 25 |
FP | 0 |
FN | 75 |
Zero correct inferences with non-zero false positive and false negative:
Metric | Value |
---|---|
TP | 0 |
FP | 15 |
FN | 10 |
Zero correct inferences with zero false positive and false negative:
Metric | Value |
---|---|
TP | 0 |
FP | 0 |
FN | 0 |
Undefined F1
This example shows an edge case where both precision and recall are undefined
. When either metric is undefined
,
F1 is also undefind
. In such cases, it's often interpreted as 0.0
instead.
Multiple Classes#
In workflows with multiple classes, the F1-score can be computed per class. In the TP / FP / FN / TN guide, we learned how to compute per-class metrics when there are multiple classes, using the one-vs-rest (OvR) strategy. Once you have TP, FP, and FN counts computed for each class, you can compute precision, recall, and F1-score for each class by treating each as a single-class problem.
Aggregating Per-class Metrics#
If you are looking for a single F1-score that summarizes model performance across all classes, there are different ways to aggregate per-class F1-scores: macro, micro, and weighted. Read more about these methods in the Averaging Methods guide.
F\(_\beta\)-score#
The F\(_\beta\)-score is a generic form of the F1-score with a weight parameter, \(\beta\), where recall is considered \(\beta\) times more important than precision:
The three most common values for the beta parameter are as follows:
- F0.5-score \(\left(\beta = 0.5\right)\), where precision is more important than recall, it focuses more on minimizing FPs than minimizing FNs
- F1-score \(\left(\beta = 1\right)\), the true harmonic mean of precision and recall
- F2-score \(\left(\beta = 2\right)\), where recall is more important than precision, it focuses more on minimizing FNs than minimizing FPs
Limitations and Biases#
While the F1-score can be used to evaluate classification/object detection models with a single metric, this metric is not adequate to use for all applications. In some applications, such as identifying pedestrians from an autonomous vehicle, any false negatives can be life-threatening. In these scenarios, having a few more false positives as a trade-off for reducing the chance of any life-threatening events happening is preferred. Here, recall should be weighted much more than the precision as it minimizes false negatives. To address the significance of recall, \(\text{F}_\beta\) score can be used as an alternative.
Threshold-Dependence#
Precision, recall, and F1-score are all threshold-dependent metrics. Threshold-dependent means that, before computing these metrics, a confidence score threshold must be applied to inferences to decide which should be used for metrics computation and which should be ignored.
A small change to this confidence score threshold can have a large impact on threshold-dependent metrics. To evaluate a model across all thresholds, rather than at a single-threshold, use threshold-independent metrics, like average precision.