Skip to content

F1-score#

The F1-score, also known as balanced F-score or F-measure, is a metric that combines two competing metrics, precision and recall, with an equal weight. F1-score is the harmonic mean between precision and recall, and symmetrically represents both in one metric.

Guides: Precision and Recall

Read the precision and the recall guides if you're not familiar with those metrics.

Precision and recall offer a trade-off: increasing precision often reduces recall, and vice versa. This is called the precision/recall trade-off.

Ideally, we want to maximize both precision and recall to obtain the perfect model. This is where the F1-score comes in play. Because the F1-score is the harmonic mean of precision and recall, maximizing the F1-score implies simultaneously maximizing both precision and recall. Thus, the F1-score has become a popular metric for the evaluation of many workflows, such as classification, object detection, semantic segmentation, and information retrieval.

Example

To see an example of of the F1 Score, checkout the Object Detection (COCO 2014) on app.kolena.com/try.

Implementation Details#

Using TP / FP / FN / TN, we can define precision and recall. The F1-score is computed by taking the harmonic mean of precision and recall.

The F1-score is defined:

\[ \begin{align} \text{F}_1 &= \frac {2} {\frac {1} {\text{Precision}} + \frac {1} {\text{Recall}}} \\[1em] &= \frac {2 \times \text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}} \end{align} \]

It can also be calculated directly from true positive (TP) / false positive (FP) / false negative (FN) counts:

\[ \text{F}_1 = \frac {\text{TP}} {\text{TP} + \frac 1 2 \left( \text{FP} + \text{FN} \right)} \]

Examples#

Perfect inferences:

Metric Value
TP 20
FP 0
FN 0
\[ \begin{align} \text{Precision} = \frac{20}{20 + 0} &= 1.0 \\[1em] \text{Recall} = \frac{20}{20 + 0} &= 1.0 \\[1em] \text{F}_1 = \frac{20}{20 + \frac 1 2 \left( 0 + 0 \right)} &= 1.0 \end{align} \]

Partially correct inferences, where every ground truth is recalled by an inference:

Metric Value
TP 25
FP 75
FN 0
\[ \begin{align} \text{Precision} = \frac{25}{25 + 75} &= 0.25 \\[1em] \text{Recall} = \frac{25}{25 + 0} &= 1.0 \\[1em] \text{F}_1 = \frac{25}{25 + \frac 1 2 \left( 75 + 0 \right)} &= 0.4 \end{align} \]

Perfect inferences but some ground truths are missed:

Metric Value
TP 25
FP 0
FN 75
\[ \begin{align} \text{Precision} = \frac{25}{25 + 0} &= 1.0 \\[1em] \text{Recall} = \frac{25}{25 + 75} &= 0.25 \\[1em] \text{F}_1 = \frac{25}{25 + \frac 1 2 \left( 0 + 75 \right)} &= 0.4 \end{align} \]

Zero correct inferences with non-zero false positive and false negative:

Metric Value
TP 0
FP 15
FN 10
\[ \begin{align} \text{Precision} = \frac{0}{0 + 15} &= 0.0 \\[1em] \text{Recall} = \frac{0}{0 + 10} &= 0.0 \\[1em] \text{F}_1 = \frac{0}{0 + \frac 1 2 \left( 15 + 10\right)} &= 0.0 \end{align} \]

Zero correct inferences with zero false positive and false negative:

Metric Value
TP 0
FP 0
FN 0

Undefined F1

This example shows an edge case where both precision and recall are undefined. When either metric is undefined, F1 is also undefind. In such cases, it's often interpreted as 0.0 instead.

\[ \begin{align} \text{Precision} &= \frac{0}{0 + 0} \\[1em] &= \text{undefined} \\[1em] \text{Recall} &= \frac{0}{0 + 0} \\[1em] &= \text{undefined} \\[1em] \text{F}_1 &= \frac{0}{0 + \frac 1 2 \left( 0 + 0\right)} \\[1em] &= \text{undefined} \\[1em] \end{align} \]

Multiple Classes#

In workflows with multiple classes, the F1-score can be computed per class. In the TP / FP / FN / TN guide, we learned how to compute per-class metrics when there are multiple classes, using the one-vs-rest (OvR) strategy. Once you have TP, FP, and FN counts computed for each class, you can compute precision, recall, and F1-score for each class by treating each as a single-class problem.

Aggregating Per-class Metrics#

If you are looking for a single F1-score that summarizes model performance across all classes, there are different ways to aggregate per-class F1-scores: macro, micro, and weighted. Read more about these methods in the Averaging Methods guide.

F\(_\beta\)-score#

The F\(_\beta\)-score is a generic form of the F1-score with a weight parameter, \(\beta\), where recall is considered \(\beta\) times more important than precision:

\[ \text{F}_{\beta} = \frac {(1 + \beta^2) \times \text{precision} \times \text{recall}} {(\beta^2 \times \text{precision}) + \text{recall}} \]

The three most common values for the beta parameter are as follows:

  • F0.5-score \(\left(\beta = 0.5\right)\), where precision is more important than recall, it focuses more on minimizing FPs than minimizing FNs
  • F1-score \(\left(\beta = 1\right)\), the true harmonic mean of precision and recall
  • F2-score \(\left(\beta = 2\right)\), where recall is more important than precision, it focuses more on minimizing FNs than minimizing FPs

Limitations and Biases#

While the F1-score can be used to evaluate classification/object detection models with a single metric, this metric is not adequate to use for all applications. In some applications, such as identifying pedestrians from an autonomous vehicle, any false negatives can be life-threatening. In these scenarios, having a few more false positives as a trade-off for reducing the chance of any life-threatening events happening is preferred. Here, recall should be weighted much more than the precision as it minimizes false negatives. To address the significance of recall, \(\text{F}_\beta\) score can be used as an alternative.

Threshold-Dependence#

Precision, recall, and F1-score are all threshold-dependent metrics. Threshold-dependent means that, before computing these metrics, a confidence score threshold must be applied to inferences to decide which should be used for metrics computation and which should be ignored.

A small change to this confidence score threshold can have a large impact on threshold-dependent metrics. To evaluate a model across all thresholds, rather than at a single-threshold, use threshold-independent metrics, like average precision.