F_{1}score#
The F_{1}score, also known as balanced Fscore or Fmeasure, is a metric that combines two competing metrics, precision and recall, with an equal weight. F_{1}score is the harmonic mean between precision and recall, and symmetrically represents both in one metric.
Guides: Precision and Recall
Read the precision and the recall guides if you're not familiar with those metrics.
Precision and recall offer a tradeoff: increasing precision often reduces recall, and vice versa. This is called the precision/recall tradeoff.
Ideally, we want to maximize both precision and recall to obtain the perfect model. This is where the F_{1}score comes in play. Because the F_{1}score is the harmonic mean of precision and recall, maximizing the F_{1}score implies simultaneously maximizing both precision and recall. Thus, the F_{1}score has become a popular metric for the evaluation of many workflows, such as classification, object detection, semantic segmentation, and information retrieval.

API Reference:
f1_score
↗
Implementation Details#
Using TP / FP / FN / TN, we can define precision and recall. The F_{1}score is computed by taking the harmonic mean of precision and recall.
The F_{1}score is defined:
It can also be calculated directly from true positive (TP) / false positive (FP) / false negative (FN) counts:
Examples#
Perfect inferences:
Metric  Value 

TP  20 
FP  0 
FN  0 
Partially correct inferences, where every ground truth is recalled by an inference:
Metric  Value 

TP  25 
FP  75 
FN  0 
Perfect inferences but some ground truths are missed:
Metric  Value 

TP  25 
FP  0 
FN  75 
Zero correct inferences with nonzero false positive and false negative:
Metric  Value 

TP  0 
FP  15 
FN  10 
Zero correct inferences with zero false positive and false negative:
Metric  Value 

TP  0 
FP  0 
FN  0 
Undefined F_{1}
This example shows an edge case where both precision and recall are undefined
. When either metric is undefined
,
F_{1} is also undefind
. In such cases, it's often interpreted as 0.0
instead.
Multiple Classes#
In workflows with multiple classes, the F_{1}score can be computed per class. In the TP / FP / FN / TN guide, we learned how to compute perclass metrics when there are multiple classes, using the onevsrest (OvR) strategy. Once you have TP, FP, and FN counts computed for each class, you can compute precision, recall, and F_{1}score for each class by treating each as a singleclass problem.
Aggregating Perclass Metrics#
If you are looking for a single F_{1}score that summarizes model performance across all classes, there are different ways to aggregate perclass F_{1}scores: macro, micro, and weighted. Read more about these methods in the Averaging Methods guide.
F\(_\beta\)score#
The F\(_\beta\)score is a generic form of the F_{1}score with a weight parameter, \(\beta\), where recall is considered \(\beta\) times more important than precision:
The three most common values for the beta parameter are as follows:
 F_{0.5}score \(\left(\beta = 0.5\right)\), where precision is more important than recall, it focuses more on minimizing FPs than minimizing FNs
 F_{1}score \(\left(\beta = 1\right)\), the true harmonic mean of precision and recall
 F_{2}score \(\left(\beta = 2\right)\), where recall is more important than precision, it focuses more on minimizing FNs than minimizing FPs
Limitations and Biases#
While the F_{1}score can be used to evaluate classification/object detection models with a single metric, this metric is not adequate to use for all applications. In some applications, such as identifying pedestrians from an autonomous vehicle, any false negatives can be lifethreatening. In these scenarios, having a few more false positives as a tradeoff for reducing the chance of any lifethreatening events happening is preferred. Here, recall should be weighted much more than the precision as it minimizes false negatives. To address the significance of recall, \(\text{F}_\beta\) score can be used as an alternative.
ThresholdDependence#
Precision, recall, and F_{1}score are all thresholddependent metrics. Thresholddependent means that, before computing these metrics, a confidence score threshold must be applied to inferences to decide which should be used for metrics computation and which should be ignored.
A small change to this confidence score threshold can have a large impact on thresholddependent metrics. To evaluate a model across all thresholds, rather than at a singlethreshold, use thresholdindependent metrics, like average precision.