Difficulty Score#
Difficulty scores are automatically computed within Kolena to surface datapoints that commonly contribute to poor model performance. Difficulty scores consider a user's custom Quality Standard configuration to make an informed assessment of which datapoints lead to the greatest recurring problems across all models using multiple performance indicators. Difficulty scores range from 0 to 1, where a lower difficulty score indicates that models produce the ideal datapointlevel metrics (e.g. lower inference time, higher accuracy), and a higher difficulty score indicates that models consistently face problems or "difficulty" (e.g. longer inference time, lower BLEU scores, and/or lower recall).
Note
For Kolena to calculate the datapoint.difficulty_score
you must have:
 at least one Model Result uploaded
 at least one metric defined in your Quality Standard
 set the direction of the metric
(
Lower is better
orHigher is better
)
When one model is selected in datapoint.difficulty_score
) is the average value from
each difficulty score from each model's results.
Using Difficulty Scores for Regression Testing
When two models called A
and B
are selected in Studio, users can
see two modellevel difficulty scores, and one overall difficulty score for any datapoint. With a filter for
resultB.difficulty_score > resultA.difficulty_score
, we find all the datapoints that performed worse
for model B
, which highlights the regressions.
With a filter for datapoint.difficulty_score > 0.9
, we see all the datapoints that significantly struggle
across both models, which are common failures that persist over different model iterations.
Implementation Details#
Suppose we have defined quality standards composed of various metrics and performance indicators.
Some of these will be metrics like ROUGE or accuracy where higher values are better
(HIB
, higher is better), but it is desirable for word_error_rate or cost to be minimized
(LIB
, lower is better). Using the results of some model A
, we can compute modellevel difficulty scores for A
,
denoted as resultA.difficulty_score
.
To describe the computation of difficulty scores at a high level:
where:
 \( QS \) is the set of quality standards
 \( w_i \) represents the weight for each quality standard \( i \)
 \( q_i \) represents the value of the quality standard \( i \), inverted if necessary
 \( \text{norm}(q_i) \) indicates the normalized value of the quality standard \( q_i \)
 \( M \) is the set of chosen models, such as models
A
,B
, andC
Important Note
Note that datapoint.difficulty_score
is the average of all relevant
modellevel resultX.difficulty_score
values.
Below is a detailed example of how resultA.difficulty_score
is computed
using cost, recall, and accuracy:
import pandas as pd
# names of quality standards where "lower is better"
LIB = ['cost']
# names of quality standards where "higher is better"
HIB = ['recall', 'accuracy']
# the name of the column for unique identifiers to a datapoint
id_column = 'id'
quality_standards = LIB + HIB
weights = [1/len(quality_standards)] * len(quality_standards) # weighting can be customized
model_results_csv = "pathtofirstmodelresults.csv"
df = pd.read_csv(model_results_csv, usecols=[id_column] + quality_standards)
def add_model_difficulty_score(df, lib, hib, weights):
"""
Adds difficulty scores to datapoints in dataframe provided a quality standard configuration.
"""
qs = lib + hib
for col in qs:
df[col] = df[col].astype(float)
# invert HIB values such that higher values yield lower difficulty scores
if col in hib:
df[col] = df[col] * 1.0
# normalize the column
min_val = df[col].min()
max_val = df[col].max()
df[col] = (df[col]  min_val) / (max_val  min_val)
df["difficulty_score"] = df[qs].dot(weights) # weighted sum
return df
df = add_model_difficulty_score(df, LIB, HIB, weights)
Example#

Begin with a CSV of model results.
id recall (HIB) cost (LIB) accuracy (HIB) 1
0.10
3.14
0.50
2
0.50
0.90
0.80
3
0.90
0.01
0.99
4
0.60
0.50
0.55

Invert every
HIB
column.id recall (HIB) cost (LIB) accuracy (HIB) 1
0.10
3.14
0.50
2
0.50
0.90
0.80
3
0.90
0.01
0.99
4
0.60
0.50
0.55

Normalize every column.
id recall (HIB) cost (LIB) accuracy (HIB) 1
1.00
1.00
1.00
2
0.50
0.284
0.387
3
0.00
0.00
0.00
4
0.375
0.156
0.898

Compute difficulty scores using weighted sums for each datapoint.
id recall (HIB) cost (LIB) accuracy (HIB) difficulty_score 1
1.00
1.00
1.00
1.00
2
0.50
0.284
0.387
0.390
3
0.00
0.00
0.00
0.00
4
0.375
0.156
0.898
0.476
Below is the math behind the 2nd datapoint's (
id == 2
) difficulty score assuming equal weighting:\[ \begin{align} \text{difficulty_score} &= \frac{1}{3} * 0.5 + \frac{1}{3} * 0.284 + \frac{1}{3} * 0.387 \\[1em] &= 0.390 \end{align} \]
We have computed a new column of difficulty scores for each datapoint based on the quality standards set by the user. If we were to add a new model, then the overall difficulty score would be the average of difficulty scores across each model.
id  resultA.difficulty_score  resultB.difficulty_score  resultC.difficulty_score  datapoint.difficulty_score 

1 
0.3 
0.3 
0.3 
0.30 
2 
0.1 
0.9 
0.1 
0.37 
3 
0.4 
0.2 
0.6 
0.4 
From the table above, we see that the 2nd datapoint performs very poorly on the 2nd model (resultB
) with a difficulty
score of 0.9
, while the 3rd datapoint has 0.2
just underneath. However, the computed difficulty_score
values
indicate that the 3rd datapoint is repeatedly the most challenging datapoint for the models based on the defined
quality standard.
Difficulty Scores for Task Metrics#
Difficulty scores for task metrics are aggregate metrics that do not offer
datapointlevel details of performance. However, the information provided to an aggregate metric is sufficient in
establishing difficulty scores at the datapoint level, similar to datapointlevel inference_time
or cost
.
Binary Classification and Regression#
The difficulty score is the absolute error between the model result and the ground truth value.
Suppose in a binary classification problem, a model's inference is binarized by the threshold of 0.5
,
so the positive class would be defined by values 0.5
to 1.0
, and values of the negative class would
be from 0.0
to 0.5
.
id  ground_truth  inference  Δ  norm(Δ) 

1 
1 
0.01 
0.99 
1.00 
2 
1 
0.49 
0.51 
0.51 
3 
1 
0.50 
0.50 
0.50 
4 
1 
0.80 
0.20 
0.19 
5 
0 
0.01 
0.01 
0.00 
6 
0 
0.49 
0.49 
0.49 
7 
0 
0.50 
0.50 
0.50 
8 
0 
0.80 
0.80 
0.81 
In the case of regression problems, difficulty can be measured in a similar way using the magnitude of the difference between the ground truth and the inference.
id  ground_truth  inference  Δ  norm(Δ) 

1 
1 
1 
0 
0.00 
2 
2 
1 
1 
0.08 
3 
3 
2 
1 
0.08 
4 
4 
3 
1 
0.08 
5 
5 
5 
0 
0.00 
6 
6 
8 
2 
0.15 
7 
7 
13 
6 
0.46 
8 
8 
21 
13 
1.00 
The greater the distance the inference is from the ground truth, the greater the difficulty of that
datapoint. These normalized Δ
column, called a norm(Δ)
column, becomes another column in step 4 of the example
above which is parallel to cost
or recall
. Then, it can be involved in the computation of the overall
datapoint.difficulty_score
.
Multiclass Classification#
The Δ
column for a datapoint of a multiclass classification task is the count of misclassifications for the
datapoint. For example, with three models the best case is a count of zero mistakes and the worst case
sums to three mistakes. Like the binary classification and regression task, the norm(Δ)
column normalizes Δ
to be
used in computing the overall datapoint.difficulty_score
.
Object Detection#
The Δ
column for an object detection task is the F_{1}score computed using the total number
of TP / FP / FN counts. If recall is more important, this can become the default
signal instead of F_{1}scores at the datapoint level.
Like the other tasks, the norm(Δ)
column normalizes Δ
to be used in computing the overall
datapoint.difficulty_score
.
Limitations and Biases#
Difficulty scores require wellconfigured quality standards and perform better with multiple metrics and indicators involved. Difficulty scores are not as useful unless multiple models are involved.
It is hard to interpret the value of a difficulty score directly, as it is an aggregate signal reflecting quality standards relative to all models of interest with all their results.