kolena.workflow.Evaluator
#
MetricsTestSample
#
Bases: DataObject
Test-sample-level metrics produced by an Evaluator
.
This class should be subclassed with the relevant fields for a given workflow.
Examples here may include the number of true positive detections on an image, the mean IOU of inferred polygon(s) with ground truth polygon(s), etc.
MetricsTestCase
#
Bases: DataObject
Test-case-level metrics produced by an Evaluator
.
This class should be subclassed with the relevant fields for a given workflow.
Test-case-level metrics are aggregate metrics like precision
,
recall
, and f1_score
. Any and all
aggregate metrics that fit a workflow should be defined here.
Nesting Aggregate Metrics#
MetricsTestCase
supports nesting metrics objects, for e.g. reporting class-level metrics within a test case that
contains multiple classes. Example usage:
@dataclass(frozen=True)
class PerClassMetrics(MetricsTestCase):
Class: str
Precision: float
Recall: float
F1: float
AP: float
@dataclass(frozen=True)
class TestCaseMetrics(MetricsTestCase):
macro_Precision: float
macro_Recall: float
macro_F1: float
mAP: float
PerClass: List[PerClassMetrics]
Any str
-type fields (e.g. Class
in the above example) will be used as identifiers when displaying nested metrics
on Kolena. For best results, include at least one str
-type field in nested metrics definitions.
When comparing nested metrics from multiple models, an int
-type column with any of the following names will be
used for sample size in statistical significance calculations: N
, n
, nTestSamples
, n_test_samples
,
sampleSize
, sample_size
, SampleSize
.
For a detailed overview of this feature, see the Nesting Test Case Metrics advanced usage guide.
MetricsTestSuite
#
Bases: DataObject
Test-suite-level metrics produced by an Evaluator
.
This class should be subclassed with the relevant fields for a given workflow.
Test-suite-level metrics typically measure performance across test cases, e.g. penalizing variance across different subsets of a benchmark.
EvaluatorConfiguration
#
Bases: DataObject
Configuration for an Evaluator
.
Example evaluator configurations may specify:
- Fixed confidence thresholds at which detections are discarded.
- Different algorithms/strategies used to compute confidence thresholds (e.g. "accuracy optimal" for a classification-type workflow).
display_name()
abstractmethod
#
The name to display for this configuration in Kolena. Must be implemented when extending
EvaluatorConfiguration
.
Evaluator(configurations=None)
#
An Evaluator
transforms inferences into metrics.
Metrics are computed at the individual test sample level (MetricsTestSample
),
in aggregate at the test case level (MetricsTestCase
), and across populations
at the test suite level (MetricsTestSuite
).
Test-case-level plots (Plot
) may also be computed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
configurations
|
Optional[List[EvaluatorConfiguration]]
|
The configurations at which to perform evaluation. Instance methods such as |
None
|
configurations: List[EvaluatorConfiguration] = configurations or []
instance-attribute
#
The configurations with which to perform evaluation, provided on instantiation.
display_name()
#
The name to display for this evaluator in Kolena. Defaults to the name of this class.
compute_test_sample_metrics(test_case, inferences, configuration=None)
abstractmethod
#
Compute metrics for every test sample in a test case, i.e. one
MetricsTestSample
object for each of the provided test samples.
Must be implemented.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_case
|
TestCase
|
The |
required |
inferences
|
List[Tuple[TestSample, GroundTruth, Inference]]
|
The test samples, ground truths, and inferences for all entries in a test case. |
required |
configuration
|
Optional[EvaluatorConfiguration]
|
The evaluator configuration to use. Empty for implementations that are not configured. |
None
|
Returns:
Type | Description |
---|---|
List[Tuple[TestSample, MetricsTestSample]]
|
|
compute_test_case_metrics(test_case, inferences, metrics, configuration=None)
abstractmethod
#
Compute aggregate metrics (MetricsTestCase
) across a test case.
Must be implemented.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_case
|
TestCase
|
The test case in question. |
required |
inferences
|
List[Tuple[TestSample, GroundTruth, Inference]]
|
The test samples, ground truths, and inferences for all entries in a test case. |
required |
metrics
|
List[MetricsTestSample]
|
The |
required |
configuration
|
Optional[EvaluatorConfiguration]
|
The evaluator configuration to use. Empty for implementations that are not configured. |
None
|
Returns:
Type | Description |
---|---|
MetricsTestCase
|
|
compute_test_case_plots(test_case, inferences, metrics, configuration=None)
#
Optionally compute any number of plots to visualize the results for a test case.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_case
|
TestCase
|
The test case in question |
required |
inferences
|
List[Tuple[TestSample, GroundTruth, Inference]]
|
The test samples, ground truths, and inferences for all entries in a test case. |
required |
metrics
|
List[MetricsTestSample]
|
The |
required |
configuration
|
Optional[EvaluatorConfiguration]
|
the evaluator configuration to use. Empty for implementations that are not configured. |
None
|
Returns:
Type | Description |
---|---|
Optional[List[Plot]]
|
Zero or more plots for this test case at this configuration. |
compute_test_suite_metrics(test_suite, metrics, configuration=None)
#
Optionally compute TestSuite
-level metrics
(MetricsTestSuite
) across the provided test_suite
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_suite
|
TestSuite
|
The test suite in question. |
required |
metrics
|
List[Tuple[TestCase, MetricsTestCase]]
|
The |
required |
configuration
|
Optional[EvaluatorConfiguration]
|
The evaluator configuration to use. Empty for implementations that are not configured. |
None
|
Returns:
Type | Description |
---|---|
Optional[MetricsTestSuite]
|
The |
Simplified interface for Evaluator
implementations.
BasicEvaluatorFunction = Union[ConfiguredEvaluatorFunction, UnconfiguredEvaluatorFunction]
module-attribute
#
BasicEvaluatorFunction
provides a function-based evaluator interface that takes
the inferences for all test samples in a test suite and a TestCases
as input and computes
the corresponding test-sample-level, test-case-level, and test-suite-level metrics (and optionally plots) as output.
Example implementation, relying on compute_per_sample
and compute_aggregate
functions implemented elsewhere:
def evaluate(
test_samples: List[TestSample],
ground_truths: List[GroundTruth],
inferences: List[Inference],
test_cases: TestCases,
# configuration: EvaluatorConfiguration, # uncomment when configuration is used
) -> EvaluationResults:
# compute per-sample metrics for each test sample
per_sample_metrics = [compute_per_sample(gt, inf) for gt, inf in zip(ground_truths, inferences)]
# compute aggregate metrics across all test cases using `test_cases.iter(...)`
aggregate_metrics: List[Tuple[TestCase, MetricsTestCase]] = []
for test_case, *s in test_cases.iter(test_samples, ground_truths, inferences, per_sample_metrics):
# subset of `test_samples`/`ground_truths`/`inferences`/`test_sample_metrics` in given test case
tc_test_samples, tc_ground_truths, tc_inferences, tc_per_sample_metrics = s
aggregate_metrics.append((test_case, compute_aggregate(tc_per_sample_metrics)))
# if desired, compute and add `plots_test_case` and `metrics_test_suite`
return EvaluationResults(
metrics_test_sample=list(zip(test_samples, per_sample_metrics)),
metrics_test_case=aggregate_metrics,
)
The control flow is in general more streamlined than with Evaluator
, but requires a
couple of assumptions to hold:
- Test-sample-level metrics do not vary by test case
- Ground truths corresponding to a given test sample do not vary by test case
This BasicEvaluatorFunction
is provided to the test run at runtime, and is expected to have the following signature:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_samples
|
List[TestSample]
|
A list of distinct |
required |
ground_truths
|
List[GroundTruth]
|
A list of |
required |
inferences
|
List[Inference]
|
A list of |
required |
test_cases
|
TestCases
|
An instance of |
required |
evaluator_configuration
|
EvaluatorConfiguration
|
The |
required |
Returns:
Type | Description |
---|---|
EvaluationResults
|
An |
TestCases
#
Provides an iterator method for grouping test-sample-level metric results with the test cases that they belong to.
iter(test_samples, ground_truths, inferences, metrics_test_sample)
abstractmethod
#
Matches test sample metrics to the corresponding test cases that they belong to.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_samples
|
List[TestSample]
|
All unique test samples within the test run, sequenced in the same order as the other parameters. |
required |
ground_truths
|
List[GroundTruth]
|
Ground truths corresponding to |
required |
inferences
|
List[Inference]
|
Inferences corresponding to |
required |
metrics_test_sample
|
List[MetricsTestSample]
|
Test-sample-level metrics corresponding to |
required |
Returns:
Type | Description |
---|---|
Iterator[Tuple[TestCase, List[TestSample], List[GroundTruth], List[Inference], List[MetricsTestSample]]]
|
Iterator that groups each test case in the test run to the lists of member test samples, inferences, and test-sample-level metrics. |
EvaluationResults
#
A bundle of metrics computed for a test run grouped at the test-sample-level, test-case-level, and test-suite-level.
Optionally includes Plot
s at the test-case-level.
metrics_test_sample: List[Tuple[BaseTestSample, BaseMetricsTestSample]]
instance-attribute
#
Sample-level metrics, extending MetricsTestSample
, for every provided test
sample.
metrics_test_case: List[Tuple[TestCase, MetricsTestCase]]
instance-attribute
#
Aggregate metrics, extending MetricsTestCase
, computed across each test case
yielded from TestCases.iter
.
plots_test_case: List[Tuple[TestCase, List[Plot]]] = field(default_factory=list)
class-attribute
instance-attribute
#
Optional test-case-level plots.
metrics_test_suite: Optional[MetricsTestSuite] = None
class-attribute
instance-attribute
#
Optional test-suite-level metrics, extending MetricsTestSuite
.