Skip to content

kolena.workflow.Evaluator#

MetricsTestSample #

Bases: DataObject

Test-sample-level metrics produced by an Evaluator.

This class should be subclassed with the relevant fields for a given workflow.

Examples here may include the number of true positive detections on an image, the mean IOU of inferred polygon(s) with ground truth polygon(s), etc.

MetricsTestCase #

Bases: DataObject

Test-case-level metrics produced by an Evaluator.

This class should be subclassed with the relevant fields for a given workflow.

Test-case-level metrics are aggregate metrics like precision, recall, and f1_score. Any and all aggregate metrics that fit a workflow should be defined here.

Nesting Aggregate Metrics#

MetricsTestCase supports nesting metrics objects, for e.g. reporting class-level metrics within a test case that contains multiple classes. Example usage:

@dataclass(frozen=True)
class PerClassMetrics(MetricsTestCase):
    Class: str
    Precision: float
    Recall: float
    F1: float
    AP: float

@dataclass(frozen=True)
class TestCaseMetrics(MetricsTestCase):
    macro_Precision: float
    macro_Recall: float
    macro_F1: float
    mAP: float
    PerClass: List[PerClassMetrics]

Any str-type fields (e.g. Class in the above example) will be used as identifiers when displaying nested metrics on Kolena. For best results, include at least one str-type field in nested metrics definitions.

When comparing nested metrics from multiple models, an int-type column with any of the following names will be used for sample size in statistical significance calculations: N, n, nTestSamples, n_test_samples, sampleSize, sample_size, SampleSize.

For a detailed overview of this feature, see the Nesting Test Case Metrics advanced usage guide.

MetricsTestSuite #

Bases: DataObject

Test-suite-level metrics produced by an Evaluator.

This class should be subclassed with the relevant fields for a given workflow.

Test-suite-level metrics typically measure performance across test cases, e.g. penalizing variance across different subsets of a benchmark.

EvaluatorConfiguration #

Bases: DataObject

Configuration for an Evaluator.

Example evaluator configurations may specify:

  • Fixed confidence thresholds at which detections are discarded.
  • Different algorithms/strategies used to compute confidence thresholds (e.g. "accuracy optimal" for a classification-type workflow).

display_name() abstractmethod #

The name to display for this configuration in Kolena. Must be implemented when extending EvaluatorConfiguration.

Evaluator(configurations=None) #

An Evaluator transforms inferences into metrics.

Metrics are computed at the individual test sample level (MetricsTestSample), in aggregate at the test case level (MetricsTestCase), and across populations at the test suite level (MetricsTestSuite).

Test-case-level plots (Plot) may also be computed.

Parameters:

Name Type Description Default
configurations Optional[List[EvaluatorConfiguration]]

The configurations at which to perform evaluation. Instance methods such as compute_test_sample_metrics are called once per test case per configuration.

None

configurations: List[EvaluatorConfiguration] = configurations or [] instance-attribute #

The configurations with which to perform evaluation, provided on instantiation.

display_name() #

The name to display for this evaluator in Kolena. Defaults to the name of this class.

compute_test_sample_metrics(test_case, inferences, configuration=None) abstractmethod #

Compute metrics for every test sample in a test case, i.e. one MetricsTestSample object for each of the provided test samples.

Must be implemented.

Parameters:

Name Type Description Default
test_case TestCase

The TestCase to which the provided test samples and ground truths belong.

required
inferences List[Tuple[TestSample, GroundTruth, Inference]]

The test samples, ground truths, and inferences for all entries in a test case.

required
configuration Optional[EvaluatorConfiguration]

The evaluator configuration to use. Empty for implementations that are not configured.

None

Returns:

Type Description
List[Tuple[TestSample, MetricsTestSample]]

TestSample-level metrics for each provided test sample.

compute_test_case_metrics(test_case, inferences, metrics, configuration=None) abstractmethod #

Compute aggregate metrics (MetricsTestCase) across a test case.

Must be implemented.

Parameters:

Name Type Description Default
test_case TestCase

The test case in question.

required
inferences List[Tuple[TestSample, GroundTruth, Inference]]

The test samples, ground truths, and inferences for all entries in a test case.

required
metrics List[MetricsTestSample]

The TestSample-level metrics computed by compute_test_sample_metrics, provided in the same order as inferences.

required
configuration Optional[EvaluatorConfiguration]

The evaluator configuration to use. Empty for implementations that are not configured.

None

Returns:

Type Description
MetricsTestCase

TestCase-level metrics for the provided test case.

compute_test_case_plots(test_case, inferences, metrics, configuration=None) #

Optionally compute any number of plots to visualize the results for a test case.

Parameters:

Name Type Description Default
test_case TestCase

The test case in question

required
inferences List[Tuple[TestSample, GroundTruth, Inference]]

The test samples, ground truths, and inferences for all entries in a test case.

required
metrics List[MetricsTestSample]

The TestSample-level metrics computed by compute_test_sample_metrics, provided in the same order as inferences.

required
configuration Optional[EvaluatorConfiguration]

the evaluator configuration to use. Empty for implementations that are not configured.

None

Returns:

Type Description
Optional[List[Plot]]

Zero or more plots for this test case at this configuration.

compute_test_suite_metrics(test_suite, metrics, configuration=None) #

Optionally compute TestSuite-level metrics (MetricsTestSuite) across the provided test_suite.

Parameters:

Name Type Description Default
test_suite TestSuite

The test suite in question.

required
metrics List[Tuple[TestCase, MetricsTestCase]]

The TestCase-level metrics computed by compute_test_case_metrics.

required
configuration Optional[EvaluatorConfiguration]

The evaluator configuration to use. Empty for implementations that are not configured.

None

Returns:

Type Description
Optional[MetricsTestSuite]

The TestSuite-level metrics for this test suite.

Simplified interface for Evaluator implementations.

BasicEvaluatorFunction = Union[ConfiguredEvaluatorFunction, UnconfiguredEvaluatorFunction] module-attribute #

BasicEvaluatorFunction provides a function-based evaluator interface that takes the inferences for all test samples in a test suite and a TestCases as input and computes the corresponding test-sample-level, test-case-level, and test-suite-level metrics (and optionally plots) as output.

Example implementation, relying on compute_per_sample and compute_aggregate functions implemented elsewhere:

def evaluate(
    test_samples: List[TestSample],
    ground_truths: List[GroundTruth],
    inferences: List[Inference],
    test_cases: TestCases,
    # configuration: EvaluatorConfiguration,  # uncomment when configuration is used
) -> EvaluationResults:
    # compute per-sample metrics for each test sample
    per_sample_metrics = [compute_per_sample(gt, inf) for gt, inf in zip(ground_truths, inferences)]

    # compute aggregate metrics across all test cases using `test_cases.iter(...)`
    aggregate_metrics: List[Tuple[TestCase, MetricsTestCase]] = []
    for test_case, *s in test_cases.iter(test_samples, ground_truths, inferences, per_sample_metrics):
        # subset of `test_samples`/`ground_truths`/`inferences`/`test_sample_metrics` in given test case
        tc_test_samples, tc_ground_truths, tc_inferences, tc_per_sample_metrics = s
        aggregate_metrics.append((test_case, compute_aggregate(tc_per_sample_metrics)))

    # if desired, compute and add `plots_test_case` and `metrics_test_suite`
    return EvaluationResults(
        metrics_test_sample=list(zip(test_samples, per_sample_metrics)),
        metrics_test_case=aggregate_metrics,
    )

The control flow is in general more streamlined than with Evaluator, but requires a couple of assumptions to hold:

  • Test-sample-level metrics do not vary by test case
  • Ground truths corresponding to a given test sample do not vary by test case

This BasicEvaluatorFunction is provided to the test run at runtime, and is expected to have the following signature:

Parameters:

Name Type Description Default
test_samples List[TestSample]

A list of distinct TestSample values that correspond to all test samples in the test run.

required
ground_truths List[GroundTruth]

A list of GroundTruth values corresponding to and sequenced in the same order as test_samples.

required
inferences List[Inference]

A list of Inference values corresponding to and sequenced in the same order as test_samples.

required
test_cases TestCases

An instance of TestCases, used to provide iteration groupings for evaluating test-case-level metrics.

required
evaluator_configuration EvaluatorConfiguration

The EvaluatorConfiguration to use when performing the evaluation. This parameter may be omitted in the function definition for implementations that do not use any configuration object.

required

Returns:

Type Description
EvaluationResults

An EvaluationResults object tracking the test-sample-level, test-case-level and test-suite-level metrics and plots for the input collection of test samples.

TestCases #

Provides an iterator method for grouping test-sample-level metric results with the test cases that they belong to.

iter(test_samples, ground_truths, inferences, metrics_test_sample) abstractmethod #

Matches test sample metrics to the corresponding test cases that they belong to.

Parameters:

Name Type Description Default
test_samples List[TestSample]

All unique test samples within the test run, sequenced in the same order as the other parameters.

required
ground_truths List[GroundTruth]

Ground truths corresponding to test_samples, sequenced in the same order.

required
inferences List[Inference]

Inferences corresponding to test_samples, sequenced in the same order.

required
metrics_test_sample List[MetricsTestSample]

Test-sample-level metrics corresponding to test_samples, sequenced in the same order.

required

Returns:

Type Description
Iterator[Tuple[TestCase, List[TestSample], List[GroundTruth], List[Inference], List[MetricsTestSample]]]

Iterator that groups each test case in the test run to the lists of member test samples, inferences, and test-sample-level metrics.

EvaluationResults #

A bundle of metrics computed for a test run grouped at the test-sample-level, test-case-level, and test-suite-level. Optionally includes Plots at the test-case-level.

metrics_test_sample: List[Tuple[BaseTestSample, BaseMetricsTestSample]] instance-attribute #

Sample-level metrics, extending MetricsTestSample, for every provided test sample.

metrics_test_case: List[Tuple[TestCase, MetricsTestCase]] instance-attribute #

Aggregate metrics, extending MetricsTestCase, computed across each test case yielded from TestCases.iter.

plots_test_case: List[Tuple[TestCase, List[Plot]]] = field(default_factory=list) class-attribute instance-attribute #

Optional test-case-level plots.

metrics_test_suite: Optional[MetricsTestSuite] = None class-attribute instance-attribute #

Optional test-suite-level metrics, extending MetricsTestSuite.

no_op_evaluator(test_samples, ground_truths, inferences, test_cases) #

A no-op implementation of the Kolena Evaluator that will bypass evaluation but make Inferences accessible in the platform.

from kolena.workflow import no_op_evaluator
from kolena.workflow import test

test(model, test_suite, no_op_evaluator)