`kolena.workflow.Evaluator`#

`MetricsTestSample` #

Bases: DataObject

Test-sample-level metrics produced by an Evaluator.

This class should be subclassed with the relevant fields for a given workflow.

Examples here may include the number of true positive detections on an image, the mean IOU of inferred polygon(s) with ground truth polygon(s), etc.

`MetricsTestCase` #

Bases: DataObject

Test-case-level metrics produced by an Evaluator.

This class should be subclassed with the relevant fields for a given workflow.

Test-case-level metrics are aggregate metrics like precision, recall, and f1_score. Any and all aggregate metrics that fit a workflow should be defined here.

Nesting Aggregate Metrics#

MetricsTestCase supports nesting metrics objects, for e.g. reporting class-level metrics within a test case that contains multiple classes. Example usage:

@dataclass(frozen=True)
class PerClassMetrics(MetricsTestCase):
    Class: str
    Precision: float
    Recall: float
    F1: float
    AP: float

@dataclass(frozen=True)
class TestCaseMetrics(MetricsTestCase):
    macro_Precision: float
    macro_Recall: float
    macro_F1: float
    mAP: float
    PerClass: List[PerClassMetrics]

Any str-type fields (e.g. Class in the above example) will be used as identifiers when displaying nested metrics on Kolena. For best results, include at least one str-type field in nested metrics definitions.

When comparing nested metrics from multiple models, an int-type column with any of the following names will be used for sample size in statistical significance calculations: N, n, nTestSamples, n_test_samples, sampleSize, sample_size, SampleSize.

For a detailed overview of this feature, see the Nesting Test Case Metrics advanced usage guide.

`MetricsTestSuite` #

Bases: DataObject

Test-suite-level metrics produced by an Evaluator.

This class should be subclassed with the relevant fields for a given workflow.

Test-suite-level metrics typically measure performance across test cases, e.g. penalizing variance across different subsets of a benchmark.

`EvaluatorConfiguration` #

Bases: DataObject

Configuration for an Evaluator.

Example evaluator configurations may specify:

Fixed confidence thresholds at which detections are discarded.
Different algorithms/strategies used to compute confidence thresholds (e.g. "accuracy optimal" for a classification-type workflow).

`display_name()` `abstractmethod` #

The name to display for this configuration in Kolena. Must be implemented when extending EvaluatorConfiguration.

`Evaluator(configurations=None)` #

An Evaluator transforms inferences into metrics.

Metrics are computed at the individual test sample level (MetricsTestSample), in aggregate at the test case level (MetricsTestCase), and across populations at the test suite level (MetricsTestSuite).

Test-case-level plots (Plot) may also be computed.

Parameters:

Name	Type	Description	Default
`configurations`	`Optional[List[EvaluatorConfiguration]]`	The configurations at which to perform evaluation. Instance methods such as `compute_test_sample_metrics` are called once per test case per configuration.	`None`

`configurations: List[EvaluatorConfiguration] = configurations or []` `instance-attribute` #

The configurations with which to perform evaluation, provided on instantiation.

`display_name()` #

The name to display for this evaluator in Kolena. Defaults to the name of this class.

`compute_test_sample_metrics(test_case, inferences, configuration=None)` `abstractmethod` #

Compute metrics for every test sample in a test case, i.e. one MetricsTestSample object for each of the provided test samples.

Must be implemented.

Parameters:

Name	Type	Description	Default
`test_case`	`TestCase`	The `TestCase` to which the provided test samples and ground truths belong.	required
`inferences`	`List[Tuple[TestSample, GroundTruth, Inference]]`	The test samples, ground truths, and inferences for all entries in a test case.	required
`configuration`	`Optional[EvaluatorConfiguration]`	The evaluator configuration to use. Empty for implementations that are not configured.	`None`

Returns:

Type	Description
`List[Tuple[TestSample, MetricsTestSample]]`	`TestSample`-level metrics for each provided test sample.

`compute_test_case_metrics(test_case, inferences, metrics, configuration=None)` `abstractmethod` #

Compute aggregate metrics (MetricsTestCase) across a test case.

Must be implemented.

Parameters:

Name	Type	Description	Default
`test_case`	`TestCase`	The test case in question.	required
`inferences`	`List[Tuple[TestSample, GroundTruth, Inference]]`	The test samples, ground truths, and inferences for all entries in a test case.	required
`metrics`	`List[MetricsTestSample]`	The `TestSample`-level metrics computed by `compute_test_sample_metrics`, provided in the same order as `inferences`.	required
`configuration`	`Optional[EvaluatorConfiguration]`	The evaluator configuration to use. Empty for implementations that are not configured.	`None`

Returns:

Type	Description
`MetricsTestCase`	`TestCase`-level metrics for the provided test case.

`compute_test_case_plots(test_case, inferences, metrics, configuration=None)` #

Optionally compute any number of plots to visualize the results for a test case.

Parameters:

Name	Type	Description	Default
`test_case`	`TestCase`	The test case in question	required
`inferences`	`List[Tuple[TestSample, GroundTruth, Inference]]`	The test samples, ground truths, and inferences for all entries in a test case.	required
`metrics`	`List[MetricsTestSample]`	The `TestSample`-level metrics computed by `compute_test_sample_metrics`, provided in the same order as `inferences`.	required
`configuration`	`Optional[EvaluatorConfiguration]`	the evaluator configuration to use. Empty for implementations that are not configured.	`None`

Returns:

Type	Description
`Optional[List[Plot]]`	Zero or more plots for this test case at this configuration.

`compute_test_suite_metrics(test_suite, metrics, configuration=None)` #

Optionally compute TestSuite-level metrics (MetricsTestSuite) across the provided test_suite.

Parameters:

Name	Type	Description	Default
`test_suite`	`TestSuite`	The test suite in question.	required
`metrics`	`List[Tuple[TestCase, MetricsTestCase]]`	The `TestCase`-level metrics computed by `compute_test_case_metrics`.	required
`configuration`	`Optional[EvaluatorConfiguration]`	The evaluator configuration to use. Empty for implementations that are not configured.	`None`

Returns:

Type	Description
`Optional[MetricsTestSuite]`	The `TestSuite`-level metrics for this test suite.

Simplified interface for Evaluator implementations.

`BasicEvaluatorFunction = Union[ConfiguredEvaluatorFunction, UnconfiguredEvaluatorFunction]` `module-attribute` #

BasicEvaluatorFunction provides a function-based evaluator interface that takes the inferences for all test samples in a test suite and a TestCases as input and computes the corresponding test-sample-level, test-case-level, and test-suite-level metrics (and optionally plots) as output.

Example implementation, relying on compute_per_sample and compute_aggregate functions implemented elsewhere:

def evaluate(
    test_samples: List[TestSample],
    ground_truths: List[GroundTruth],
    inferences: List[Inference],
    test_cases: TestCases,
    # configuration: EvaluatorConfiguration,  # uncomment when configuration is used
) -> EvaluationResults:
    # compute per-sample metrics for each test sample
    per_sample_metrics = [compute_per_sample(gt, inf) for gt, inf in zip(ground_truths, inferences)]

    # compute aggregate metrics across all test cases using `test_cases.iter(...)`
    aggregate_metrics: List[Tuple[TestCase, MetricsTestCase]] = []
    for test_case, *s in test_cases.iter(test_samples, ground_truths, inferences, per_sample_metrics):
        # subset of `test_samples`/`ground_truths`/`inferences`/`test_sample_metrics` in given test case
        tc_test_samples, tc_ground_truths, tc_inferences, tc_per_sample_metrics = s
        aggregate_metrics.append((test_case, compute_aggregate(tc_per_sample_metrics)))

    # if desired, compute and add `plots_test_case` and `metrics_test_suite`
    return EvaluationResults(
        metrics_test_sample=list(zip(test_samples, per_sample_metrics)),
        metrics_test_case=aggregate_metrics,
    )

The control flow is in general more streamlined than with Evaluator, but requires a couple of assumptions to hold:

Test-sample-level metrics do not vary by test case
Ground truths corresponding to a given test sample do not vary by test case

This BasicEvaluatorFunction is provided to the test run at runtime, and is expected to have the following signature:

Parameters:

Name	Type	Description	Default
`test_samples`	`List[TestSample]`	A list of distinct `TestSample` values that correspond to all test samples in the test run.	required
`ground_truths`	`List[GroundTruth]`	A list of `GroundTruth` values corresponding to and sequenced in the same order as `test_samples`.	required
`inferences`	`List[Inference]`	A list of `Inference` values corresponding to and sequenced in the same order as `test_samples`.	required
`test_cases`	`TestCases`	An instance of `TestCases`, used to provide iteration groupings for evaluating test-case-level metrics.	required
`evaluator_configuration`	`EvaluatorConfiguration`	The `EvaluatorConfiguration` to use when performing the evaluation. This parameter may be omitted in the function definition for implementations that do not use any configuration object.	required

Returns:

Type	Description
`EvaluationResults`	An `EvaluationResults` object tracking the test-sample-level, test-case-level and test-suite-level metrics and plots for the input collection of test samples.

`TestCases` #

Provides an iterator method for grouping test-sample-level metric results with the test cases that they belong to.

`iter(test_samples, ground_truths, inferences, metrics_test_sample)` `abstractmethod` #

Matches test sample metrics to the corresponding test cases that they belong to.

Parameters:

Name	Type	Description	Default
`test_samples`	`List[TestSample]`	All unique test samples within the test run, sequenced in the same order as the other parameters.	required
`ground_truths`	`List[GroundTruth]`	Ground truths corresponding to `test_samples`, sequenced in the same order.	required
`inferences`	`List[Inference]`	Inferences corresponding to `test_samples`, sequenced in the same order.	required
`metrics_test_sample`	`List[MetricsTestSample]`	Test-sample-level metrics corresponding to `test_samples`, sequenced in the same order.	required

Returns:

Type	Description
`Iterator[Tuple[TestCase, List[TestSample], List[GroundTruth], List[Inference], List[MetricsTestSample]]]`	Iterator that groups each test case in the test run to the lists of member test samples, inferences, and test-sample-level metrics.

`EvaluationResults` #

A bundle of metrics computed for a test run grouped at the test-sample-level, test-case-level, and test-suite-level. Optionally includes Plots at the test-case-level.

`metrics_test_sample: List[Tuple[BaseTestSample, BaseMetricsTestSample]]` `instance-attribute` #

Sample-level metrics, extending MetricsTestSample, for every provided test sample.

`metrics_test_case: List[Tuple[TestCase, MetricsTestCase]]` `instance-attribute` #

Aggregate metrics, extending MetricsTestCase, computed across each test case yielded from TestCases.iter.

`plots_test_case: List[Tuple[TestCase, List[Plot]]] = field(default_factory=list)` `class-attribute` `instance-attribute` #

Optional test-case-level plots.

`metrics_test_suite: Optional[MetricsTestSuite] = None` `class-attribute` `instance-attribute` #

Optional test-suite-level metrics, extending MetricsTestSuite.

`no_op_evaluator(test_samples, ground_truths, inferences, test_cases)` #

A no-op implementation of the Kolena Evaluator that will bypass evaluation but make Inferences accessible in the platform.

from kolena.workflow import no_op_evaluator
from kolena.workflow import test

test(model, test_suite, no_op_evaluator)

kolena.workflow.Evaluator#

MetricsTestSample #

MetricsTestCase #

Nesting Aggregate Metrics#

MetricsTestSuite #

EvaluatorConfiguration #

display_name() abstractmethod #

Evaluator(configurations=None) #

configurations: List[EvaluatorConfiguration] = configurations or [] instance-attribute #

display_name() #

compute_test_sample_metrics(test_case, inferences, configuration=None) abstractmethod #

compute_test_case_metrics(test_case, inferences, metrics, configuration=None) abstractmethod #

compute_test_case_plots(test_case, inferences, metrics, configuration=None) #

compute_test_suite_metrics(test_suite, metrics, configuration=None) #

BasicEvaluatorFunction = Union[ConfiguredEvaluatorFunction, UnconfiguredEvaluatorFunction] module-attribute #

TestCases #

iter(test_samples, ground_truths, inferences, metrics_test_sample) abstractmethod #

EvaluationResults #

metrics_test_sample: List[Tuple[BaseTestSample, BaseMetricsTestSample]] instance-attribute #

metrics_test_case: List[Tuple[TestCase, MetricsTestCase]] instance-attribute #

plots_test_case: List[Tuple[TestCase, List[Plot]]] = field(default_factory=list) class-attribute instance-attribute #

metrics_test_suite: Optional[MetricsTestSuite] = None class-attribute instance-attribute #

no_op_evaluator(test_samples, ground_truths, inferences, test_cases) #

`kolena.workflow.Evaluator`#

`MetricsTestSample` #

`MetricsTestCase` #

`MetricsTestSuite` #

`EvaluatorConfiguration` #

`display_name()` `abstractmethod` #

`Evaluator(configurations=None)` #

`configurations: List[EvaluatorConfiguration] = configurations or []` `instance-attribute` #

`display_name()` #

`compute_test_sample_metrics(test_case, inferences, configuration=None)` `abstractmethod` #

`compute_test_case_metrics(test_case, inferences, metrics, configuration=None)` `abstractmethod` #

`compute_test_case_plots(test_case, inferences, metrics, configuration=None)` #

`compute_test_suite_metrics(test_suite, metrics, configuration=None)` #

`BasicEvaluatorFunction = Union[ConfiguredEvaluatorFunction, UnconfiguredEvaluatorFunction]` `module-attribute` #

`TestCases` #

`iter(test_samples, ground_truths, inferences, metrics_test_sample)` `abstractmethod` #

`EvaluationResults` #

`metrics_test_sample: List[Tuple[BaseTestSample, BaseMetricsTestSample]]` `instance-attribute` #

`metrics_test_case: List[Tuple[TestCase, MetricsTestCase]]` `instance-attribute` #

`plots_test_case: List[Tuple[TestCase, List[Plot]]] = field(default_factory=list)` `class-attribute` `instance-attribute` #

`metrics_test_suite: Optional[MetricsTestSuite] = None` `class-attribute` `instance-attribute` #

`no_op_evaluator(test_samples, ground_truths, inferences, test_cases)` #