Representation Score#
The representation_score
measures how well each data point in an unlabelled or training dataset is represented within the
validation or test dataset. This score is calculated by fitting a Kernel Density Estimation (KDE) model on the embeddings
of the validation/test data in the 2D UMAP space. The KDE model estimates the probability density function of the
validation/test data distribution. For each data point in the training dataset, the representation_score is obtained by
evaluating the KDE (of the validation/test data) at that point, yielding the log-probability density. This score
effectively quantifies the extent to which each training data point lies in regions of the embedding space populated by
validation/test data.
Note
To utilize this score, provide the appropriate metadata that indicate if a datapoint is in the training set or not.
For example, you can use a split
property to pass train
or test
values and use for this score.
Note
To further assist users with data curation tasks, Kolena automatically calculated a number of metrics based on the embedding space details. Enable automatic embedding extractions or upload your own embeddings to utilize these scores.
Interpretation#
The representation_score is a valuable indicator of the alignment between the unlabelled/training data and the validation/test data distributions. A higher representation_score for a training data point implies that it occupies a region in the embedding space that is well-represented in the validation/test set, suggesting that its features are relevant for the model's performance on unseen data. Conversely, a lower score may highlight areas where the training data lacks coverage of the validation/test data distribution. Its goal is to identify gaps in the training data that could impact model training. This metric is useful for guiding data collection, labelling, and augmentation tasks, allowing you to address underrepresented areas in the training dataset, improving model robustness and accuracy on validation and test data.