Skip to content

Formatting your Datasets#

What is a Dataset#

A dataset is a structured assembly of datapoints, designed for model evaluation. Each datapoint in a dataset is a comprehensive unit that combines data traditionally segmented into test samples, ground truth, and metadata.

What defines a Datapoint#

Conceptually, a datapoint is a set of inputs that you would want to test on your models. Consider a single row within the Classification (CIFAR-10) ↗ dataset with the following columns:

locator ground_truth image_brightness image_contrast
s3://kolena-public-examples/cifar10/data/horse0000.png horse 153.994 84.126

From this you can see that image horse0000.png has the ground_truth classification of horse, and has brightness and contrast data.

When uploading a dataset to Kolena, it is important to be able to differentiate between each datapoint. This is accomplished by configuring an id_field - a unique identifier for a datapoint. You can select any field that is unique across your data, or generate one if no unique identifiers exist for your dataset. Below are some common patterns for generating/selecting a unique identifier if your data does not have a natural ID field:

  • If your datapoints contain a locator field pointing to the external files representing your model inputs, the locator field is usually used as the ID field.
  • For datapoints with a text field for text-based models, we recommend either generating and saving a UUID for each datapoint or generating a hash of the text field to use as the ID field. You can also use the text field itself as the ID field.
  • For other kinds of datapoints, we recommend generating and saving a UUID for each datapoint to use as the ID field.

Kolena will attempt to infer common id_fields (eg. locator, text) based on what is present in the dataset during import. This can be overridden by explicitly declaring id fields when importing via the Web App from the Datasets page, or the SDK by using the upload_dataset function.

Kolena will look for the following fields when displaying datapoints:

Field Name Description
locator Url path to a file to be displayed, either a cloud storage url or a http url that serves a file.
text Raw text input for text based models.

A locator needs to have correct extensions for the corresponding file type. For example an image should be in a format such as .jpg or .png, whereas locators for audio data should be in forms like .mp3 or .wav.

Locator Support Matrix#

Data Type Supported file formats
Image jpg, jpeg, png, gif, bmp and other web browser supported image types.
Audio flac, mp3, wav, acc, ogg, ra and other web browser supported audio types.
Video mov, mp4, mpeg, avi and other web browser supported video types.
Document txt and pdf files.
Point Cloud pcd files.

Metadata and other additional fields can be added to datasets by adding a column to the .csv and providing values for datapoints where applicable. For example image_height and image_width may be useful metadata for image datasets and fields like word_count may be useful for text datasets.

How are Datasets viewed#

Kolena allows you to visualize your datasets by use of the Studio. The studio experience depends on the type of data relevant to your problem.

The first experience is the Gallery view which allows you to view your data in a grid. This is useful as you can see chunks of your data (images, video, audio, text) and view results without having to view each datapoint individually.

The second experience is the Tabular view, used when your data is a set of columns and values. An example of this is the Rain Forcast ↗ dataset.

In order to use the Gallery view you will need to have the locator or text fields specified in the dataset.

Enriching your Dataset experience#

Kolena Assets#

You can connect files to datapoints in Kolena by the use of asset, which can be visualized in the Studio when exploring datasets and results. Multiple assets can be attached to a single datapoint allowing you to represent complex scenarios on Kolena. Assets are files stored in a cloud bucket or served at a URL.

Asset Type Description
ImageAsset Useful if your data is modeled as multiple related images.
BinaryAsset Useful if you want to attach any segmentation or bitmap masks.
AudioAsset Useful if you want to attach an audio file.
VideoAsset Useful if you want to attach a video file.
PointCloudAsset Useful for attaching 3D point cloud data.

The Automatic Speech Recognition ↗ example showcases how AudioAssets can be attached to datapoints.

The s3://kolena-public-examples/LibriSpeech/raw/LibriSpeech.csv csv contains data of following format:

id audio transcript word_count
1272-128104-0014 s3://kolena-public-examples/LibriSpeech/data/dev-clean/1272/128104/1272-128104-0014.flac by harry quilter m a 5

Here the audio column contains a locator but if uploaded as is, it would just be rendered as a text metadata field. We need to use the AudioAsset annotation when uploading in order for the Audio file to be rendered as an asset.

from kolena.asset import AudioAsset
from import dataframe_to_csv
import pandas as pd

df = pd.read_csv("s3://kolena-public-examples/LibriSpeech/raw/LibriSpeech.csv", storage_options={"anon": True})
df["audio"] = df["audio"].apply(AudioAsset)
dataframe_to_csv(df, "audio-asset.csv")
Now the data in audio-asset.csv can be uploaded as a tabular dataset with audio assets attached to each row. Any name can be used for the audio column in this example.

Kolena Annotations#

Kolena allows you to visualize overlays on top of datapoints through the use ofannotation. These annotations are visible on both the Gallery view for groups of datapoints and for individual datapoints.

Annotation Type Description
BoundingBox Used to overlay bounding boxes (including confidence scores and labels) on top of images.
SegmentationMask Used to overlay raster segmentation maps on top of images.

Structured Data#

Consider a .csv file containing ground truth data in the from of bounding boxes for an Object Detection problem.

locator label min_x max_x min_y max_y
s3://kolena-public-examples/coco-2014-val/data/COCO_val2014_000000369763.jpg motorcycle 270.77 621.61 44.59 254.18
s3://kolena-public-examples/coco-2014-val/data/COCO_val2014_000000369763.jpg car 538.03 636.85 8.86 101.93
s3://kolena-public-examples/coco-2014-val/data/COCO_val2014_000000369763.jpg trunk 313.02 553.98 12.01 99.84

The first bounding box for the image is (270.77, 44.59), (621.61, 254.18). To represent this within Kolena use the LabeledBoundingBox annotation. If you want to ignore labels the base BoundingBox can be used.

This looks like:

from kolena.annotation import LabeledBoundingBox
bbox = LabeledBoundingBox(top_left=(270.77, 44.59), bottom_right=(621.61,  254.18), label="motorcycle")
When viewing a bounding box within python the format is:
LabeledBoundingBox(top_left=(270.77, 44.59), bottom_right=(621.61, 254.18), label="motorcycle", width=350.84,
 height=209.59, area=73532.5556, aspect_ratio=1.67)

A single bounding box would be serialized as the following JSON string within a .csv file:

{""top_left"": [270.77, 44.59], ""bottom_right"": [621.61, 254.18], ""width"": 350.84, ""height"": 209.59,
 ""area"": 73532.5556, ""aspect_ratio"": 1.67, ""label"": ""motorcycle"",  ""data_type"": ""ANNOTATION/BOUNDING_BOX""},

The above example has multiple objects within a single image, which is represented in Kolena as a list of bounding boxes.

For example:

from kolena.annotation import LabeledBoundingBox
bboxes = [
    LabeledBoundingBox(top_left=(270.77, 44.59), bottom_right=(621.61, 254.18), label="motorcycle"),
    LabeledBoundingBox(top_left=(538.03, 8.86), bottom_right=(636.85, 101.93), label="car"),
    LabeledBoundingBox(top_left=(313.02, 12.01), bottom_right=(553.98, 99.84), label="trunk"),
This would be represented within a .csv file as shown below. Note this will be a single line, but is shown here as multiple lines for formatting.
"[{""top_left"": [270.77, 44.59], ""bottom_right"": [621.61, 254.18], ""width"": 350.84, ""height"": 209.59,
 ""area"": 73532.5556, ""aspect_ratio"": 1.67, ""label"": ""motorcycle"", ""data_type"": ""ANNOTATION/BOUNDING_BOX""},
  {""top_left"": [538.03, 8.86], ""bottom_right"": [636.85, 101.93], ""width"": 98.82, ""height"": 93.07,
   ""area"": 9197.1774, ""aspect_ratio"": 1.062, ""label"": ""car"", ""data_type"": ""ANNOTATION/BOUNDING_BOX""},
  {""top_left"": [313.02, 12.01], ""bottom_right"": [553.98, 99.84], ""width"": 240.96,
   ""height"": 87.83, ""area"": 21163.5168, ""aspect_ratio"": 2.743, ""label"": ""trunk"", ""data_type"": ""ANNOTATION/BOUNDING_BOX""}]"

When uploading .csv files for datasets that contain annotations, assets or nested values in a column use the dataframe_to_csv() function provided by Kolena to save a .csv file instead of pandas.to_csv(). pandas.to_csv does not serialize Kolena annotation objects in a way that is compatible with the platform.

The following snippet shows how to format COCO data as a dataset within Kolena. As the input .csv file contains rows for each bounding box within an image, we need to apply some transformations to the raw data. This is done by creating a list of all bounding boxes for an image and then merging it with the metadata. The produced .csv contains a column called ground_truths where the data is the same format as the above bounding boxes.

from kolena.annotation import LabeledBoundingBox
from import dataframe_to_csv
from collections import defaultdict
import pandas as pd

df = pd.read_csv(f"s3://kolena-public-examples/coco-2014-val/transportation/raw/coco-2014-val.csv",
                 storage_options={"anon": True})
image_to_boxes = defaultdict(list)
image_to_metadata = defaultdict(dict)

for record in df.itertuples():
    coords = (float(record.min_x), float(record.min_y)), (float(record.max_x), float(record.max_y))
    bounding_box = LabeledBoundingBox(*coords, record.label)
    metadata = {
        "locator": record.locator,
        "height": record.height,
        "width": record.width,
        "date_captured": record.date_captured,
        "brightness": record.brightness,
    image_to_metadata[record.locator] = metadata

df_boxes = pd.DataFrame(list(image_to_boxes.items()), columns=["locator", "ground_truths"])
df_metadata = pd.DataFrame.from_dict(image_to_metadata, orient="index").reset_index(drop=True)
df_merged = df_metadata.merge(df_boxes, on="locator")

dataframe_to_csv(df_merged, "processed.csv")

The file processed.csv can be uploaded through the Datasets page.

Configuring Thumbnails#

In order to improve the loading performance of your image data, you can upload compressed versions of the image with the same dimensions as thumbnails. This results in an improved Studio experience due to faster image loading when filtering, sorting or using embedding sort.

Thumbnails are configured by adding a field called thumbnail_locator to the data, where the value points to a compressed version of the locator image.

If you wanted to add a thumbnail to the classification data shown above it would look like:

locator thumbnail_locator ground_truth image_brightness image_contrast
s3://kolena-public-examples/cifar10/data/horse0000.png s3://kolena-public-examples/cifar10/data/thumbnail/horse0000.png horse 153.994 84.126

Formatting Results#

Formatting results for Object Detection#

For Object Detection problems, model results need to have the following columns for the best experience. The values for each of the columns is a List[ScoredLabeledBoundingBox]

Column Name Description
matched_inference Inferences that were matched to a ground truth.
unmatched_inference Inferences that were not matched to a ground truth.
unmatched_ground_truth Ground truths with no matching inference.

These columns are used to determine True Postitives, False Positives, and False Negatives. These results can be formatted for upload with a similar process as above. This is done by adding the relevant list of bounding boxes to the matched_inference, unmatched_inference, and unmatched_ground_truth columns for each image. The results.csv created can be uploaded by opening the corresponding dataset from the Datasets page and navigating to the Studio section.

We have provided an Object Detection (2D) ↗ example that shows how to take raw results and perform bounding box matching to produce the values mentioned above.

To use compound metrics on the fly#

The Kolena web application currently supports precision, recall, f1_score, accuracy, false_positive_rate, and true_negative_rate.

To leverage these, add the following columns to your CSV: count_TP, count_FP, count_FN, count_TN.

Supported File Data Formats#

The Kolena web application currently supports various file formats for both dataset uploads and model results processing. The following table lists the supported file formats:

Format Description
.csv Comma-separated values file, ideal for tabular data.
.parquet Apache Parquet format, efficient for columnar storage.
.jsonl JSON Lines format, suitable for handling nested data.

CSV Files: Widely used for simple tabular datasets, CSV files are easy to generate and manipulate, making them a popular choice for data scientists and developers.

Parquet Files: Offering efficient storage and fast retrieval, Parquet files are optimal for handling large datasets with a significant number of columns.

JSON Lines (JSONL) Files: Each line in a JSONL file is a complete JSON object, making this format ideal for datasets with complex or nested data structures.

When preparing your dataset or model results files for upload, ensure that they conform to one of these supported file formats to guarantee compatibility with Kolena's data processing capabilities.