Skip to content

Artboard 2 Created with Sketch. Natural Language#

In this document we will review best practices when setting up Kolena datasets for NLP or LLM problems.

Basics#

Supported File Data Formats#

The Kolena SDK supports upload of data in the Pandas DataFrame format.

The Kolena web app supports the following file formats.

Format Description
.csv Comma-separated values file, ideal for tabular data.
.parquet Apache Parquet format, efficient for columnar storage.
.jsonl JSON Lines format, suitable for handling nested data.

Using the text column#

Text samples can be visualized on Kolena one of two ways.

Gallery mode: visualizes each text value as a tile.

To enable this view, include your primary text values in your .CSV in a column named text.

Gallery View Gallery View

Gallery View

Example

The Text Summarization ↗ example showcases how texts can be uploaded in Gallery mode.

Tip

Use the TextSegment or LabeledTextSegment annotations to highlights parts of your text that is of interest to you.

Tabular mode: visualizes each text field with its corresponding meta-data in a table with common table functionalities.

To use this view, simply provide the text values in your .CSV in any column named other than text.

Tabular View Tabular View

Tabular View

Using fields#

You can add additional information about your text by adding columns to the .CSV file with the meta-data name and values in each row.

Tip

Kolena is able to automatically extract multiple properties from your text values by Extracting Metadata from Text Fields. You can use these values to create test cases and better understand your data.

Uploading Model Results#

Model results contain your model inferences as well as any custom metrics that you wish to monitor on Kolena. The data structure of model results is very similar to the structure of a dataset.

  • make sure to link your inferences to the dataset using the same unique ID (for example the text field) you used when uploading the dataset.
  • use ScoredTextSegment or ScoredLabeledTextSegment annotations to indicate the inference confidence score.