Skip to content

kolena.dataset#

upload_dataset(name, df, *, id_fields=None, commit_tags=None, append_only=False) #

Create or update a dataset with the contents of the provided DataFrame df.

Updating id_fields

ID fields are used to associate model results (uploaded via upload_results) with datapoints in this dataset. When updating an existing dataset, update id_fields with caution.

Parameters:

Name Type Description Default
name str

The name of the dataset.

required
df Union[DataFrame, Iterator[DataFrame]]

A DataFrame or iterator of DataFrames. Provide an iterator to perform batch upload (example: csv_reader = pd.read_csv("PathToDataset.csv", chunksize=10)).

required
id_fields Optional[List[str]]

Optionally specify a list of ID fields that will be used to link model results with the datapoints within a dataset. When unspecified, a suitable value is inferred from the columns of the provided df. Note that id_fields must be hashable.

None
commit_tags Optional[List[str]]

Optionally specify a list of tags to associate with the dataset commit.

None
append_only bool

If False, all datapoints in the dataset will be replaced by the ones in the input dataframe, and existing datapoints absent from the input dataframe will be removed from the dataset. If True, new datapoints from the input dataframe will be added, and existing datapoints will be modified if present in the input dataframe, but no datapoints will be deleted from the datasets. This behaves like an UPSERT operation.

False

list_datasets() #

List the names of all uploaded datasets

Returns:

Type Description
List[str]

A list of the names of all uploaded datasets

download_dataset(name, *, commit=None, include_extracted_properties=False) #

Download an entire dataset given its name.

Parameters:

Name Type Description Default
name str

The name of the dataset.

required
commit Optional[str]

The commit hash for version control. Get the latest commit when this value is None.

None
include_extracted_properties bool

If True, include kolena extracted properties from automated extractions in the dataset as separate columns

False

Returns:

Type Description
DataFrame

A DataFrame containing the specified dataset.

EvalConfig = Optional[Dict[str, Any]] module-attribute #

User defined configuration for evaluating results, for example {"threshold": 7}.

DataFrame = Union[pd.DataFrame, Iterator[pd.DataFrame]] module-attribute #

A type alias representing a DataFrame, which can be either a pandas DataFrame or an iterator of pandas DataFrames.

EvalConfigResults #

Bases: NamedTuple

Named tuple where the first element (the eval_config field) is an evaluation configuration, and the second element (the results field) is the corresponding DataFrame of results.

ModelEntity #

The descriptor of a model tested on Kolena.

name: str instance-attribute #

Unique name of the model.

tags: List[str] instance-attribute #

Tags associated with the model.

download_results(dataset, model, commit=None, include_extracted_properties=False) #

Download results given dataset name and model name.

Concat dataset with results:

df_dp, results = download_results("dataset name", "model name")
for eval_config, df_result in results:
    df_combined = pd.concat([df_dp, df_result], axis=1)

Parameters:

Name Type Description Default
dataset str

The name of the dataset.

required
model str

The name of the model.

required
commit Optional[str]

The commit hash for version control. Get the latest commit when this value is None.

None
include_extracted_properties bool

If True, include kolena extracted properties from automated extractions in the datapoints and results as separate columns

False

Returns:

Type Description
Tuple[DataFrame, List[EvalConfigResults]]

Tuple of DataFrame of datapoints and list of EvalConfigResults.

upload_results(dataset, model, results, thresholded_fields=None, tags=[]) #

This function is used for uploading the results from a specified model on a given dataset.

Parameters:

Name Type Description Default
dataset str

The name of the dataset.

required
model str

The name of the model.

required
results Union[DataFrame, List[EvalConfigResults]]

Either a DataFrame or a list of EvalConfigResults.

required
thresholded_fields Optional[List[str]]

Optional columns in result DataFrame containing data associated with different thresholds.

None
tags List[str]

Optional list of tags to be associated with the model.

[]

Returns:

Type Description
None

None

get_models(dataset) #

Get all models with results on a given dataset.

Parameters:

Name Type Description Default
dataset str

The name of the dataset.

required

Returns:

Type Description
List[ModelEntity]

A list of models tested on the given dataset.