`kolena.dataset`#

Examples: kolena/examples/dataset ↗

`upload_dataset(name, df, *, id_fields=None, commit_tags=None, dataset_tags=None, append_only=False, description=None)` #

Create or update a dataset with the contents of the provided DataFrame df.

Updating id_fields

ID fields are used to associate model results (uploaded via upload_results) with datapoints in this dataset. When updating an existing dataset, update id_fields with caution.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required
`df`	`Union[DataFrame, Iterator[DataFrame]]`	A DataFrame or iterator of DataFrames. Provide an iterator to perform batch upload (example: `csv_reader = pd.read_csv("PathToDataset.csv", chunksize=10)`).	required
`id_fields`	`Optional[List[str]]`	Optionally specify a list of ID fields that will be used to link model results with the datapoints within a dataset. When unspecified, a suitable value is inferred from the columns of the provided `df`. Note that `id_fields` must be hashable.	`None`
`commit_tags`	`Optional[List[str]]`	Optionally specify a list of tags to associate with the dataset commit.	`None`
`dataset_tags`	`Optional[List[str]]`	Optionally specify a list of tags to associate with the dataset.	`None`
`append_only`	`bool`	If `False`, all datapoints in the dataset will be replaced by the ones in the input dataframe, and existing datapoints absent from the input dataframe will be removed from the dataset. If `True`, new datapoints from the input dataframe will be added, and existing datapoints will be modified if present in the input dataframe, but no datapoints will be deleted from the datasets. This behaves like an `UPSERT` operation.	`False`
`description`	`Optional[str]`	Optionally specify the description of the dataset.	`None`

Returns:

Type	Description
`DatasetEntity`	The dataset as a `DatasetEntity` object.

`list_datasets()` #

List the names of all uploaded datasets

Returns:

Type	Description
`List[str]`	A list of the names of all uploaded datasets

`download_dataset(name, *, commit=None, include_extracted_properties=False, filters=None)` #

Download an entire dataset given its name.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required
`commit`	`Optional[str]`	The commit hash for version control. Get the latest commit when this value is `None`.	`None`
`include_extracted_properties`	`bool`	If True, include kolena extracted properties from automated extractions in the dataset as separate columns	`False`
`filters`	`Optional[Filters]`	[Experimental] Optional filter to specify which datapoints should be downloaded.	`None`

Returns:

Type	Description
`DataFrame`	A DataFrame containing the specified dataset.

`delete_dataset(name)` #

Deletes an entire dataset given its name. The deletion will cascade to all datapoints within the dataset as well as embeddings and model results for those datapoints.

Please be careful when deleting a dataset programmatically. This operation can not be undone.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required

`EvalConfig = Optional[Dict[str, Any]]` `module-attribute` #

User defined configuration for evaluating results, for example {"threshold": 7}.

`DataFrame = Union[pd.DataFrame, Iterator[pd.DataFrame]]` `module-attribute` #

A type alias representing a DataFrame, which can be either a pandas DataFrame or an iterator of pandas DataFrames.

`EvalConfigResults` #

Bases: NamedTuple

Named tuple where the first element (the eval_config field) is an evaluation configuration, and the second element (the results field) is the corresponding DataFrame of results.

`ModelEntity` #

The descriptor of a model tested on Kolena.

`name: str` `instance-attribute` #

Unique name of the model.

`tags: List[str]` `instance-attribute` #

Tags associated with the model.

`metadata: Optional[Dict[str, Union[StrictInt, StrictFloat, StrictStr, None]]] = None` `class-attribute` `instance-attribute` #

Metadata associated with the model.

`download_results(dataset, model, commit=None, include_extracted_properties=False)` #

Download results given dataset name and model name.

Concat dataset with results:

df_dp, results = download_results("dataset name", "model name")
for eval_config, df_result in results:
    df_combined = pd.concat([df_dp, df_result], axis=1)

Parameters:

Name	Type	Description	Default
`dataset`	`str`	The name of the dataset.	required
`model`	`str`	The name of the model.	required
`commit`	`Optional[str]`	The commit hash for version control. Get the latest commit when this value is `None`.	`None`
`include_extracted_properties`	`bool`	If True, include kolena extracted properties from automated extractions in the datapoints and results as separate columns	`False`

Returns:

Type	Description
`Tuple[DataFrame, List[EvalConfigResults]]`	Tuple of DataFrame of datapoints and list of `EvalConfigResults`.

`upload_results(dataset, model, results, thresholded_fields=None, tags=[], metadata=None)` #

This function is used for uploading the results from a specified model on a given dataset.

Parameters:

Name	Type	Description	Default
`dataset`	`str`	The name of the dataset.	required
`model`	`str`	The name of the model.	required
`results`	`Union[DataFrame, List[EvalConfigResults]]`	Either a DataFrame or a list of `EvalConfigResults`.	required
`thresholded_fields`	`Optional[List[str]]`	Optional columns in result DataFrame containing data associated with different thresholds.	`None`
`tags`	`List[str]`	Optional list of tags to be associated with the model.	`[]`
`metadata`	`Optional[Dict[str, Union[StrictInt, StrictFloat, StrictStr, None]]]`	Optional dictionary of string key to values tobe associated with the model.	`None`

Returns:

Type	Description
`None`	None

`get_models(dataset)` #

Get all models with results on a given dataset.

Parameters:

Name	Type	Description	Default
`dataset`	`str`	The name of the dataset.	required

Returns:

Type	Description
`List[ModelEntity]`	A list of models tested on the given dataset.

`upload_dataset_embeddings(dataset_name, key, df_embedding)` #

Upload a list of search embeddings for a dataset.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	String value indicating the name of the dataset for which the embeddings will be uploaded.	required
`key`	`str`	String value uniquely corresponding to the embedding vectors. For example, this can be the name of the embedding model along with the column with which the embedding was extracted, such as `resnet50-image_locator`.	required
`df_embedding`	`DataFrame`	Dataframe containing id fields for identifying datapoints in the dataset and the associated embeddings as `numpy.typing.ArrayLike` of numeric values.	required

Raises:

Type	Description
`NotFoundError`	The given dataset does not exist.
`InputValidationError`	The provided input is not valid.

`Filters` #

Filters to be applied on the dataset during the operation. Currently only used as an optional argument in download_dataset.

`datapoint: Dict[str, GeneralFieldFilter] = field(default_factory=dict)` `class-attribute` `instance-attribute` #

Dictionary of a field name of the datapoint to the GeneralFieldFilter to be applied on the field. In case of nested objects, use . as the delimiter to separate the keys. For example, if you have a ground_truth column of Label type, you can use ground_truth.label as the key to query for the class label.

`GeneralFieldFilter` #

Generic representation of a filter on Kolena.

`value_in: Optional[List[Union[StrictStr, StrictBool]]] = None` `class-attribute` `instance-attribute` #

A list of desired categorical values.

`null_value: Optional[Literal[True]] = None` `class-attribute` `instance-attribute` #

Whether to filter for cases where the field has null value or the field name does not exist.

`DatasetEntity` #

The descriptor of a dataset on Kolena.

`id: int` `instance-attribute` #

ID of the dataset.

`name: str` `instance-attribute` #

Name of the dataset.

`description: str` `instance-attribute` #

Description of the dataset.

`id_fields: List[str]` `instance-attribute` #

ID fields of the dataset.

kolena.dataset#

upload_dataset(name, df, *, id_fields=None, commit_tags=None, dataset_tags=None, append_only=False, description=None) #

list_datasets() #

download_dataset(name, *, commit=None, include_extracted_properties=False, filters=None) #

delete_dataset(name) #

EvalConfig = Optional[Dict[str, Any]] module-attribute #

DataFrame = Union[pd.DataFrame, Iterator[pd.DataFrame]] module-attribute #

EvalConfigResults #

ModelEntity #

name: str instance-attribute #

tags: List[str] instance-attribute #

metadata: Optional[Dict[str, Union[StrictInt, StrictFloat, StrictStr, None]]] = None class-attribute instance-attribute #

download_results(dataset, model, commit=None, include_extracted_properties=False) #

upload_results(dataset, model, results, thresholded_fields=None, tags=[], metadata=None) #

get_models(dataset) #

upload_dataset_embeddings(dataset_name, key, df_embedding) #

Filters #

datapoint: Dict[str, GeneralFieldFilter] = field(default_factory=dict) class-attribute instance-attribute #

GeneralFieldFilter #

value_in: Optional[List[Union[StrictStr, StrictBool]]] = None class-attribute instance-attribute #

null_value: Optional[Literal[True]] = None class-attribute instance-attribute #

DatasetEntity #

id: int instance-attribute #

name: str instance-attribute #

description: str instance-attribute #

id_fields: List[str] instance-attribute #

`kolena.dataset`#

`upload_dataset(name, df, *, id_fields=None, commit_tags=None, dataset_tags=None, append_only=False, description=None)` #

`list_datasets()` #

`download_dataset(name, *, commit=None, include_extracted_properties=False, filters=None)` #

`delete_dataset(name)` #

`EvalConfig = Optional[Dict[str, Any]]` `module-attribute` #

`DataFrame = Union[pd.DataFrame, Iterator[pd.DataFrame]]` `module-attribute` #

`EvalConfigResults` #

`ModelEntity` #

`name: str` `instance-attribute` #

`tags: List[str]` `instance-attribute` #

`metadata: Optional[Dict[str, Union[StrictInt, StrictFloat, StrictStr, None]]] = None` `class-attribute` `instance-attribute` #

`download_results(dataset, model, commit=None, include_extracted_properties=False)` #

`upload_results(dataset, model, results, thresholded_fields=None, tags=[], metadata=None)` #

`get_models(dataset)` #

`upload_dataset_embeddings(dataset_name, key, df_embedding)` #

`Filters` #

`datapoint: Dict[str, GeneralFieldFilter] = field(default_factory=dict)` `class-attribute` `instance-attribute` #

`GeneralFieldFilter` #

`value_in: Optional[List[Union[StrictStr, StrictBool]]] = None` `class-attribute` `instance-attribute` #

`null_value: Optional[Literal[True]] = None` `class-attribute` `instance-attribute` #

`DatasetEntity` #

`id: int` `instance-attribute` #

`name: str` `instance-attribute` #

`description: str` `instance-attribute` #

`id_fields: List[str]` `instance-attribute` #