kolena.dataset
#
-
Examples:
kolena/examples/dataset
↗
upload_dataset(name, df, *, id_fields=None, commit_tags=None, dataset_tags=None, append_only=False, description=None)
#
Create or update a dataset with the contents of the provided DataFrame df
.
Updating id_fields
ID fields are used to associate model results (uploaded via upload_results
)
with datapoints in this dataset. When updating an existing dataset, update id_fields
with caution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the dataset. |
required |
df
|
Union[DataFrame, Iterator[DataFrame]]
|
A DataFrame or iterator of DataFrames. Provide an iterator to perform batch upload (example: |
required |
id_fields
|
Optional[List[str]]
|
Optionally specify a list of ID fields that will be used to link model results with the datapoints within a dataset. When unspecified, a suitable value is inferred from the columns of the provided |
None
|
commit_tags
|
Optional[List[str]]
|
Optionally specify a list of tags to associate with the dataset commit. |
None
|
dataset_tags
|
Optional[List[str]]
|
Optionally specify a list of tags to associate with the dataset. |
None
|
append_only
|
bool
|
If |
False
|
description
|
Optional[str]
|
Optionally specify the description of the dataset. |
None
|
list_datasets()
#
List the names of all uploaded datasets
Returns:
Type | Description |
---|---|
List[str]
|
A list of the names of all uploaded datasets |
download_dataset(name, *, commit=None, include_extracted_properties=False, filters=None)
#
Download an entire dataset given its name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the dataset. |
required |
commit
|
Optional[str]
|
The commit hash for version control. Get the latest commit when this value is |
None
|
include_extracted_properties
|
bool
|
If True, include kolena extracted properties from automated extractions in the dataset as separate columns |
False
|
filters
|
Optional[Filters]
|
[Experimental] Optional filter to specify which datapoints should be downloaded. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A DataFrame containing the specified dataset. |
EvalConfig = Optional[Dict[str, Any]]
module-attribute
#
User defined configuration for evaluating results, for example {"threshold": 7}
.
DataFrame = Union[pd.DataFrame, Iterator[pd.DataFrame]]
module-attribute
#
A type alias representing a DataFrame, which can be either a pandas DataFrame or an iterator of pandas DataFrames.
EvalConfigResults
#
Bases: NamedTuple
Named tuple where the first element (the eval_config
field) is an evaluation configuration, and the second element
(the results
field) is the corresponding DataFrame of results.
ModelEntity
#
The descriptor of a model tested on Kolena.
download_results(dataset, model, commit=None, include_extracted_properties=False)
#
Download results given dataset name and model name.
Concat dataset with results:
df_dp, results = download_results("dataset name", "model name")
for eval_config, df_result in results:
df_combined = pd.concat([df_dp, df_result], axis=1)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
str
|
The name of the dataset. |
required |
model
|
str
|
The name of the model. |
required |
commit
|
Optional[str]
|
The commit hash for version control. Get the latest commit when this value is |
None
|
include_extracted_properties
|
bool
|
If True, include kolena extracted properties from automated extractions in the datapoints and results as separate columns |
False
|
Returns:
Type | Description |
---|---|
Tuple[DataFrame, List[EvalConfigResults]]
|
Tuple of DataFrame of datapoints and list of |
upload_results(dataset, model, results, thresholded_fields=None, tags=[], metadata=None)
#
This function is used for uploading the results from a specified model on a given dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
str
|
The name of the dataset. |
required |
model
|
str
|
The name of the model. |
required |
results
|
Union[DataFrame, List[EvalConfigResults]]
|
Either a DataFrame or a list of |
required |
thresholded_fields
|
Optional[List[str]]
|
Optional columns in result DataFrame containing data associated with different thresholds. |
None
|
tags
|
List[str]
|
Optional list of tags to be associated with the model. |
[]
|
metadata
|
Optional[Dict[str, Union[StrictInt, StrictFloat, StrictStr, None]]]
|
Optional dictionary of string key to values tobe associated with the model. |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
get_models(dataset)
#
Get all models with results on a given dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
str
|
The name of the dataset. |
required |
Returns:
Type | Description |
---|---|
List[ModelEntity]
|
A list of models tested on the given dataset. |
upload_dataset_embeddings(dataset_name, key, df_embedding)
#
Upload a list of search embeddings for a dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
String value indicating the name of the dataset for which the embeddings will be uploaded. |
required |
key
|
str
|
String value uniquely corresponding to the embedding vectors. For example, this can be the name of the embedding model along with the column with which the embedding was extracted, such as |
required |
df_embedding
|
DataFrame
|
Dataframe containing id fields for identifying datapoints in the dataset and the associated embeddings as |
required |
Raises:
Type | Description |
---|---|
NotFoundError
|
The given dataset does not exist. |
InputValidationError
|
The provided input is not valid. |
Filters
#
Filters to be applied on the dataset during the operation. Currently only used as an optional argument
in download_dataset
.
datapoint: Dict[str, GeneralFieldFilter] = field(default_factory=dict)
class-attribute
instance-attribute
#
Dictionary of a field name of the datapoint to the GeneralFieldFilter
to be
applied on the field. In case of nested objects, use .
as the delimiter to separate the keys. For example, if you
have a ground_truth
column of Label
type, you can use ground_truth.label
as the key
to query for the class label.
GeneralFieldFilter
#
Generic representation of a filter on Kolena.
value_in: Optional[List[Union[StrictStr, StrictBool]]] = None
class-attribute
instance-attribute
#
A list of desired categorical values.
null_value: Optional[Literal[True]] = None
class-attribute
instance-attribute
#
Whether to filter for cases where the field has null value or the field name does not exist.