Setting up Natural Language Search#

Kolena supports natural language and similar image search across image data registered to the platform. Users may set up this functionality by enabling the automated embedding extraction process or manually extracting and uploading corresponding search embeddings using a Kolena provided package.

Setting up Automated Embedding extraction#

Requirements

This feature is currently supported for Amazon S3 integrations.
Kolena requires access to the content of your images. Read Connecting Cloud Storage: Amazon S3 for more details.
Only account administrators are able to change this setting.

Embedding extractions allow you to find datapoints using natural language or similarity between desired datapoints. To enable automated embedding, navigate to "Organization Settings" available on your profile menu, top right of the screen. Under the "Automations" tab, Enable the Automated Embeddings Extraction by Kolena option.

Defining Metrics — Automated Embeddings Extraction

Once this setting is enabled, embeddings for new and edited datapoints in your datasets will be automatically extracted.

Uploading embeddings manually#

If your organization does not allow Kolena access to the images, or you have custom embedding extraction logic, you may upload those embeddings manually to enable Natural Language and Similar Image search on Kolena.

In this document, we will go over main components of the below and steps you need to take to tailor it for your application.

Example

The Kolena repository includes a code example for extraction and uploading embeddings. It builds on data from the semantic_segmentation example dataset, so ensure the dataset is uploaded to your Kolena environment before running the code example.

Uploading embeddings to Kolena can be done in four simple steps:

Step 1: installing dependency package
Step 2: loading dataset and model to run embedding extraction
Step 3: loading images for input to extraction library
Step 4: extracting and uploading search embeddings

Step 1: Install `kolena-embeddings` Package#

The package can be installed via pip or uv and requires use of your kolena token which can be created on the Developer page.

We first retrieve and set our KOLENA_TOKEN environment variable. This is used by the uploader for authentication against your Kolena instance.

export KOLENA_TOKEN="********"

pipuv

Run the following command, making sure to replace with the token retrieved from the developer page:

pip install --extra-index-url="https://<KOLENA_TOKEN>@gateway.kolena.cloud/repositories" kolena-embeddings

Run the following command, making sure to replace with the token retrieved from the developer page:

uv add --extra-index-url="https://<KOLENA_TOKEN>@gateway.kolena.cloud/repositories" kolena-embeddings

This package provides the kembed.util.extract_embeddings method that generates embeddings as a numpy array for a given PIL.Image.Image object.

Step 2: Load Dataset and Model#

Before extracting embeddings on a dataset, we need to load the dataset. The dataset seeded in the semantic_segmentation example contains image assets referenced by the locator column, and we load the dataset in to a dataframe.

The embedding model and its key are obtained via the load_embedding_model() method.

kolena.initialize(verbose=True)
df_dataset = download_dataset("coco-stuff-10k")
model, model_key = load_embedding_model()

Step 3: Load Images for Extraction#

In order to extract embeddings on image data, we must load our image files into a PIL.Image.Image object. In this section, we will load these images from an S3 bucket. For other cloud storage services, please refer to your cloud storage's API docs.

s3 = boto3.client("s3")

def load_image_from_accessor(accessor: str) -> Image:
    bucket_name, *parts = accessor[5:].split("/")
    file_stream = boto3.resource("s3").Bucket(bucket_name).Object("/".join(parts)).get()["Body"]
    return Image.open(file_stream)

def iter_image_paths(image_accessors: List[str]) -> Iterator[Tuple[str, Image.Image]]:
    for locator in image_accessors:
        image = load_image_from_accessor(locator)
        yield locator, image

Tip

When processing large scales of images, we recommend using an Iterator to limit the number of images loaded into memory at once.

Step 4: Extract and Upload Embeddings#

Once embeddings are extracted for each locator on the dataset, we create a dataframe with embedding and locator columns, and use the upload_dataset_embeddings method to upload the embeddings.

The dataframe uploaded is required to contain the ID columns of the dataset in order to match against the datapoints in the dataset. In this example, the ID column of the dataset is locator.

def extract_image_embeddings(
    model: StudioModel,
    locators_and_filepaths: List[Tuple[str, Optional[str]]],
    batch_size: int = 50,
) -> List[Tuple[str, np.ndarray]]:
    """
    Extract a list of search embeddings corresponding to sample locators.
    """

locator_and_image_iterator = iter_image_paths(locators)
locator_and_embeddings = extract_image_embeddings(model, locator_and_image_iterator)

df_embeddings = pd.DataFrame(locator_and_embeddings, columns=["locator", "embedding"])
upload_dataset_embeddings(dataset_name, model_key, df_embeddings)

Once the upload completes, we can now visit Datasets, open the dataset and navigate to the Studio tab to search by natural language or similar images over the corresponding image data.

Conclusion#

In this tutorial, we learned how to extract and upload vector embeddings over your image data automatically and manually.

FAQ#

Can I share embeddings with Kolena even if I do not share the underlying images?

Yes!

Embeddings extraction is a unidirectional mapping, and used only for natural language search and similarity comparisons. Uploading these embeddings to Kolena does not allow for any reconstruction of these images, nor does it involve sharing these images with Kolena.