diff --git a/docs/blog/art/clip_index.png b/docs/blog/art/clip_index.png new file mode 100644 index 00000000..eb34138c Binary files /dev/null and b/docs/blog/art/clip_index.png differ diff --git a/docs/blog/posts/2024-03-06|Datacomp_index.md b/docs/blog/posts/2024-03-06|Datacomp_index.md new file mode 100644 index 00000000..616a5f35 --- /dev/null +++ b/docs/blog/posts/2024-03-06|Datacomp_index.md @@ -0,0 +1,115 @@ +--- +date: + created: 2024-03-05 +authors: + - RobbeSneyders + - mrchtr +--- + +# Building a Datacomp CLIP index with Fondant + +Large (image) datasets are often unwieldy to use due to their sheer size. Assume for instance +that we would like to extract all the cat images from such a dataset. We would have to look at +every image to classify if it's a cat image or not. And if we want to extract all the dog images +next, we again need to look at every image. + +Instead, we can look at every image once, and calculate a (CLIP) embedding representing its +content. Combining these embeddings into an index, we can efficiently search through the dataset +with a query, finding specific images, without having to look at each one. + +![CLIP index](../art/clip_index.png) + +This is what LAION did for their [LAION-5b dataset](https://laion.ai/blog/laion-5b/), which made +it possible to use, like we did in our +[ControlNet example](https://github.com/ml6team/fondant-usecase-controlnet). +Unfortunately, the LAION-5b dataset and index have been +[taken offline](https://laion.ai/notes/laion-maintanence/) (temporarily) and there +[aren't any alternatives](https://github.com/rom1504/clip-retrieval/issues/324). This is +why we built an index for the Datacomp-12M dataset. While it is a lot smaller than LAION-5b, it +should already enable a lot of use cases again, and can hopefully be the start towards building +indices for more and larger datasets. + +You can access the index directly on the Hugging Face Hub +[here](https://huggingface.co/datasets/fondant-ai/datacomp-small-clip/blob/main/faiss), or read on +below on how to use it with Fondant. + + + +## Using the index + +### With Fondant + +The easiest way to use the index, is using Fondant. Fondant offers reusable operations which +allow you to query the index with your data, either prompts or embeddings: +- [By prompt](https://fondant.ai/en/latest/components/hub/#retrieve_from_faiss_by_prompt#description) +- [By embedding](https://fondant.ai/en/latest/components/hub/#retrieve_from_faiss_by_embedding#description) + +To see how it can be used in an end-to-end example, check our +[ControlNet example](https://github.com/ml6team/fondant-usecase-controlnet) which +uses the index to create a dataset to fine-tune a ControlNet model on a specific domain. + +### With Clip-Retrieval + +There are other open source tools which allow you to leverage a CLIP index. We can recommend +[clip-retrieval](https://github.com/rom1504/clip-retrieval) which lets you set up a service +hosting the index accessible by API. + +## Creating the index + +We leveraged Fondant to generate the CLIP index and published the pipeline as a +[git repository](https://github.com/ml6team/fondant-clip-index). The pipeline consists of 4 steps: + +- A [`load_from_hf_hub`](https://fondant.ai/en/stable/components/hub/#load_from_hf_hub#description) + operation that loads the + [datacomp_small](https://huggingface.co/datasets/mlfoundations/datacomp_small) dataset from + huggingface into the Fondant workspace and format. +- A [`download_images`](https://fondant.ai/en/stable/components/hub/#download_images#description) + operation which downloads the actual images from the urls in the dataset. +- A [`embed_images`](https://fondant.ai/en/stable/components/hub/#embed_images#description) operation which embeds the downloaded images using a CLIP model. +- A [`write_to_file`](https://fondant.ai/en/stable/components/hub/#write_to_file#description) + operation which writes the original urls and generated embeddings to the chosen destination. + +After running the pipeline, we used [`autofaiss`](https://github.com/criteo/autofaiss) to build the +CLIP index. + +## Execution details + +### Download images + +We downloaded the images with 32 cores in parallel, each opening up to 25 concurrent connections, +and achieved a success rate of 72%, resulting in 9.251.172 images. + +The downloading was executed on a VM on GCP using the Fondant Docker runner. We originally +planned to run this on Vertex AI, but moved to a VM when noticing lower network bandwidth on Vertex. + +The success rate can probably be further improved by setting up a faster DNS resolver. + +### Embed images + +We leveraged the +[`laion/CLIP-ViT-B-32-laion2B-s34B-b79K`](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K) +CLIP model. We chose this model because of a couple of reasons. It is popular, which makes it +easy to use with existing embeddings, it is small, which makes it cheap to run, and it is an open +model trained on open data. + +We appreciate any feedback on our choice of model, so we can take this into account if we +generate indices for larger datasets in the future. + +The embedding was executed on 4 T4 GPUs on Google Cloud using our Vertex AI runner, with a batch +size of 32. The execution took 8:15 hours. + +## What's next + +### Making data building collaborative + +With Fondant we aim to make data building collaborative, and we will share more features built +on top of the Datacomp datasets to showcase this in the future. To stay up to date, join our +[Discord](https://discord.gg/HnTdWhydGp). + +### Larger datasets + +Based on the popularity and feedback we receive on this 12.8M index, we might generate a CLIP +index for the datacomp-128M dataset. If there are other datasets you are interested in, or want +to generate an index for a different dataset yourself, please let us know in our +[Discord](https://discord.gg/HnTdWhydGp). + diff --git a/docs/overrides/main.html b/docs/overrides/main.html index aadbc5d1..00f6dbf5 100644 --- a/docs/overrides/main.html +++ b/docs/overrides/main.html @@ -3,8 +3,8 @@ {% block announce %}

- Let's tune RAG pipelines with Fondant. + We generated a CLIP index for the datacomp-12.8M dataset. Check out our recent blog post! + style="color: white; text-decoration: underline">Learn how you can use it!

{% endblock %} \ No newline at end of file