Cross-Lingual Image Search Engine

This repo contains an end-to-end implementation of an image search engine application. It is a rough simulation of a real world implementation and contains various modules, which are as follows

image-producer
image-consumer
FastAPI server
React web app

Architecture diagram

Images from Flickr are extracted using an API key and the search API provided by the Flickr team. These images along with some additional metadata (also extracted from Flickr) are converted to producer records and pushed to Kafka.

These records are then processed by a spark streaming application (at a 1-min interval) to extract the embeddings of the images. This embedding alongside the metadata are then written into elasticsearch. ElasticSearch provides vector KNN search capabilities, along with pagination and regular text based search.

Kafka allows for decoupling of the image data extraction from Flickr and the spark processing, this allows for handling Flickr APIs' ratelimiting and spark's resource related issues. This allows us to limit our spark cluster to be in a cost-efficient size.

The CLIP model's artifact files are managed using mlflow, this allows for better versioning of the models and enables model registry capabilities. Allowing for efficient model usage, versioning and management, across the consumer and server applications.

The backend server is a lightweight FastAPI based server, which exposes 2 endpoints to its clients.

/text_search - takes a text phrase
/image_search - takes an image url or base64 image string

and returns the top K matching responses stored in elasticsearch. These are then displayed in the KNN score order on the frontend.

Demo

Below is an example of the results when searching with phrase playing dogs.

OpenAI's CLIP model has been trained on multiple languages and hence it has the capability to infer text from various languages. Below is an example when using the german translated phrase for playing dogs which is spielende Hunde.

As you can see the results are very similar to when an english phrase was used. This demonstrates the multilingual capability of CLIP model.

The results as a whole display the capability of the model to carry out cross domain tasks across text and images.

Tech stack used

Confluent Kafka
ElasticSearch
Pyspark streaming
FastAPI
HuggingFace transformers
Reactjs

Please consider staring the repo if you find this repo to be useful.

Feel free to connect with me on LinkedIn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cross-Lingual Image Search Engine

Architecture diagram

Demo

Tech stack used

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cross-Lingual Image Search Engine

Architecture diagram

Demo

Tech stack used