Skip to content

Commit

Permalink
Add vector search tutorial (#1236)
Browse files Browse the repository at this point in the history
  • Loading branch information
oliverhowell authored Aug 2, 2024
1 parent fc497ea commit 75f0f40
Show file tree
Hide file tree
Showing 3 changed files with 190 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
189 changes: 189 additions & 0 deletions docs/modules/data-structures/pages/vector-search-tutorial.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
= Vector search tutorial
:description: This tutorial guides you through using Hazelcast Enterprise to build an image search system.
:page-enterprise: true
:page-beta: true

This tutorial shows you how to use {enterprise-product-name} to build an image search system. This solution uses the https://huggingface.co/sentence-transformers/clip-ViT-B-32[CLIP sentence transformer] to map images.
and text onto a shared vector 512-dimensional vector space.

This tutorial uses:

* A Hazelcast pipeline that consumes unstructured data (images), computes
embeddings using Python, and stores them as vectors in a Hazelcast Enterprise `VectorCollection` data structure.
* A Jupyter notebook that implements text-based image searching using
a Hazelcast Python client.
The ingestion pipeline has the following high level components:

. Directory Watcher detects the arrival of new images and creates an event
containing the name of the new image.
. A `mapUsingPython` stage in which images are retrieved and converted into
vectors using the previously mentioned CLIP sentence transformer.
. A sink which stores the image vectors, along with their URLs, in
a Hazelcast `VectorCollection`.

The diagram below shows you how the components fit together and the processing steps each component performs.

image:TutorialBlueprint.gif[Tutorial Blueprint]

== Prerequisites

To complete this tutorial, you will need the following:

* https://www.oracle.com/java/technologies/downloads/[Java Developer Kit] 17 or later
* A Java IDE (we suggest https://www.jetbrains.com/idea/[IntelliJ IDEA])
* https://www.docker.com/products/docker-desktop/[Docker Desktop]
* A Hazelcast Enterprise license key with "Advanced AI" enabled
** https://hazelcast.com/get-started/?utm_source=docs-website[Get a Hazelcast Enterprise trial license.]

You will also need basic knowledge of both Java and Python to complete the
hands-on sections in this tutorial.

[NOTE]
====
This tutorial environment downloads several Python packages and Docker
images. You will need a good internet connection to run it.
====


== Pipeline References

This tutorial makes use of the Hazelcast Pipeline API. If you are not familiar with the structure of a pipeline, refer to the links below.

* https://docs.hazelcast.com/hazelcast/latest/pipelines/overview
* https://docs.hazelcast.org/docs/latest/javadoc/com/hazelcast/jet/pipeline/StreamStage.html

== Tutorial Setup

. Download the GitHub repo for this tutorial: https://github.com/hazelcast-guides/hazelcast-image-search

. Download the CLIP model
+
```sh
docker compose run download-model
```
+
The model we will be using to perform embedding is almost 500 MB. To speed
up everything that uses the model, you can download it ahead of time.

. Verify that the _models_ folder of the project has been populated.

. Install Hazelcast license
+
This Docker Compose project is configured to read the license from
the default Docker Compose property file, _.env_.
+
Create _.env_ (note the file name begins with a _dot_) in the project base
directory. Set the _HZ_LICENSEKEY_ variable to your license, as shown below.
+
```sh
HZ_LICENSEKEY=Your-License-Here
```

== Create `VectorColletion`

. Review the `VectorCollection` configuration in the file `hazelcast.yaml`.

+
```yaml
hazelcast:
properties:
hazelcast.logging.type: log4j2
hazelcast.partition.count: 13

jet:
enabled: True
resource-upload-enabled: True

vector-collection:
images:
indexes:
- name: semantic-search
dimension: 512
metric: COSINE


```
+
* `hazelcast.partition.count: Vector search performs better with fewer partitions. On the other hand, fewer partitions means larger partitions, which can cause problems during migration. A discussion of the tradeoffs can be found here:
(https://docs.hazelcast.com/hazelcast/latest/data-structures/vector-search-overview#partition-count-impact).
* `jet`: This is the Hazelcast stream processing engine. Hazelcast pipelines are a scalable way to rapidly ingest or process large amounts of data. This example uses a pipeline to compute embeddings and load them into a vector collection, so stream processing must be enabled.
* `vector-collection`: If you are using a vector collection, you must configure the index settings. There are no defaults. In this case, the name of the collection is `images` and it has one index, which is called `semantic-search`. The dimension and distance metric are dependent on the embedding being used. The `dimension` must match the size of the vectors produced by the embedding. The `metric` defines the algorithm used to compute the distance between 2 vectors and it must match the one used to train the embedding. This tutorial uses the CLIP sentence transformer for embeddings. CLIP uses a dimension of 512 and cosine distance metric (literally the cosine of the angle between 2 vectors, adjusted to be non-negative). For more detail on supported options, see xref:data-structures:vector-collections.adoc[].

. Start the tutorial environment.
+
```sh
docker compose up -d
```
+
This launches Hazelcast Platform, Hazelcast Management Center, and the Web server. Hazelcast Management Center is accessible at http://localhost:8080.

. Using your Java IDE, open `ImagesIngestPipeline.java` in the `image-ingest-pipeline` module. Follow the guidance and instructions in the file.

. Deploy the pipeline
+
.. build the project: `mvn clean package`
.. deploy the pipeline: `docker compose run submit-image-loader`
.. monitor the logs: `docker compose logs --follow hz`
.. check the job status: Open Hazelcast Management Center. Navigate
to *Stream Processing > Jobs* and select the image ingestion job.
+
[NOTE]
====
Once you have deployed the pipeline, it will take a while for the status to change from *Starting* to *Running* (up to 5 minutes) because Hazelcast has to download and install many Python packages to support the embedding. You will see something like the following in the hazelcast logs when the Python stream stage has initialized.
```bash
hazelcast-image-search-hz-1 | 2024-07-17 19:18:41,881 [ INFO] [hz.magical_joliot.cached.thread-7] [c.h.j.python]: [172.25.0.3]:5701 [dev] [5.5.0] Started Python process: 246
hazelcast-image-search-hz-1 | 2024-07-17 19:18:41,881 [ INFO] [hz.magical_joliot.cached.thread-3] [c.h.j.python]: [172.25.0.3]:5701 [dev] [5.5.0] Started Python process: 245
hazelcast-image-search-hz-1 | 2024-07-17 19:18:43,786 [ INFO] [hz.magical_joliot.cached.thread-7] [c.h.j.python]: [172.25.0.3]:5701 [dev] [5.5.0] Python process 246 listening on port 39819
hazelcast-image-search-hz-1 | 2024-07-17 19:18:43,819 [ INFO] [hz.magical_joliot.cached.thread-3] [c.h.j.python]: [172.25.0.3]:5701 [dev] [5.5.0] Python process 245 listening on port 39459
```
====
. Copy some images from the `images` folder into the `www` folder. Check the job status in Management Center. You will see a new pipeline event for each image.


+
[NOTE]
====
A solution pipeline is available in the
`hazelcast.platform.labs.image.similarity.solution` package. You can also choose to bypass building the pipeline and directly deploy the solution by running
`docker compose run submit-image-loader-solution`
====



== Perform a Nearest Neighbor Search

You need to use a Jupyter notebook for the remaining steps.

. Start the Jupyter process inside Docker.
+
```sh
docker compose logs jupyter
```
+
You will see the following output:
+
```sh
hazelcast-image-search-jupyter-1 | [C 2024-07-17 19:57:47.478 ServerApp]
hazelcast-image-search-jupyter-1 |
hazelcast-image-search-jupyter-1 | To access the server, open this file in a browser:
hazelcast-image-search-jupyter-1 | file:///root/.local/share/jupyter/runtime/jpserver-1-open.html
hazelcast-image-search-jupyter-1 | Or copy and paste one of these URLs:
hazelcast-image-search-jupyter-1 | http://localhost:8888/tree?token=7a4d2794d4135eaa88ee9e9642e80e7044cb5c213717e2be
hazelcast-image-search-jupyter-1 | http://127.0.0.1:8888/tree?token=7a4d2794d4135eaa88ee9e9642e80e7044cb5c213717e2be
```

. Copy the URL from the output and paste it into a browser window. This will bring up a Jupyter notebook. Double-click on the "Hazelcast Image Similarity" notebook to open it and follow the directions there.

= Summary

You should now be able to load unstructured data into a Hazelcast vector
collection and perform similarity searches.

= Known Issues

. If an image is removed from the `www` directory, it will not be removed from the vector collection. This is because the underlying Java WatcherService is not detecting the delete events.
. If too many images are dumped into `www` at the same time, the pipeline will break with a 'grpc max message size exceeded' message. The solution can safely handle 200-250 images at the same time. This is a known issue with the Python integration that will be addressed in a future release.
. Deploying the pipeline can take 2-10 minutes depending on your internet connection. This is due to the need to download many Python packages.
. Check the xref:release-notes:5.5.0.adoc[] for any additional known issues with Vector Search.
1 change: 1 addition & 0 deletions docs/modules/data-structures/partials/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,4 @@
*** xref:data-structures:cardinality-estimator-service.adoc[]
*** xref:data-structures:vector-collections.adoc[Vector Collection]
**** xref:data-structures:vector-search-overview.adoc[Data Structure Design]
**** xref:data-structures:vector-search-tutorial.adoc[Vector search tutorial]

0 comments on commit 75f0f40

Please sign in to comment.