This is a simple example of vector similarity search using DataStax Astra DB. This repo aims to walk the line between providing a simplified example that is not overwhelmingly copmlex, but still illustrates key steps you'll need to take to solve real world vector similarity use cases. There are four key use cases that this repo will illustrate:
- How to create a vector-enabled collection using AstraPY
- How to generate real embeddings using the HuggingFace transformers library and the jinaai/jina-embeddings-v2-base-en
- How to insert vectorized embeddings and relevant text into your vectorized collection
- How to perform a similarity search and work with the fields that are returned from your query.
While this repo does not use the Open AI libraries to call an LLM, the patterns here are applicable for building RAG use cases. As such, the content that we are chunking and building embeddings for is contained in a text document located in towns/shadowfen.txt
. This text file was generated by ChatGPT and describes many aspects of a fictional town in a fantasy setting. We selected this content because it's fictional and not something that ChatGPT has been trained on. If you ask ChatGPT about the fictional town of Shadowfen, it will tell you it's not a real place and that it doesn't have any information about it. Therefore, it is easy for you to leverage the content in this repository and extend it to build a RAG application if that is your goal. The output from astra_query.py
is a set of questions about Shadowfen and the most relevant chunks of content from the text file, so you can easily copy and paste the output directly into ChatGPT to get an idea of how well the content helps answer questions and then move on to an API based implementation if you so choose.
- Create a DataStax Astra account - https://astra.datastax.com
- Create a vector database within Astra
- Get a database accesss token for your database using the Astra UI (see the connect tab for your database)
- Get the API endpoint for your database (should have a form like:
https://{uuid}-{region}.apps.astra.datastax.com/api/json
)
Clone the repo and create a python virtual environment:
python -m venv myenv
Then activate it (Mac):
source source myenv/bin/activate
Then activate it (Windows):
myenv\Scripts\activate
Use pip to install the dependencies we need, if I miss one please open a PR:
pip install astrapy json transformers torch uuid
Set the following 3 environment variables (mac syntax shown, adjust for your environment):
export ASTRA_DB_API_ENDPOINT={Replace with your API endpoint}
export ASTRA_DB_APPLICATION_TOKEN={Replace with your token}
export ASTRA_DB_KEYSPACE={your selected keyspace or don't set this to use the default keyspace}
From the root directory of the repo, start by executing:
python ./astra_create.py
You can use the CQL Console in the Astra DB UI to verify that the table was created.
If you used the default keyspace, you can simply issue the statement:
desc town_content;
If you used a different keyspace, you will need to execute the use
command first. For example, if you used a keyspace called dnd
you would execute the statements:
use dnd;
desc town_content;
or fully qualify the table name:
desc dnd.town_content
You should get a response back on the CQL console that looks something like this:
token@cqlsh> desc dnd.town_content;
CREATE TABLE dnd.town_content (
key frozen<tuple<tinyint, text>> PRIMARY KEY,
array_contains set<text>,
array_size map<text, int>,
doc_json text,
exist_keys set<text>,
query_bool_values map<text, tinyint>,
query_dbl_values map<text, decimal>,
query_null_values set<text>,
query_text_values map<text, text>,
query_timestamp_values map<text, timestamp>,
query_vector_value vector<float, 768>,
tx_id timeuuid
) WITH additional_write_policy = '99p'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.UnifiedCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99p';
CREATE CUSTOM INDEX town_content_array_contains ON dnd.town_content (values(array_contains)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_array_size ON dnd.town_content (entries(array_size)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_exists_keys ON dnd.town_content (values(exist_keys)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_query_bool_values ON dnd.town_content (entries(query_bool_values)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_query_dbl_values ON dnd.town_content (entries(query_dbl_values)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_query_null_values ON dnd.town_content (values(query_null_values)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_query_text_values ON dnd.town_content (entries(query_text_values)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_query_timestamp_values ON dnd.town_content (entries(query_timestamp_values)) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX town_content_query_vector_value ON dnd.town_content (query_vector_value) USING 'StorageAttachedIndex' WITH OPTIONS = {'similarity_function': 'cosine'};
At this point we have a collection called town_content
, we can execute the astra_insert.py
script to chunk up our content text file and generate embeddings of the chunks using HuggingFace, then insert them into our collection.
To do this, you can execute the astra_insert.py
script from the root directory for the repo.
python ./astra_insert.py
If this script executes successfully, you will see several batches of documents successfully inserted and the UUIDs for each one returned to the console.
You can verify this on the CQL Console in Astra as well. Since the items in our collection are small, you can count the rows and ignore the warning that you will see:
token@cqlsh> select count(*) from dnd.town_content;
count
-------
121
(1 rows)
Warnings :
Aggregation query used without partition key
Again, you can replace dnd
above with your keyspace or leave it out if you're using the default keyspace.
At this point, we now have several dozen items in our collection with embeddings. The astra_query.py
script has an array of several queries about Shadowfen and will retrieve the most relevant results based on a similarity search of each query. You can modify this script to ask different questions to see which chunks of the document are returned. Note that the chunking algorithm used here is fairly naive, it's just chunking at a paragraph level. As a next step you may want to consider changing the chunking algorithm to use a recursive chunking algorithm or an adjacent sentences algorithm to see how your results change. However, these implementations are outside the scope of this project, but it's something you should be aware of when assessing the quality of results you're retrieving; there are better chunking algorithms out there.
To run the similarity search, you can execute the astra_query.py
script from the root directory for the repo.
python ./astra_query.py
If this runs successfully, you'll see each query printed to the console with the two most similar chunks, along with a similarity score, that we created in the previous step. This will look something like this:
Who is Eldermarsh Thorne?
Eldermarsh Thorne**: The current High Druid of the Marshbinders, Thorne is a figure of awe and a little fear, given his reputed ability to converse with the swamp itself. With his staff carved from a blackened root and his cloak of woven reeds, he is the embodiment of the swamp's enigmatic nature.
0.9189216
High Druid of the Marshbinders**: The leader of the druid circle, currently Eldermarsh Thorne, holds a permanent seat on the council. They represent the interests of the natural world and ensure that the town's actions are in harmony with the swamp.
0.89715797
Again, you can play with the queries
array inside of astra_query.py
to see what results you get for any queries you wish.
# Preparing a list of queries about the town Shadowfen
queries = [
"What are the locations within Shadowfen?",
"Who is Eldermarsh Thorne?",
"Who is Brom Stoutfist?",
"What is The Gloomwater Brewery?",
"What is the terrain like surrounding Shadowfen?",
"Who created Shadowfen?",
"What is the climate of Shadowfen?",
"What is the population of Shadowfen?",
"What is the history of Shadowfen?",
"Where can I get a drink in Shadowfen?",
]
You may want to skim through towns/shadowfen.txt
as well to get ideas for questions to see if the relevant chunks you'd like to return are actually returned.
Please open an issue with any feedback or bugs you find.