Introduction of Metadata filtering & additional minor requested features #4

marcel-vesely-kinit · 2024-12-30T09:41:57Z

This is a belated Christmas present by yours truly :)

Main changes

This is a rather large PR containing various modifications to the Semantic Search service, namely:

Incorporation of Metadata filtering logic
- Extraction of metadata from assets (for now we only support extracting data from HuggingFace datasets)
- User query parsing and retrieving filters that could be used to narrow down a list of retrieved assets
  - Automatic user query parsing utilizing an LLM (LLM inference is run via Ollama service)
  - Manually user-defined filters (filters are defined in the POST request, so an LLM inference is skipped)
Small changes to the API endpoints
- Expanded the endpoints with pagination support and an option to retrieve entire AIoD assets directly (not only asset IDs)
- Added blocking endpoints that wait till the results to a user query are retrieved
- Endpoints also return num_hits argument representing the number of assets in the entire Milvus DB that match a specific set of user criteria
Simplified deployment process
- Instead of having a ton of docker-compose files that differ from each other in a way they combine services we utilize a Jinja2 templates to build a specific docker-compose with services we wish to include on the fly
Augmented garbage collector logic
- Recurring (on a monthly basis) Milvus DB garbage collector logic ->delete embeddings associated with old, outdated assets (this functionality has already been implemented)
- Recurring (on a daily basis) TinyDB garbage collector logic -> delete expired queries
  - User queries have an expiration date -> 1 hour after being resolved

Metadata filtering

Metadata filtering functionality consists of two main processes that need to be taken care of:
a) Extracting and storing of metadata associated with individual assets
b) Extracting of filters/conditions found in the user query /user query parsing/

Extracting of metadata from assets

Currently this process applicability is restricted to only a specific subset of assets we can manually extract metadata from, from Huggingface datasets to be exact. Due to this extraction process being very time demanding, we have opted to perform the metadata extraction manually without any use of LLMs for now.

Since the datasets as an asset type is the most prevalent asset in the AIoD platform, we have decided to apply metadata filtering on said asset, whilst choosing solely Huggingface datasets that share a common metadata structure as it can be used to retrieve a handful of metadata fields to be stored in Milvus database.

Extracting of filters from user queries

Since we wish to minimize the costs and increase computational/time efficiency pertaining to serving of an LLM, we have opted to go for a rather small LLM (Llama 3.1 8B). Due to its size, performing more advanced tasks or multiple tasks at once can often lead incorrect results. To mitigate this to a certain degree, we have divided the user query parsing process into 2-3 LLM steps that further dissect and scrutinize a user query on different levels of granularity. The LLM steps are:

STEP 1: Extraction of natural language conditions from query (extraction of spans representing a condtions/filters each associated with a specific metadata field)
STEP 2: Analysis and transformation of each natural language condition (a span from user query) to a structure representing the condition performed separately
[Optional STEP 3]: Further validation of transformed value against a list of permitted values for a particular metadata field

Unlike the former process, the extraction of metadata from assets, that can be performed manually without the use of an LLM to only a limited degree, the manual approach of defining filters explicitly is a full-fledged alternative to an automated user query parsing by an LLM. To this end, a user can define the filters himself, which can eliminate possibility for an LLM to misinterpret or omit some conditions user wanted to apply in the first place.

Brief description of noteworthy files

Here I briefly describe the contents of some files whose purpose may not be clear or they're simply too big to be easily understood

api/models/filter.py: Represents structured filters either extracted using an LLM or defined in the body of an HTTP request
- validate_filter_or_raise function: Checks the type/value of values associated with each expression tied to a particular metadata field based on its defined restrictions the value should adhere to. This function in particular important to check the validity of manually user-defined filters
api/schemas/asset_metadata/base.py: Contains various annotation and schema operations that can be used for validation or creation of dynamic types on runtime.
api/schemas/asset_metadata/dataset_metadata.py: Contains Pydantic model representing fields we wish to extract. This Pydantic model is then passed to an LLM functioning as an output schema an LLM is supposed to conform to. Each field may also have associated field validators with it that can further restrict the values permitted by said field.
api/services/inference/text_operations.py: This file has been extended with additional functions that perform the manual (no LLM) extraction of metadata from Huggingface datasets
api/services/inference/llm_query_parsing.py: This file contains all the logic regarding user query parsing using an LLM. For now I’d suggest you not to delve too much into this particular file as it is a quite a mess now, but functional nevertheless. In any case, this file contains:
- Pydantic classes used as output schemas for individual LLM steps
- Llama_ManualFunctionCalling: Class representing use of function calling performed through prompt engineering only rather than relying on Ollama/Langchain tool calls
- UserQueryParsingStages: Class containing functions for performing individual LLM steps. Each LLM step is a variation, an instance of Llama_ManualFunctionCalling class
- UserQueryParsing: Class encapsulating all the LLM steps to be performed for user query parsing purposes. This wrapper class is then used in other parts of our application to perform the user query parsing functionality with an LLM.

Potential problems

Asset pagination and the total number of assets fulfilling a set of criteria

Implementation of these two features for Milvus embeddings is rather trivial, but there are two main obstacles that can potentially be surmounted but at the great cost. For now we have chosen to stick to implementing of only an approximate pagination and retrieval of the total number of assets tied to a particular query.

Example 1: I wish to retrieve a page (offset=5000, limit=100) associated with a particular user query

Obstacle 1: The underlying AIoD assets of embeddings are constantly changing

Since AIoD assets are constantly changing, we cannot guarantee that the first 5000 assets (example 1) are valid and up-to-date. Not to mention that there could be an overlap of assets in between pages if we were to retrieve the page with offset of 5000 before moving to the page with offset of 4900 (if there were any outdated assets that is...)
This also applies to retrieving a total number of assets that comply with user filters -> We don't know the exact number of assets that are valid in the time of the user request.
Solution: Always check all the preceding assets up to the page we're interested in (or even check all the assets in the Milvus DB in the case of determining the total number of assets compliant with user queries with no filters). In our This is obviously a ludicrously expensive and stupid...

Obstacle 2: One AIoD asset may be divided into N separate chunks each of them represented by its own Milvus embedding

So far, we have assumed that each AIoD asset corresponds to one and only embedding. However this is not the case
There's actually a hidden layer of abstraction that we have never wished to delve into: The Milvus operations are performed on embeddings rather than on assets, but user defines pagination parameters in the number of assets. We conceal this fact by prompting for more embeddings than the number of assets required and then we retrieve only embeddings associated with distinct assets
This further exacerbates the precision of the page offset as the Milvus offset itself is applied on the embeddings rather than on assets
This leads to overlaps of asset between pages
Solution: Yet again the solution is to check all the assets preceding our page. For instance, to truly get an asset offset of 5000 (example 1), we would need to retrieve the top 5100 embeddings, actually it would be more than that to account for the assets that are tied to multiple embeddings. Then we would retrieve 5000 actual distinct and still existing assets from the embeddings and return only a specific window user requested. This approach is straight forward yet very expensive.

Preserving AIoD assets locally (for a limited time)

In order for us to be able to return the entire assets instead of their IDs only, we need to temporarily store them to our database (this would not necessarily be the case with blocking endpoints but that is a discussion for another day). The problem arises with the fact that 1) not only do we store asset related data that is not immediately deleted once it gets removed on original platform as well (Huggingface, ...), but we may also risk potentially serving outdated AIoD assets to our users.

Query expiration date: We have introduced a concept of expiration date of queries that states up to what time a specific query is accessible to the user

Still, even if a query is not expired yet, we DO NOT GUARANTEE results validity as the changes to AIoD assets can be done whenever... I suppose we could potentially check the results validity each time a user requests the results, but still that would not make impervious to being invalidated in the meantime...
For now we have set the expiration date to be one hour after the user query is resolved

Problems:

To alleviate the 1) first problem, we have introduced an additional daily job that gets rid of all the expired queries. This job ensures the TinyDB size not to get out of hand whilst also removing asset metadata that might pertain to old or deleted AIoD assets
The 2) second problem cannot be easily addressed I suppose. By further reducing the expiration duration we could avoid some situations when an AIoD asset is no longer up-to-date, but this problem cannot be nullified completely. Actually, even if we were to provide the asset IDs only, said IDs would still be susceptible to being out-of-date. The silver lining that makes the serving of invalid asset IDs somewhat acceptable compared to serving of invalid AIoD assets is the fact that the user needs to additionally prompt the AIoD platform to retrieve corresponding assets and thus he's subsequently informed of the invalidity of our results, which would not be the case with our response that returns entire assets.

Additional planned features

Extend the list of assets we can apply metadata filtering on. This entails:
- creating a separate class, a Pydantic model, defining metadata fields to extract from a specific asset
- utilizing an LLM to extract said fields automatically instead of relying on common fragile metadata structure that could be potentially changed in the future
Make DB updating job more time efficient -> the DB updating job should contain multiple tasks run in parallel to speed up the whole process. The job should contain the following tasks associated with:
- 1. fetching AIoD assets
- 1. computing embeddings
- 1. LLM metadata extraction
Distinguish between manual and automatized filter extraction in service config/settings => so that in the case we don't have an access to an LLM, we could still potentially perform metadata filtering manually (if there's a support for manually extracted asset metadata that is)
- Currently we either support all the forms of metadata filtering (an LLM is necessary) or none
Add DEBUG logs associated with time spent performing various processes (fetching of AIoD assets, computing/storing of embeddings, invoking an LLM, ...) so that we can determine what the weak link is once the processes become unbearably slow
Perform changes to the non-blocking endpoints to appease AR (issue Resolve inconsistencies with standard asset search #3)

TODO for deploying on AIoD

We should repopulate Milvus and TinyDB databases from scratch, otherwise some unexpected behavior may be encountered as I have not implemented any counter measures for dealing with old, not up-to-date schema in databases, etc, ...
- I have precomputed embeddings and metadata on our cluster
I have yet to test api/scripts/populate_milvus.py script

…incorporate LLM into this service

…ervice: TODO: support for manual filtering; prepare deployment setup

…ed whether they adhere to value constraints tied to individual fields we wish to filter by

…on the environment variables - it yet to be tested though

…ire assets instead of doc IDs only; 3) created blocking endpoints that wait till the query is processed

…ect them

…alid value to a list of permitted values for a particular metadata field

…ta into JSON files for later expedited deployment on another machine

… subsequently resolved

…nal user query even if we parse some filters from it; checked and updated populate_milvus.py script

andrejridzik

Reviewing a first batch of modified files (35/49)

api/Dockerfile.template

api/deploy.sh

andrejridzik · 2025-01-14T16:21:46Z

api/Dockerfile.build-compose

@@ -0,0 +1,11 @@
+FROM python:3.11


It seems to me that this file is obsolete or not?

What do you mean by obsolete?

api/app/models/filter.py

api/app/schemas/asset_metadata/base.py

api/app/schemas/enums.py

api/app/schemas/query.py

api/app/schemas/search_results.py

andrejridzik · 2025-01-16T11:48:00Z

api/deploy.sh

 fi

 # What operation we wish to perform
 COMPOSE_COMMAND="up -d --build"
-if [ "$1" == "--stop" ]; then  
+if [ "$#" -eq 0 ]; then


This is a rather strange way of populating values in jinja template.. Can't we just run a simple jinja-cli command directly from this script and potentially add it to requirements/requirements-dev? Running a separate docker container for this seems to me like an overkill, but perhaps I am missing something.

I suppose the idea is to have as few requirements as possible on the machine we wish to use for deployment. In other words, I don't expect said machine to have Python installed for instance, hence the need to run a new Python container that executes a script for building a new Docker/dockercompose.

api/scripts/build_compose.py

…RTAINING TO METADATA FILTERING

marcel-vesely-kinit and others added 10 commits December 17, 2024 15:22

Added metadata extraction logic for Huggingface datasets; next step: …

7ffb16e

…incorporate LLM into this service

Incorporated user parsing and automatic filter extraction into this s…

a8e4cb3

…ervice: TODO: support for manual filtering; prepare deployment setup

Prepared deployment files, yet to be tested

2cd5414

Service can be now deployed with Ollama service

b74ab48

Manual user-defined filters are now working correctly and are validat…

eafaad4

…ed whether they adhere to value constraints tied to individual fields we wish to filter by

Minor TODOs resolved, docker compose should be built on the go based …

b3bec42

…on the environment variables - it yet to be tested though

deployment tested and some bugs fixed

f697524

Modified endpoints: 1) created pagination; 2) an option to return ent…

864581a

…ire assets instead of doc IDs only; 3) created blocking endpoints that wait till the query is processed

Queries now have their expiration date based on which we garbage coll…

3b2fb96

…ect them

Added additional check (third-LLM-step) that explicitly checks an inv…

350e1fb

…alid value to a list of permitted values for a particular metadata field

marcel-vesely-kinit self-assigned this Dec 30, 2024

marcel-vesely-kinit and others added 4 commits December 31, 2024 14:56

minor modifications

18679d5

minor fixes part2

5a1b90b

minor changes p3 + changes to response models

708f8cb

added official support for precomputing and storing embeddings/metada…

0ce5540

…ta into JSON files for later expedited deployment on another machine

marcel-vesely-kinit requested review from andrejridzik and mtkinit January 7, 2025 08:58

marcel-vesely-kinit added 3 commits January 9, 2025 15:24

commented out pagination for now + found some minor mistakes and were…

44997db

… subsequently resolved

Topic (string used for semantic search) is always equivalent to origi…

3176724

…nal user query even if we parse some filters from it; checked and updated populate_milvus.py script

fixed a silly mistake made in the last commit

dbb8f34

andrejridzik requested changes Jan 16, 2025

View reviewed changes

Resolved current PR issues; NOW WE WILL DELETE ALL THE CHANGES NOT PE…

647579a

…RTAINING TO METADATA FILTERING

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction of Metadata filtering & additional minor requested features #4

Introduction of Metadata filtering & additional minor requested features #4

marcel-vesely-kinit commented Dec 30, 2024 •

edited

Loading

andrejridzik left a comment

andrejridzik Jan 14, 2025

marcel-vesely-kinit Jan 24, 2025

andrejridzik Jan 16, 2025

marcel-vesely-kinit Jan 24, 2025

Introduction of Metadata filtering & additional minor requested features #4

Are you sure you want to change the base?

Introduction of Metadata filtering & additional minor requested features #4

Conversation

marcel-vesely-kinit commented Dec 30, 2024 • edited Loading

Main changes

Metadata filtering

Extracting of metadata from assets

Extracting of filters from user queries

Brief description of noteworthy files

Potential problems

Asset pagination and the total number of assets fulfilling a set of criteria

Preserving AIoD assets locally (for a limited time)

Additional planned features

TODO for deploying on AIoD

andrejridzik left a comment

Choose a reason for hiding this comment

andrejridzik Jan 14, 2025

Choose a reason for hiding this comment

marcel-vesely-kinit Jan 24, 2025

Choose a reason for hiding this comment

andrejridzik Jan 16, 2025

Choose a reason for hiding this comment

marcel-vesely-kinit Jan 24, 2025

Choose a reason for hiding this comment

marcel-vesely-kinit commented Dec 30, 2024 •

edited

Loading