-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduction of Metadata filtering & additional minor requested features #4
base: main
Are you sure you want to change the base?
Conversation
…incorporate LLM into this service
…ervice: TODO: support for manual filtering; prepare deployment setup
…ed whether they adhere to value constraints tied to individual fields we wish to filter by
…on the environment variables - it yet to be tested though
…ire assets instead of doc IDs only; 3) created blocking endpoints that wait till the query is processed
…alid value to a list of permitted values for a particular metadata field
…ta into JSON files for later expedited deployment on another machine
… subsequently resolved
…nal user query even if we parse some filters from it; checked and updated populate_milvus.py script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing a first batch of modified files (35/49)
@@ -0,0 +1,11 @@ | |||
FROM python:3.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this file is obsolete or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by obsolete?
fi | ||
|
||
# What operation we wish to perform | ||
COMPOSE_COMMAND="up -d --build" | ||
if [ "$1" == "--stop" ]; then | ||
if [ "$#" -eq 0 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather strange way of populating values in jinja template.. Can't we just run a simple jinja-cli
command directly from this script and potentially add it to requirements/requirements-dev? Running a separate docker container for this seems to me like an overkill, but perhaps I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the idea is to have as few requirements as possible on the machine we wish to use for deployment. In other words, I don't expect said machine to have Python installed for instance, hence the need to run a new Python container that executes a script for building a new Docker/dockercompose.
…RTAINING TO METADATA FILTERING
This is a belated Christmas present by yours truly :)
Main changes
This is a rather large PR containing various modifications to the Semantic Search service, namely:
num_hits
argument representing the number of assets in the entire Milvus DB that match a specific set of user criteriaMetadata filtering
Metadata filtering functionality consists of two main processes that need to be taken care of:
a) Extracting and storing of metadata associated with individual assets
b) Extracting of filters/conditions found in the user query /user query parsing/
Extracting of metadata from assets
Currently this process applicability is restricted to only a specific subset of assets we can manually extract metadata from, from Huggingface datasets to be exact. Due to this extraction process being very time demanding, we have opted to perform the metadata extraction manually without any use of LLMs for now.
Since the datasets as an asset type is the most prevalent asset in the AIoD platform, we have decided to apply metadata filtering on said asset, whilst choosing solely Huggingface datasets that share a common metadata structure as it can be used to retrieve a handful of metadata fields to be stored in Milvus database.
Extracting of filters from user queries
Since we wish to minimize the costs and increase computational/time efficiency pertaining to serving of an LLM, we have opted to go for a rather small LLM (Llama 3.1 8B). Due to its size, performing more advanced tasks or multiple tasks at once can often lead incorrect results. To mitigate this to a certain degree, we have divided the user query parsing process into 2-3 LLM steps that further dissect and scrutinize a user query on different levels of granularity. The LLM steps are:
Unlike the former process, the extraction of metadata from assets, that can be performed manually without the use of an LLM to only a limited degree, the manual approach of defining filters explicitly is a full-fledged alternative to an automated user query parsing by an LLM. To this end, a user can define the filters himself, which can eliminate possibility for an LLM to misinterpret or omit some conditions user wanted to apply in the first place.
Brief description of noteworthy files
Here I briefly describe the contents of some files whose purpose may not be clear or they're simply too big to be easily understood
api/models/filter.py
: Represents structured filters either extracted using an LLM or defined in the body of an HTTP requestvalidate_filter_or_raise
function: Checks the type/value of values associated with each expression tied to a particular metadata field based on its defined restrictions the value should adhere to. This function in particular important to check the validity of manually user-defined filtersapi/schemas/asset_metadata/base.py
: Contains various annotation and schema operations that can be used for validation or creation of dynamic types on runtime.api/schemas/asset_metadata/dataset_metadata.py
: Contains Pydantic model representing fields we wish to extract. This Pydantic model is then passed to an LLM functioning as an output schema an LLM is supposed to conform to. Each field may also have associated field validators with it that can further restrict the values permitted by said field.api/services/inference/text_operations.py
: This file has been extended with additional functions that perform the manual (no LLM) extraction of metadata from Huggingface datasetsapi/services/inference/llm_query_parsing.py
: This file contains all the logic regarding user query parsing using an LLM. For now I’d suggest you not to delve too much into this particular file as it is a quite a mess now, but functional nevertheless. In any case, this file contains:Llama_ManualFunctionCalling
: Class representing use of function calling performed through prompt engineering only rather than relying on Ollama/Langchain tool callsUserQueryParsingStages
: Class containing functions for performing individual LLM steps. Each LLM step is a variation, an instance ofLlama_ManualFunctionCalling
classUserQueryParsing
: Class encapsulating all the LLM steps to be performed for user query parsing purposes. This wrapper class is then used in other parts of our application to perform the user query parsing functionality with an LLM.Potential problems
Asset pagination and the total number of assets fulfilling a set of criteria
Implementation of these two features for Milvus embeddings is rather trivial, but there are two main obstacles that can potentially be surmounted but at the great cost. For now we have chosen to stick to implementing of only an approximate pagination and retrieval of the total number of assets tied to a particular query.
Example 1: I wish to retrieve a page (offset=5000, limit=100) associated with a particular user query
Obstacle 1: The underlying AIoD assets of embeddings are constantly changing
Obstacle 2: One AIoD asset may be divided into N separate chunks each of them represented by its own Milvus embedding
Preserving AIoD assets locally (for a limited time)
In order for us to be able to return the entire assets instead of their IDs only, we need to temporarily store them to our database (this would not necessarily be the case with blocking endpoints but that is a discussion for another day). The problem arises with the fact that 1) not only do we store asset related data that is not immediately deleted once it gets removed on original platform as well (Huggingface, ...), but we may also risk potentially serving outdated AIoD assets to our users.
Query expiration date: We have introduced a concept of expiration date of queries that states up to what time a specific query is accessible to the user
Problems:
Additional planned features
TODO for deploying on AIoD
api/scripts/populate_milvus.py
script