Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of Metadata filtering & additional minor requested features #4

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

marcel-vesely-kinit
Copy link
Collaborator

@marcel-vesely-kinit marcel-vesely-kinit commented Dec 30, 2024

This is a belated Christmas present by yours truly :)

Main changes

This is a rather large PR containing various modifications to the Semantic Search service, namely:

  • Incorporation of Metadata filtering logic
    • Extraction of metadata from assets (for now we only support extracting data from HuggingFace datasets)
    • User query parsing and retrieving filters that could be used to narrow down a list of retrieved assets
      • Automatic user query parsing utilizing an LLM (LLM inference is run via Ollama service)
      • Manually user-defined filters (filters are defined in the POST request, so an LLM inference is skipped)
  • Small changes to the API endpoints
    • Expanded the endpoints with pagination support and an option to retrieve entire AIoD assets directly (not only asset IDs)
    • Added blocking endpoints that wait till the results to a user query are retrieved
    • Endpoints also return num_hits argument representing the number of assets in the entire Milvus DB that match a specific set of user criteria
  • Simplified deployment process
    • Instead of having a ton of docker-compose files that differ from each other in a way they combine services we utilize a Jinja2 templates to build a specific docker-compose with services we wish to include on the fly
  • Augmented garbage collector logic
    • Recurring (on a monthly basis) Milvus DB garbage collector logic ->delete embeddings associated with old, outdated assets (this functionality has already been implemented)
    • Recurring (on a daily basis) TinyDB garbage collector logic -> delete expired queries
      • User queries have an expiration date -> 1 hour after being resolved

Metadata filtering

Metadata filtering functionality consists of two main processes that need to be taken care of:
a) Extracting and storing of metadata associated with individual assets
b) Extracting of filters/conditions found in the user query /user query parsing/

Extracting of metadata from assets

Currently this process applicability is restricted to only a specific subset of assets we can manually extract metadata from, from Huggingface datasets to be exact. Due to this extraction process being very time demanding, we have opted to perform the metadata extraction manually without any use of LLMs for now.

Since the datasets as an asset type is the most prevalent asset in the AIoD platform, we have decided to apply metadata filtering on said asset, whilst choosing solely Huggingface datasets that share a common metadata structure as it can be used to retrieve a handful of metadata fields to be stored in Milvus database.

Extracting of filters from user queries

Since we wish to minimize the costs and increase computational/time efficiency pertaining to serving of an LLM, we have opted to go for a rather small LLM (Llama 3.1 8B). Due to its size, performing more advanced tasks or multiple tasks at once can often lead incorrect results. To mitigate this to a certain degree, we have divided the user query parsing process into 2-3 LLM steps that further dissect and scrutinize a user query on different levels of granularity. The LLM steps are:

  • STEP 1: Extraction of natural language conditions from query (extraction of spans representing a condtions/filters each associated with a specific metadata field)
  • STEP 2: Analysis and transformation of each natural language condition (a span from user query) to a structure representing the condition performed separately
  • [Optional STEP 3]: Further validation of transformed value against a list of permitted values for a particular metadata field

Unlike the former process, the extraction of metadata from assets, that can be performed manually without the use of an LLM to only a limited degree, the manual approach of defining filters explicitly is a full-fledged alternative to an automated user query parsing by an LLM. To this end, a user can define the filters himself, which can eliminate possibility for an LLM to misinterpret or omit some conditions user wanted to apply in the first place.

Brief description of noteworthy files

Here I briefly describe the contents of some files whose purpose may not be clear or they're simply too big to be easily understood

  • api/models/filter.py: Represents structured filters either extracted using an LLM or defined in the body of an HTTP request
    • validate_filter_or_raise function: Checks the type/value of values associated with each expression tied to a particular metadata field based on its defined restrictions the value should adhere to. This function in particular important to check the validity of manually user-defined filters
  • api/schemas/asset_metadata/base.py: Contains various annotation and schema operations that can be used for validation or creation of dynamic types on runtime.
  • api/schemas/asset_metadata/dataset_metadata.py: Contains Pydantic model representing fields we wish to extract. This Pydantic model is then passed to an LLM functioning as an output schema an LLM is supposed to conform to. Each field may also have associated field validators with it that can further restrict the values permitted by said field.
  • api/services/inference/text_operations.py: This file has been extended with additional functions that perform the manual (no LLM) extraction of metadata from Huggingface datasets
  • api/services/inference/llm_query_parsing.py: This file contains all the logic regarding user query parsing using an LLM. For now I’d suggest you not to delve too much into this particular file as it is a quite a mess now, but functional nevertheless. In any case, this file contains:
    • Pydantic classes used as output schemas for individual LLM steps
    • Llama_ManualFunctionCalling: Class representing use of function calling performed through prompt engineering only rather than relying on Ollama/Langchain tool calls
    • UserQueryParsingStages: Class containing functions for performing individual LLM steps. Each LLM step is a variation, an instance of Llama_ManualFunctionCalling class
    • UserQueryParsing: Class encapsulating all the LLM steps to be performed for user query parsing purposes. This wrapper class is then used in other parts of our application to perform the user query parsing functionality with an LLM.

Potential problems

Asset pagination and the total number of assets fulfilling a set of criteria

Implementation of these two features for Milvus embeddings is rather trivial, but there are two main obstacles that can potentially be surmounted but at the great cost. For now we have chosen to stick to implementing of only an approximate pagination and retrieval of the total number of assets tied to a particular query.

Example 1: I wish to retrieve a page (offset=5000, limit=100) associated with a particular user query

Obstacle 1: The underlying AIoD assets of embeddings are constantly changing

  • Since AIoD assets are constantly changing, we cannot guarantee that the first 5000 assets (example 1) are valid and up-to-date. Not to mention that there could be an overlap of assets in between pages if we were to retrieve the page with offset of 5000 before moving to the page with offset of 4900 (if there were any outdated assets that is...)
  • This also applies to retrieving a total number of assets that comply with user filters -> We don't know the exact number of assets that are valid in the time of the user request.
  • Solution: Always check all the preceding assets up to the page we're interested in (or even check all the assets in the Milvus DB in the case of determining the total number of assets compliant with user queries with no filters). In our This is obviously a ludicrously expensive and stupid...

Obstacle 2: One AIoD asset may be divided into N separate chunks each of them represented by its own Milvus embedding

  • So far, we have assumed that each AIoD asset corresponds to one and only embedding. However this is not the case
  • There's actually a hidden layer of abstraction that we have never wished to delve into: The Milvus operations are performed on embeddings rather than on assets, but user defines pagination parameters in the number of assets. We conceal this fact by prompting for more embeddings than the number of assets required and then we retrieve only embeddings associated with distinct assets
  • This further exacerbates the precision of the page offset as the Milvus offset itself is applied on the embeddings rather than on assets
  • This leads to overlaps of asset between pages
  • Solution: Yet again the solution is to check all the assets preceding our page. For instance, to truly get an asset offset of 5000 (example 1), we would need to retrieve the top 5100 embeddings, actually it would be more than that to account for the assets that are tied to multiple embeddings. Then we would retrieve 5000 actual distinct and still existing assets from the embeddings and return only a specific window user requested. This approach is straight forward yet very expensive.

Preserving AIoD assets locally (for a limited time)

In order for us to be able to return the entire assets instead of their IDs only, we need to temporarily store them to our database (this would not necessarily be the case with blocking endpoints but that is a discussion for another day). The problem arises with the fact that 1) not only do we store asset related data that is not immediately deleted once it gets removed on original platform as well (Huggingface, ...), but we may also risk potentially serving outdated AIoD assets to our users.

Query expiration date: We have introduced a concept of expiration date of queries that states up to what time a specific query is accessible to the user

  • Still, even if a query is not expired yet, we DO NOT GUARANTEE results validity as the changes to AIoD assets can be done whenever... I suppose we could potentially check the results validity each time a user requests the results, but still that would not make impervious to being invalidated in the meantime...
  • For now we have set the expiration date to be one hour after the user query is resolved

Problems:

  • To alleviate the 1) first problem, we have introduced an additional daily job that gets rid of all the expired queries. This job ensures the TinyDB size not to get out of hand whilst also removing asset metadata that might pertain to old or deleted AIoD assets
  • The 2) second problem cannot be easily addressed I suppose. By further reducing the expiration duration we could avoid some situations when an AIoD asset is no longer up-to-date, but this problem cannot be nullified completely. Actually, even if we were to provide the asset IDs only, said IDs would still be susceptible to being out-of-date. The silver lining that makes the serving of invalid asset IDs somewhat acceptable compared to serving of invalid AIoD assets is the fact that the user needs to additionally prompt the AIoD platform to retrieve corresponding assets and thus he's subsequently informed of the invalidity of our results, which would not be the case with our response that returns entire assets.

Additional planned features

  • Extend the list of assets we can apply metadata filtering on. This entails:
    • creating a separate class, a Pydantic model, defining metadata fields to extract from a specific asset
    • utilizing an LLM to extract said fields automatically instead of relying on common fragile metadata structure that could be potentially changed in the future
  • Make DB updating job more time efficient -> the DB updating job should contain multiple tasks run in parallel to speed up the whole process. The job should contain the following tasks associated with:
      1. fetching AIoD assets
      1. computing embeddings
      1. LLM metadata extraction
  • Distinguish between manual and automatized filter extraction in service config/settings => so that in the case we don't have an access to an LLM, we could still potentially perform metadata filtering manually (if there's a support for manually extracted asset metadata that is)
    • Currently we either support all the forms of metadata filtering (an LLM is necessary) or none
  • Add DEBUG logs associated with time spent performing various processes (fetching of AIoD assets, computing/storing of embeddings, invoking an LLM, ...) so that we can determine what the weak link is once the processes become unbearably slow
  • Perform changes to the non-blocking endpoints to appease AR (issue Resolve inconsistencies with standard asset search #3)

TODO for deploying on AIoD

  • We should repopulate Milvus and TinyDB databases from scratch, otherwise some unexpected behavior may be encountered as I have not implemented any counter measures for dealing with old, not up-to-date schema in databases, etc, ...
    • I have precomputed embeddings and metadata on our cluster
  • I have yet to test api/scripts/populate_milvus.py script

marcel-vesely-kinit and others added 10 commits December 17, 2024 15:22
…ervice: TODO: support for manual filtering; prepare deployment setup
…ed whether they adhere to value constraints tied to individual fields we wish to filter by
…on the environment variables - it yet to be tested though
…ire assets instead of doc IDs only; 3) created blocking endpoints that wait till the query is processed
…alid value to a list of permitted values for a particular metadata field
@marcel-vesely-kinit marcel-vesely-kinit self-assigned this Dec 30, 2024
Copy link
Collaborator

@andrejridzik andrejridzik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing a first batch of modified files (35/49)

api/Dockerfile.template Show resolved Hide resolved
api/deploy.sh Outdated Show resolved Hide resolved
@@ -0,0 +1,11 @@
FROM python:3.11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this file is obsolete or not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by obsolete?

api/app/models/filter.py Outdated Show resolved Hide resolved
api/app/schemas/asset_metadata/base.py Outdated Show resolved Hide resolved
api/app/schemas/enums.py Outdated Show resolved Hide resolved
api/app/schemas/query.py Show resolved Hide resolved
api/app/schemas/search_results.py Show resolved Hide resolved
fi

# What operation we wish to perform
COMPOSE_COMMAND="up -d --build"
if [ "$1" == "--stop" ]; then
if [ "$#" -eq 0 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather strange way of populating values in jinja template.. Can't we just run a simple jinja-cli command directly from this script and potentially add it to requirements/requirements-dev? Running a separate docker container for this seems to me like an overkill, but perhaps I am missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the idea is to have as few requirements as possible on the machine we wish to use for deployment. In other words, I don't expect said machine to have Python installed for instance, hence the need to run a new Python container that executes a script for building a new Docker/dockercompose.

api/scripts/build_compose.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants