-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354
Conversation
@pmeier See this draft pull request. |
Thanks for the PR @Tengal-Teemo! I'll have a look soon.
Well, #217 🥲 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should split this PR into two parts. One that adds the embedding models and one that adds chunking. Otherwise this will be really hard to review. Assuming we want to do embedding models first, I would totally be ok with EmbeddingModel
receiving documents and performing the chunking internally as it is done currently in the source storages. In the follow-up PR we can factor it out into the chunking.
@@ -188,7 +202,14 @@ async def prepare(self) -> Message: | |||
detail=RagnaException.EVENT, | |||
) | |||
|
|||
await self._run(self.source_storage.store, self.documents) | |||
if list[Document] in inspect.signature(self.source_storage.store).parameters.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We should have this kind of logic on as class attribute on the
SourceStorage
itself. Otherwise, how are we going to communicate this to the REST API / web UI? This needs to be known, because it makes no sense to force the user to select an embedding model when it is unused by the backend. - This check needs to be more strict. We should only check the first argument rather than the whole signature.
# Here we need to generate the list of embeddings | ||
chunks = self.chunking_model.chunk_documents(self.documents) | ||
embeddings = self.embedding_model.embed_chunks(chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source storage should also be able to just request chunks. Meaning, we have three distinct cases and cannot group these two. However, if we split this PR as suggested above, this distinction will only come in the follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So something like?
if type(self.source_storage).__ragna_input_type__ == Document:
await self._run(self.source_storage.store, self.documents)
else:
chunks = self.chunking_model.chunk_documents(documents=self.documents)
if type(self.source_storage).__ragna_input_type__ == Chunk:
await self._run(self.source_storage.store, chunks)
else:
embeddings = self.embedding_model.embed_chunks(chunks)
await self._run(self.source_storage.store, embeddings)
@@ -218,7 +239,10 @@ async def answer(self, prompt: str, *, stream: bool = False) -> Message: | |||
|
|||
self._messages.append(Message(content=prompt, role=MessageRole.USER)) | |||
|
|||
sources = await self._run(self.source_storage.retrieve, self.documents, prompt) | |||
if list[Document] in inspect.signature(self.source_storage.store).parameters.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hits a point that I didn't consider before: we are currently passing the documents again to the retrieve function. See the part about BC in #256 (comment) for a reason why. This will likely change when we implement #256. However, in the mean time we need to decide if we want the same "input switching" here as for store. I think this is ok, but want to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I only in hindsight understand the logic here. Of course we need to be able to embed the prompt. So this correct.
self._embedding_model = MiniLML6v2() | ||
self._chunking_model = NLTKChunkingModel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the source storage need any of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, and these have been removed.
self._tokens = 0 | ||
self._embeddings = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will change with every .store
call. Why are they instance attributes rather than local variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables represent the average length of a chunk within the source storage, It is aggregated across calls to store. I'm not sure what scope you're referring to but I don't think they can be local for them to work.
from sentence_transformers import SentenceTransformer | ||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch is a massive dependency that we cannot pull in by default. This has to be optional.
import ragna | ||
from ragna.core import Document, PackageRequirement, Requirement, Source | ||
|
||
from ._vector_database import VectorDatabaseSourceStorage | ||
|
||
from ._embedding_model import Embedding | ||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import this locally where needed to avoid a hard dependency.
self.chunk = chunk | ||
|
||
|
||
class GenericEmbeddingModel(Component, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class as well as the Embedding
class above should be moved into ragna.core._components
given that we want to separate them from the source storages.
# Ignore my entrypoint | ||
test.py | ||
doctest/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flag to revert them before merge.
@pmeier I've read all your insights, and I'll be updating the PR accordingly. |
@Tengal-Teemo I took the liberty and implemented the automatic type discovery that I pitched in #354 (comment) in 34fd3ee. Now every |
@pmeier I've created two new pull requests, one for embedding and one for chunking. This PR shall be left open for now as a record of the comments you made. |
We don't need to keep it open to still have access to the comments. Since we are not planning to add anything here, I'm going to close it. We can always re-open later if needs be. |
Sorry yea I meant I won't delete the branch |
This attempts to add an
EmbeddingModel
andChunkingModel
with the same overall architecture (using components) as the rest of ragna allowing for, hopefully, automated generation of UI components.This PR is for preliminary evaluation only.