Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354

Closed
wants to merge 7 commits into from

Conversation

Tengal-Teemo
Copy link
Contributor

This attempts to add an EmbeddingModel and ChunkingModel with the same overall architecture (using components) as the rest of ragna allowing for, hopefully, automated generation of UI components.

This PR is for preliminary evaluation only.

@Tengal-Teemo Tengal-Teemo marked this pull request as ready for review March 12, 2024 03:24
@Tengal-Teemo Tengal-Teemo marked this pull request as draft March 12, 2024 03:24
@Tengal-Teemo
Copy link
Contributor Author

@pmeier See this draft pull request.

@pmeier
Copy link
Member

pmeier commented Mar 12, 2024

Thanks for the PR @Tengal-Teemo! I'll have a look soon.

as the rest of ragna allowing for, hopefully, automated generation of UI components.

Well, #217 🥲

Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should split this PR into two parts. One that adds the embedding models and one that adds chunking. Otherwise this will be really hard to review. Assuming we want to do embedding models first, I would totally be ok with EmbeddingModel receiving documents and performing the chunking internally as it is done currently in the source storages. In the follow-up PR we can factor it out into the chunking.

@@ -188,7 +202,14 @@ async def prepare(self) -> Message:
detail=RagnaException.EVENT,
)

await self._run(self.source_storage.store, self.documents)
if list[Document] in inspect.signature(self.source_storage.store).parameters.values():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should have this kind of logic on as class attribute on the SourceStorage itself. Otherwise, how are we going to communicate this to the REST API / web UI? This needs to be known, because it makes no sense to force the user to select an embedding model when it is unused by the backend.
  2. This check needs to be more strict. We should only check the first argument rather than the whole signature.

Comment on lines +208 to +210
# Here we need to generate the list of embeddings
chunks = self.chunking_model.chunk_documents(self.documents)
embeddings = self.embedding_model.embed_chunks(chunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source storage should also be able to just request chunks. Meaning, we have three distinct cases and cannot group these two. However, if we split this PR as suggested above, this distinction will only come in the follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like?

        if type(self.source_storage).__ragna_input_type__ == Document:
            await self._run(self.source_storage.store, self.documents)
        else:
            chunks = self.chunking_model.chunk_documents(documents=self.documents)
            if type(self.source_storage).__ragna_input_type__ == Chunk:
                await self._run(self.source_storage.store, chunks)
            else:
                embeddings = self.embedding_model.embed_chunks(chunks)
                await self._run(self.source_storage.store, embeddings)

@@ -218,7 +239,10 @@ async def answer(self, prompt: str, *, stream: bool = False) -> Message:

self._messages.append(Message(content=prompt, role=MessageRole.USER))

sources = await self._run(self.source_storage.retrieve, self.documents, prompt)
if list[Document] in inspect.signature(self.source_storage.store).parameters.values():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hits a point that I didn't consider before: we are currently passing the documents again to the retrieve function. See the part about BC in #256 (comment) for a reason why. This will likely change when we implement #256. However, in the mean time we need to decide if we want the same "input switching" here as for store. I think this is ok, but want to hear your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I only in hindsight understand the logic here. Of course we need to be able to embed the prompt. So this correct.

Comment on lines +31 to +32
self._embedding_model = MiniLML6v2()
self._chunking_model = NLTKChunkingModel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the source storage need any of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, and these have been removed.

Comment on lines +42 to +43
self._tokens = 0
self._embeddings = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will change with every .store call. Why are they instance attributes rather than local variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables represent the average length of a chunk within the source storage, It is aggregated across calls to store. I'm not sure what scope you're referring to but I don't think they can be local for them to work.

Comment on lines +3 to +4
from sentence_transformers import SentenceTransformer
import torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch is a massive dependency that we cannot pull in by default. This has to be optional.

ragna/source_storages/_embedding_model.py Show resolved Hide resolved
import ragna
from ragna.core import Document, PackageRequirement, Requirement, Source

from ._vector_database import VectorDatabaseSourceStorage

from ._embedding_model import Embedding
import pyarrow as pa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import this locally where needed to avoid a hard dependency.

self.chunk = chunk


class GenericEmbeddingModel(Component, ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class as well as the Embedding class above should be moved into ragna.core._components given that we want to separate them from the source storages.

Comment on lines +9 to +11
# Ignore my entrypoint
test.py
doctest/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flag to revert them before merge.

@Tengal-Teemo
Copy link
Contributor Author

@pmeier I've read all your insights, and I'll be updating the PR accordingly.

@pmeier
Copy link
Member

pmeier commented Mar 18, 2024

@Tengal-Teemo I took the liberty and implemented the automatic type discovery that I pitched in #354 (comment) in 34fd3ee. Now every SourceStorage will have a __ragna_input_type__ class attribute that holds either Document or Embedding. We can use this to switch the input type in the Chat object.

@Tengal-Teemo
Copy link
Contributor Author

@pmeier I've created two new pull requests, one for embedding and one for chunking. This PR shall be left open for now as a record of the comments you made.

@pmeier
Copy link
Member

pmeier commented Mar 19, 2024

We don't need to keep it open to still have access to the comments. Since we are not planning to add anything here, I'm going to close it. We can always re-open later if needs be.

@pmeier pmeier closed this Mar 19, 2024
@Tengal-Teemo
Copy link
Contributor Author

Sorry yea I meant I won't delete the branch

@pmeier pmeier mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants