Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354

Tengal-Teemo · 2024-03-12T03:21:47Z

This attempts to add an EmbeddingModel and ChunkingModel with the same overall architecture (using components) as the rest of ragna allowing for, hopefully, automated generation of UI components.

This PR is for preliminary evaluation only.

Tengal-Teemo · 2024-03-12T03:26:35Z

@pmeier See this draft pull request.

pmeier · 2024-03-12T15:51:44Z

Thanks for the PR @Tengal-Teemo! I'll have a look soon.

as the rest of ragna allowing for, hopefully, automated generation of UI components.

Well, #217 🥲

pmeier

IMO, we should split this PR into two parts. One that adds the embedding models and one that adds chunking. Otherwise this will be really hard to review. Assuming we want to do embedding models first, I would totally be ok with EmbeddingModel receiving documents and performing the chunking internally as it is done currently in the source storages. In the follow-up PR we can factor it out into the chunking.

pmeier · 2024-03-15T12:25:31Z

ragna/core/_rag.py

@@ -188,7 +202,14 @@ async def prepare(self) -> Message:
                detail=RagnaException.EVENT,
            )

-        await self._run(self.source_storage.store, self.documents)
+        if list[Document] in inspect.signature(self.source_storage.store).parameters.values():


We should have this kind of logic on as class attribute on the SourceStorage itself. Otherwise, how are we going to communicate this to the REST API / web UI? This needs to be known, because it makes no sense to force the user to select an embedding model when it is unused by the backend.

This check needs to be more strict. We should only check the first argument rather than the whole signature.

pmeier · 2024-03-15T12:29:15Z

ragna/core/_rag.py

+            # Here we need to generate the list of embeddings
+            chunks = self.chunking_model.chunk_documents(self.documents)
+            embeddings = self.embedding_model.embed_chunks(chunks)


The source storage should also be able to just request chunks. Meaning, we have three distinct cases and cannot group these two. However, if we split this PR as suggested above, this distinction will only come in the follow-up PR.

So something like?

if type(self.source_storage).__ragna_input_type__ == Document: await self._run(self.source_storage.store, self.documents) else: chunks = self.chunking_model.chunk_documents(documents=self.documents) if type(self.source_storage).__ragna_input_type__ == Chunk: await self._run(self.source_storage.store, chunks) else: embeddings = self.embedding_model.embed_chunks(chunks) await self._run(self.source_storage.store, embeddings)

pmeier · 2024-03-15T12:32:04Z

ragna/core/_rag.py

@@ -218,7 +239,10 @@ async def answer(self, prompt: str, *, stream: bool = False) -> Message:

        self._messages.append(Message(content=prompt, role=MessageRole.USER))

-        sources = await self._run(self.source_storage.retrieve, self.documents, prompt)
+        if list[Document] in inspect.signature(self.source_storage.store).parameters.values():


This hits a point that I didn't consider before: we are currently passing the documents again to the retrieve function. See the part about BC in #256 (comment) for a reason why. This will likely change when we implement #256. However, in the mean time we need to decide if we want the same "input switching" here as for store. I think this is ok, but want to hear your thoughts.

Ooh, I only in hindsight understand the logic here. Of course we need to be able to embed the prompt. So this correct.

pmeier · 2024-03-15T12:32:52Z

ragna/source_storages/_chroma.py

+        self._embedding_model = MiniLML6v2()
+        self._chunking_model = NLTKChunkingModel()


Why would the source storage need any of these?

It doesn't, and these have been removed.

pmeier · 2024-03-15T12:33:31Z

ragna/source_storages/_chroma.py

+        self._tokens = 0
+        self._embeddings = 0


These will change with every .store call. Why are they instance attributes rather than local variables?

These variables represent the average length of a chunk within the source storage, It is aggregated across calls to store. I'm not sure what scope you're referring to but I don't think they can be local for them to work.

pmeier · 2024-03-15T12:40:16Z

ragna/source_storages/_embedding_model.py

+from sentence_transformers import SentenceTransformer
+import torch


PyTorch is a massive dependency that we cannot pull in by default. This has to be optional.

ragna/source_storages/_embedding_model.py

pmeier · 2024-03-15T12:42:58Z

ragna/source_storages/_lancedb.py

 import ragna
 from ragna.core import Document, PackageRequirement, Requirement, Source

 from ._vector_database import VectorDatabaseSourceStorage

+from ._embedding_model import Embedding
+import pyarrow as pa


Import this locally where needed to avoid a hard dependency.

pmeier · 2024-03-15T12:43:39Z

ragna/source_storages/_embedding_model.py

+        self.chunk = chunk
+
+
+class GenericEmbeddingModel(Component, ABC):


This class as well as the Embedding class above should be moved into ragna.core._components given that we want to separate them from the source storages.

pmeier · 2024-03-15T12:44:35Z

.gitignore

+# Ignore my entrypoint
+test.py
+doctest/


Flag to revert them before merge.

Tengal-Teemo · 2024-03-18T02:18:08Z

@pmeier I've read all your insights, and I'll be updating the PR accordingly.

pmeier · 2024-03-18T12:11:49Z

@Tengal-Teemo I took the liberty and implemented the automatic type discovery that I pitched in #354 (comment) in 34fd3ee. Now every SourceStorage will have a __ragna_input_type__ class attribute that holds either Document or Embedding. We can use this to switch the input type in the Chat object.

Tengal-Teemo · 2024-03-19T03:14:49Z

@pmeier I've created two new pull requests, one for embedding and one for chunking. This PR shall be left open for now as a record of the comments you made.

pmeier · 2024-03-19T08:14:55Z

We don't need to keep it open to still have access to the comments. Since we are not planning to add anything here, I'm going to close it. We can always re-open later if needs be.

Tengal-Teemo · 2024-03-19T19:56:58Z

Sorry yea I meant I won't delete the branch

Tengal-Teemo added 6 commits March 12, 2024 11:51

implemented embedding model barebones

ff64425

removed parameters from db constructor

ac2adb0

updated chroma store function

0da2472

updated retrieve and store functions to auto-estimate chunk size

75f27b8

implemented original chunking method as TokenChunkingModel

579906d

created generate_chunks_from_pages to allow for expansion

1f591aa

Tengal-Teemo marked this pull request as ready for review March 12, 2024 03:24

Tengal-Teemo marked this pull request as draft March 12, 2024 03:24

pmeier reviewed Mar 15, 2024

View reviewed changes

add automatic input type discovery for source storages

34fd3ee

pmeier closed this Mar 19, 2024

pmeier mentioned this pull request Mar 21, 2024

Embedding pr #369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354

Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354

Tengal-Teemo commented Mar 12, 2024

Tengal-Teemo commented Mar 12, 2024

pmeier commented Mar 12, 2024

pmeier left a comment

pmeier Mar 15, 2024

pmeier Mar 15, 2024

Tengal-Teemo Mar 19, 2024

pmeier Mar 15, 2024

pmeier Mar 15, 2024

pmeier Mar 15, 2024

Tengal-Teemo Mar 19, 2024

pmeier Mar 15, 2024

Tengal-Teemo Mar 18, 2024

pmeier Mar 15, 2024

pmeier Mar 15, 2024

pmeier Mar 15, 2024

pmeier Mar 15, 2024

Tengal-Teemo commented Mar 18, 2024

pmeier commented Mar 18, 2024

Tengal-Teemo commented Mar 19, 2024

pmeier commented Mar 19, 2024

Tengal-Teemo commented Mar 19, 2024

		self._embedding_model = MiniLML6v2()
		self._chunking_model = NLTKChunkingModel()

		from sentence_transformers import SentenceTransformer
		import torch

		self.chunk = chunk


		class GenericEmbeddingModel(Component, ABC):

Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354

Embeddings Model and Chunking Engine (Preliminary PR for evaluation purposes only) #354

Conversation

Tengal-Teemo commented Mar 12, 2024

Tengal-Teemo commented Mar 12, 2024

pmeier commented Mar 12, 2024

pmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tengal-Teemo commented Mar 18, 2024

pmeier commented Mar 18, 2024

Tengal-Teemo commented Mar 19, 2024

pmeier commented Mar 19, 2024

Tengal-Teemo commented Mar 19, 2024