-
-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow adding metadata for chunks of files #318
Comments
@drale2k Do you think this functionality should go into https://github.com/moekidev/baran which is the gem we're using to do chunking? |
Good question given that baran is a Text Splitter specifically for LLMs but even if baran were to accept metadata, langchainrb still needs to take it as input. You still want people to interact with the lanchainrb APIs and not baran directly or? |
@drale2k Correct, I'm just saying that those changes would need to happen in the baran gem itself first and then Langchain.rb would make the corresponding changes to accept metadata. I think instead of returning the plain chunks array, baran should return a different data structure that would hold all that metadata as well. Do you want to suggest those changes to @moekidev? |
@drale2k Which vectorsearch DB are you using btw? And what kind of files are you looking to upload? |
Currently mostly pinecone but have been looking into open source ones as well. PDFs, MS Office docx and ppt mostly. Starting to look into audio transcriptions as well using https://github.com/guillaumekln/faster-whisper |
this would help support the ability to add metadata such as source document names / source urls for the text. I can see this being useful in add_data by optionally being able to pass an array of objects vs just string paths. checking the class of the "path" object before passing it to the chunker so and object that looks like
could be sent to the chunker would be nice. Then when we ask the vectorsearch database for similarities we should also get the metadata back to use for source links |
I'm still reading code... It looks like Langchain::Loader actually will take a url! That is nice. I'll have to give that a try. It would be nice if the url was passed into the vectorsearch database as metadata directly. I'm wishing out loud and should definitely consider making a pull request. Thanks for making this a lot easier! |
@jjimenez Take a look at this draft branch I'm working on: https://github.com/andreibondarev/langchainrb/pull/538/files. The rest of the vectorsearch DBs need to be fixed to accept the |
any news on this feature ? |
I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with? |
@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented? |
@andreibondarev I'm not sure it's so much a DSL as a standardized schema for storing the data. That being said I'm not very knowledgeable in terms of the different vector databases here, but I'm assuming that in an ideal schema we would be able to store an object the represents the original data source (with a unique identifier) and then a collection of objects with the actual text splits that reference the original source. The metadata field could live on either the parent or the chunks, (or both I suppose if you wanted?) but probably on the parent makes the most sense? Then when you do a search you are searching the chunks and you can also grab the parent record that the chunks reference to get the metadata. I'm thinking this schema allows for the easiest updates if you are using sources change (for instance if your source is a url and the content updates, the url is the same but you want to clear out the old chunks and add new ones). Note I'm taking these ideas from the LangChain python pgvector implementation. In terms of a DSL, maybe it makes sense to name the concept of data like a |
@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually. |
@andreibondarev totally fair. I think we can essentially achieve the same functionality with just adding a metadata field for each of the vector dbs like you said, then it could always be improved upon in the future should the need arise. Regardless, what's the state of your branch referenced here where you added the metadata as part of the parsing process? Do you want to build on that and update all the vector db adapters to expect this new field on that branch? Or do you want a separate PR to update all the vector db schemas? |
@sean-dickinson I think maybe we, first, standardize the We could probably do Pgvector first, I think you'd need to add some sort of a What're your thoughts? |
I took a stab at this for pgvector following the guidance of the above comment #859 |
Currently only
add_texts
takes an 'metadata' argument butadd_data
does not. Sinceadd_data
takes an array of files it would be clunky to extend it to allow metadata directly. Adding metadata needs to happen on a chunk-level.The use case i have for this is to add the page number a chunk was found on and reference that as the source of the information.
To work around it i am currently reading and chunking files manually and then calling
add_texts
to supply the metadata. It's not too difficult but it would be nice if this was easier.The text was updated successfully, but these errors were encountered: