You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
Commit to Help
I commit to help with one of those options 👆
Example Code
fromlangchain_community.embeddingsimportGPT4AllEmbeddingsfromlangchain_community.vectorstoresimportLanceDBfromlangchain_core.documentsimportDocumentvectorstore=LanceDB(
embedding=GPT4AllEmbeddings(model_name='all-MiniLM-L6-v2.gguf2.f16.gguf'),
)
doc1=Document(page_content='Hello, world!')
vectorstore.add_documents([doc1])
doc2=Document(page_content='Is there anybody out there?',)
vectorstore.add_documents([doc2])
row_count=vectorstore.get_table().count_rows()
print(f'Added 2 docs, but the row count is {row_count}')
# Output is 'Added 2 docs, but the row count is 1'
Description
Opening a discussion, because I'm not sure if the current LanceDB implementation is incorrect, or there are reasons for it working this way.
Current Situation
I'm creating a LanceDB vector store and adding documents to it. I want to overwrite any existing LanceDB vector store at that location. The default mode in the LanceDB constructor is "overwrite", so I just do vectorstore = LanceDB(embedding=...).
I then make multiple calls to add_documents(), but I find that each call overwrites the data written in previous calls.
Examining the code reveals that I have to use mode="append" in the LanceDB constructor to have add_documents() actually add documents without deleting existing data.
This is quite unintuitive. With the default mode of "overwrite", I would expect any existing database to be overwritten, then successive add_documents() calls to actually add documents, rather than removing existing documents then adding the new ones.
To achieve my goal, I have to either:
Manually delete the existing vector store, then create my LanceDB with mode="append", or
Create my LanceDB with mode="overwrite", call add_documents() once, then create another LanceDB instance with mode="append" for subsequent add_documents() calls.
Further, if I later create another LanceDB with "overwrite" mode, then do vectorstore.get_table().count_rows(), there is data there. Existing data is not deleted until the first add_documents() call.
How I think it should work
The default mode on creating a LanceDB object should be "append", rather than "overwrite", or perhaps there should be no default at all. It's better to unexpectedly append to existing data than unexpectedly replace it.
Creating a LanceDB object with mode="overwrite" should delete the existing database.
add_documents() should always add documents, rather than replacing the existing data.
System Info
System Information
OS: Darwin
OS Version: Darwin Kernel Version 23.6.0: Thu Sep 12 23:35:29 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_ARM64_T6000
Python Version: 3.12.3 (main, Sep 14 2024, 15:06:05) [Clang 15.0.0 (clang-1500.3.9.4)]
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Checked other resources
Commit to Help
Example Code
Description
Opening a discussion, because I'm not sure if the current LanceDB implementation is incorrect, or there are reasons for it working this way.
Current Situation
mode
in theLanceDB
constructor is"overwrite"
, so I just dovectorstore = LanceDB(embedding=...)
.add_documents()
, but I find that each call overwrites the data written in previous calls.mode="append"
in theLanceDB
constructor to haveadd_documents()
actually add documents without deleting existing data.This is quite unintuitive. With the default
mode
of"overwrite"
, I would expect any existing database to be overwritten, then successiveadd_documents()
calls to actually add documents, rather than removing existing documents then adding the new ones.To achieve my goal, I have to either:
LanceDB
withmode="append"
, orLanceDB
withmode="overwrite"
, calladd_documents()
once, then create anotherLanceDB
instance withmode="append"
for subsequentadd_documents()
calls.Further, if I later create another
LanceDB
with"overwrite"
mode, then dovectorstore.get_table().count_rows()
, there is data there. Existing data is not deleted until the firstadd_documents()
call.How I think it should work
mode
on creating aLanceDB
object should be"append"
, rather than"overwrite"
, or perhaps there should be no default at all. It's better to unexpectedly append to existing data than unexpectedly replace it.LanceDB
object withmode="overwrite"
should delete the existing database.add_documents()
should always add documents, rather than replacing the existing data.System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
Beta Was this translation helpful? Give feedback.
All reactions