-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index Loading not working if dumped from different process #154
Comments
Hi, thank you for filing the issue. Obviously, I cannot see your code, but I'm assuming you are using the defaults from the example. The issue here seems to me is that we do not have a persistent DocumentStore implementation - we only have an InMemoryDocumentStore. So what effectively happens is that the HNSW index itself (so the embedded vectors) does get saved to the file, but the documents (so the contents) only ever lived in your original process and are not saved to file when dumping the index. For now if you want persistence quickly I would recommend the QdrantVectorStore |
Thanks for the response. Given this, would it be possible (and would you be open to a pr) in which we can load the documents to a vector store without embedding? Assuming of course that the vector store already has the embeddings/index from the loaded .hnsw files. Hnsw indexes are great for POC and lightweight without exploring full vector db solutions, so Im hesitant to move to qdrant at this time. |
Sure, we are always open to new PRs and this is definitely a blindspot. I'm not entirely sure if having a method that would simply add a doc to the vectorstore without embedding would be sound. This could lead to invalid states if the user misuses the API. I think it might be better to have some dump_docs()/load_docs() type of methods implemented specifically for HNSWVectorStore with InMemoryDocstore. This is my first idea but this is definitely not the only solution. I think the most important thing is that it should never be possible to have a VectorStore with a Document without an embedded vector, or vice versa - a vector without a corresponding Document. |
Running the example works fine if you both generate, dump then load the index. However, if you generate and dump the index, you cannot reload the index in a new process, without adding the documents again. Running a query on a loaded index, leads to missing document errors.
Do you have to add_documents again after load? As I believe the 'add_documents' method, generates the embeddings itself, does this not lead to redundant calls to openai in which you have to regenerate the embeddings on load a second time?
The text was updated successfully, but these errors were encountered: