-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Vespa and rework Document Indices #317
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
be1ad24
to
858ccda
Compare
|
||
def upgrade() -> None: | ||
op.drop_table("chunk") | ||
op.drop_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why is this here?
|
||
|
||
def downgrade() -> None: | ||
op.create_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
DOCUMENT_INDEX_NAME = "danswer_index" # Shared by vector/keyword indices | ||
# Vespa is now the default document index store for both keyword and vector | ||
DOCUMENT_INDEX_TYPE = os.environ.get( | ||
"DOCUMENT_INDEX_TYPE", DocumentIndexType.SPLIT.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended? Seems to contradict the comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally this PR was suppose to swap to vespa but its separate now
chunk_ids = _get_points_from_document_ids( | ||
doc_id_batch, self.collection, self.client | ||
) | ||
self.client.set_payload( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we will be potentially updating quite a lot more than _BATCH_SIZE
points here. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_BATCH_SIZE is quite conservative, and this is a tiny request in terms of packet size so I figure it shouldn't be an issue. Plus this won't really be used anymore anyway...
chunk_ids = _get_points_from_document_ids( | ||
doc_id_batch, self.collection, self.client | ||
) | ||
self.client.delete( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
if should_delete_doc: | ||
# Processing the first chunk of the doc and the doc exists | ||
deletion_success = _delete_vespa_doc_chunks(document.id) | ||
if not deletion_success: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer to fail loudly in case that we fail to delete (plus add retries to delete)
|
||
for update_request in update_requests: | ||
if update_request.boost is None and update_request.allowed_users is None: | ||
logger.error("Update request received but nothing to update") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should continue
here?
url = f"{DOCUMENT_ID_ENDPOINT}/{doc_chunk_id}" | ||
res = requests.put(url, headers=json_header, json=update_dict) | ||
|
||
if res.status_code != 200: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also fail loudly here if update fails to avoid marking the delete as successful without actually updating all permissions
|
||
def delete(self, doc_ids: list[str]) -> None: | ||
logger.info(f"Deleting {len(doc_ids)} documents from Vespa") | ||
for doc_id in doc_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably also use _delete_vespa_doc_chunks
?
9e2477f
to
8a83f83
Compare
No description provided.