Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Vespa and rework Document Indices #317

Merged
merged 11 commits into from
Aug 24, 2023
Merged

Add Vespa and rework Document Indices #317

merged 11 commits into from
Aug 24, 2023

Conversation

yuhongsun96
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Aug 19, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 24, 2023 6:39am


def upgrade() -> None:
op.drop_table("chunk")
op.drop_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is this here?



def downgrade() -> None:
op.create_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

DOCUMENT_INDEX_NAME = "danswer_index" # Shared by vector/keyword indices
# Vespa is now the default document index store for both keyword and vector
DOCUMENT_INDEX_TYPE = os.environ.get(
"DOCUMENT_INDEX_TYPE", DocumentIndexType.SPLIT.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended? Seems to contradict the comment above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally this PR was suppose to swap to vespa but its separate now

chunk_ids = _get_points_from_document_ids(
doc_id_batch, self.collection, self.client
)
self.client.set_payload(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we will be potentially updating quite a lot more than _BATCH_SIZE points here. Is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_BATCH_SIZE is quite conservative, and this is a tiny request in terms of packet size so I figure it shouldn't be an issue. Plus this won't really be used anymore anyway...

chunk_ids = _get_points_from_document_ids(
doc_id_batch, self.collection, self.client
)
self.client.delete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

backend/danswer/datastores/vespa/store.py Show resolved Hide resolved
if should_delete_doc:
# Processing the first chunk of the doc and the doc exists
deletion_success = _delete_vespa_doc_chunks(document.id)
if not deletion_success:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer to fail loudly in case that we fail to delete (plus add retries to delete)


for update_request in update_requests:
if update_request.boost is None and update_request.allowed_users is None:
logger.error("Update request received but nothing to update")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should continue here?

url = f"{DOCUMENT_ID_ENDPOINT}/{doc_chunk_id}"
res = requests.put(url, headers=json_header, json=update_dict)

if res.status_code != 200:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also fail loudly here if update fails to avoid marking the delete as successful without actually updating all permissions


def delete(self, doc_ids: list[str]) -> None:
logger.info(f"Deleting {len(doc_ids)} documents from Vespa")
for doc_id in doc_ids:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably also use _delete_vespa_doc_chunks?

@yuhongsun96 yuhongsun96 merged commit 8159fdc into main Aug 24, 2023
4 of 5 checks passed
@yuhongsun96 yuhongsun96 deleted the index-consolidation branch August 24, 2023 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants