Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proportional staging index dag #3763

Merged
merged 22 commits into from
Feb 28, 2024
Merged

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Feb 7, 2024

Fixes

Fixes #3488 by @stacimc

Description

This PR adds a new create_proportional_by_source_staging_index DAG as described in the IP here:
Screenshot 2024-02-07 at 12 53 35 PM

It can be used to create a new index in staging which is a subset of a staging index, but maintains the same proportion of records per source.

Note: reindexing is not parallelized with slicing, as it is for create_new_production_es_index and other DAGs (see comment thread for explanation). However it is parallelized in the sense that the individual reindexing tasks for each source can run in parallel. Both of these are acceptable because the DAG only touches staging indices, not production.

Testing Instructions

Tests assume you're starting from fresh local testing data, ie after running just api/init.

  • Run the create_proportional_by_source_staging_index DAG locally with default conf options. Then verify in elasticvue that you have a new es index called audio-50-percent-proportional-20240214t181238 (the final suffix is a timestamp and so will differ), with the alias audio-subset-by-source.
    • By default it will be based on the audio-filtered index and should have 50% of the total documents, with the same source proportions. Since the filtered index has 4912 documents, the new index should have 2456.
    • Inspect the logs for get_staging_source_counts to see the counts for the source index, and then the wait_for_reindex tasks to see how many were reindexed for each source in the destination index. The proportions should remain the same. For example wikimedia_audio has 3992 records in the source index, and 1996 in the new one.
    • Note that the remove_existing_alias step was skipped in this dagrun
  • Run the DAG again, but this time change the percentage_of_prod to 0.25 and the source_index to audio.
    • You should get a new index named something like audio-25-percent-proportional-20240214t182518
    • No steps should be skipped this time. The audio-subset-by-source alias should have been removed from the previous index and is now applied to this one.
    • The index should have 1250 documents (1/4 of 5000 from the main audio index)
  • Run the DAG with media type image and ensure this also works. You should get a new index with image-subset-by-source alias, a name like image-50-percent-proportional-20240214t181238, and 2328 documents (half of the 4656 that are in the filtered image index). Check the proportions are correct.
  • Run the staging_elasticsearch_cluster_healthcheck and create_new_staging_es_index DAGs to ensure that refactoring out shared elasticsearch utilities did not break them.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc self-assigned this Feb 7, 2024
@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation labels Feb 7, 2024
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Feb 7, 2024
Copy link

github-actions bot commented Feb 7, 2024

Full-stack documentation: https://docs.openverse.org/_preview/3763

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

@stacimc
Copy link
Collaborator Author

stacimc commented Feb 7, 2024

I think I am going to hold off on this and implement the fix for #3761 in this branch, because otherwise we have to add a new Airflow connection for the API which will just get removed later.

@stacimc
Copy link
Collaborator Author

stacimc commented Feb 14, 2024

The fix for #3761 was added in this branch.

I've encountered another issue with auto-slicing to parallelize the reindexing step. This works fine in DAGs that don't use max_docs (like create_new_es_index), and for dagruns targeting the audio indices (which only have one primary shard). For dagruns targeting the image indices, the reindexing tasks would consistently fail with the error:

ApiError: ApiError(500, 'search_phase_execution_exception', '[slice] can only be used with [scroll] or [point-in-time] requests')

The docs state that max_docs When set to a value less then or equal to scroll_size then a scroll will not be used to retrieve the results for the operation. This disabling of scroll then causes slicing to break. Scroll size defaults to 1000. This can be addressed in a few ways:

  • Reduce the value of size (the scroll size). This may have a negative effect on reindexing efficiency.
  • Set slices to None. Reindexing will not be parallelized but the problem is avoided.
  • Set conflicts='proceed' (instead of the default 'abort').
    • After looking into the code, I realized scroll is only disabled if reindex is set to abort on version conflicts. I think we can safely set this to 'proceed', since we're reindexing into an empty index and should not expect version conflicts. This prevents the scroll issue.

I implemented the last of these options in this commit, but while this fixed the errors I encountered another issue. As noted in the docs:

using max_docs with slices might not result in exactly max_docs documents being reindexed.

And in fact you can see, if you check out this branch at the commit mentioned above, that if you run it against the image-filtered index with percentage_of_prod=0.25, we end up with a new index that has fewer records than expected because max_docs is not respected when using slices. In my test run, I got 852 records instead of the expected 1164. This is in my opinion not acceptable because it results in the proportions by source being off.

As a result, I do not think it is possible to use slicing to parallelize reindexing for this DAG, and still achieve exact record proportions. The most recent commit on this PR handles it by disabling slicing. This will necessarily be less efficient, although I'm not sure how much.

@stacimc stacimc force-pushed the add/proportional-staging-index-dag branch from e1bc98a to a5014aa Compare February 14, 2024 01:29
@stacimc stacimc marked this pull request as ready for review February 14, 2024 19:01
@stacimc stacimc requested review from a team as code owners February 14, 2024 19:01
@stacimc
Copy link
Collaborator Author

stacimc commented Feb 17, 2024

Noting this will have to be rebased when #3805 is merged.

@stacimc stacimc force-pushed the add/proportional-staging-index-dag branch from 512c2fb to 5a053be Compare February 21, 2024 19:17
@stacimc
Copy link
Collaborator Author

stacimc commented Feb 21, 2024

#3805 was merged, so this has been rebased.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So exciting to see this complete! I tried creating several indexes as suggested, and everything worked wonderfully 👏 Good eye on avoiding related ES/DB DAGs from conflicting with each other. Everything is carefully thought out, as always.

This is excellent. Thank you, @stacimc!

Comment on lines 60 to 68
response = es_conn.search(
index=source_index,
size=0,
aggregations={
"unique_sources": {
"terms": {"field": "source", "size": 100, "order": {"_key": "desc"}}
}
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to count the items of each source by directly querying ES 💯

To clarify, what does the "size":100 part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- I added a comment for context. By default an aggregations query will return the 10 buckets with the largest number of documents, but we want aggregations for all sources.size is used to specify the max number of buckets you'd like to return; we want to set it to something that is definitely greater than the number of sources we have, but relatively low (issue, ie rather than just setting it to the maximum).

We have 56 total sources (across both media types) right now. This may be another cause for concern with adding thousands of new sources for Europeana, though.

@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@AetherUnbound
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 7 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Great work on implementing this Staci, it's very clear and works flawlessly locally. I have a few nits/thoughts, but nothing to block a merge 🚀

@stacimc stacimc merged commit a930ee0 into main Feb 28, 2024
39 checks passed
@stacimc stacimc deleted the add/proportional-staging-index-dag branch February 28, 2024 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create a DAG for creating es indices proportional by provider
4 participants