-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proportional staging index dag #3763
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/3763 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
I think I am going to hold off on this and implement the fix for #3761 in this branch, because otherwise we have to add a new Airflow connection for the API which will just get removed later. |
The fix for #3761 was added in this branch. I've encountered another issue with auto-slicing to parallelize the reindexing step. This works fine in DAGs that don't use
The docs state that max_docs
I implemented the last of these options in this commit, but while this fixed the errors I encountered another issue. As noted in the docs:
And in fact you can see, if you check out this branch at the commit mentioned above, that if you run it against the As a result, I do not think it is possible to use slicing to parallelize reindexing for this DAG, and still achieve exact record proportions. The most recent commit on this PR handles it by disabling slicing. This will necessarily be less efficient, although I'm not sure how much. |
e1bc98a
to
a5014aa
Compare
Noting this will have to be rebased when #3805 is merged. |
- Since we're reindexing from a staging index, it's also okay to run the reindexing tasks in parallel - Update broken import
512c2fb
to
5a053be
Compare
#3805 was merged, so this has been rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So exciting to see this complete! I tried creating several indexes as suggested, and everything worked wonderfully 👏 Good eye on avoiding related ES/DB DAGs from conflicting with each other. Everything is carefully thought out, as always.
This is excellent. Thank you, @stacimc!
...eate_proportional_by_source_staging_index/create_proportional_by_source_staging_index_dag.py
Outdated
Show resolved
Hide resolved
response = es_conn.search( | ||
index=source_index, | ||
size=0, | ||
aggregations={ | ||
"unique_sources": { | ||
"terms": {"field": "source", "size": 100, "order": {"_key": "desc"}} | ||
} | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to count the items of each source by directly querying ES 💯
To clarify, what does the "size":100
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- I added a comment for context. By default an aggregations query will return the 10 buckets with the largest number of documents, but we want aggregations for all sources.size
is used to specify the max number of buckets you'd like to return; we want to set it to something that is definitely greater than the number of sources we have, but relatively low (issue, ie rather than just setting it to the maximum).
We have 56 total sources (across both media types) right now. This may be another cause for concern with adding thousands of new sources for Europeana, though.
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @AetherUnbound Excluding weekend1 days, this PR was ready for review 7 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Great work on implementing this Staci, it's very clear and works flawlessly locally. I have a few nits/thoughts, but nothing to block a merge 🚀
Fixes
Fixes #3488 by @stacimc
Description
This PR adds a new
create_proportional_by_source_staging_index
DAG as described in the IP here:It can be used to create a new index in staging which is a subset of a staging index, but maintains the same proportion of records per source.
Note: reindexing is not parallelized with slicing, as it is for
create_new_production_es_index
and other DAGs (see comment thread for explanation). However it is parallelized in the sense that the individual reindexing tasks for each source can run in parallel. Both of these are acceptable because the DAG only touches staging indices, not production.Testing Instructions
Tests assume you're starting from fresh local testing data, ie after running
just api/init
.create_proportional_by_source_staging_index
DAG locally with default conf options. Then verify in elasticvue that you have a new es index calledaudio-50-percent-proportional-20240214t181238
(the final suffix is a timestamp and so will differ), with the aliasaudio-subset-by-source
.audio-filtered
index and should have 50% of the total documents, with the same source proportions. Since the filtered index has 4912 documents, the new index should have 2456.get_staging_source_counts
to see the counts for the source index, and then thewait_for_reindex
tasks to see how many were reindexed for each source in the destination index. The proportions should remain the same. For examplewikimedia_audio
has 3992 records in the source index, and 1996 in the new one.remove_existing_alias
step was skipped in this dagrunpercentage_of_prod
to 0.25 and thesource_index
toaudio
.audio-25-percent-proportional-20240214t182518
audio-subset-by-source
alias should have been removed from the previous index and is now applied to this one.image
and ensure this also works. You should get a new index withimage-subset-by-source
alias, a name likeimage-50-percent-proportional-20240214t181238
, and 2328 documents (half of the 4656 that are in the filtered image index). Check the proportions are correct.staging_elasticsearch_cluster_healthcheck
andcreate_new_staging_es_index
DAGs to ensure that refactoring out shared elasticsearch utilities did not break them.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin