Add proportional staging index dag #3763

stacimc · 2024-02-07T00:31:42Z

Fixes

Description

This PR adds a new create_proportional_by_source_staging_index DAG as described in the IP here:

It can be used to create a new index in staging which is a subset of a staging index, but maintains the same proportion of records per source.

Note: reindexing is not parallelized with slicing, as it is for create_new_production_es_index and other DAGs (see comment thread for explanation). However it is parallelized in the sense that the individual reindexing tasks for each source can run in parallel. Both of these are acceptable because the DAG only touches staging indices, not production.

Testing Instructions

Tests assume you're starting from fresh local testing data, ie after running just api/init.

Run the create_proportional_by_source_staging_index DAG locally with default conf options. Then verify in elasticvue that you have a new es index called audio-50-percent-proportional-20240214t181238 (the final suffix is a timestamp and so will differ), with the alias audio-subset-by-source.
- By default it will be based on the audio-filtered index and should have 50% of the total documents, with the same source proportions. Since the filtered index has 4912 documents, the new index should have 2456.
- Inspect the logs for get_staging_source_counts to see the counts for the source index, and then the wait_for_reindex tasks to see how many were reindexed for each source in the destination index. The proportions should remain the same. For example wikimedia_audio has 3992 records in the source index, and 1996 in the new one.
- Note that the remove_existing_alias step was skipped in this dagrun
Run the DAG again, but this time change the percentage_of_prod to 0.25 and the source_index to audio.
- You should get a new index named something like audio-25-percent-proportional-20240214t182518
- No steps should be skipped this time. The audio-subset-by-source alias should have been removed from the previous index and is now applied to this one.
- The index should have 1250 documents (1/4 of 5000 from the main audio index)
Run the DAG with media type image and ensure this also works. You should get a new index with image-subset-by-source alias, a name like image-50-percent-proportional-20240214t181238, and 2328 documents (half of the 4656 that are in the filtered image index). Check the proportions are correct.
Run the staging_elasticsearch_cluster_healthcheck and create_new_staging_es_index DAGs to ensure that refactoring out shared elasticsearch utilities did not break them.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2024-02-07T00:43:06Z

Full-stack documentation: https://docs.openverse.org/_preview/3763

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

https://docs.openverse.org/_preview/3763/catalog/reference/DAGs.html

stacimc · 2024-02-07T20:57:20Z

I think I am going to hold off on this and implement the fix for #3761 in this branch, because otherwise we have to add a new Airflow connection for the API which will just get removed later.

stacimc · 2024-02-14T01:24:21Z

The fix for #3761 was added in this branch.

I've encountered another issue with auto-slicing to parallelize the reindexing step. This works fine in DAGs that don't use max_docs (like create_new_es_index), and for dagruns targeting the audio indices (which only have one primary shard). For dagruns targeting the image indices, the reindexing tasks would consistently fail with the error:

ApiError: ApiError(500, 'search_phase_execution_exception', '[slice] can only be used with [scroll] or [point-in-time] requests')

The docs state that max_docs When set to a value less then or equal to scroll_size then a scroll will not be used to retrieve the results for the operation. This disabling of scroll then causes slicing to break. Scroll size defaults to 1000. This can be addressed in a few ways:

Reduce the value of size (the scroll size). This may have a negative effect on reindexing efficiency.
Set slices to None. Reindexing will not be parallelized but the problem is avoided.
Set conflicts='proceed' (instead of the default 'abort').
- After looking into the code, I realized scroll is only disabled if reindex is set to abort on version conflicts. I think we can safely set this to 'proceed', since we're reindexing into an empty index and should not expect version conflicts. This prevents the scroll issue.

I implemented the last of these options in this commit, but while this fixed the errors I encountered another issue. As noted in the docs:

using max_docs with slices might not result in exactly max_docs documents being reindexed.

And in fact you can see, if you check out this branch at the commit mentioned above, that if you run it against the image-filtered index with percentage_of_prod=0.25, we end up with a new index that has fewer records than expected because max_docs is not respected when using slices. In my test run, I got 852 records instead of the expected 1164. This is in my opinion not acceptable because it results in the proportions by source being off.

As a result, I do not think it is possible to use slicing to parallelize reindexing for this DAG, and still achieve exact record proportions. The most recent commit on this PR handles it by disabling slicing. This will necessarily be less efficient, although I'm not sure how much.

stacimc · 2024-02-17T00:48:01Z

Noting this will have to be rebased when #3805 is merged.

- Since we're reindexing from a staging index, it's also okay to run the reindexing tasks in parallel - Update broken import

stacimc · 2024-02-21T19:17:42Z

#3805 was merged, so this has been rebased.

krysal

So exciting to see this complete! I tried creating several indexes as suggested, and everything worked wonderfully 👏 Good eye on avoiding related ES/DB DAGs from conflicting with each other. Everything is carefully thought out, as always.

This is excellent. Thank you, @stacimc!

...eate_proportional_by_source_staging_index/create_proportional_by_source_staging_index_dag.py

krysal · 2024-02-24T04:35:25Z

...r/create_proportional_by_source_staging_index/create_proportional_by_source_staging_index.py

+    response = es_conn.search(
+        index=source_index,
+        size=0,
+        aggregations={
+            "unique_sources": {
+                "terms": {"field": "source", "size": 100, "order": {"_key": "desc"}}
+            }
+        },
+    )


Good idea to count the items of each source by directly querying ES 💯

To clarify, what does the "size":100 part?

Good question -- I added a comment for context. By default an aggregations query will return the 10 buckets with the largest number of documents, but we want aggregations for all sources.size is used to specify the max number of buckets you'd like to return; we want to set it to something that is definitely greater than the number of sources we have, but relatively low (issue, ie rather than just setting it to the maximum).

We have 56 total sources (across both media types) right now. This may be another cause for concern with adding thousands of new sources for Europeana, though.

openverse-bot · 2024-02-26T00:00:09Z

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@AetherUnbound
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 7 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)².

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

AetherUnbound

This is fantastic! Great work on implementing this Staci, it's very clear and works flawlessly locally. I have a few nits/thoughts, but nothing to block a merge 🚀

catalog/dags/common/elasticsearch.py

...r/create_proportional_by_source_staging_index/create_proportional_by_source_staging_index.py

...eate_proportional_by_source_staging_index/create_proportional_by_source_staging_index_dag.py

stacimc self-assigned this Feb 7, 2024

github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation labels Feb 7, 2024

openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Feb 7, 2024

stacimc force-pushed the add/proportional-staging-index-dag branch from e1bc98a to a5014aa Compare February 14, 2024 01:29

stacimc marked this pull request as ready for review February 14, 2024 19:01

stacimc requested review from a team as code owners February 14, 2024 19:01

stacimc requested review from krysal and AetherUnbound February 14, 2024 19:01

stacimc added 13 commits February 20, 2024 14:22

Add API connection

47f70b6

Pull out common elasticsearch functions

2441d2a

Remove duplicated tasks

9439559

Add new dag

b60807b

Tests

716426c

Add tasks for point alias

25d023c

Update dag docs

078b9f1

Fix tests

bdbcf38

Fix es query for reindexing

61551c6

Get stats for source index rather than using stats endpoint

e0a3ea3

Raise error when incorrect number of documents are reindexed

4941e31

Prevent scroll from being disabled and causing error with auto-slicing

d90cc1e

Turn off slicing for reindex with max_docs instead

4da2e8c

stacimc added 5 commits February 20, 2024 14:29

Do not perform multiple source reindexes simultaneously

625e4b6

Update docs to correctly describe source index as a staging index

8c7bbe8

- Since we're reindexing from a staging index, it's also okay to run the reindexing tasks in parallel - Update broken import

Lint

aa9ce98

Log reindexed count

c80eb1c

Clarify language to 'staging', simplify math

5a053be

stacimc force-pushed the add/proportional-staging-index-dag branch from 512c2fb to 5a053be Compare February 21, 2024 19:17

krysal approved these changes Feb 24, 2024

View reviewed changes

stacimc added 2 commits February 26, 2024 10:37

Add clarifying comment about aggregations query

1344209

Make minimum of 0 exclusive

72a922b

AetherUnbound approved these changes Feb 26, 2024

View reviewed changes

stacimc added 2 commits February 27, 2024 16:50

Only refresh index once, clean up es task params

f4613ba

Clarify arguments to get_staging_source_counts

47efec8

stacimc merged commit a930ee0 into main Feb 28, 2024
39 checks passed

stacimc deleted the add/proportional-staging-index-dag branch February 28, 2024 21:47

stacimc mentioned this pull request Mar 5, 2024

create_proportional_by_source_staging_index DAG does not base proportions off source index #3761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proportional staging index dag #3763

Add proportional staging index dag #3763

stacimc commented Feb 7, 2024 •

edited

Loading

github-actions bot commented Feb 7, 2024

stacimc commented Feb 7, 2024

stacimc commented Feb 14, 2024

stacimc commented Feb 17, 2024

stacimc commented Feb 21, 2024

krysal left a comment

krysal Feb 24, 2024

stacimc Feb 26, 2024

openverse-bot commented Feb 26, 2024

AetherUnbound left a comment

Add proportional staging index dag #3763

Add proportional staging index dag #3763

Conversation

stacimc commented Feb 7, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Feb 7, 2024

stacimc commented Feb 7, 2024

stacimc commented Feb 14, 2024

stacimc commented Feb 17, 2024

stacimc commented Feb 21, 2024

krysal left a comment

Choose a reason for hiding this comment

krysal Feb 24, 2024

Choose a reason for hiding this comment

stacimc Feb 26, 2024

Choose a reason for hiding this comment

openverse-bot commented Feb 26, 2024

Footnotes

AetherUnbound left a comment

Choose a reason for hiding this comment

stacimc commented Feb 7, 2024 •

edited

Loading