-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation plan: Catalog Data Cleaning #3848
Conversation
933cbd3
to
28bbd68
Compare
Full-stack documentation: https://docs.openverse.org/_preview/3848 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: |
28bbd68
to
10bd0a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drafting this @krysal! I have a number of editorial suggestions and questions as part of the clarification round.
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
If confirmed the time is reduced to zero, optionally the cleaning steps can be | ||
removed, or leave them in case we want to perform a similar cleaning effort | ||
later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can maybe leave a cleaning "shim" in for the future, but the current cleaning logic that gets called should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will look a cleaning shim in your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could perhaps leave this section in, even if clean_image_data
is a noop:
openverse/ingestion_server/ingestion_server/ingest.py
Lines 326 to 333 in 9b4f727
if downstream_table == "image": | |
# Step 5: Clean the data | |
log.info("Cleaning data...") | |
clean_image_data(downstream_table) | |
log.info("Cleaning completed!") | |
slack.status( | |
downstream_table, "Data cleaning complete | _Next: Elasticsearch reindex_" | |
) |
Now that I say it though, it seems silly 😅 we probably don't need to keep that either!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of think we do need to have this shim, because I'm pretty sure we will find something to clean, at least in the short run. Would be nice to have some criteria to determine when to remove the shim. Maybe, when the second stage of the data normalization project is done?
With this shim in place, the functions that clean tags and URLs in the cleanup.py
would be removed, and this dictionary would have empty values:
openverse/ingestion_server/ingestion_server/cleanup.py
Lines 151 to 167 in 9b4f727
_cleanup_config = { | |
"tables": { | |
"image": { | |
"sources": { | |
# Applies to all sources. | |
"*": { | |
"fields": { | |
"tags": CleanupFunctions.cleanup_tags, | |
"url": CleanupFunctions.cleanup_url, | |
"creator_url": CleanupFunctions.cleanup_url, | |
"foreign_landing_url": CleanupFunctions.cleanup_url, | |
} | |
} | |
} | |
} | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating between the two options, too.
Would be nice to have some criteria to determine when to remove the shim. Maybe, when the second stage of the data normalization project is done?
That's a great idea! I agree with it 💯
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great plan, @krysal !
Could we also add some notes on how we keep track of the cleaned up data, and how we report on it? This will probably be useful for the person who runs the DAGs, and maybe the MSR runner?
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Show resolved
Hide resolved
7c0d57b
to
3d70e64
Compare
@obulat Good note! I see you're already thinking about re-using the DAG outside of cleaning tasks 😄 I added that step to the DAG specification. Thanks for bringing it up and for the quick review! |
@AetherUnbound @obulat I made the corrections from your excellent suggestions. This is ready for another review 📑 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more comments, no further input after this from me on the clarification round! 🙂
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go on to the next stage.
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
If confirmed the time is reduced to zero, optionally the cleaning steps can be | ||
removed, or leave them in case we want to perform a similar cleaning effort | ||
later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of think we do need to have this shim, because I'm pretty sure we will find something to clean, at least in the short run. Would be nice to have some criteria to determine when to remove the shim. Maybe, when the second stage of the data normalization project is done?
With this shim in place, the functions that clean tags and URLs in the cleanup.py
would be removed, and this dictionary would have empty values:
openverse/ingestion_server/ingestion_server/cleanup.py
Lines 151 to 167 in 9b4f727
_cleanup_config = { | |
"tables": { | |
"image": { | |
"sources": { | |
# Applies to all sources. | |
"*": { | |
"fields": { | |
"tags": CleanupFunctions.cleanup_tags, | |
"url": CleanupFunctions.cleanup_url, | |
"creator_url": CleanupFunctions.cleanup_url, | |
"foreign_landing_url": CleanupFunctions.cleanup_url, | |
} | |
} | |
} | |
} | |
} | |
} |
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Show resolved
Hide resolved
eb6a419
to
483be63
Compare
0532dae
to
d17e708
Compare
Co-authored-by: Madison Swain-Bowden <[email protected]> Co-authored-by: Olga Bulat <[email protected]> Fix and add links Add suggested extra issue and adjust the Expected Outcomes Include smart_open in the Tools & packages section Apply editorial suggestions
d17e708
to
9e3d2cb
Compare
Document updated. I moved the discussion to the Decision round 🥁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blocking objections, the plan looks good! 😄
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to get this project started!
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
.../projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Olga Bulat <[email protected]> Co-authored-by: Madison Swain-Bowden <[email protected]>
Thank you folks! ✨ |
Fixes
Related to #430 by @obulat
Description
Since the project proposal is clear with what we want to achieve in this case, and in a past attempt, there were more doubts with the technical details than with the proposal itself, here I directly add an implementation plan with the most straightforward path identified.
This proposal also represents work that @obulat pushed last year; it's partly possible due to her effort and the @stacimc's precedent in the batched_update DAG. Kudos to both!
This discussion follows the Openverse decision-making process. Information about this process can be found on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarize yourself with and follow the process.
Current round
This discussion is currently in the Decision round.
The deadline for review of this round is March 12, 2024.
Developer Certificate of Origin
Developer Certificate of Origin