From 376d883f17e5acd3ffd72f436760354bc15041dc Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Tue, 27 Feb 2024 22:32:39 -0400 Subject: [PATCH 1/4] Add Data Cleaning implementation plan --- ...plementation_plan_catalog_data_cleaning.md | 198 ++++++++++++++++++ .../proposals/data_normalization/index.md | 8 + 2 files changed, 206 insertions(+) create mode 100644 documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md create mode 100644 documentation/projects/proposals/data_normalization/index.md diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md new file mode 100644 index 00000000000..0494fda9aae --- /dev/null +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -0,0 +1,198 @@ +# 2024-02-27 Implementation Plan: Catalog Data Cleaning + +**Author**: @krysal + +## Reviewers + +- [ ] TBD +- [ ] TBD + +## Project links + + + +- [Project Thread](https://github.com/WordPress/openverse/issues/430) + +This project does not have a project proposal because the scope and rationale of +the project are clear, as defined in the project thread. In doubt, check the +[Expected Outcomes](#expected-outcomes) section below. + +## Overview + +This document describes a solution for incorrect data in the catalog database +(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted +resources. + +## Background + +One of the steps of the [data refresh process for images][img-data-refresh] is +cleaning the data that is not fit for production. This process is triggered +weekly by an Airflow DAG, and then runs in the Ingestion Server, taking +approximately just over **20 hours** to complete, according to a inspection of +latest executions. The cleaned data is only saved to the API database, which is +replaced each time during the same data refresh, causing it to have to be +repeated each time to make the _same_ corrections. + +This cleaning process was designed this way to speed the rows update up since +the relevant part was to provide the correct data to users via the API. Most of +the rows affected were added previous to the creation of the `MediaStore` class +in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is +nowadays responsible for validating the provider data. However, it entails a +problem of wasting resources both in time, which continues to increase, and in +the machines (CPU) it uses, which could easily be avoided making the changes +permanent by saving them in the upstream database. + +[img-data-refresh]: + https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh + +## Expected Outcomes + + + +- The catalog database (upstream) preserves the cleaned data results of the + current Ingestion Server's cleaning steps +- The image Data Refresh process is simplified by reducing the cleaning steps + time to nearly zero (and optionally removing them). + +## Step-by-step plan + +The cleaning functions that the Ingestion Server applies are already implemented +in the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method +(which probably should be renamed) and the [url utilities][url_utils] file. The +only part that it's not there and can't be ported is the filtering of +low-confidence tags, since provider scripts don't save an "accuracy" by tag. + +With this the plan then starts in the Ingestion Server with the following steps: + +1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3) +1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup) +1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced) + +[url_utils]: + https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py + +## Step details + +### Save TSV files of cleaned data to AWS S3 + +In a previous exploration, it was set to store TSV files of the cleaned data in +the form of ` `, which can be used later to perform +the updates efficiently in the catalog DB, which only had indexes for the +`identifier` field. These files are saved to the disk of the Ingestion Server +EC2 instances, and worked fine for files with URL corrections since this type of +fields is relatively short, but became a problem when trying to save tags, as +the file turned too large and filled up the disk, causing problems to the data +refresh execution. + +The alternative is to upload TSV files to the Amazon Simple Storage Service +(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The +benefit of using S3 buckets is that they have streaming capabilities and will +allow us to read the files in chunks later if necessary for performance. The +downside is that objects in S3 don't allow appending, so it may require to +upload files with different part numbers or evaluate if the [multipart upload +process][aws_mpu] will serve us here. + +[aws_mpu]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html + +| timestamp (UTC) | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' | +| ------------------- | :---: | :-----------: | :-------------------: | :----: | +| 2024-02-27 04:05:26 | 22156 | 9035458 | 8809213 | 0 | +| 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 | +| 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 | + +To have some numbers of the problem we are delaing with, the previous table +shows the number of records cleaned by field for last runs at the moment of +writing this IP, except for tags, which we don't have accurate registries since +file saving was disabled. + +### Make and run a batched update DAG for one-time cleanup + +A batched catalog cleaner DAG (or potentially a `batched_update_from_file`) +should take the files of the previous step to perform an batched update on the +catalog's image table, while handling deadlocking and timeout concerns, similar +to the [batched_update][batched_update]. This table is constantly in use by +other DAGs, such as those from API providers or the data refresh process, and +ideally can't be singly blocked by any DAG. + +[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update + +A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601) +consisted of uploading each file to temporary `UNLOGGED` DB tables (which +provides huge gains in writing performance while their disadventages are not +relevant since they won't be permanent), and including a `row_id` serial number +used later to query it in batches. Adding an index in this last column after +filling up the table could improve the query performance. An adaptation will be +needed to handle the column type of tags (`jsonb`). + +### Run an image data refresh to confirm cleaning time is reduced + +Finally, after the previous steps are done, running a data refresh will confirm +there are no more updates applied at ingestion. If time isn't significantly +reduced then it will be necessary to check what was missing in the previous +steps. + +If confirmed the time is reduced to zero, optionally the cleaning steps can be +removed, or leave them in case we want to perform a similar cleaning effort +later. + +## Dependencies + +### Infrastructure + +No changes needed. The Ingestion Server already has the credentials required to +[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28). + + + +### Other projects or work + +Once the steps have been completed and proved the method works we could make +additional similar corrections following the same procedure. Some potentially +related issues are: + +- [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303) +- [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441) +- [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728) + +This will also open up space for more structural changes to the Openverse DB +schemas in a [second phase](https://github.com/WordPress/openverse/issues/244) +of the Data Normalization endeavor. + +## Alternatives + +A previous proposal was to use the `ImageStore` to re-evaluate every image in +the catalog DB. While this could theoretically be performed in a batched way +too, and presented the advantage of future validations to be easily incorpored +in a single place, it also came with significant shortcomings and complexities. +The class would have to adapt to validations for images ingested by the +CommonCrawl process, for which it was not planned and could open a can of extra +problems. It would also have to go through the entire database to update the bad +rows, unless a mix of both proposal is implemented, but ultimately the process +of this IP is considered more direct and simple for the goal. + +## Rollback + + + +In the rare case we need the old data back, we can resort to DB backups, which +are performed [weekly][db_snapshots]. + +[db_snapshots]: ./../../../catalog/reference/DAGs.md#rotate_db_snapshots + + + +## Prior art + +- Previous attempt from cc-archive: [Clean preexisting data using ImageStore + #517][mathemancer_pr] +- @obulat's PR to + [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904) + +[mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517 diff --git a/documentation/projects/proposals/data_normalization/index.md b/documentation/projects/proposals/data_normalization/index.md new file mode 100644 index 00000000000..2d2f7d8e966 --- /dev/null +++ b/documentation/projects/proposals/data_normalization/index.md @@ -0,0 +1,8 @@ +# Data Normalization + +```{toctree} +:titlesonly: +:glob: + +* +``` From e39a2cf3f0027aa80db6c5feb0b0dd3bfbfaf631 Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Thu, 29 Feb 2024 17:35:07 -0400 Subject: [PATCH 2/4] Add reviewers --- .../20240227-implementation_plan_catalog_data_cleaning.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md index 0494fda9aae..bd6de7ac301 100644 --- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -4,8 +4,8 @@ ## Reviewers -- [ ] TBD -- [ ] TBD +- [ ] @obulat +- [ ] @AetherUnbound ## Project links From 9e3d2cb4c967c2ad48315d12997cdb9d76336e30 Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Mon, 4 Mar 2024 14:20:18 -0400 Subject: [PATCH 3/4] Apply editorial suggestions Co-authored-by: Madison Swain-Bowden Co-authored-by: Olga Bulat Fix and add links Add suggested extra issue and adjust the Expected Outcomes Include smart_open in the Tools & packages section Apply editorial suggestions --- ...plementation_plan_catalog_data_cleaning.md | 140 ++++++++++-------- 1 file changed, 81 insertions(+), 59 deletions(-) diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md index bd6de7ac301..33dad87d34c 100644 --- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -19,45 +19,50 @@ the project are clear, as defined in the project thread. In doubt, check the ## Overview -This document describes a solution for incorrect data in the catalog database -(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted -resources. +This document describes a mechanism for rectifying incorrect data in the catalog +database (DB) that currently has to be cleaned up every time a data refresh is +run. This one-time fix is an effort to avoid wasting resources and data refresh +runtime. ## Background One of the steps of the [data refresh process for images][img-data-refresh] is cleaning the data that is not fit for production. This process is triggered -weekly by an Airflow DAG, and then runs in the Ingestion Server, taking +weekly by an Airflow DAG, which then runs in the Ingestion Server, taking approximately just over **20 hours** to complete, according to a inspection of -latest executions. The cleaned data is only saved to the API database, which is -replaced each time during the same data refresh, causing it to have to be -repeated each time to make the _same_ corrections. - -This cleaning process was designed this way to speed the rows update up since -the relevant part was to provide the correct data to users via the API. Most of -the rows affected were added previous to the creation of the `MediaStore` class -in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is -nowadays responsible for validating the provider data. However, it entails a -problem of wasting resources both in time, which continues to increase, and in -the machines (CPU) it uses, which could easily be avoided making the changes -permanent by saving them in the upstream database. - -[img-data-refresh]: - https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh +recent executions as of the time of drafting this document. The cleaned data is +only saved to the API database, which is replaced each time during the same data +refresh, meaning this process has to be repeated each time to make the _same_ +corrections. + +This cleaning process was designed this way to optimize writes to the API +database, since the most important factor was to provide the correct data to +users via the API. Most of the rows affected were added prior to the creation of +the `MediaStore` class in the Catalog (possibly by the discontinued CommonCrawl +ingestion) which is nowadays responsible for validating the provider data prior +to upserting the records into the upstream database. However, the current +approach entails a problem of wasting resources both in time, which continues to +increase, and in the machines (CPU) it uses, which could easily be avoided +making the changes permanent by saving them in the upstream database. + +[img-data-refresh]: ./../../../catalog/reference/DAGs.md#image_data_refresh ## Expected Outcomes -- The catalog database (upstream) preserves the cleaned data results of the +- The catalog database (upstream) contains the cleaned data outputs of the current Ingestion Server's cleaning steps -- The image Data Refresh process is simplified by reducing the cleaning steps - time to nearly zero (and optionally removing them). +- The image Data Refresh process is simplified by reducing significantly + cleaning times. + + ## Step-by-step plan -The cleaning functions that the Ingestion Server applies are already implemented -in the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method +The cleaning functions that the Ingestion Server applies (see the +[cleanup][ing_server_cleanup] file) are already implemented in the Catalog in +the `MediaStore` class: see its [`_tag_blacklisted` method][tag_blacklisted] (which probably should be renamed) and the [url utilities][url_utils] file. The only part that it's not there and can't be ported is the filtering of low-confidence tags, since provider scripts don't save an "accuracy" by tag. @@ -65,9 +70,13 @@ low-confidence tags, since provider scripts don't save an "accuracy" by tag. With this the plan then starts in the Ingestion Server with the following steps: 1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3) -1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup) +1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanups) 1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced) +[ing_server_cleanup]: + https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/ingestion_server/ingestion_server/cleanup.py#L79-L168 +[tag_blacklisted]: + https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/catalog/dags/common/storage/media.py#L245-L259 [url_utils]: https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py @@ -75,66 +84,75 @@ With this the plan then starts in the Ingestion Server with the following steps: ### Save TSV files of cleaned data to AWS S3 -In a previous exploration, it was set to store TSV files of the cleaned data in -the form of ` `, which can be used later to perform -the updates efficiently in the catalog DB, which only had indexes for the -`identifier` field. These files are saved to the disk of the Ingestion Server -EC2 instances, and worked fine for files with URL corrections since this type of -fields is relatively short, but became a problem when trying to save tags, as -the file turned too large and filled up the disk, causing problems to the data -refresh execution. - -The alternative is to upload TSV files to the Amazon Simple Storage Service -(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The -benefit of using S3 buckets is that they have streaming capabilities and will -allow us to read the files in chunks later if necessary for performance. The -downside is that objects in S3 don't allow appending, so it may require to -upload files with different part numbers or evaluate if the [multipart upload -process][aws_mpu] will serve us here. +In a previous exploration, the Ingestion Server was set to [store TSV files of +the cleaned data][pr-saving-tsv] in the form of ` `, +which can be used later to perform the updates efficiently in the catalog DB, +which only had indexes for the `identifier` field. These files are saved to the +disk of the Ingestion Server EC2 instances, and worked fine for files with URL +corrections since this type of fields is relatively short, but became a problem +when trying to save tags, as the file turned too large and filled up the disk, +causing issues to the data refresh execution. [aws_mpu]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html +To have some numbers of the problem we are dealing with, the following table +shows the number of records cleaned by field for last runs at the moment of +writing this IP, except for tags, which we don't have accurate registries since +file saving was disabled. + | timestamp (UTC) | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' | | ------------------- | :---: | :-----------: | :-------------------: | :----: | | 2024-02-27 04:05:26 | 22156 | 9035458 | 8809213 | 0 | | 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 | | 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 | -To have some numbers of the problem we are delaing with, the previous table -shows the number of records cleaned by field for last runs at the moment of -writing this IP, except for tags, which we don't have accurate registries since -file saving was disabled. +The alternative is to upload TSV files to the Amazon Simple Storage Service +(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The +benefit of using S3 buckets is that they have streaming capabilities and will +allow us to read the files in chunks later if necessary for performance. The +downside is that objects in S3 don't allow appending natviely, so it may require +to upload files with different part numbers or evaluate if the [multipart upload +process][aws_mpu] or more easily, the [`smart_open`][smart_open] package could +serve us here. -### Make and run a batched update DAG for one-time cleanup +[smart_open]: https://github.com/piskvorky/smart_open + +### Make and run a batched update DAG for one-time cleanups A batched catalog cleaner DAG (or potentially a `batched_update_from_file`) -should take the files of the previous step to perform an batched update on the +should take the files of the previous step to perform a batched update on the catalog's image table, while handling deadlocking and timeout concerns, similar to the [batched_update][batched_update]. This table is constantly in use by -other DAGs, such as those from API providers or the data refresh process, and -ideally can't be singly blocked by any DAG. +other DAGs, such as those from providers ingestion or the data refresh process, +and ideally can't be singly blocked by any DAG. [batched_update]: ./../../../catalog/reference/DAGs.md#batched_update A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601) consisted of uploading each file to temporary `UNLOGGED` DB tables (which provides huge gains in writing performance while their disadventages are not -relevant since they won't be permanent), and including a `row_id` serial number -used later to query it in batches. Adding an index in this last column after -filling up the table could improve the query performance. An adaptation will be -needed to handle the column type of tags (`jsonb`). +relevant to us, they won't be permanent), and include a `row_id` serial number +used later to query it in batches. The following must be included: + +- Add an index for the `identifier` column in the temporary table after filling + it up, to improve the query performance +- An adaptation to handle the column type of tags (`jsonb`) and modify the + `metadata` +- Include an DAG task for reporting the number of rows affected by column to + Slack ### Run an image data refresh to confirm cleaning time is reduced Finally, after the previous steps are done, running a data refresh will confirm there are no more updates applied at ingestion. If time isn't significantly reduced then it will be necessary to check what was missing in the previous -steps. +steps. Looking at files generated in the +[step 1](#save-tsv-files-of-cleaned-data-to-aws-s3) may yield clues. If confirmed the time is reduced to zero, optionally the cleaning steps can be removed, or leave them in case we want to perform a similar cleaning effort -later. +later, e.g. see the [Other projects or work](#other-projects-or-work) section. ## Dependencies @@ -143,10 +161,11 @@ later. No changes needed. The Ingestion Server already has the credentials required to [connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28). - + + +Requires installing and familiarizing with the [smart_open][smart_open] utility. ### Other projects or work @@ -157,6 +176,8 @@ related issues are: - [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303) - [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441) - [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728) +- [Add filetype to all images in the catalog DB #1560](https://github.com/WordPress/openverse/issues/1560), + for when the file type can be derived from the URL. This will also open up space for more structural changes to the Openverse DB schemas in a [second phase](https://github.com/WordPress/openverse/issues/244) @@ -192,7 +213,8 @@ What risks are we taking with this solution? Are there risks that once taken can - Previous attempt from cc-archive: [Clean preexisting data using ImageStore #517][mathemancer_pr] -- @obulat's PR to - [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904) +- @obulat's PR to [add logging and save cleaned up data in the Ingestion + Server][pr-saving-tsv] +[pr-saving-tsv]: https://github.com/WordPress/openverse/pull/904 [mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517 From 5c5e7711049a58a7622092d704747ebd76338b79 Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Tue, 12 Mar 2024 12:46:54 -0400 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Olga Bulat Co-authored-by: Madison Swain-Bowden --- .../20240227-implementation_plan_catalog_data_cleaning.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md index 33dad87d34c..1a7db196b93 100644 --- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -4,8 +4,8 @@ ## Reviewers -- [ ] @obulat -- [ ] @AetherUnbound +- [x] @obulat +- [x] @AetherUnbound ## Project links @@ -139,7 +139,7 @@ used later to query it in batches. The following must be included: it up, to improve the query performance - An adaptation to handle the column type of tags (`jsonb`) and modify the `metadata` -- Include an DAG task for reporting the number of rows affected by column to +- Include a DAG task for reporting the number of rows affected by column to Slack ### Run an image data refresh to confirm cleaning time is reduced