From 376d883f17e5acd3ffd72f436760354bc15041dc Mon Sep 17 00:00:00 2001
From: Krystle Salazar <krystle.salazar@automattic.com>
Date: Tue, 27 Feb 2024 22:32:39 -0400
Subject: [PATCH 1/4] Add Data Cleaning implementation plan

---
 ...plementation_plan_catalog_data_cleaning.md | 198 ++++++++++++++++++
 .../proposals/data_normalization/index.md     |   8 +
 2 files changed, 206 insertions(+)
 create mode 100644 documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
 create mode 100644 documentation/projects/proposals/data_normalization/index.md

diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
new file mode 100644
index 00000000000..0494fda9aae
--- /dev/null
+++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -0,0 +1,198 @@
+# 2024-02-27 Implementation Plan: Catalog Data Cleaning
+
+**Author**: @krysal
+
+## Reviewers
+
+- [ ] TBD
+- [ ] TBD
+
+## Project links
+
+<!-- Enumerate any references to other documents/pages, including milestones and other plans -->
+
+- [Project Thread](https://github.com/WordPress/openverse/issues/430)
+
+This project does not have a project proposal because the scope and rationale of
+the project are clear, as defined in the project thread. In doubt, check the
+[Expected Outcomes](#expected-outcomes) section below.
+
+## Overview
+
+This document describes a solution for incorrect data in the catalog database
+(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted
+resources.
+
+## Background
+
+One of the steps of the [data refresh process for images][img-data-refresh] is
+cleaning the data that is not fit for production. This process is triggered
+weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
+approximately just over **20 hours** to complete, according to a inspection of
+latest executions. The cleaned data is only saved to the API database, which is
+replaced each time during the same data refresh, causing it to have to be
+repeated each time to make the _same_ corrections.
+
+This cleaning process was designed this way to speed the rows update up since
+the relevant part was to provide the correct data to users via the API. Most of
+the rows affected were added previous to the creation of the `MediaStore` class
+in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
+nowadays responsible for validating the provider data. However, it entails a
+problem of wasting resources both in time, which continues to increase, and in
+the machines (CPU) it uses, which could easily be avoided making the changes
+permanent by saving them in the upstream database.
+
+[img-data-refresh]:
+  https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh
+
+## Expected Outcomes
+
+<!-- List any succinct expected products from this implementation plan. -->
+
+- The catalog database (upstream) preserves the cleaned data results of the
+  current Ingestion Server's cleaning steps
+- The image Data Refresh process is simplified by reducing the cleaning steps
+  time to nearly zero (and optionally removing them).
+
+## Step-by-step plan
+
+The cleaning functions that the Ingestion Server applies are already implemented
+in the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method
+(which probably should be renamed) and the [url utilities][url_utils] file. The
+only part that it's not there and can't be ported is the filtering of
+low-confidence tags, since provider scripts don't save an "accuracy" by tag.
+
+With this the plan then starts in the Ingestion Server with the following steps:
+
+1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3)
+1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup)
+1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced)
+
+[url_utils]:
+  https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py
+
+## Step details
+
+### Save TSV files of cleaned data to AWS S3
+
+In a previous exploration, it was set to store TSV files of the cleaned data in
+the form of `<identifier> <cleaned_field>`, which can be used later to perform
+the updates efficiently in the catalog DB, which only had indexes for the
+`identifier` field. These files are saved to the disk of the Ingestion Server
+EC2 instances, and worked fine for files with URL corrections since this type of
+fields is relatively short, but became a problem when trying to save tags, as
+the file turned too large and filled up the disk, causing problems to the data
+refresh execution.
+
+The alternative is to upload TSV files to the Amazon Simple Storage Service
+(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
+benefit of using S3 buckets is that they have streaming capabilities and will
+allow us to read the files in chunks later if necessary for performance. The
+downside is that objects in S3 don't allow appending, so it may require to
+upload files with different part numbers or evaluate if the [multipart upload
+process][aws_mpu] will serve us here.
+
+[aws_mpu]:
+  https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
+
+| timestamp (UTC)     | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' |
+| ------------------- | :---: | :-----------: | :-------------------: | :----: |
+| 2024-02-27 04:05:26 | 22156 |    9035458    |        8809213        |   0    |
+| 2024-02-20 04:06:56 | 22157 |    9035456    |        8809209        |   0    |
+| 2024-02-13 04:41:22 | 22155 |    9035451    |        8809204        |   0    |
+
+To have some numbers of the problem we are delaing with, the previous table
+shows the number of records cleaned by field for last runs at the moment of
+writing this IP, except for tags, which we don't have accurate registries since
+file saving was disabled.
+
+### Make and run a batched update DAG for one-time cleanup
+
+A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
+should take the files of the previous step to perform an batched update on the
+catalog's image table, while handling deadlocking and timeout concerns, similar
+to the [batched_update][batched_update]. This table is constantly in use by
+other DAGs, such as those from API providers or the data refresh process, and
+ideally can't be singly blocked by any DAG.
+
+[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update
+
+A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
+consisted of uploading each file to temporary `UNLOGGED` DB tables (which
+provides huge gains in writing performance while their disadventages are not
+relevant since they won't be permanent), and including a `row_id` serial number
+used later to query it in batches. Adding an index in this last column after
+filling up the table could improve the query performance. An adaptation will be
+needed to handle the column type of tags (`jsonb`).
+
+### Run an image data refresh to confirm cleaning time is reduced
+
+Finally, after the previous steps are done, running a data refresh will confirm
+there are no more updates applied at ingestion. If time isn't significantly
+reduced then it will be necessary to check what was missing in the previous
+steps.
+
+If confirmed the time is reduced to zero, optionally the cleaning steps can be
+removed, or leave them in case we want to perform a similar cleaning effort
+later.
+
+## Dependencies
+
+### Infrastructure
+
+No changes needed. The Ingestion Server already has the credentials required to
+[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).
+
+<!--
+### Tools & packages
+
+ Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
+
+### Other projects or work
+
+Once the steps have been completed and proved the method works we could make
+additional similar corrections following the same procedure. Some potentially
+related issues are:
+
+- [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303)
+- [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441)
+- [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728)
+
+This will also open up space for more structural changes to the Openverse DB
+schemas in a [second phase](https://github.com/WordPress/openverse/issues/244)
+of the Data Normalization endeavor.
+
+## Alternatives
+
+A previous proposal was to use the `ImageStore` to re-evaluate every image in
+the catalog DB. While this could theoretically be performed in a batched way
+too, and presented the advantage of future validations to be easily incorpored
+in a single place, it also came with significant shortcomings and complexities.
+The class would have to adapt to validations for images ingested by the
+CommonCrawl process, for which it was not planned and could open a can of extra
+problems. It would also have to go through the entire database to update the bad
+rows, unless a mix of both proposal is implemented, but ultimately the process
+of this IP is considered more direct and simple for the goal.
+
+## Rollback
+
+<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? -->
+
+In the rare case we need the old data back, we can resort to DB backups, which
+are performed [weekly][db_snapshots].
+
+[db_snapshots]: ./../../../catalog/reference/DAGs.md#rotate_db_snapshots
+
+<!--
+## Risks
+
+What risks are we taking with this solution? Are there risks that once taken can’t be undone?-->
+
+## Prior art
+
+- Previous attempt from cc-archive: [Clean preexisting data using ImageStore
+  #517][mathemancer_pr]
+- @obulat's PR to
+  [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904)
+
+[mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517
diff --git a/documentation/projects/proposals/data_normalization/index.md b/documentation/projects/proposals/data_normalization/index.md
new file mode 100644
index 00000000000..2d2f7d8e966
--- /dev/null
+++ b/documentation/projects/proposals/data_normalization/index.md
@@ -0,0 +1,8 @@
+# Data Normalization
+
+```{toctree}
+:titlesonly:
+:glob:
+
+*
+```

From e39a2cf3f0027aa80db6c5feb0b0dd3bfbfaf631 Mon Sep 17 00:00:00 2001
From: Krystle Salazar <krystle.salazar@automattic.com>
Date: Thu, 29 Feb 2024 17:35:07 -0400
Subject: [PATCH 2/4] Add reviewers

---
 .../20240227-implementation_plan_catalog_data_cleaning.md     | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
index 0494fda9aae..bd6de7ac301 100644
--- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
+++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -4,8 +4,8 @@
 
 ## Reviewers
 
-- [ ] TBD
-- [ ] TBD
+- [ ] @obulat
+- [ ] @AetherUnbound
 
 ## Project links
 

From 9e3d2cb4c967c2ad48315d12997cdb9d76336e30 Mon Sep 17 00:00:00 2001
From: Krystle Salazar <krystle.salazar@automattic.com>
Date: Mon, 4 Mar 2024 14:20:18 -0400
Subject: [PATCH 3/4] Apply editorial suggestions

Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
Co-authored-by: Olga Bulat <obulat@gmail.com>

Fix and add links

Add suggested extra issue and adjust the Expected Outcomes

Include smart_open in the Tools & packages section

Apply editorial suggestions
---
 ...plementation_plan_catalog_data_cleaning.md | 140 ++++++++++--------
 1 file changed, 81 insertions(+), 59 deletions(-)

diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
index bd6de7ac301..33dad87d34c 100644
--- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
+++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -19,45 +19,50 @@ the project are clear, as defined in the project thread. In doubt, check the
 
 ## Overview
 
-This document describes a solution for incorrect data in the catalog database
-(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted
-resources.
+This document describes a mechanism for rectifying incorrect data in the catalog
+database (DB) that currently has to be cleaned up every time a data refresh is
+run. This one-time fix is an effort to avoid wasting resources and data refresh
+runtime.
 
 ## Background
 
 One of the steps of the [data refresh process for images][img-data-refresh] is
 cleaning the data that is not fit for production. This process is triggered
-weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
+weekly by an Airflow DAG, which then runs in the Ingestion Server, taking
 approximately just over **20 hours** to complete, according to a inspection of
-latest executions. The cleaned data is only saved to the API database, which is
-replaced each time during the same data refresh, causing it to have to be
-repeated each time to make the _same_ corrections.
-
-This cleaning process was designed this way to speed the rows update up since
-the relevant part was to provide the correct data to users via the API. Most of
-the rows affected were added previous to the creation of the `MediaStore` class
-in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
-nowadays responsible for validating the provider data. However, it entails a
-problem of wasting resources both in time, which continues to increase, and in
-the machines (CPU) it uses, which could easily be avoided making the changes
-permanent by saving them in the upstream database.
-
-[img-data-refresh]:
-  https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh
+recent executions as of the time of drafting this document. The cleaned data is
+only saved to the API database, which is replaced each time during the same data
+refresh, meaning this process has to be repeated each time to make the _same_
+corrections.
+
+This cleaning process was designed this way to optimize writes to the API
+database, since the most important factor was to provide the correct data to
+users via the API. Most of the rows affected were added prior to the creation of
+the `MediaStore` class in the Catalog (possibly by the discontinued CommonCrawl
+ingestion) which is nowadays responsible for validating the provider data prior
+to upserting the records into the upstream database. However, the current
+approach entails a problem of wasting resources both in time, which continues to
+increase, and in the machines (CPU) it uses, which could easily be avoided
+making the changes permanent by saving them in the upstream database.
+
+[img-data-refresh]: ./../../../catalog/reference/DAGs.md#image_data_refresh
 
 ## Expected Outcomes
 
 <!-- List any succinct expected products from this implementation plan. -->
 
-- The catalog database (upstream) preserves the cleaned data results of the
+- The catalog database (upstream) contains the cleaned data outputs of the
   current Ingestion Server's cleaning steps
-- The image Data Refresh process is simplified by reducing the cleaning steps
-  time to nearly zero (and optionally removing them).
+- The image Data Refresh process is simplified by reducing significantly
+  cleaning times.
+
+<!-- removing the cleaning steps from the Ingestion Server. -->
 
 ## Step-by-step plan
 
-The cleaning functions that the Ingestion Server applies are already implemented
-in the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method
+The cleaning functions that the Ingestion Server applies (see the
+[cleanup][ing_server_cleanup] file) are already implemented in the Catalog in
+the `MediaStore` class: see its [`_tag_blacklisted` method][tag_blacklisted]
 (which probably should be renamed) and the [url utilities][url_utils] file. The
 only part that it's not there and can't be ported is the filtering of
 low-confidence tags, since provider scripts don't save an "accuracy" by tag.
@@ -65,9 +70,13 @@ low-confidence tags, since provider scripts don't save an "accuracy" by tag.
 With this the plan then starts in the Ingestion Server with the following steps:
 
 1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3)
-1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup)
+1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanups)
 1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced)
 
+[ing_server_cleanup]:
+  https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/ingestion_server/ingestion_server/cleanup.py#L79-L168
+[tag_blacklisted]:
+  https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/catalog/dags/common/storage/media.py#L245-L259
 [url_utils]:
   https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py
 
@@ -75,66 +84,75 @@ With this the plan then starts in the Ingestion Server with the following steps:
 
 ### Save TSV files of cleaned data to AWS S3
 
-In a previous exploration, it was set to store TSV files of the cleaned data in
-the form of `<identifier> <cleaned_field>`, which can be used later to perform
-the updates efficiently in the catalog DB, which only had indexes for the
-`identifier` field. These files are saved to the disk of the Ingestion Server
-EC2 instances, and worked fine for files with URL corrections since this type of
-fields is relatively short, but became a problem when trying to save tags, as
-the file turned too large and filled up the disk, causing problems to the data
-refresh execution.
-
-The alternative is to upload TSV files to the Amazon Simple Storage Service
-(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
-benefit of using S3 buckets is that they have streaming capabilities and will
-allow us to read the files in chunks later if necessary for performance. The
-downside is that objects in S3 don't allow appending, so it may require to
-upload files with different part numbers or evaluate if the [multipart upload
-process][aws_mpu] will serve us here.
+In a previous exploration, the Ingestion Server was set to [store TSV files of
+the cleaned data][pr-saving-tsv] in the form of `<identifier> <cleaned_field>`,
+which can be used later to perform the updates efficiently in the catalog DB,
+which only had indexes for the `identifier` field. These files are saved to the
+disk of the Ingestion Server EC2 instances, and worked fine for files with URL
+corrections since this type of fields is relatively short, but became a problem
+when trying to save tags, as the file turned too large and filled up the disk,
+causing issues to the data refresh execution.
 
 [aws_mpu]:
   https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
 
+To have some numbers of the problem we are dealing with, the following table
+shows the number of records cleaned by field for last runs at the moment of
+writing this IP, except for tags, which we don't have accurate registries since
+file saving was disabled.
+
 | timestamp (UTC)     | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' |
 | ------------------- | :---: | :-----------: | :-------------------: | :----: |
 | 2024-02-27 04:05:26 | 22156 |    9035458    |        8809213        |   0    |
 | 2024-02-20 04:06:56 | 22157 |    9035456    |        8809209        |   0    |
 | 2024-02-13 04:41:22 | 22155 |    9035451    |        8809204        |   0    |
 
-To have some numbers of the problem we are delaing with, the previous table
-shows the number of records cleaned by field for last runs at the moment of
-writing this IP, except for tags, which we don't have accurate registries since
-file saving was disabled.
+The alternative is to upload TSV files to the Amazon Simple Storage Service
+(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The
+benefit of using S3 buckets is that they have streaming capabilities and will
+allow us to read the files in chunks later if necessary for performance. The
+downside is that objects in S3 don't allow appending natviely, so it may require
+to upload files with different part numbers or evaluate if the [multipart upload
+process][aws_mpu] or more easily, the [`smart_open`][smart_open] package could
+serve us here.
 
-### Make and run a batched update DAG for one-time cleanup
+[smart_open]: https://github.com/piskvorky/smart_open
+
+### Make and run a batched update DAG for one-time cleanups
 
 A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
-should take the files of the previous step to perform an batched update on the
+should take the files of the previous step to perform a batched update on the
 catalog's image table, while handling deadlocking and timeout concerns, similar
 to the [batched_update][batched_update]. This table is constantly in use by
-other DAGs, such as those from API providers or the data refresh process, and
-ideally can't be singly blocked by any DAG.
+other DAGs, such as those from providers ingestion or the data refresh process,
+and ideally can't be singly blocked by any DAG.
 
 [batched_update]: ./../../../catalog/reference/DAGs.md#batched_update
 
 A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
 consisted of uploading each file to temporary `UNLOGGED` DB tables (which
 provides huge gains in writing performance while their disadventages are not
-relevant since they won't be permanent), and including a `row_id` serial number
-used later to query it in batches. Adding an index in this last column after
-filling up the table could improve the query performance. An adaptation will be
-needed to handle the column type of tags (`jsonb`).
+relevant to us, they won't be permanent), and include a `row_id` serial number
+used later to query it in batches. The following must be included:
+
+- Add an index for the `identifier` column in the temporary table after filling
+  it up, to improve the query performance
+- An adaptation to handle the column type of tags (`jsonb`) and modify the
+  `metadata`
+- Include an DAG task for reporting the number of rows affected by column to
+  Slack
 
 ### Run an image data refresh to confirm cleaning time is reduced
 
 Finally, after the previous steps are done, running a data refresh will confirm
 there are no more updates applied at ingestion. If time isn't significantly
 reduced then it will be necessary to check what was missing in the previous
-steps.
+steps. Looking at files generated in the
+[step 1](#save-tsv-files-of-cleaned-data-to-aws-s3) may yield clues.
 
 If confirmed the time is reduced to zero, optionally the cleaning steps can be
 removed, or leave them in case we want to perform a similar cleaning effort
-later.
+later, e.g. see the [Other projects or work](#other-projects-or-work) section.
 
 ## Dependencies
 
@@ -143,10 +161,11 @@ later.
 No changes needed. The Ingestion Server already has the credentials required to
 [connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).
 
-<!--
 ### Tools & packages
 
- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
+<!-- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
+
+Requires installing and familiarizing with the [smart_open][smart_open] utility.
 
 ### Other projects or work
 
@@ -157,6 +176,8 @@ related issues are:
 - [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303)
 - [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441)
 - [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728)
+- [Add filetype to all images in the catalog DB #1560](https://github.com/WordPress/openverse/issues/1560),
+  for when the file type can be derived from the URL.
 
 This will also open up space for more structural changes to the Openverse DB
 schemas in a [second phase](https://github.com/WordPress/openverse/issues/244)
@@ -192,7 +213,8 @@ What risks are we taking with this solution? Are there risks that once taken can
 
 - Previous attempt from cc-archive: [Clean preexisting data using ImageStore
   #517][mathemancer_pr]
-- @obulat's PR to
-  [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904)
+- @obulat's PR to [add logging and save cleaned up data in the Ingestion
+  Server][pr-saving-tsv]
 
+[pr-saving-tsv]: https://github.com/WordPress/openverse/pull/904
 [mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517

From 5c5e7711049a58a7622092d704747ebd76338b79 Mon Sep 17 00:00:00 2001
From: Krystle Salazar <krystle.salazar@automattic.com>
Date: Tue, 12 Mar 2024 12:46:54 -0400
Subject: [PATCH 4/4] Apply suggestions from code review

Co-authored-by: Olga Bulat <obulat@gmail.com>
Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
---
 .../20240227-implementation_plan_catalog_data_cleaning.md   | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
index 33dad87d34c..1a7db196b93 100644
--- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
+++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -4,8 +4,8 @@
 
 ## Reviewers
 
-- [ ] @obulat
-- [ ] @AetherUnbound
+- [x] @obulat
+- [x] @AetherUnbound
 
 ## Project links
 
@@ -139,7 +139,7 @@ used later to query it in batches. The following must be included:
   it up, to improve the query performance
 - An adaptation to handle the column type of tags (`jsonb`) and modify the
   `metadata`
-- Include an DAG task for reporting the number of rows affected by column to
+- Include a DAG task for reporting the number of rows affected by column to
   Slack
 
 ### Run an image data refresh to confirm cleaning time is reduced