From 483be63ee42940b277f344eb083d7375eb80e959 Mon Sep 17 00:00:00 2001
From: Krystle Salazar <krystle.salazar@automattic.com>
Date: Tue, 5 Mar 2024 18:07:14 -0400
Subject: [PATCH] Include smart_open in the Tools & packages section

---
 ...plementation_plan_catalog_data_cleaning.md | 35 ++++++++++++-------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
index 03f9db56026..e88fb64ffde 100644
--- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
+++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -88,7 +88,7 @@ which only had indexes for the `identifier` field. These files are saved to the
 disk of the Ingestion Server EC2 instances, and worked fine for files with URL
 corrections since this type of fields is relatively short, but became a problem
 when trying to save tags, as the file turned too large and filled up the disk,
-causing problems to the data refresh execution.
+causing issues to the data refresh execution.
 
 [aws_mpu]:
   https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
@@ -110,22 +110,25 @@ benefit of using S3 buckets is that they have streaming capabilities and will
 allow us to read the files in chunks later if necessary for performance. The
 downside is that objects in S3 don't allow appending natviely, so it may require
 to upload files with different part numbers or evaluate if the [multipart upload
-process][aws_mpu] or more easily, the `smart_open` package could serve us here.
+process][aws_mpu] or more easily, the [`smart_open`][smart_open] package could
+serve us here.
 
 [smart_open]: https://github.com/piskvorky/smart_open
 
 The alternative is to upload TSV files to the Amazon Simple Storage Service
-(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
+(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The
 benefit of using S3 buckets is that they have streaming capabilities and will
 allow us to read the files in chunks later if necessary for performance. The
-downside is that objects in S3 don't allow appending, so it may require to
-upload files with different part numbers or evaluate if the [multipart upload
-process][aws_mpu] will serve us here.
+downside is that objects in S3 don't allow appending natviely, so it may require
+to upload files with different part numbers or evaluate if the [multipart upload
+process][aws_mpu] or more easily, the `smart_open` package could serve us here.
+
+[smart_open]: https://github.com/piskvorky/smart_open
 
 ### Make and run a batched update DAG for one-time cleanup
 
 A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
-should take the files of the previous step to perform an batched update on the
+should take the files of the previous step to perform a batched update on the
 catalog's image table, while handling deadlocking and timeout concerns, similar
 to the [batched_update][batched_update]. This table is constantly in use by
 other DAGs, such as those from providers ingestion or the data refresh process,
@@ -136,10 +139,15 @@ and ideally can't be singly blocked by any DAG.
 A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
 consisted of uploading each file to temporary `UNLOGGED` DB tables (which
 provides huge gains in writing performance while their disadventages are not
-relevant to us, they won't be permanent), and including a `row_id` serial number
-used later to query it in batches. Adding an index in this last column after
-filling up the table could improve the query performance. An adaptation will be
-needed to handle the column type of tags (`jsonb`).
+relevant to us, they won't be permanent), and include a `row_id` serial number
+used later to query it in batches. The following must be included:
+
+- Add an index for the `identifier` column in the temporary table after filling
+  it up, to improve the query performance
+- An adaptation to handle the column type of tags (`jsonb`) and modify the
+  `metadata`
+- Include an DAG task for reporting the number of rows affected by column to
+  Slack
 
 ### Run an image data refresh to confirm cleaning time is reduced
 
@@ -159,10 +167,11 @@ later.
 No changes needed. The Ingestion Server already has the credentials required to
 [connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).
 
-<!--
 ### Tools & packages
 
- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
+<!-- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
+
+Requires installing and familiarizing with the [smart_open][smart_open] utility.
 
 ### Other projects or work