Apply editorial suggestions

Co-authored-by: Madison Swain-Bowden <[email protected]> Co-authored-by: Olga Bulat <[email protected]>
WordPress · Mar 5, 2024 · a399420 · a399420
1 parent 39a0d68
commit a399420
Showing 1 changed file with 23 additions and 21 deletions.
diff --git a/...posals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/...posals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -19,24 +19,26 @@ the project are clear, as defined in the project thread. In doubt, check the
 
 ## Overview
 
-This document describes a solution for incorrect data in the catalog database
-(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted
-resources.
+This document describes a mechanism for rectifying incorrect data in the catalog
+database (DB) that currently has to be cleaned up every time a data refresh is
+run. This one-time fix is an effort to avoid wasting resources and data refresh
+runtime.
 
 ## Background
 
 One of the steps of the [data refresh process for images][img-data-refresh] is
 cleaning the data that is not fit for production. This process is triggered
-weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
+weekly by an Airflow DAG, which then runs in the Ingestion Server, taking
 approximately just over **20 hours** to complete, according to a inspection of
-latest executions. The cleaned data is only saved to the API database, which is
-replaced each time during the same data refresh, causing it to have to be
-repeated each time to make the _same_ corrections.
+recent executions as of the time of drafting this document. The cleaned data is
+only saved to the API database, which is replaced each time during the same data
+refresh, meaning this process has to be repeated each time to make the _same_
+corrections.
 
 This cleaning process was designed this way to speed the rows update up since
 the relevant part was to provide the correct data to users via the API. Most of
-the rows affected were added previous to the creation of the `MediaStore` class
-in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
+the rows affected were added prior to the creation of the `MediaStore` class in
+the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
 nowadays responsible for validating the provider data. However, it entails a
 problem of wasting resources both in time, which continues to increase, and in
 the machines (CPU) it uses, which could easily be avoided making the changes
@@ -49,7 +51,7 @@ permanent by saving them in the upstream database.
 
 <!-- List any succinct expected products from this implementation plan. -->
 
-- The catalog database (upstream) preserves the cleaned data results of the
+- The catalog database (upstream) contains the cleaned data outputs of the
   current Ingestion Server's cleaning steps
 - The image Data Refresh process is simplified by reducing the cleaning steps
   time to nearly zero (and optionally removing them).
@@ -75,14 +77,14 @@ With this the plan then starts in the Ingestion Server with the following steps:
 
 ### Save TSV files of cleaned data to AWS S3
 
-In a previous exploration, it was set to store TSV files of the cleaned data in
-the form of `<identifier> <cleaned_field>`, which can be used later to perform
-the updates efficiently in the catalog DB, which only had indexes for the
-`identifier` field. These files are saved to the disk of the Ingestion Server
-EC2 instances, and worked fine for files with URL corrections since this type of
-fields is relatively short, but became a problem when trying to save tags, as
-the file turned too large and filled up the disk, causing problems to the data
-refresh execution.
+In a previous exploration, the Ingestion Server was set to store TSV files of
+the cleaned data in the form of `<identifier> <cleaned_field>`, which can be
+used later to perform the updates efficiently in the catalog DB, which only had
+indexes for the `identifier` field. These files are saved to the disk of the
+Ingestion Server EC2 instances, and worked fine for files with URL corrections
+since this type of fields is relatively short, but became a problem when trying
+to save tags, as the file turned too large and filled up the disk, causing
+problems to the data refresh execution.
 
 The alternative is to upload TSV files to the Amazon Simple Storage Service
 (S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
@@ -101,7 +103,7 @@ process][aws_mpu] will serve us here.
 | 2024-02-20 04:06:56 | 22157 |    9035456    |        8809209        |   0    |
 | 2024-02-13 04:41:22 | 22155 |    9035451    |        8809204        |   0    |
 
-To have some numbers of the problem we are delaing with, the previous table
+To have some numbers of the problem we are dealing with, the previous table
 shows the number of records cleaned by field for last runs at the moment of
 writing this IP, except for tags, which we don't have accurate registries since
 file saving was disabled.
@@ -112,8 +114,8 @@ A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
 should take the files of the previous step to perform an batched update on the
 catalog's image table, while handling deadlocking and timeout concerns, similar
 to the [batched_update][batched_update]. This table is constantly in use by
-other DAGs, such as those from API providers or the data refresh process, and
-ideally can't be singly blocked by any DAG.
+other DAGs, such as those from providers ingestion or the data refresh process,
+and ideally can't be singly blocked by any DAG.
 
 [batched_update]: ./../../../catalog/reference/DAGs.md#batched_update