From 483be63ee42940b277f344eb083d7375eb80e959 Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Tue, 5 Mar 2024 18:07:14 -0400 Subject: [PATCH] Include smart_open in the Tools & packages section --- ...plementation_plan_catalog_data_cleaning.md | 35 ++++++++++++------- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md index 03f9db56026..e88fb64ffde 100644 --- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -88,7 +88,7 @@ which only had indexes for the `identifier` field. These files are saved to the disk of the Ingestion Server EC2 instances, and worked fine for files with URL corrections since this type of fields is relatively short, but became a problem when trying to save tags, as the file turned too large and filled up the disk, -causing problems to the data refresh execution. +causing issues to the data refresh execution. [aws_mpu]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html @@ -110,22 +110,25 @@ benefit of using S3 buckets is that they have streaming capabilities and will allow us to read the files in chunks later if necessary for performance. The downside is that objects in S3 don't allow appending natviely, so it may require to upload files with different part numbers or evaluate if the [multipart upload -process][aws_mpu] or more easily, the `smart_open` package could serve us here. +process][aws_mpu] or more easily, the [`smart_open`][smart_open] package could +serve us here. [smart_open]: https://github.com/piskvorky/smart_open The alternative is to upload TSV files to the Amazon Simple Storage Service -(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The +(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The benefit of using S3 buckets is that they have streaming capabilities and will allow us to read the files in chunks later if necessary for performance. The -downside is that objects in S3 don't allow appending, so it may require to -upload files with different part numbers or evaluate if the [multipart upload -process][aws_mpu] will serve us here. +downside is that objects in S3 don't allow appending natviely, so it may require +to upload files with different part numbers or evaluate if the [multipart upload +process][aws_mpu] or more easily, the `smart_open` package could serve us here. + +[smart_open]: https://github.com/piskvorky/smart_open ### Make and run a batched update DAG for one-time cleanup A batched catalog cleaner DAG (or potentially a `batched_update_from_file`) -should take the files of the previous step to perform an batched update on the +should take the files of the previous step to perform a batched update on the catalog's image table, while handling deadlocking and timeout concerns, similar to the [batched_update][batched_update]. This table is constantly in use by other DAGs, such as those from providers ingestion or the data refresh process, @@ -136,10 +139,15 @@ and ideally can't be singly blocked by any DAG. A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601) consisted of uploading each file to temporary `UNLOGGED` DB tables (which provides huge gains in writing performance while their disadventages are not -relevant to us, they won't be permanent), and including a `row_id` serial number -used later to query it in batches. Adding an index in this last column after -filling up the table could improve the query performance. An adaptation will be -needed to handle the column type of tags (`jsonb`). +relevant to us, they won't be permanent), and include a `row_id` serial number +used later to query it in batches. The following must be included: + +- Add an index for the `identifier` column in the temporary table after filling + it up, to improve the query performance +- An adaptation to handle the column type of tags (`jsonb`) and modify the + `metadata` +- Include an DAG task for reporting the number of rows affected by column to + Slack ### Run an image data refresh to confirm cleaning time is reduced @@ -159,10 +167,11 @@ later. No changes needed. The Ingestion Server already has the credentials required to [connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28). - + + +Requires installing and familiarizing with the [smart_open][smart_open] utility. ### Other projects or work