Skip to content

Commit

Permalink
Include smart_open in the Tools & packages section
Browse files Browse the repository at this point in the history
  • Loading branch information
krysal committed Mar 7, 2024
1 parent a4ecc4c commit 483be63
Showing 1 changed file with 22 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ which only had indexes for the `identifier` field. These files are saved to the
disk of the Ingestion Server EC2 instances, and worked fine for files with URL
corrections since this type of fields is relatively short, but became a problem
when trying to save tags, as the file turned too large and filled up the disk,
causing problems to the data refresh execution.
causing issues to the data refresh execution.

[aws_mpu]:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
Expand All @@ -110,22 +110,25 @@ benefit of using S3 buckets is that they have streaming capabilities and will
allow us to read the files in chunks later if necessary for performance. The
downside is that objects in S3 don't allow appending natviely, so it may require
to upload files with different part numbers or evaluate if the [multipart upload
process][aws_mpu] or more easily, the `smart_open` package could serve us here.
process][aws_mpu] or more easily, the [`smart_open`][smart_open] package could
serve us here.

[smart_open]: https://github.com/piskvorky/smart_open

The alternative is to upload TSV files to the Amazon Simple Storage Service
(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The
benefit of using S3 buckets is that they have streaming capabilities and will
allow us to read the files in chunks later if necessary for performance. The
downside is that objects in S3 don't allow appending, so it may require to
upload files with different part numbers or evaluate if the [multipart upload
process][aws_mpu] will serve us here.
downside is that objects in S3 don't allow appending natviely, so it may require
to upload files with different part numbers or evaluate if the [multipart upload
process][aws_mpu] or more easily, the `smart_open` package could serve us here.

[smart_open]: https://github.com/piskvorky/smart_open

### Make and run a batched update DAG for one-time cleanup

A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
should take the files of the previous step to perform an batched update on the
should take the files of the previous step to perform a batched update on the
catalog's image table, while handling deadlocking and timeout concerns, similar
to the [batched_update][batched_update]. This table is constantly in use by
other DAGs, such as those from providers ingestion or the data refresh process,
Expand All @@ -136,10 +139,15 @@ and ideally can't be singly blocked by any DAG.
A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
consisted of uploading each file to temporary `UNLOGGED` DB tables (which
provides huge gains in writing performance while their disadventages are not
relevant to us, they won't be permanent), and including a `row_id` serial number
used later to query it in batches. Adding an index in this last column after
filling up the table could improve the query performance. An adaptation will be
needed to handle the column type of tags (`jsonb`).
relevant to us, they won't be permanent), and include a `row_id` serial number
used later to query it in batches. The following must be included:

- Add an index for the `identifier` column in the temporary table after filling
it up, to improve the query performance
- An adaptation to handle the column type of tags (`jsonb`) and modify the
`metadata`
- Include an DAG task for reporting the number of rows affected by column to
Slack

### Run an image data refresh to confirm cleaning time is reduced

Expand All @@ -159,10 +167,11 @@ later.
No changes needed. The Ingestion Server already has the credentials required to
[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).

<!--
### Tools & packages

Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
<!-- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->

Requires installing and familiarizing with the [smart_open][smart_open] utility.

### Other projects or work

Expand Down

0 comments on commit 483be63

Please sign in to comment.