Skip to content

Commit

Permalink
Apply editorial suggestions
Browse files Browse the repository at this point in the history
Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Olga Bulat <[email protected]>
  • Loading branch information
3 people committed Mar 5, 2024
1 parent 39a0d68 commit a399420
Showing 1 changed file with 23 additions and 21 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,26 @@ the project are clear, as defined in the project thread. In doubt, check the

## Overview

This document describes a solution for incorrect data in the catalog database
(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted
resources.
This document describes a mechanism for rectifying incorrect data in the catalog
database (DB) that currently has to be cleaned up every time a data refresh is
run. This one-time fix is an effort to avoid wasting resources and data refresh
runtime.

## Background

One of the steps of the [data refresh process for images][img-data-refresh] is
cleaning the data that is not fit for production. This process is triggered
weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
weekly by an Airflow DAG, which then runs in the Ingestion Server, taking
approximately just over **20 hours** to complete, according to a inspection of
latest executions. The cleaned data is only saved to the API database, which is
replaced each time during the same data refresh, causing it to have to be
repeated each time to make the _same_ corrections.
recent executions as of the time of drafting this document. The cleaned data is
only saved to the API database, which is replaced each time during the same data
refresh, meaning this process has to be repeated each time to make the _same_
corrections.

This cleaning process was designed this way to speed the rows update up since
the relevant part was to provide the correct data to users via the API. Most of
the rows affected were added previous to the creation of the `MediaStore` class
in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
the rows affected were added prior to the creation of the `MediaStore` class in
the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
nowadays responsible for validating the provider data. However, it entails a
problem of wasting resources both in time, which continues to increase, and in
the machines (CPU) it uses, which could easily be avoided making the changes
Expand All @@ -49,7 +51,7 @@ permanent by saving them in the upstream database.

<!-- List any succinct expected products from this implementation plan. -->

- The catalog database (upstream) preserves the cleaned data results of the
- The catalog database (upstream) contains the cleaned data outputs of the
current Ingestion Server's cleaning steps
- The image Data Refresh process is simplified by reducing the cleaning steps
time to nearly zero (and optionally removing them).
Expand All @@ -75,14 +77,14 @@ With this the plan then starts in the Ingestion Server with the following steps:

### Save TSV files of cleaned data to AWS S3

In a previous exploration, it was set to store TSV files of the cleaned data in
the form of `<identifier> <cleaned_field>`, which can be used later to perform
the updates efficiently in the catalog DB, which only had indexes for the
`identifier` field. These files are saved to the disk of the Ingestion Server
EC2 instances, and worked fine for files with URL corrections since this type of
fields is relatively short, but became a problem when trying to save tags, as
the file turned too large and filled up the disk, causing problems to the data
refresh execution.
In a previous exploration, the Ingestion Server was set to store TSV files of
the cleaned data in the form of `<identifier> <cleaned_field>`, which can be
used later to perform the updates efficiently in the catalog DB, which only had
indexes for the `identifier` field. These files are saved to the disk of the
Ingestion Server EC2 instances, and worked fine for files with URL corrections
since this type of fields is relatively short, but became a problem when trying
to save tags, as the file turned too large and filled up the disk, causing
problems to the data refresh execution.

The alternative is to upload TSV files to the Amazon Simple Storage Service
(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
Expand All @@ -101,7 +103,7 @@ process][aws_mpu] will serve us here.
| 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 |
| 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 |

To have some numbers of the problem we are delaing with, the previous table
To have some numbers of the problem we are dealing with, the previous table
shows the number of records cleaned by field for last runs at the moment of
writing this IP, except for tags, which we don't have accurate registries since
file saving was disabled.
Expand All @@ -112,8 +114,8 @@ A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
should take the files of the previous step to perform an batched update on the
catalog's image table, while handling deadlocking and timeout concerns, similar
to the [batched_update][batched_update]. This table is constantly in use by
other DAGs, such as those from API providers or the data refresh process, and
ideally can't be singly blocked by any DAG.
other DAGs, such as those from providers ingestion or the data refresh process,
and ideally can't be singly blocked by any DAG.

[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update

Expand Down

0 comments on commit a399420

Please sign in to comment.