insert_or_update can have problems with multiple "issue"s #1427

melange396 · 2024-05-08T17:47:45Z

insert_or_update() (which is really both insert_or_update_batch() and insert_or_update_bulk() right now; they should be resolved into one in #989) can have problems when given a set of rows that includes more than one issue. It is not that multiple issue values themselves are problematic, but the desired result may not be achieved when there are multiple rows for the same source/signal/time/geo that have different issues.

The problem stems from the fact that the SQL code in fix_is_latest_issue_sql kinda naively inherently assumes only one issue will be present for each (source, signal, geo_type, geo_value, time_type, time_value). It should be noted that it sort of presumes most rows inserted will be a latest issue, which is why the is_latest_issue column defaults to 1 in insert_into_loader_sql. It should also be noted that "overwrites" will happen when load.issue>latest.issue, but ALSO when they are ==, which lets us update/patch previously-written values that may have been incorrect (then the SQL that moves the load table data to latest and history tables has an ON DUPLICATE clause that preserves the unique key epimetric_id for the same unique composite key tuple).

In practice, the problematic behavior does not happen because the only time insert_or_update() is called (in csv_to_database.py:upload_archive()), it is handed rows returned by CsvImporter.load_csv(), which only uses a single issue value at a time. However, it is possible that we might write something else in the future that does not account for this expected invariance, so we should take steps to resolve it now.

Non-compliant calls to insert_or_update() can still happen in test code via calls to _insert_rows(). This github search provides a list of them.

Possible ways to address this:

add a constraint to the load table to enforce that no multiple-issue collisions happen
- can be done just by removing issue from its UNIQUE INDEX)
- doesnt actually solve the problem, but does prevent us from suffering the ill effects
fix the fix_is_latest_issue_sql calculation to work with multiple issues
- perhaps by incorporating a GROUP BY
- perhaps by inserting a SQL statement before it that does its own GROUP BY that sets is_latest_issue=0 for non-maximal issue rows.

Doing unit/integration testing for these conditions should be ~easy, and some work on this appears to have already been done in #925

The text was updated successfully, but these errors were encountered:

melange396 added bug acquisition changes acquisition logic data quality mysql mysql database related labels May 8, 2024

melange396 mentioned this issue May 8, 2024

Add min_issue to metadata #1413

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insert_or_update can have problems with multiple "issue"s #1427

insert_or_update can have problems with multiple "issue"s #1427

melange396 commented May 8, 2024

insert_or_update can have problems with multiple "issue"s #1427

insert_or_update can have problems with multiple "issue"s #1427

Comments

melange396 commented May 8, 2024

Possible ways to address this: