Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add min_issue to covidcast meta data #236

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/api/covidcast_meta.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ None required.
| `epidata[].max_value` | maximum value | float |
| `epidata[].mean_value` | mean of value | float |
| `epidata[].stdev_value` | standard deviation of value | float |
| `epidata[].min_issue` | earliest date data was issued (e.g., 20200710) | integer |
| `epidata[].max_issue` | most recent date data was issued (e.g., 20200710) | integer |
| `epidata[].min_lag` | smallest lag from observation to issue, in `time_type` units | integer |
| `epidata[].max_lag` | largest lag from observation to issue, in `time_type` units | integer |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ def test_caching(self):
'max_value': 1,
'mean_value': 1,
'stdev_value': 0,
'min_issue': 20200423,
'max_issue': 20200423,
'min_lag': 0,
'max_lag': 1,
Expand Down
5 changes: 3 additions & 2 deletions integrations/client/test_delphi_epidata.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def tearDown(self):
def test_covidcast(self):
"""Test that the covidcast endpoint returns expected data."""
self.maxDiff=None

# insert dummy data
self.cur.execute('''
insert into covidcast values
Expand Down Expand Up @@ -123,7 +123,7 @@ def test_covidcast(self):
}],
'message': 'success',
})

# fetch data, without specifying issue or lag
response_1 = Epidata.covidcast(
'src', 'sig', 'day', 'county', 20200414, '01234')
Expand Down Expand Up @@ -258,6 +258,7 @@ def test_covidcast_meta(self):
'mean_value': 6.5,
'stdev_value': 0.5,
'last_update': 345,
'min_issue': 20200416,
'max_issue': 20200416,
'min_lag': 1,
'max_lag': 2,
Expand Down
3 changes: 3 additions & 0 deletions integrations/server/test_covidcast_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def test_round_trip(self):
'mean_value': 15,
'stdev_value': 5,
'last_update': 123,
'min_issue': 2,
'max_issue': 2,
'min_lag': 0,
'max_lag': 0,
Expand Down Expand Up @@ -116,6 +117,7 @@ def test_filter(self):
'mean_value': 15,
'stdev_value': 5,
'last_update': 123,
'min_issue': 2,
'max_issue': 2,
'min_lag': 0,
'max_lag': 0,
Expand Down Expand Up @@ -235,6 +237,7 @@ def test_suppress_work_in_progress(self):
'mean_value': 15,
'stdev_value': 5,
'last_update': 123,
'min_issue': 2,
'max_issue': 2,
'min_lag': 0,
'max_lag': 0,
Expand Down
11 changes: 6 additions & 5 deletions src/acquisition/covidcast/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,12 @@ def insert_or_update_batch(self, cc_rows, batch_size=2**20, commit_partial=False
SET `is_latest_issue`=0
'''
set_is_latest_issue_sql = f'''
UPDATE
UPDATE
(
SELECT `source`, `signal`, `time_type`, `geo_type`, `time_value`, `geo_value`, MAX(`issue`) AS `issue`
FROM
(
SELECT DISTINCT `source`, `signal`, `time_type`, `geo_type`, `time_value`, `geo_value`
SELECT DISTINCT `source`, `signal`, `time_type`, `geo_type`, `time_value`, `geo_value`
FROM `{tmp_table_name}`
) AS TMP
LEFT JOIN `covidcast`
Expand All @@ -176,7 +176,7 @@ def insert_or_update_batch(self, cc_rows, batch_size=2**20, commit_partial=False
) AS TMP
LEFT JOIN `covidcast`
USING (`source`, `signal`, `time_type`, `geo_type`, `time_value`, `geo_value`, `issue`)
SET `is_latest_issue`=1
SET `is_latest_issue`=1
'''

# TODO: ^ do we want to reset `direction_updated_timestamp` and `direction` in the duplicate key case?
Expand Down Expand Up @@ -333,7 +333,7 @@ def get_all_record_values_of_timeseries_with_potentially_stale_direction(self, t
# A query that selects all rows from `covidcast` that have latest issue-date
# for any (time-series-key, time_value) with time_type='day'.
latest_issues_sql = f'''
SELECT
SELECT
`id`,
`source`,
`signal`,
Expand Down Expand Up @@ -576,6 +576,7 @@ def get_covidcast_meta(self):
ROUND(AVG(`value`),7) AS `mean_value`,
ROUND(STD(`value`),7) AS `stdev_value`,
MAX(`value_updated_timestamp`) AS `last_update`,
MIN(`issue`) as `min_issue`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is not going to work as intended, and instead will show the issue date of the least-recently-updated of the issues in a group, which i would argue is meaningless (plz correct me if im wrong). i think it will require a MIN(issue) inside the sub-SELECT and appropriate handling outside of that.

also, some non-trivial test cases would be nice, with different issue values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, I will think about it how to fix it. Moreover, we can think of using the is_latest_issue to find the max issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melange396 can you take a look at the new statements. I split it into two, since I couldn't think of a fast way to handle both cases at the same time. Moreover, I switched the nested statement to the is_latest_issue flag. It is not that easy to test for me, since I don't have access to the real database but just local test data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose this will work, but the old SQL query would have done the trick if you added min(issue) min_issue`` to the nested select... i think that would be preferable to adding a bunch of extra lines unless you can think of another benefit that justifies the extra complexity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but I think this version should be more efficient using is_latest_issue instead of a join

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after wrestling with and waiting for long-duration queries for the past week, i think it might be wise to test the performance of the one-query vs two-query approaches on the staging server. i can certainly help do this if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being so late to this conversation.

The use case for min_issue is briefly described in the source issue for this PR, but the jist is this: A researcher is constructing as_of queries and wants to know the earliest date they can pass in and still expect a non-nil result. This means that min_issue must not be restricted to rows with is_latest_issue=1, since that would hide issues which are not the most recent but nevertheless accessible using as_of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will go ahead and try this and other variants out on the staging copy of the database to check performance and runtimes

MAX(`issue`) as `max_issue`,
MIN(`lag`) as `min_lag`,
MAX(`lag`) as `max_lag`
Expand Down Expand Up @@ -621,7 +622,7 @@ def get_covidcast_meta(self):
meta.extend(list(dict(zip(self._cursor.column_names,x)) for x in self._cursor))

return meta

def update_covidcast_meta_cache(self, metadata):
"""Updates the `covidcast_meta_cache` table."""

Expand Down