Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add min_issue to covidcast meta data #236

Closed
wants to merge 7 commits into from
Closed

feat: add min_issue to covidcast meta data #236

wants to merge 7 commits into from

Conversation

sgratzl
Copy link
Member

@sgratzl sgratzl commented Oct 7, 2020

closes #232

  • removes some unneeded whitespace

@krivard
Copy link
Contributor

krivard commented Oct 8, 2020

Fails integration tests:

$ docker run --rm --network delphi-net delphi_python python3 -m undefx.py3tester.py3tester repos/delphi/delphi-epidata/integrations
[...]
======================================================================
FAIL: test_round_trip (repos.delphi.delphi-epidata.integrations.server.test_covidcast_meta.CovidcastMetaTests)
Make a simple round-trip with some sample data.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/app/repos/delphi/delphi-epidata/integrations/server/test_covidcast_meta.py", line 87, in test_round_trip
    self.assertEqual(response, {
AssertionError: {'res[250 chars]ue': 1, 'max_issue': 2, 'min_lag': 0, 'max_lag[4262 chars]ess'} != {'res[250 chars]ue': 2, 'max_issue': 2, 'min_lag': 0, 'max_lag[4262 chars]ess'}
Diff is 22152 characters long. Set self.maxDiff to None to see it.

======================================================================
FAIL: test_suppress_work_in_progress (repos.delphi.delphi-epidata.integrations.server.test_covidcast_meta.CovidcastMetaTests)
Don't surface signals that are a work-in-progress.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/app/repos/delphi/delphi-epidata/integrations/server/test_covidcast_meta.py", line 257, in test_suppress_work_in_progress
    self.assertEqual(response, {
AssertionError: {'res[250 chars]ue': 1, 'max_issue': 2, 'min_lag': 0, 'max_lag[4262 chars]ess'} != {'res[250 chars]ue': 2, 'max_issue': 2, 'min_lag': 0, 'max_lag[4262 chars]ess'}
Diff is 22152 characters long. Set self.maxDiff to None to see it.

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is not going to work as intended, and instead will show the issue date of the least-recently-updated of the issues in a group, which i would argue is meaningless (plz correct me if im wrong). i think it will require a MIN(issue) inside the sub-SELECT and appropriate handling outside of that.

also, some non-trivial test cases would be nice, with different issue values.

@@ -576,6 +576,7 @@ def get_covidcast_meta(self):
ROUND(AVG(`value`),7) AS `mean_value`,
ROUND(STD(`value`),7) AS `stdev_value`,
MAX(`value_updated_timestamp`) AS `last_update`,
MIN(`issue`) as `min_issue`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is not going to work as intended, and instead will show the issue date of the least-recently-updated of the issues in a group, which i would argue is meaningless (plz correct me if im wrong). i think it will require a MIN(issue) inside the sub-SELECT and appropriate handling outside of that.

also, some non-trivial test cases would be nice, with different issue values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, I will think about it how to fix it. Moreover, we can think of using the is_latest_issue to find the max issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melange396 can you take a look at the new statements. I split it into two, since I couldn't think of a fast way to handle both cases at the same time. Moreover, I switched the nested statement to the is_latest_issue flag. It is not that easy to test for me, since I don't have access to the real database but just local test data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose this will work, but the old SQL query would have done the trick if you added min(issue) min_issue`` to the nested select... i think that would be preferable to adding a bunch of extra lines unless you can think of another benefit that justifies the extra complexity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but I think this version should be more efficient using is_latest_issue instead of a join

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after wrestling with and waiting for long-duration queries for the past week, i think it might be wise to test the performance of the one-query vs two-query approaches on the staging server. i can certainly help do this if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being so late to this conversation.

The use case for min_issue is briefly described in the source issue for this PR, but the jist is this: A researcher is constructing as_of queries and wants to know the earliest date they can pass in and still expect a non-nil result. This means that min_issue must not be restricted to rows with is_latest_issue=1, since that would hide issues which are not the most recent but nevertheless accessible using as_of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will go ahead and try this and other variants out on the staging copy of the database to check performance and runtimes

@sgratzl sgratzl marked this pull request as draft October 12, 2020 12:28
@@ -576,6 +576,7 @@ def get_covidcast_meta(self):
ROUND(AVG(`value`),7) AS `mean_value`,
ROUND(STD(`value`),7) AS `stdev_value`,
MAX(`value_updated_timestamp`) AS `last_update`,
MIN(`issue`) as `min_issue`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose this will work, but the old SQL query would have done the trick if you added min(issue) min_issue`` to the nested select... i think that would be preferable to adding a bunch of extra lines unless you can think of another benefit that justifies the extra complexity

src/acquisition/covidcast/database.py Outdated Show resolved Hide resolved
@sgratzl sgratzl marked this pull request as ready for review October 21, 2020 17:13
@krivard
Copy link
Contributor

krivard commented Nov 3, 2020

Putting this back on your radar @melange396 now that the JHU fires are out

@sgratzl
Copy link
Member Author

sgratzl commented May 5, 2021

closing this PR since meta generation has changed since then

@sgratzl sgratzl closed this May 5, 2021
@sgratzl sgratzl deleted the sgratzl/min_issue branch May 5, 2021 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add min_issue to covidcast metadata endpoint
3 participants