-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Port R docs vignettes to epidatpy #40
Conversation
Because the output data is a standard Pandas DataFrame, we can easily plot | ||
it using any of the available Python libraries: | ||
|
||
>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other visualization would we like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, but let's change this to a plot of multiple lines, one for each of pa, ca, and fl. Hopefully the scale is comparable (otherwise we might need to choose a normalized signal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the versioned_data.rst, it would be nice to have a plot that demonstrates data revisions where a time series as_of an older versioned is plotted against a more up to date snapshot. There are some plots that demonstrate this in epipredict vignettes (though they do it in the context of forecasting). We don't need to do anything quite so fancy, you just might be able to use the signals used there to generate a simpler plot. Just one signal but at three different as_of_snapshots (maybe a week apart?) and a choice of date and location so that the revision is visible would be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added both of these :)
epidatpy/_covidcast.py
Outdated
@@ -72,7 +72,7 @@ def define_covidcast_fields() -> List[EpidataFieldInfo]: | |||
EpidataFieldInfo("lag", EpidataFieldType.int), | |||
EpidataFieldInfo("value", EpidataFieldType.float), | |||
EpidataFieldInfo("stderr", EpidataFieldType.float), | |||
EpidataFieldInfo("sample_size", EpidataFieldType.int), | |||
EpidataFieldInfo("sample_size", EpidataFieldType.text), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change the second query of docs_smoke_test
fails, because in the dataset there's a sample size of O
(the letter O, not zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
Could you isolate that row, get a reproducible API request, and make an issue about it in delphi-epidata? We definitely need to fix it in the database (and where-ever in acquisition that that O got introduced).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, turns out I was dealing with a slightly misleading error message :) This column is supposed to have a float
type, as documented here - that was the actual issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rostyslav, this looks great. I made a few requests to fix a few examples, but overall this is excellent. We might just need to let sample_size
be type text
for now (wow). A better getting_started plot would be good (request in a comment) and a simple plot for revisions would be good (request in a comment). Everything else is looking great though.
A few more things:
- please remove the old unfinished docs:
signals_covid.rst
andcovidcast_examples.rst
- move the
epidatpy
reference section below the intro guides in the left bar - the signal-discovery page in epidatr has a table of available endpoints that produces a table like what is now in
getting_started.rst
... I would like to combine those; could you write a simpleavailable_endpoints()
function that pretty prints a table similar to what is there in R and use that function to produce a table insignal_discovery.rst
? - let's move the information about epiweeks and dates from
getting_started.rst
to the bottom ofgetting_started_with_epidatpy.rst
- after that we can remove
getting_started.rst
Because the output data is a standard Pandas DataFrame, we can easily plot | ||
it using any of the available Python libraries: | ||
|
||
>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, but let's change this to a plot of multiple lines, one for each of pa, ca, and fl. Hopefully the scale is comparable (otherwise we might need to choose a normalized signal).
docs/signal_discovery.rst
Outdated
This DataFrame contains the following columns: | ||
|
||
- ``source`` - Data source name. | ||
- ``signal`` - Signal name. | ||
- ``description`` - Description of the signal. | ||
- ``reference_signal`` - Geographic level for which this signal is available, such as county, state, msa, hss, hrr, or nation. Most signals are available at multiple geographic levels and will hence be listed in multiple rows with their own metadata. | ||
- ``license`` - The license | ||
- ``dua`` - Link to the Data Use Agreement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: could you update the columns listed? They don't match the dataframe above.
docs/signal_discovery.rst
Outdated
|
||
- ``data_source`` - Data source name. | ||
- ``signal`` - Signal name. | ||
- ``name`` - Name of signal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ``name`` - Name of signal. | |
- ``name`` - Human-readable signal name. |
Because the output data is a standard Pandas DataFrame, we can easily plot | ||
it using any of the available Python libraries: | ||
|
||
>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the versioned_data.rst, it would be nice to have a plot that demonstrates data revisions where a time series as_of an older versioned is plotted against a more up to date snapshot. There are some plots that demonstrate this in epipredict vignettes (though they do it in the context of forecasting). We don't need to do anything quite so fancy, you just might be able to use the signals used there to generate a simpler plot. Just one signal but at three different as_of_snapshots (maybe a week apart?) and a choice of date and location so that the revision is visible would be good enough.
epidatpy/_covidcast.py
Outdated
@@ -72,7 +72,7 @@ def define_covidcast_fields() -> List[EpidataFieldInfo]: | |||
EpidataFieldInfo("lag", EpidataFieldType.int), | |||
EpidataFieldInfo("value", EpidataFieldType.float), | |||
EpidataFieldInfo("stderr", EpidataFieldType.float), | |||
EpidataFieldInfo("sample_size", EpidataFieldType.int), | |||
EpidataFieldInfo("sample_size", EpidataFieldType.text), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
Could you isolate that row, get a reproducible API request, and make an issue about it in delphi-epidata? We definitely need to fix it in the database (and where-ever in acquisition that that O got introduced).
@dshemetov finished updating this, please take a look again! One thing that's still a WIP:
Here I used a somewhat similar approach to R, but a bit more complex as Python doesn't have a nice native "table" format. My solution instead is to:
Let me know if this approach is OK, and I can clean up the docstrings themselves. Currently they have a couple of missing values and inconsistent formatting, and don't specify links to the original endpoints either. |
From a comment above: maybe a lot of the work here can be simplified by using Jupyter notebooks directly using https://github.com/spatialaudio/nbsphinx/. Pandas offers quite a lot of configuration options for how to print its tables, see here, maybe you could use that and |
* switch default venv from env to .venv for uv * update gitignore
5294fe3
to
b092dad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for setting this all up Rostyslav! I took another pass through all the documentation, made sure all the language flows, and that we we covered all the major features (like caching and signal discovery).
Original vignettes can be found here: https://github.com/cmu-delphi/epidatr/tree/dev/vignettes