Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Port R docs vignettes to epidatpy #40

Merged
merged 8 commits into from
Oct 2, 2024
Merged

Conversation

rzats
Copy link
Contributor

@rzats rzats commented Aug 16, 2024

Original vignettes can be found here: https://github.com/cmu-delphi/epidatr/tree/dev/vignettes

Makefile Show resolved Hide resolved
Because the output data is a standard Pandas DataFrame, we can easily plot
it using any of the available Python libraries:

>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other visualization would we like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but let's change this to a plot of multiple lines, one for each of pa, ca, and fl. Hopefully the scale is comparable (otherwise we might need to choose a normalized signal).

Copy link
Contributor

@dshemetov dshemetov Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the versioned_data.rst, it would be nice to have a plot that demonstrates data revisions where a time series as_of an older versioned is plotted against a more up to date snapshot. There are some plots that demonstrate this in epipredict vignettes (though they do it in the context of forecasting). We don't need to do anything quite so fancy, you just might be able to use the signals used there to generate a simpler plot. Just one signal but at three different as_of_snapshots (maybe a week apart?) and a choice of date and location so that the revision is visible would be good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added both of these :)

@@ -72,7 +72,7 @@ def define_covidcast_fields() -> List[EpidataFieldInfo]:
EpidataFieldInfo("lag", EpidataFieldType.int),
EpidataFieldInfo("value", EpidataFieldType.float),
EpidataFieldInfo("stderr", EpidataFieldType.float),
EpidataFieldInfo("sample_size", EpidataFieldType.int),
EpidataFieldInfo("sample_size", EpidataFieldType.text),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change the second query of docs_smoke_test fails, because in the dataset there's a sample size of O (the letter O, not zero).

Copy link
Contributor

@dshemetov dshemetov Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

Could you isolate that row, get a reproducible API request, and make an issue about it in delphi-epidata? We definitely need to fix it in the database (and where-ever in acquisition that that O got introduced).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, turns out I was dealing with a slightly misleading error message :) This column is supposed to have a float type, as documented here - that was the actual issue.

docs_smoke_test.py Outdated Show resolved Hide resolved
@rzats rzats requested a review from dshemetov August 16, 2024 15:45
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rostyslav, this looks great. I made a few requests to fix a few examples, but overall this is excellent. We might just need to let sample_size be type text for now (wow). A better getting_started plot would be good (request in a comment) and a simple plot for revisions would be good (request in a comment). Everything else is looking great though.

A few more things:

  • please remove the old unfinished docs: signals_covid.rst and covidcast_examples.rst
  • move the epidatpy reference section below the intro guides in the left bar
  • the signal-discovery page in epidatr has a table of available endpoints that produces a table like what is now in getting_started.rst... I would like to combine those; could you write a simple available_endpoints() function that pretty prints a table similar to what is there in R and use that function to produce a table in signal_discovery.rst?
  • let's move the information about epiweeks and dates from getting_started.rst to the bottom of getting_started_with_epidatpy.rst
  • after that we can remove getting_started.rst

docs/getting_started_with_epidatpy.rst Outdated Show resolved Hide resolved
docs/getting_started_with_epidatpy.rst Outdated Show resolved Hide resolved
Because the output data is a standard Pandas DataFrame, we can easily plot
it using any of the available Python libraries:

>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but let's change this to a plot of multiple lines, one for each of pa, ca, and fl. Hopefully the scale is comparable (otherwise we might need to choose a normalized signal).

Comment on lines 42 to 49
This DataFrame contains the following columns:

- ``source`` - Data source name.
- ``signal`` - Signal name.
- ``description`` - Description of the signal.
- ``reference_signal`` - Geographic level for which this signal is available, such as county, state, msa, hss, hrr, or nation. Most signals are available at multiple geographic levels and will hence be listed in multiple rows with their own metadata.
- ``license`` - The license
- ``dua`` - Link to the Data Use Agreement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: could you update the columns listed? They don't match the dataframe above.

docs/signal_discovery.rst Outdated Show resolved Hide resolved

- ``data_source`` - Data source name.
- ``signal`` - Signal name.
- ``name`` - Name of signal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``name`` - Name of signal.
- ``name`` - Human-readable signal name.

Because the output data is a standard Pandas DataFrame, we can easily plot
it using any of the available Python libraries:

>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI")
Copy link
Contributor

@dshemetov dshemetov Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the versioned_data.rst, it would be nice to have a plot that demonstrates data revisions where a time series as_of an older versioned is plotted against a more up to date snapshot. There are some plots that demonstrate this in epipredict vignettes (though they do it in the context of forecasting). We don't need to do anything quite so fancy, you just might be able to use the signals used there to generate a simpler plot. Just one signal but at three different as_of_snapshots (maybe a week apart?) and a choice of date and location so that the revision is visible would be good enough.

docs_smoke_test.py Outdated Show resolved Hide resolved
@@ -72,7 +72,7 @@ def define_covidcast_fields() -> List[EpidataFieldInfo]:
EpidataFieldInfo("lag", EpidataFieldType.int),
EpidataFieldInfo("value", EpidataFieldType.float),
EpidataFieldInfo("stderr", EpidataFieldType.float),
EpidataFieldInfo("sample_size", EpidataFieldType.int),
EpidataFieldInfo("sample_size", EpidataFieldType.text),
Copy link
Contributor

@dshemetov dshemetov Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

Could you isolate that row, get a reproducible API request, and make an issue about it in delphi-epidata? We definitely need to fix it in the database (and where-ever in acquisition that that O got introduced).

@rzats
Copy link
Contributor Author

rzats commented Sep 10, 2024

@dshemetov finished updating this, please take a look again!

One thing that's still a WIP:

the signal-discovery page in epidatr has a table of available endpoints that produces a table like what is now in getting_started.rst... I would like to combine those; could you write a simple available_endpoints() function that pretty prints a table similar to what is there in R and use that function to produce a table in signal_discovery.rst?

Here I used a somewhat similar approach to R, but a bit more complex as Python doesn't have a nice native "table" format.

My solution instead is to:

  • grab all methods that start with pub_ or pvt_ from AEpiDataEndpoints;
  • get their docstrings;
  • combine the methods and the docstrings into a CSV file. I updated make doc so this file is regenerated before the docs;
  • display this CSV file as a pretty-printed table in the signal discovery doc.

Let me know if this approach is OK, and I can clean up the docstrings themselves. Currently they have a couple of missing values and inconsistent formatting, and don't specify links to the original endpoints either.

@dshemetov
Copy link
Contributor

From a comment above: maybe a lot of the work here can be simplified by using Jupyter notebooks directly using https://github.com/spatialaudio/nbsphinx/. Pandas offers quite a lot of configuration options for how to print its tables, see here, maybe you could use that and available_endpoints()?

rzats and others added 3 commits September 16, 2024 18:40
* switch default venv from env to .venv for uv
* update gitignore
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for setting this all up Rostyslav! I took another pass through all the documentation, made sure all the language flows, and that we we covered all the major features (like caching and signal discovery).

@dshemetov dshemetov merged commit 8d43b5e into dev Oct 2, 2024
1 check passed
@dshemetov dshemetov deleted the rzatserkovnyi/docs branch October 2, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants