Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated wastewater docs #56

Open
wants to merge 61 commits into
base: main
Choose a base branch
from
Open

Updated wastewater docs #56

wants to merge 61 commits into from

Conversation

mindoftea
Copy link
Collaborator

Merging in Sarah's work

Copy link
Collaborator Author

@mindoftea mindoftea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. Things for us to do, besides these comments:

  • remove the build files from the repo and get our github action working
  • merge in my coverage-handling code (I'll update figE to match)
  • move the fig notebooks into the docs
  • publish it as version 2.0.0

@@ -89,11 +97,13 @@ def get_descendants(node):
"""Get the set of all descendants of some node."""
return set(node['children']) | set.union(*[get_descendants(c) for c in node['children']]) if len(node['children']) > 0 else set([])

def gather_groups(clusters, prevalences, count_scores = tuple([0.1, 4, 4, 4, 0.1] + [0] * 256)):
def (clusters, prevalences, count_scores = tuple([0.1, 4, 4, 4, 0.1] + [0] * 256)):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function name got accidentally deleted here


In contrast, wastewater samples have been highly useful for tracking regional infection dynamics while providing less biased abundance estimates than clinical testing. Data collected by tracking viral genomic sequences in wastewater has also improved community prevalence estimates and detects emerging variants earlier on.

The Andersen Lab has developed improved virus concentration protocols and deconvolution software that fully resolve multiple virus strains from wastewater. The resulting data is now deployed by Python-outbreak-info. In short, SARS-Cov-2 analysis can be done using both clinical and wastewater tools, yet data from the wastewater analysis tools may be more accurate in some situations.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. The only thing that I think needs a little more explanation is that with clinical genomics, each sample is one sequence, so we can see if two mutations occur frequently together etc, while with wastewater each sample is a mix of sequences, so we don't know which mutations go with which variants exactly. Basically, just reminding people that they might need some clinical data to answer co-occurence questions.

from outbreak_data import authenticate_user
authenticate_user.authenticate_new_user()
from outbreak_data.authenticate_user import authenticate_new_user
authenticate_new_user()

and then you should be able access all of the functionality of the package. Most of the rest of the tools are available within the ``outbreak_data`` component of the package. For example:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a note here that authentication is not required for wastewater data.


This project is under active development.
Table of Contents:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, in the sidebar, and on the title of each page, it would be good to make clear what submodule each function is in.

@@ -1,4 +1,4 @@
authenticate_new_user()
authenticate_new_user
----------------------------------------------------

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some text about this being only needed for clinical, that access to a web browser is needed, and that the token is saved locally between runs would be useful.

"source": [
"ww_prevalences = outbreak_tools.datebin_and_agg(ww_lineages, weights=outbreak_tools.get_ww_weights(ww_lineages), startdate=startdate, enddate=enddate, freq='7D', rolling=[1,4,1], log=False)\n",
"ww_prevalences_daily_unsmoothed = outbreak_tools.datebin_and_agg(ww_lineages, weights=outbreak_tools.get_ww_weights(ww_lineages), startdate=startdate, enddate=enddate, freq='D', rolling=1, log=False)\n",
"ww_prevalences_daily, ww_prevalences_daily_varis = outbreak_tools.datebin_and_agg(ww_lineages, weights=outbreak_tools.get_ww_weights(ww_lineages), startdate=startdate, enddate=enddate, freq='D', rolling=smooth, log=False, variance=True)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually use the variances for these plots, so we can set variance=False and delete parts of this line and others about the "_varis" to simplify

tests/figC.ipynb Outdated
"ww_prev_data = ww_prevalences.mul(viral_load_weekly, axis=0).sum()\n",
"clinical_prev_data = clinical_prevalences.mul(viral_load_weekly, axis=0).sum()\n",
"\n",
"ww_clusters = outbreak_clustering.cluster_lineages(ww_prev_data, tree, lineage_key=lineage_key, n=10, alpha=0.25)\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to figAB, let's simplify by getting rid of the viral load info from this notebook and just clustering on ww_prevalences.sum(). We can use one set of clusters for both ww and clinical

tests/figC.ipynb Outdated
"metadata": {},
"outputs": [
{
"data": {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a scatter plot of the daily unsmoothed data on top of this

tests/figD.ipynb Outdated
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll need to fetch and aggregate the viral load sample data to get our prevalence data. "
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the viral load code from this notebook too

tests/figE.ipynb Outdated
"id": "5e2ed603-3042-49ca-865e-a823287bdeb8",
"metadata": {},
"source": [
"Now we go ahead and query for wastewater data using our defined specifications. After this, we'll need to organize our retrieved sample data by date and site_id within our specified region to get the viral load smaple data."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo here. Also good to explain that we're filtering out viral load information from sites with few samples, and then normalizing each site's viral load signals to have a variance of 1.

@mindoftea mindoftea changed the base branch from wastewater_sprint_2 to main July 31, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants