add serum_titers rule (#7)

* add `sera_by_plate` * set up `serum_titers` skeleton rule * make `viral_strain_plot_order` overall variable not plate-specific * get per-rep and median titers * add titer chart * save pickled `CurveFits` * plot all serum curves * update README and `.gitignore` * check QC for serum titers * update `neutcurve` to 0.10.0 * implement `virus_replicates_to_drop` in `serum_titers` * fully implement QC thresholds for `serum_titers` * save pickle file for serum curves * document `serum_titers` and QC filters * code formatting
jbloomlab · Dec 14, 2023 · 7098442 · 7098442
1 parent 55b9e66
commit 7098442
Show file tree

Hide file tree

Showing 23 changed files with 2,239 additions and 64 deletions.
diff --git a/README.md b/README.md
@@ -63,6 +63,8 @@ rule all:
 
 In addition, you need to create the configuration file `config.yml` and ensure it includes the appropriate configuration for `seqneut-pipeline` as described below.
 
+To track the correct files in the created results, we suggest you copy the [./test_example/.gitignore](test_example/.gitignore) file to be the `.gitignore` for your main repo.
+
 Finally, you need to create a `conda` environment that minimally includes the packages needed to run the pipeline, which are a recent version of [snakemake](https://snakemake.readthedocs.io/) and [pandas](https://pandas.pydata.org/).
 You can either create your own environment containing these, or simply build and use the one specified in [environment.yml](environment.yml) file of `seqneut-pipeline`, which is named `seqneut-pipeline`. So if you are using that environment, you can simply run the pipeline with:
 ```
@@ -108,16 +110,15 @@ GCATGGATCCTTTACT,A/Togo/845/2020
 ```
 
 ### viral_strain_plot_order
-An optional dictionary (mapping) of viral library names (as specified in `viral_libraries`) to a CSV with a column titled "strain" that lists the strains in the order they should be plotted.
-If not specified (or set to "null"), plotting is just alphabetical.
-So in general, this key will look like:
+A a CSV with a column named "strain" that lists the strains in the order they should be plotted.
+If not specified or set to "null", then plotting is just alphabetical.
+Must include all strains being used if specified.
+So should look like this:
 ```
-viral_strain_plot_order:
-  pdmH1N1_lib2023_loes: data/viral_libraries/pdmH1N1_lib2023_loes_strain_order.csv
-  <potentially more viral libraries specified as name: CSV pairs>
+viral_strain_plot_order: data/viral_strain_plot_order.csv
 ```
 
-The CSV files themselves will just have a column named "strain" specifying the order, such as:
+The CSV file itself will just have a column named "strain" specifying the order, such as:
 ```
 strain
 A/California/07/2009
@@ -328,6 +329,44 @@ Typically you might either set to 1, or "false" if you want to let the top be a
 Fix the bottom plateau of the neutralization curve to this value.
 Typicallyou might either set to 0, or "false" if you want to let the bottom be a free parameter.
 
+### serum_titers_qc_thresholds
+This key defines quality-control thresholds to apply in `serum_titers` when aggregating replicate titers for each virus for each serum.
+These thresholds are designed to ensure that the serum titers reported by the pipeline are robust (sufficient replicates and sufficiently similar to the median).
+It defines two variables, `min_replicates` and `max_fold_change_from_median` as follows:
+```
+serum_titers_qc_thresholds:
+  min_replicates: 2
+  max_fold_change_from_median: 3
+```
+
+The `min_replicates` key defines the minimum number of replicates that must be measured for each serum, and the `max_fold_change` specifies the maximum fold-change from the serum-virus median that any replicate can have.
+
+### serum_titers_qc_exclusions
+This key defines exclusions of replicates and QC thresholds in `serum_titers` to pass `serum_titers_qc_thresholds`.
+Typically, you would set this if you have serum titers failing `serum_titer_qc_thresholds`.
+It is keyed by each serum, then any viruses for that serum where we need to exclude replicates or thresholds.
+Specifically, it should look like this:
+```
+serum_titers_qc_exclusions:
+
+  M099d0:
+    A/Bangladesh/8002/2021:
+      ignore_qc: true  # many replicates, so ignore the extra variation around median
+    A/Brisbane/02/2018:
+      ignore_qc: true  # many replicates, so ignore the extra variation around median
+    A/Norway/25089/2022:
+      replicates_to_drop:
+        - plate11-CGGATAAAAATGATAT  # NT50 is an outlier
+    A/Wisconsin/588/2019:
+      replicates_to_drop:
+        - plate11-AGTCCTATCCTCAAAT  # NT50 is an outlier
+
+  <keys for additional sera>
+```
+Under each virus, you can set `ignore_qc: true` if you simply want to ignore any QC failures for that virus for that serum.
+
+If you want to exclude specific replicates, instead under the virus key set `replicates_to_drop` to be a name of the replicate for that serum-virus as named in `serum_titers`.
+
 ## Output of the pipeline
 The results of running the pipeline are put in the `./results/` subdirectory of your main repo.
 We recommend using the `.gitignore` file in [./test_example/.gitignore] in your main repo to only track key results in your GitHub repo.
@@ -348,18 +387,27 @@ The set of full created outputs are as follows (note only some will be tracked d
   - Outputs related to fitting the neutralization curves for each plate:
     - `./results/plates/{plate}/curvefits.csv`: the neutralization curve fits to each serum on each plate, including the NT50s. You should track this in repo.
     - `./results/plates/{plate}/curvefits.pdf`: PDF rendering the neutralization curves for the plate. You do not need to track this in the repo as a HTML version of a notebook containing the plot is tracked in `./docs/`.
-    - `./results/plates/{plate}/curvefits_{plate}.ipynb`: Jupyter notebook that does the curve fitting. You do not need to track this in the repo as a HTML version of the notebook is tracked in `./docs/`.
+    - `./results/plates/{plate}/curvefits.pickle`: pickle files with the `neutcurve.CurveFits` object for the plate. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
     - `./results/plates/{plate}/curvefits_{plate}.ipynb`: Jupyter notebook that does the curve fitting. You do not need to track this in the repo as a HTML version of the notebook is tracked in `./docs/`.
     - `./results/plates/{plate}/curvefits_{plate}.html`: HTML rendering of Jupyter notebook that does the curve fitting. You do not need to track this in the repo as it will be rendered in `./docs/` when the pipeline runs successfully.
 
+  - Output related to analyzing neutralization titers on a per-serum basis, aggregating across plates:
+    - `./results/sera/sera_by_plate.csv` summarizes which plate(s) each serum was run on.
+    - `./results/sera/{serum}/titers_median.csv`: titer for each virus against the serum, reported as the median across replicates. You should track this file in the repo.
+    - `./results/sera/{serum}/titers_per_replicate.csv`: titers for each replicate of each virus against the serum. You should track this file in the repo.
+    - `./results/sera/{serum}/curves.pdf`: PDF rendering of the neutralization curves for the serum. You do not need to track this in the repo as a HTML version of a notebook containing the plots is tracked in `./docs/`.
+    - `./results/sera/{serum}/curvefits.pickle`: pickle files with the `neutcurve.CurveFits` object for this serum, after applying QC filters. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
+    - `./results/sera/{serum}/serum_titers_{serum}.ipynb`: Jupyter notebook that aggregates titers for a serum across all plates. You do not need to track this in the repo as a HTML version of the notebook is tracked in `./docs/`.
+    - `./results/sera/{serum}/serum_titers_{serum}.html`: HTML rendering of the Jupyter notebook that aggregates titers for a serum across all plates. You do not need to track this in the repo as it will be rendered in `./docs/` when the pipeline runs successfully.
+
   - `./logs/`: logs from `snakemake` rules, you may want to look at these if there are rule failures. They do not need to be tracked in the repo.
 
 ## Running pipeline to identify QC failures and fixing them
 If you run the pipeline via `snakemake` with the `--keep-going` flag as recommended above, the pipeline will run as far as possible.
 However, if there are any QC failures that will keep it from running to completion.
 You will then need to manually look at the results, identify the problem (typically problematic barcodes or samples / wells), and decide how to fix the problem by using the YAML configuration file to exclude problematic barcodes / samples.
 
-For the processing of counts to fraction infectivity, the file `./results/plates/qc_process_counts_summary.txt` will summarize the QC failures and tell you which HTML notebooks to look at for details.
+For the processing of counts to fraction infectivity (`process_counts`), the file `./results/plates/qc_process_counts_summary.txt` will summarize the QC failures and tell you which HTML notebooks to look at for details.
 You then need to address these QC failures by doing one of the following:
 
  - Removing the offending barcodes by adding them to `barcodes_to_drop` for that plate. (If the barcode is missing in all plates, you might remove from `viral_barcodes` or `neut_standard_sets`.)
@@ -368,7 +416,11 @@ You then need to address these QC failures by doing one of the following:
 
  - Adjusting the `process_counts_qc_thresholds` for that plate to be more lenient.
 
+For the computation of serum titers against specific viruses, the file `./results/sera/qc_serum_titers_summary.txt` will summarize the QC failures and tell you which HTML notebooks to look at for details.
+You will need to address these QC failures by adjusting `serum_titers_qc_exclusions` to either not worry if a serum-virus pair fails the QC filters or dropping specific serum-virus-replicate measurements.
+
 It is expected that you may have to perform several iterations of running and fixing QC failures.
+The pipeline will only run to completion when all all QC filters are passed.
 
 ## Test example and testing via GitHub Actions
 The [./test_example](test_example) subdirectory contains a small test example that illustrates use of the pipeline.

diff --git a/environment.yml b/environment.yml
@@ -21,4 +21,4 @@ dependencies:
   - ruff
   - pip:
     - dms_variants==1.4.3
-    - neutcurve==0.8.0
+    - neutcurve==0.10.0
diff --git a/funcs.smk b/funcs.smk
@@ -5,33 +5,9 @@ Designed to be included in another ``Snakefile`` that specifies the config.
 """
 
 
-def get_viral_strain_plot_order(viral_libs, config):
-    """Get the viral strain plot order."""
-    viral_strain_plot_order = {viral_library: None for viral_library in viral_libs}
-    if "viral_strain_plot_order" in config:
-        viral_strain_plot_order = (
-            viral_strain_plot_order | config["viral_strain_plot_order"]
-        )
-        if set(viral_strain_plot_order) != set(viral_libs):
-            raise ValueError(
-                f"{viral_strain_plot_order.keys()=} != {viral_lib.keys()=}"
-            )
-    for viral_library, csv in viral_strain_plot_order.items():
-        viral_library_strains = sorted(
-            set(pd.read_csv(viral_libs[viral_library])["strain"])
-        )
-        if csv:
-            viral_order = pd.read_csv(csv)["strain"].tolist()
-            if len(viral_order) != len(set(viral_order)):
-                raise ValueError(f"duplicate strains in viral_strain_order CSV {csv}")
-            if set(viral_order) != set(viral_library_strains):
-                raise ValueError(
-                    f"viral_strain_order does not have correct strains for {viral_library}"
-                )
-            viral_strain_plot_order[viral_library] = viral_order
-        else:
-            viral_strain_plot_order[viral_library] = viral_library_strains
-    return viral_strain_plot_order
+import copy
+import functools
+import os
 
 
 def process_plate(plate, plate_params):
@@ -162,3 +138,15 @@ def process_plate(plate, plate_params):
     plate_d["samples"] = samples_df
 
     return plate_d
+
+
+@functools.lru_cache
+def sera_plates():
+    """Get dict keyed by serum with values lists of plates with titers for serum."""
+    csv_file = checkpoints.sera_by_plate.get().output.csv
+    return (
+        pd.read_csv(csv_file)
+        .assign(plates=lambda x: x["plates"].str.split(";"))
+        .set_index("serum")["plates"]
+        .to_dict()
+    )
diff --git a/notebooks/curvefits.py.ipynb b/notebooks/curvefits.py.ipynb
@@ -16,6 +16,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import pickle\n",
+    "\n",
     "import neutcurve\n",
     "\n",
     "import pandas as pd"
@@ -40,6 +42,7 @@
     "frac_infectivity_csv = snakemake.input.frac_infectivity_csv\n",
     "output_csv = snakemake.output.csv\n",
     "output_pdf = snakemake.output.pdf\n",
+    "output_pickle = snakemake.output.pickle\n",
     "curvefit_params = snakemake.params.curvefit_params\n",
     "plate = snakemake.wildcards.plate"
    ]
@@ -91,10 +94,7 @@
     "print(f\"Clipping with {frac_infectivity_ceiling=}\")\n",
     "\n",
     "frac_infectivity = pd.read_csv(frac_infectivity_csv).assign(\n",
-    "    serum_replicate=lambda x: x[\"serum\"].where(\n",
-    "        x[\"plate_replicate\"] == plate,\n",
-    "        x[\"serum\"] + \" (\" + x[\"plate_replicate\"].replace(plate, \"\") + \")\",\n",
-    "    ),\n",
+    "    replicate=lambda x: x[\"plate_replicate\"] + \"-\" + x[\"barcode\"],\n",
     "    serum_concentration=lambda x: 1 / x[\"dilution_factor\"],\n",
     "    frac_infectivity=lambda x: x[\"frac_infectivity\"].clip(\n",
     "        upper=frac_infectivity_ceiling,\n",
@@ -136,9 +136,9 @@
     "    ),\n",
     "    conc_col=\"serum concentration\",\n",
     "    fracinf_col=\"fraction infectivity\",\n",
-    "    serum_col=\"serum_replicate\",\n",
+    "    serum_col=\"serum\",\n",
     "    virus_col=\"strain\",\n",
-    "    replicate_col=\"barcode\",\n",
+    "    replicate_col=\"replicate\",\n",
     "    fixtop=curvefit_params[\"fixtop\"],\n",
     "    fixbottom=curvefit_params[\"fixbottom\"],\n",
     ")"
@@ -161,7 +161,7 @@
    "source": [
     "fig, _ = fits.plotReplicates(\n",
     "    attempt_shared_legend=False,\n",
-    "    legendfontsize=9,\n",
+    "    legendfontsize=8,\n",
     "    titlesize=10,\n",
     "    ticksize=10,\n",
     "    ncol=6,\n",
@@ -204,32 +204,40 @@
    "source": [
     "fit_params = (\n",
     "    fits.fitParams(average_only=False, no_average=True)\n",
-    "    .rename(columns={\"serum\": \"serum_replicate\", \"replicate\": \"barcode\"})\n",
     "    .assign(nt50=lambda x: 1 / x[\"ic50\"])\n",
     "    .merge(\n",
-    "        frac_infectivity[\n",
-    "            [\"serum\", \"serum_replicate\", \"plate_replicate\"]\n",
-    "        ].drop_duplicates(),\n",
+    "        frac_infectivity[[\"serum\", \"replicate\"]].drop_duplicates(),\n",
     "        validate=\"many_to_one\",\n",
     "    )\n",
-    "    .drop(columns=[\"ic50_str\", \"nreplicates\", \"serum_replicate\"])\n",
-    "    .sort_values([\"serum\", \"virus\", \"plate_replicate\", \"barcode\"])\n",
+    "    .drop(columns=[\"ic50_str\", \"nreplicates\"])\n",
+    "    .sort_values([\"serum\", \"virus\", \"replicate\"])\n",
     ")\n",
     "\n",
-    "assert len(fit_params) == len(frac_infectivity.groupby([\"barcode\", \"serum_replicate\"]))\n",
+    "assert len(fit_params) == len(frac_infectivity.groupby([\"serum\", \"replicate\"]))\n",
     "\n",
     "print(f\"Saving to {output_csv}\")\n",
     "\n",
     "fit_params.to_csv(output_csv, index=False, float_format=\"%.4g\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "79ff668d-e39c-4acb-8ef9-e6fe15a3f8f3",
+   "metadata": {},
+   "source": [
+    "Save the pickled `CurveFits` object:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "dcf7dad2-dcc4-4a5a-8b3d-9e8fa101ad1e",
+   "id": "c181d160-5c8f-4c8e-ae5b-32e241823ee3",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "with open(output_pickle, \"wb\") as f:\n",
+    "    pickle.dump(fits, f)"
+   ]
   }
  ],
  "metadata": {