aggregate titers across sera (#8)

* upgrade to `neutcurve 0.10.1 * all serum titers adjusted to pass QC * mostly complete `aggregate_titers` * complete `aggregate_titers` * document aggregate titers * lint and format * save all results in test Github Actions artifact * update config to pass tests and plot QC failure in `serum_titers` in a clearer-to-look-at order
jbloomlab · Dec 14, 2023 · 2475a95 · 2475a95
1 parent 7098442
commit 2475a95
Show file tree

Hide file tree

Showing 12 changed files with 466 additions and 25 deletions.
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -64,5 +64,5 @@ jobs:
         if: failure()
         with:
           name: log-files
-          path: test_example/results/logs
+          path: test_example/results
           retention-days: 7
diff --git a/README.md b/README.md
@@ -17,6 +17,9 @@ The titers are computed by fitting Hill-curve style neutralization curves using
 The titers are summarized by the neutralization titer 50% (NT50), which is the serum dilution factor at which the serum neutralizations half of the viral infectivity.
 So a NT50 of 200 means that at a 1:200 dilution of the serum, half the viral infectivity is neutralized.
 Note that the NT50 is the reciprocal of the IC50 (concentration of serum at which 50% of viral infectivity is neutralized).
+Typically there will be multiple replicates, and the reported titer is the median among replicates.
+
+Note also there are several quality control steps with thresholds and exclusions specified in the configuration YAML (see below), and the pipeline will only run to completion once all quality-control is passed.
 
 ## Using this pipeline
 This pipeline is designed to be included as a modular portion of a larger [snakemake](https://snakemake.readthedocs.io/) analysis.
@@ -367,9 +370,10 @@ Under each virus, you can set `ignore_qc: true` if you simply want to ignore any
 
 If you want to exclude specific replicates, instead under the virus key set `replicates_to_drop` to be a name of the replicate for that serum-virus as named in `serum_titers`.
 
-## Output of the pipeline
+## Results of running the pipeline
 The results of running the pipeline are put in the `./results/` subdirectory of your main repo.
 We recommend using the `.gitignore` file in [./test_example/.gitignore] in your main repo to only track key results in your GitHub repo.
+The key results file if the pipeline runs to completion in `./results/aggregated_titers/titers.csv`.
 The set of full created outputs are as follows (note only some will be tracked depending on your `.gitignore`):
 
   - Outputs related to barcode counting:
@@ -387,7 +391,7 @@ The set of full created outputs are as follows (note only some will be tracked d
   - Outputs related to fitting the neutralization curves for each plate:
     - `./results/plates/{plate}/curvefits.csv`: the neutralization curve fits to each serum on each plate, including the NT50s. You should track this in repo.
     - `./results/plates/{plate}/curvefits.pdf`: PDF rendering the neutralization curves for the plate. You do not need to track this in the repo as a HTML version of a notebook containing the plot is tracked in `./docs/`.
-    - `./results/plates/{plate}/curvefits.pickle`: pickle files with the `neutcurve.CurveFits` object for the plate. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
+    - `./results/plates/{plate}/curvefits.pickle`: pickle file with the `neutcurve.CurveFits` object for the plate. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
     - `./results/plates/{plate}/curvefits_{plate}.ipynb`: Jupyter notebook that does the curve fitting. You do not need to track this in the repo as a HTML version of the notebook is tracked in `./docs/`.
     - `./results/plates/{plate}/curvefits_{plate}.html`: HTML rendering of Jupyter notebook that does the curve fitting. You do not need to track this in the repo as it will be rendered in `./docs/` when the pipeline runs successfully.
 
@@ -396,10 +400,16 @@ The set of full created outputs are as follows (note only some will be tracked d
     - `./results/sera/{serum}/titers_median.csv`: titer for each virus against the serum, reported as the median across replicates. You should track this file in the repo.
     - `./results/sera/{serum}/titers_per_replicate.csv`: titers for each replicate of each virus against the serum. You should track this file in the repo.
     - `./results/sera/{serum}/curves.pdf`: PDF rendering of the neutralization curves for the serum. You do not need to track this in the repo as a HTML version of a notebook containing the plots is tracked in `./docs/`.
-    - `./results/sera/{serum}/curvefits.pickle`: pickle files with the `neutcurve.CurveFits` object for this serum, after applying QC filters. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
+    - `./results/sera/{serum}/curvefits.pickle`: pickle file with the `neutcurve.CurveFits` object for this serum, after applying QC filters. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
     - `./results/sera/{serum}/serum_titers_{serum}.ipynb`: Jupyter notebook that aggregates titers for a serum across all plates. You do not need to track this in the repo as a HTML version of the notebook is tracked in `./docs/`.
     - `./results/sera/{serum}/serum_titers_{serum}.html`: HTML rendering of the Jupyter notebook that aggregates titers for a serum across all plates. You do not need to track this in the repo as it will be rendered in `./docs/` when the pipeline runs successfully.
 
+  - Results related to aggregated titers across all sera after applying all quality control:
+    - `./results/aggregated_titers/titers.csv`: titers for all sera / virus (median of replicates). You should track this file as it has the final processed results.
+    - `./results/aggregated_titers/curvefits.pickle`: pickle file with the `neutcurve.CurveFits` object holding all final curves. You do not need to track this in the repo, but if you have further code that makes specific plots you may want to use this.
+    - `./results/aggregated_titers/titers.html`: interactive plot of titers for all sera. You do not need to track this in the repo as it is rendered in `./docs/` when the pipeline runs successfully.
+    - `./results/aggregated_titers/aggregate_titers.ipynb`: Jupyter notebook that aggregates all the titers. You do not need to track this in the repo.
+
   - `./logs/`: logs from `snakemake` rules, you may want to look at these if there are rule failures. They do not need to be tracked in the repo.
 
 ## Running pipeline to identify QC failures and fixing them

diff --git a/environment.yml b/environment.yml
@@ -21,4 +21,4 @@ dependencies:
   - ruff
   - pip:
     - dms_variants==1.4.3
-    - neutcurve==0.10.0
+    - neutcurve==0.10.1
diff --git a/notebooks/aggregate_titers.py.ipynb b/notebooks/aggregate_titers.py.ipynb
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "abe395dc-8a57-4ee8-b22e-8a1acacf1619",
+   "metadata": {},
+   "source": [
+    "# Aggregate titers across all sera\n",
+    "Aggregate the titers across all sera, failing if there are QC failures for any individual sera."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f404cc02-6d9a-4a3a-892e-fd73f1e00b52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "\n",
+    "import altair as alt\n",
+    "\n",
+    "import neutcurve\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "_ = alt.data_transformers.disable_max_rows()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be4112bc-7d79-4b38-8e76-ec81588f40e9",
+   "metadata": {},
+   "source": [
+    "Get variables from `snakemake`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83425f55-9905-4cc0-b42c-440926e9a81e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "qc_failures_file = snakemake.input.qc_serum_titer_failures\n",
+    "input_pickles = snakemake.input.pickles\n",
+    "input_titers = snakemake.input.titers\n",
+    "output_pickle = snakemake.output.pickle\n",
+    "output_titers = snakemake.output.titers\n",
+    "titers_chart_html = snakemake.output.titers_chart\n",
+    "viral_strain_plot_order = snakemake.params.viral_strain_plot_order"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3668ae9-ad6d-436e-8e7b-9a0cd3b30548",
+   "metadata": {},
+   "source": [
+    "Check for quality control failures for any individual sera:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1fbc6445-bbd0-4d7e-b50e-a861e03c06b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(qc_failures_file) as f:\n",
+    "    qc_failures = f.readlines()\n",
+    "if not all(line.strip().endswith(\"serum passed all QC\") for line in qc_failures):\n",
+    "    raise ValueError(\n",
+    "        f\"QC failures for some serum titers. See {qc_failures_file}:\\n\\n\"\n",
+    "        + \"\".join(qc_failures)\n",
+    "    )\n",
+    "else:\n",
+    "    print(\"All serum titers pass the QC filters.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a89d9f5-a789-454b-9ec6-3186a5d55c7b",
+   "metadata": {},
+   "source": [
+    "Get the merged titers and merged `CurveFits` object:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dda8dec6-8981-4f10-9fe2-afc35066929e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert len(input_titers) == len(input_pickles)\n",
+    "\n",
+    "titers = pd.concat([pd.read_csv(f) for f in input_titers], ignore_index=True)\n",
+    "assert len(titers) == len(titers.groupby([\"serum\", \"virus\"]))\n",
+    "print(f\"Writing aggregated titers to {output_titers}\")\n",
+    "titers.to_csv(output_titers, index=False, float_format=\"%.4g\")\n",
+    "\n",
+    "fits_list = []\n",
+    "for fname in input_pickles:\n",
+    "    with open(fname, \"rb\") as f:\n",
+    "        fits_list.append(pickle.load(f))\n",
+    "curvefits = neutcurve.CurveFits.combineCurveFits(fits_list)\n",
+    "print(f\"Pickling aggregated `CurveFits` to {output_pickle}\")\n",
+    "with open(output_pickle, \"wb\") as f:\n",
+    "    pickle.dump(curvefits, f)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e194786d-c227-403f-bbf9-d8ab3b19b502",
+   "metadata": {},
+   "source": [
+    "Plot all the titers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb2c2a97-e54a-4875-bc75-d57975017565",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "viruses = [v for v in viral_strain_plot_order if v in curvefits.allviruses]\n",
+    "\n",
+    "sera = curvefits.sera\n",
+    "\n",
+    "virus_selection = alt.selection_point(fields=[\"virus\"], on=\"mouseover\", empty=False)\n",
+    "\n",
+    "serum_selection = alt.selection_point(\n",
+    "    fields=[\"serum\"],\n",
+    "    bind=\"legend\",\n",
+    "    toggle=\"true\",\n",
+    ")\n",
+    "\n",
+    "ncols = 8\n",
+    "\n",
+    "titers_chart = (\n",
+    "    alt.Chart(titers)\n",
+    "    .add_params(virus_selection, serum_selection)\n",
+    "    .transform_filter(serum_selection)\n",
+    "    .encode(\n",
+    "        alt.X(\n",
+    "            \"nt50\",\n",
+    "            title=\"neutralization titer\",\n",
+    "            scale=alt.Scale(nice=False, padding=4, type=\"log\"),\n",
+    "            axis=alt.Axis(labelOverlap=True),\n",
+    "        ),\n",
+    "        alt.Y(\"virus\", sort=viruses),\n",
+    "        alt.Facet(\n",
+    "            \"serum\",\n",
+    "            header=alt.Header(\n",
+    "                title=None, labelFontSize=11, labelFontStyle=\"bold\", labelPadding=0\n",
+    "            ),\n",
+    "            spacing=3,\n",
+    "            columns=ncols,\n",
+    "        ),\n",
+    "        alt.StrokeWidth(\n",
+    "            \"serum:N\",\n",
+    "            scale=alt.Scale(domain=sera, range=[1] * len(sera)),\n",
+    "            legend=alt.Legend(\n",
+    "                orient=\"bottom\",\n",
+    "                columns=ncols,\n",
+    "                symbolFillColor=\"black\",\n",
+    "                title=\"serum (click to select)\",\n",
+    "            ),\n",
+    "        ),\n",
+    "        color=alt.condition(virus_selection, alt.value(\"red\"), alt.value(\"black\")),\n",
+    "        tooltip=[\n",
+    "            \"serum\",\n",
+    "            \"virus\",\n",
+    "            alt.Tooltip(\"nt50\", title=\"NT50\", format=\".3g\"),\n",
+    "            \"n_replicates\",\n",
+    "        ],\n",
+    "    )\n",
+    "    .mark_line(point=True)\n",
+    "    .configure_axis(grid=False)\n",
+    "    .configure_point(size=45)\n",
+    "    .properties(\n",
+    "        height=alt.Step(11),\n",
+    "        width=100,\n",
+    "        title=alt.TitleParams(\n",
+    "            \"Interactive chart of serum neutralization titers\",\n",
+    "            subtitle=\"Mouseover points for details, click serum legend at bottom to select sera to show\",\n",
+    "            fontSize=15,\n",
+    "            dx=100,\n",
+    "            dy=-5,\n",
+    "        ),\n",
+    "        autosize=alt.AutoSizeParams(resize=True),\n",
+    "    )\n",
+    ")\n",
+    "\n",
+    "print(f\"Saving chart to {titers_chart_html}\")\n",
+    "titers_chart.save(titers_chart_html)\n",
+    "\n",
+    "titers_chart"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27ff41ab-c967-42a9-adb1-5856810c7f34",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/serum_titers.py.ipynb b/notebooks/serum_titers.py.ipynb
@@ -239,12 +239,12 @@
     "per_rep_chart = (\n",
     "    alt.Chart(per_rep_titers)\n",
     "    .encode(\n",
-    "        alt.X(\"virus\", sort=viruses),\n",
-    "        alt.Y(\n",
+    "        alt.X(\n",
     "            \"nt50\",\n",
     "            title=\"neutralization titer\",\n",
     "            scale=alt.Scale(nice=False, padding=5, type=\"log\"),\n",
     "        ),\n",
+    "        alt.Y(\"virus\", sort=viruses),\n",
     "        alt.Shape(\"nt50_bound\", title=\"is titer bound?\"),\n",
     "        strokeWidth=alt.condition(virus_selection, alt.value(2), alt.value(0)),\n",
     "        tooltip=[\n",
@@ -266,12 +266,12 @@
     "median_chart = (\n",
     "    alt.Chart(median_titers)\n",
     "    .encode(\n",
-    "        alt.X(\"virus\", sort=viruses),\n",
-    "        alt.Y(\n",
+    "        alt.X(\n",
     "            \"nt50\",\n",
     "            title=\"neutralization titer\",\n",
     "            scale=alt.Scale(nice=False, padding=5, type=\"log\"),\n",
     "        ),\n",
+    "        alt.Y(\"virus\", sort=viruses),\n",
     "        alt.Shape(\"nt50_bound\", title=\"is titer bound?\"),\n",
     "        strokeWidth=alt.condition(virus_selection, alt.value(2), alt.value(0)),\n",
     "        tooltip=[\n",
@@ -294,8 +294,8 @@
     "    (per_rep_chart + median_chart)\n",
     "    .add_params(virus_selection)\n",
     "    .properties(\n",
-    "        width=alt.Step(14),\n",
-    "        height=200,\n",
+    "        height=alt.Step(14),\n",
+    "        width=250,\n",
     "        title=f\"{serum} median (orange) and per-replicate (blue) titers\",\n",
     "    )\n",
     "    .configure_axis(grid=False)\n",
@@ -321,9 +321,14 @@
    "source": [
     "qc_failures = []\n",
     "\n",
-    "insufficient_replicates = median_titers[\n",
-    "    median_titers[\"n_replicates\"] < qc_thresholds[\"min_replicates\"]\n",
-    "].query(\"virus not in @viruses_ignore_qc\")\n",
+    "insufficient_replicates = (\n",
+    "    median_titers.sort_values(\"virus\", key=lambda s: s.map(lambda v: viruses.index(v)))\n",
+    "    .reset_index(drop=True)[\n",
+    "        median_titers[\"n_replicates\"] < qc_thresholds[\"min_replicates\"]\n",
+    "    ]\n",
+    "    .query(\"virus not in @viruses_ignore_qc\")\n",
+    "    .drop(columns=\"serum\")\n",
+    ")\n",
     "if len(insufficient_replicates):\n",
     "    print(f\"The following viruses fail {qc_thresholds['min_replicates']=}\")\n",
     "    display(insufficient_replicates)\n",
@@ -343,7 +348,11 @@
     "            | (x[\"fold_change_from_median\"] < 1 / max_fold_change_from_median)\n",
     "        ),\n",
     "    )\n",
-    "    .query(\"excess_fold_change and (virus not in @viruses_ignore_qc)\")\n",
+    "    .sort_values(\"virus\", key=lambda s: s.map(lambda v: viruses.index(v)))\n",
+    "    .reset_index(drop=True)\n",
+    "    .query(\"excess_fold_change and (virus not in @viruses_ignore_qc)\")[\n",
+    "        [\"virus\", \"replicate\", \"median_nt50\", \"nt50\", \"fold_change_from_median\"]\n",
+    "    ]\n",
     ")\n",
     "if len(excess_fold_change_from_median):\n",
     "    print(f\"\\nThese replicates fail {qc_thresholds['max_fold_change_from_median']=}\")\n",