This is a modular analysis pipeline for analyzing high-throughput sequencing-based neutralization assays of the type developed in the Bloom lab. See Loes et al (2024) for a description of these assays.
Please cite Loes et al (2024) if you use this pipeline for your scientific study.
See here for a list of the contributors to this pipeline.
This pipeline goes from the FASTQ files that represent the counts of each barcoded viral variant to the computed neutralization titers for each sera. The titers are computed by fitting Hill-curve style neutralization curves using the neutcurve package; see the documentation for the details of these curves. The titers represent the reciprocal serum dilutions at which half the viral infectivity is neutralized. The pipeline provides options to compute these titers as either:
- the reciprocal of the inhibitory concentration 50% (IC50), namely as the neutralization titer 50% (NT50)
- the reciprocal of the midpoint of the neutralization curve
When the curves are fit with a top plateau of 1 and a bottom plateau of zero, these two different ways of calculating the titers are identical, and represent the serum dilution factor at which the serum neutralizes half of the viral infectivity. Typically there will be multiple replicates, and the final reported titer is the median among replicates.
The pipeline also performs extensive quality control at different steps using configurable options described below.
This pipeline is designed to be included as a modular portion of a larger snakemake analysis. This pipeline processes FASTQ files to get the counts of each barcode, analyzes those to determine the fraction infectivity at each serum concentration, and then fits and plots neutralization curves.
To use the pipeline, first create another repository specific to your project.
The include this pipeline as a git submodule in your repository, where it will be present in a subdirectory named seqneut-pipeline
.
So the overall structure will look like this:
<your_project_repo>
├── seqneut-pipeline [added as git submodule]
│ ├── seqneut-pipeline.smk [snakemake rules for pipeline]
│ ├── environment.yml [conda environment for pipeline]
│ └── ... [other contents of seqneut-pipeline]
├── README.md [README for your project]
├── config.yml [YAML configuration for project]
├── Snakefile [top-level snakemake file that includes `seqneut-pipeline.smk`]
├── data [subdirectory with input data for your project]
├── results [subdirectory with results created by pipeline]
├── docs [HTML summary of results created by pipeline]
└── <other files / subdirectories that are part of project>
So after you have created your project repo, add this pipeline as a git submodule with:
git submodule add https://github.com/jbloomlab/seqneut-pipeline
This creates a file called gitmodules
and the seqneut-pipeline
submodule, which can then be committed to the repo.
If at some point you want to update the version of the pipeline, simply cd
into the seqneut-pipeline
subdirectory and pull or checkout the version you want.
To use the pipeline, you then need to add a few things to your main-level repo.
The first is a top-level Snakefile
that includes your configuration, seqneut-pipeline
, and outputs of seqneut-pipeline
as targets of the all
rule.
So minimally that top-level Snakefile
should contain the following lines (it can also contain additional stuff if you also have it running project-specific analyses on the output of the pipeline):
import os
configfile: "config.yml"
include: os.path.join(config["seqneut-pipeline"], "seqneut-pipeline.smk")
rule all:
input:
seqneut_pipeline_outputs
In this Snakefile
, the seqneut_pipeline_outputs
specify files created by the pipeline.
Several of of these may be of special interest for you to use in additional rules you define in Snakefile
:
./results/aggregated_titers/titers_{group}.csv
: CSV with the final (median across replicates) titers for each serum-virus pair in a group after applying quality control filters../results/aggregated_titers/curvefits_{group}.pickle
: a pickled neutcurve.CurveFits object with all of the curve fits for all serum-virus-replicates in a group after applying the QC filters. You can use the methods of this object to make plots of neutralization curves for specific sera / viruses / replicates.
In addition, you need to create the configuration file config.yml
and ensure it includes the appropriate configuration for seqneut-pipeline
as described below.
To track the correct files in the created results, we suggest you copy the ./test_example/.gitignore file to be the .gitignore
for your main repo.
This will track key results files, but not an excessive number of non-essential files.
Finally, you need to create a conda
environment that minimally includes the packages needed to run the pipeline, which are a recent version of snakemake and pandas.
You can either create your own environment containing these, or simply build and use the one specified in environment.yml file of seqneut-pipeline
, which is named seqneut-pipeline
. So if you are using that environment, you can simply run the pipeline with:
conda activate seqneut-pipeline
snakemake -j <n_jobs> --software-deployment-method conda
Note also that a few rules have rule-specific conda
environments in ./envs/.
The configuration for the pipeline is in a file called config.yml
.
An example configuration file is in ./test_example/config.yml (although some of the QC thresholds are set more leniently to make the test example work for small data as described in the comments in that YAML).
Here we describe the required keys in this YAML file (you can also add additional information specific to your repo, but we advise adding comments to put that in a separate part of the YAML from the seqneut-pipeline
configuration).
For background on YAML format, including on the anchor (&
) and merge (<<: *
) operators that can be helpful to simplify the YAML file, see here and here.
The top-level keys in the YAML are:
Location of the seqneut-pipeline
relative to the top-level repo.
This will almost always be a subdirectory of the same name, so this key will be as shown below unless you have a good reason to do otherwise:
seqneut-pipeline: seqneut-pipeline
Location where we create the ./docs/
subdirectory with HTMLs for rendering on GitHub pages.
This will almost always be docs
, so this key will be as shown below unless you have a good reason to do otherwise:
docs: docs
Description of pipeline, used in the HTML docs rendering.
Should include title (with markdown #
heading, authors and/or citation, and link to GitHub repo.
For instance:
description: |
# <title>
<short description>
<authors and/or link to citation>
See <GitHub repo link> for code and numerical data.
A dictionary (mapping) of viral library names to CSV files holding these libraries. So in general this key will look like:
viral_libraries:
pdmH1N1_lib2023_loes: data/viral_libraries/pdmH1N1_lib2023_loes.csv
<potentially more viral libraries specified as name: CSV pairs>
The recommended way to organize the viral libraries (as indicated above) is to put them in a ./data/viral_libraries/
subdirectory.
The CSV files themselves will have columns specifying the viral barcode and the strain it corresponds to, such as:
barcode,strain
ACGGAATCCCCTGAGA,A/Washington/23/2020
GCATGGATCCTTTACT,A/Togo/845/2020
<additional lines>
A a CSV with a column named "strain" that lists the strains in the order they should be plotted. If not specified or set to "null", then plotting is just alphabetical. Must include all strains being used if specified. So should look like this:
viral_strain_plot_order: data/viral_strain_plot_order.csv
The CSV file itself will just have a column named "strain" specifying the order, such as:
strain
A/California/07/2009
A/Michigan/45/2015
<additional lines>
A dictionary (mapping) of neutralization-standard set names to CSV files holding the barcodes for the neutralization standard set. So in general, this key will look like:
neut_standard_sets:
loes2023: data/neut_standard_sets/loes2023_neut_standards.csv
<potentially more neut standard sets specified as name: CSV pairs>
The recommended way to organize the neutralization-standard sets (as indicated above) is to put them in a ./data/neut_standard_sets/
subdirectory.
The CSV files need just a single column specifying the neutralization standard barcode, such as:
barcode
CTTTAAATTATAGTCT
CATACAGAGTTTGTTG
<additional lines>
A dictionary (mapping) specifying how to parse the Illumina FASTQ files to barcode counts.
This is a global dictionary that is applied to all plates, but can be augmented or overriden on a per-plate basis by specifying plate specific illumina_barcode_parser_params
as described in the plate configuration below.
This mapping should just specify key-word arguments that can be passed to dms_variants.illuminabarcodeparser.IlluminaBarcodeParser; note that the barcode-length (bclen
) parameter should not be specified as it is inferred from the length of the barcodes.
So in general, this key will look like this:
illumina_barcode_parser_params:
upstream: CTCCCTACAATGTCGGATTTGTATTTAATAG
downstream: ''
minq: 20
upstream_mismatch: 4
bc_orientation: R2
This dictionary (mapping) contains the heart of the configuration, and may be quite large. Essentially, it specifies what samples are contained in each plate, how those samples should be processed, QC thresholds, and any specific barcodes or samples that should be dropped. In addition, each plate is assigned to a group, which might be "serum" or "pilot" (if you are mixing analyses of your sera with pilot experiments), or could be additional groups if you have two distinct sets of sera.
The basic structure is that plates
maps plate names to configurations for the plates.
Specifically, it should look like this:
plates:
plate1:
group: serum
date: 2023-08-01
viral_library: pdmH1N1_lib2023_loes
neut_standard_set: loes2023
samples_csv: data/plates/plate1_samples.csv
manual_drops: {}
qc_thresholds:
<<: *default_process_plate_qc_thresholds
curvefit_params:
<<: *default_process_plate_curvefit_params
curvefit_qc:
<<: *default_process_plate_curvefit_qc
illumina_barcode_parser_params: # optional argument
upstream2: GCTACA
<additional_plates>
The above example shows the configuration of a plate called plate1
, and there may be many additional plates.
The elements under each plate-mapping are in turn as follows:
The group that this plate is assigned to (cannot contain any underscores). Typically this might be "serum" or "pilot" or however you are categorizing the runs.
The date
key specifies the date on which the plate was processed in YYYY-MM-DD
format.
The viral_library
key gives the name of a key in viral_libraries
that specifies the barcodes / strains for the viral library used for this plate.
The neut_standard_set
key gives the name of a key in neut_standard_sets
that specifies the neutralization-standard set barcodes used for this plate.
The samples_csv
key gives the name of a CSV file specifying the samples for that plate.
The recommended way to organize these sample CSVs is to put them in ./data/plates/
subdirectory.
The CSV file must have the following columns:
- well: well in plate in which sample was run, typically names like "A1", "B1", etc.
- serum: name of the serum in this well, or "none" if it is a no-serum sample.
- dilution_factor: dilution factor of the serum (should be a number > 1), leave blank for the no-serum samples (serum of "none")
- replicate: the replicate of this serum, which you only need to specify if there are multiple different samples with the same serum and dilution_factor in the plate.
- fastq: path to the FASTQ file, can be gzipped
- other columns (e.g., notes, etc) are allowed but are ignored by the pipeline
Here are a few lines of an example CSV file:
well,serum,dilution_factor,replicate,fastq
A1,none,,1,/fh/fast/bloom_j/SR/ngs/illumina/aloes/230801_VH00319_391_AACWKHTM5/Unaligned/Project_aloes/Plate1_Noserum1_S1_R1_001.fastq.gz
A2,Y106d182,20.0,1,/fh/fast/bloom_j/SR/ngs/illumina/aloes/230801_VH00319_391_AACWKHTM5/Unaligned/Project_aloes/Y106_d182_conc1_S9_R1_001.fastq.gz
<additional lines>
As you analyze plates, you may find specific barcodes, wells, etc that you want to drop even if they don't fail the QC if they appear problematic to you for some reason.
If so, specify them using this key (if you don't want to manually drop any data for the plate, then just set this key to an empy dictionary {}
).
The manual drops can have the following keys:
-
wells
: list of wells to drop -
barcodes
: list of barcodes to drop from all wells -
barcode_wells
: list of[barcode, well]
lists to drop specific barcodes in specific wells -
barcode_serum_replicates
: list of[barcode, serum_replicate]
to drop specific barcodes for specific serum-replicates -
serum_replicates
: list of serum-replicates to drop
So for instance, you could have this specification if you wanted to drop barcode AGTCCTATCCTCAAAT
for all wells of serum-replicate M099d0
manual_drops:
barcode_serum_replicates:
- [AGTCCTATCCTCAAAT, M099d0]
This key defines a mapping of the quality-control thresholds for processing the sequencing counts to get fraction infectivities.
These thresholds are used for the QC in the process_count
rule (see section below on quality-control for more details).
Since it is a bit complex, you typically will want to use the YAML anchor / merge syntax to define a default that you then merge for specific plates. The default can be defined like this:
default_process_plate_qc_thresholds: &default_process_plate_qc_thresholds
avg_barcode_counts_per_well: 500
min_neut_standard_frac_per_well: 0.005
no_serum_per_viral_barcode_filters:
min_frac: 0.0005
max_fold_change: 3
max_wells: 2
per_neut_standard_barcode_filters:
min_frac: 0.005
max_fold_change: 3
max_wells: 2
min_neut_standard_count_per_well: 1000
min_no_serum_count_per_viral_barcode_well: 30
max_frac_infectivity_per_viral_barcode_well: 5
min_dilutions_per_barcode_serum_replicate: 6
and then for specific plates you can merge this default it in and overwrite any specific keys if needed.
For instance, below would merge the above defaults but then overwrite the min_viral_barcode_frac
to a different value:
plates:
plate1:
<other keys>
qc_thresholds:
<<: *default_process_plate_qc_thresholds # merge in defaults
avg_barcode_counts_per_well: 1000 # overwrite default for this key for this plate
<other plates>
The QC-thresholds defined here are applied in order to drop data (wells, barcodes, etc) when processing the plates. Specifically:
-
avg_barcode_counts_per_well
: drop any well that does not have at least this many average counts per barcode. -
min_neut_standard_frac_per_well
: drop any well where the neutralization standard is not at least this fraction of the counts in the well. -
no_serum_per_viral_barcode_filters
: has subkeysmin_frac
,max_fold_change
, andmax_wells
. The QC analyzes the fraction of all viral-barcode counts in the no-serum samples (wells) that are attributable to each viral barcode, and checks that this fraction is at leastmin_frac
and is not more thanmax_fold_change
different from the median fraction for this viral barcode across no-serum samples. If a given viral barcode fails either of these filters in at leastmax_wells
wells, it is dropped entirely from that plate. -
per_neut_standard_barcode_filters
: has subkeysmin_frac
,max_fold_change
, andmax_wells
. The QC analyzes the fraction of all neutralization-standard barcode counts in all samples (wells) that are attributable to each neutralization-standard barcode, and checks that this fraction is at leastmin_frac
and is not more thanmax_fold_change
different from the median fraction for this barcode across all wells samples. If a given neutralization-standard barcode fails either of these filters in at leastmax_wells
wells, it is dropped entirely from that plate. -
min_neut_standard_count_per_well
: drop any well where the total counts for neutralization standard barcodesis not at least this large. -
min_no_serum_count_per_viral_barcode_well
: drop any viral-barcode / no-serum well combination where the viral barcode does not have at least this many counts. -
max_frac_infectivity_per_viral_barcode_well
: drop any viral-barcode / well combinations where the viral barcode in that well has a computed fraction infectivity exceeding this. -
min_dilutions_per_barcode_serum_replicate
: drop any viral-barcode / serum-replicate combinations where the serum-replicate does not have at least this many dilutions for the viral barcode.
This key defines some parameters specifying how the neutralization curves are fit, which is done using the Hill curves defined in the neutcurve package.
You typically want to use the YAML anchor/merge syntax to define a default that you then merge for specific plates. The default can be defined like this:
default_process_plate_curvefit_params: &default_process_plate_curvefit_params
frac_infectivity_ceiling: 1
fixtop: [0.75, 1.0]
fixbottom: 0
fixslope: [0.8, 10]
The specific meaning of these curve-fitting parameters are as follows:
-
frac_infectivity_ceiling
: ceiling to apply to all fraction infectivities before fitting curves. You may want to this to one to put a ceiling on all values >1. In principle, no values should be >1 in the absence of experimental noise. -
fixtop
: how to set the top plateau of the neutralization curve. You can set it to:- A list of length two specifying a reasonable range, such as
[0.75, 1.0]
, in which case the top is optimized within that range. This is typically the recommended setting. - A fixed value to fix to a specific number, typically 1. This is recommended if you want to force all curves to have a top plateau of one.
- The value
false
if you want it to be a totally free parameter. This is not recommended as you can sometimes get spurious fits of a very large value when the data don't capture fully neutralization.
- A list of length two specifying a reasonable range, such as
-
fixbottom
: how to set the bottom plateau of the neutralization curve to this value. Likefixtop
, it can be a length-two list, a fixed value, orfalse
. Typically you should set it to 0 unless you have a good reason otherwise. -
fixslope
: how to set the slope of the neutralization curve. Likefixtop
, it can be a length-two list, a fixed value, orfalse
. If you don't know the "slope" of the neutralization curve, setting tofalse
is a reasonable choice. However, in many cases it is preferable to set to a range that encompasses "reasonable" slopes. Note that what is "reasonable" will depend on the units of the concentration, but when they are serum dilutions a "reasonable" range is often[0.8, 10]
.
This key defines some parameters on quality-control performed after the curve-fitting; viral-barcode / serum-replicate combinations that fail this QC are dropped.
You typically want to use the YAML anchor/merge syntax to define a default that you then merge for specific plates. The default can be defined like this:
default_process_plate_curvefit_qc: &default_process_plate_curvefit_qc
max_frac_infectivity_at_least: 0
goodness_of_fit:
min_R2: 0.75
max_RMSD: 0.05
serum_replicates_ignore_curvefit_qc: []
barcode_serum_replicates_ignore_curvefit_qc: []
The specific meanings of these QC parameters are:
-
max_frac_infectivity_at_least
: drop any viral-barcode / serum-replicate combination that does not have a maximum frac infectivity across all concentrations of at least this value. Typically if you want to allow curves where the sera neutralize at all tested concentrations then you should set a value of 0. But you should set a value >0.5 if you want to require all sera to have a midpoint within the dilution range. -
goodness_of_fit
: drop any viral-barcode / serum-replicate combination where the curve fit does not have reasonable goodness of fit. A curve is dropped if it fails both ofmin_R2
andmax_RMSD
(passing one is enough). The reason for using both is that when the data has more variation, we can tolerate a higher RMSD if the R2 is still good. There are two keys specified undergoodness_of_fit
: -
min_R2
: does curve fit have a coefficient of determination at least this large (a coefficient of determination of 1 is a perfect fit). Used to drop very poor fitting curves. Reasonable values might be in the 0.6 to 0.8 range, although you should also just look at the curves being dropped to see if they look good. -
max_RMSD
: does curve fit have a root-mean square deviation (square root of mean residuals) no larger than this? Reasonable values might be in the 0.05 to 0.1 range. -
serum_replicates_ignore_curvefit_qc
: list of any serum replicates for which we ignore the curve-fitting QC for all viral barcodes. -
barcode_serum_replicates_ignore_curvefit_qc
: list (as[barcode, serum_replicate]
) of viral-barcodes / serum-replicates where we ignore the curve-fitting QC.
This key defines parameters for the illuminabarcodeparser that override anything set in the global illumina_barcode_parser_params
above.
It is optional, and if not defined just the global params are used.
If this is defined, it is used to update the global params (adding new params and overriding any shared ones).
The main anticipated use case is if you add plate-specific indices in the round 1 PCR and want to specify those indices here using upstream2
and upstream2_mismatch
.
Specifies how we compute the final titers in serum_titers
.
Can be either midpoint
or nt50
depending on whether you want to report the value where the fraction infectivity gets to 50%, or the midpoint of the curve, so should be either
default_serum_titer_as: midpoint
or
default_serum_titer_as: nt50
The difference only becomes relevant if some your curves have plateaus substantially different than zero and one.
If you want to handle specific sera different, see sera_override_defaults
.
Default QC we apply to each serum-virus pair when reporting out the final titers (medians across replicaes) in serum_titers
.
Any serum-virus pair that fails this QC does not have a titer reported unless it is specified in sera_override_defaults
.
Should look like this:
default_serum_qc_thresholds: &default_serum_qc_thresholds
min_replicates: 2
max_fold_change_from_median: 3
viruses_ignore_qc: []
where:
min_replicates
: drop any virus-serum titer that is not supported by at least this many replicates.max_fold_change_from_median
: drop any virus-serum titer where any replicate differs by more than this from the median across replicates.viruses_ignore_qc
: list of viruses for which you want to ignore the above QC. Specifying a virus here will ignore the QC for all sera, if you want to make a virus-serum specific exclusion then instead specify this insera_override_defaults
.
Override default_serum_titer_as
or default_serum_qc_thresholds
for specific sera in each group (recall groups are assigned per-plate).
For instance, this could look like:
sera_override_defaults:
serum:
M099d30:
qc_thresholds:
<<: *default_serum_qc_thresholds
viruses_ignore_qc:
- A/Belgium/H0017/2022
Y044d30:
qc_thresholds:
<<: *default_serum_qc_thresholds
max_fold_change_from_median: 4
titer_as: nt50
The above means that in the group called serum, for serum M099d30
we override the default_serum_qc_thresholds
to exclude virus A/Belgium/H0017/2022
, and for serum Y044d30
we override the defaults to allow a greater fold-change from median for individual replicates, and compute the titer as nt50
.
Anything not listed here gets handled by the defaults in default_serum_titer_as
and default_serum_qc_thresholds
.
This is an optional key that can be used specify plates that you just want to count barcodes for, and then analyze those counts outside the main pipeline. This might be useful for library pooling or QC, for instance---or if you want to look at some failed plates that you don't actually want to fit curves for.
If you do not want to specify any miscellaneous plates either leave this key out or set it to an empty dictionary ({}
).
The key should look like this:
miscellaneous_plates:
<plate_name_1>:
date: <date>
viral_library: <viral library>
neut_standard_set: <standard set>
samples_csv: <filename>
illumina_barcode_parser_params: # optional key
<parser params to override global>
<plate_name_2>:
...
The plate name is just the name assigned to the plate.
The date
, viral_library
, neut_standard_set
, and illumina_barcode_parser_params
keys have the same meaning as for the plates specified under plates
.
The samples_csv
should specify the samples to analyze in a CSV that has columns named "well" and "fastq", and optionally other columns as well.
The output is that for each plate, the following files are created:
results/miscellaneous_plates/<plate_name>/<well>_counts.csv
: counts of each viral barcode in that well of that plate.results/miscellaneous_plates/<plate_name>/<well>_invalid.csv
: counts of each invalid barcode in that well of that plate.results/miscellaneous_plates/<plate_name>/<well>_fates.csv
: summarizing number of reads that are valid and various types of invalid for each well of that plate.
The results of running the pipeline are put in the ./results/
subdirectory of your main repo.
We recommend using the .gitignore
file in [./test_example/.gitignore] in your main repo to only track key results in your GitHub repo.
The key results if the pipeline runs to completion are in ./results/aggregated_titers/titers_{group}.csv
for each group of sera.
The set of full created outputs are as follows (note only some will be tracked depending on your .gitignore
):
-
Outputs related to barcode counting:
./results/barcode_counts/
: files giving the barcode counts for each sample. You should track this in the repo../results/barcode_fates/
: files giving the statistics (fates) of reads in the barcode counting for each sample. You do not need to track this in the repo as the results are plotted../results/barcode_invalid/
: files giving counts of invalid barcodes for each sample. You do not need to track this in the repo, but it could be helpful to look at these identities in counts if QC shows you are getting many invalid barcodes.
-
Outputs related to processing each plate:
./results/plates/{plate}/frac_infectivity.csv
: fraction infectivity for viral barcodes for a plate. You should track this in the repo../results/plates/{plate}/process_{plate}.ipynb
: Jupyter notebook processing counts for a plate. You do not need to track this as an HTML version will be rendered in./docs/
when pipeline runs successfully../results/plates/{plate}/process_{plate}.html
: HTML of Jupyter notebook processing counts for a plate. You do not need to track this as it will be rendered in./docs/
when pipeline runs successfully../results/plates/{plate}/qc_drops.yml
: details on data (barcodes, wells, etc) dropped for failing QC when processing this plate../results/plates/{plate}/curvefits.csv
: the neutralization curve fits to each serum on each plate. You should track this in repo../results/plates/{plate}/curvefits.pickle
: pickle file with theneutcurve.CurveFits
object for the plate. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere.
-
Output related to per-serum titers (aggregated across replicates potentially run on different plates); note that serum are organized per-group as specified in the plates:
./results/sera/groups_sera_by_plate.csv
summarizes which plate(s) each group/serum was run on../results/sera/{group}_{serum}/titers.csv
: titer for each virus against the group/serum, reported as the median across replicates, and only keeping those that pass QC. You should track this file in the repo../results/sera/{group}_{serum}/titers_per_replicate.csv
: titers for each replicate of each virus against the group/serum. You should track this file in the repo../results/sera/{group}_{serum}/curves.pdf
: PDF rendering of the neutralization curves for the group/serum. You do not need to track this in the repo as a HTML version of a notebook containing the plots is tracked in./docs/
../results/sera/{group}_{serum}/curvefits.pickle
: pickle file with theneutcurve.CurveFits
object for this group/serum, after applying QC filters. You do not need to track this in the repo as both the plots and numerical data are rendered elsewhere../results/sera/{group}_{serum}/{group}_{serum}_titers.ipynb
: Jupyter notebook that aggregates titers for a group/serum across all plates. You do not need to track this in the repo as a HTML version of the notebook is tracked in./docs/
../results/sera/{group}_{serum}/{group}_{serum}_titers.html
: HTML rendering of the Jupyter notebook that aggregates titers for a group/serum across all plates. You do not need to track this in the repo as it will be rendered in./docs/
when the pipeline runs successfully../results/sera/{group}_{serum}/qc_drops.yml
: virus-group/serum titers dropped due to QC when processing this serum's titers.
-
Results related to aggregated titers across all sera in a group after applying all quality control:
./results/aggregated_titers/titers_{group}.csv
: titers for all sera / virus in a group (median of replicates). You should track this file as it has the final processed results../results/aggregated_titers/curvefits_{group}.pickle
: pickle file with theneutcurve.CurveFits
object holding all final curves for a group. You do not need to track this in the repo, but if you have further code that makes specific plots you may want to use this../results/aggregated_titers/titers.html
: interactive plot of titers for all sera. You do not need to track this in the repo as it is rendered in./docs/
when the pipeline runs successfully../results/aggregated_titers/aggregate_titers.ipynb
: Jupyter notebook that aggregates all the titers. You do not need to track this in the repo.
-
Results summarizing data dropped due to QC:
./results/plate_qc_drops.yml
: YAML file summarizing all data (barcodes, wells, etc) dropped during the plate-processing QC. You should track this in repo../results/groups_sera_qc_drops.yml
: YAML file summarizing all group/serum-virus titers dropped during the serum titers QC. You should track this in repo../results/aggregate_qc_drops.ipynb
: Jupypter notebook summarizing the QC drops. You do not need to track as an HTML version is rendered in./docs/
./results/aggregate_qc_drops.html
: HTML version Jupypter notebook summarizing the QC drops. You do not need to track as it is rendered in./docs/
When you run the pipeline, the QC values in the configuration will be automatically applied, and HTML notebooks summarizing the processing of each plate and sera are rendered in ./docs
, alongside a summary of all QC across all plates / sera.
YAML summaries of the QC are also created.
While the QC is designed to hopefully make reasonable default choices, you should always carefully look through these notebooks after adding new data, and potentially adjust the QC in the configuration and re-run.
If the pipeline runs to completion, it will create HTML documentation with plots of the overall titers, per-serum titer analyses, per-plate analyses and overall QC summary in a docs subdirectory, which will typically named be ./docs/
(if you use suggested key in configuration YAML).
This HTML documentation can be rendered via GitHub Pages from the ./docs/
directory.
Looking at this documentation is a good way to QC the data and understand the results.
The documentation for the test example for this pipeline is at https://jbloomlab.github.io/seqneut-pipeline/.
If you want to add additional HTML files to the docs, specify a dict in the top-level Snakefile
with the name add_htmls_to_docs
like this:
add_htmls_to_docs = {
"Additional files": {
"Example HTML file": "results/extra_htmls/example_html.html",
<other keys specifying file names and their paths>
},
<other nested dicts with a heading and then name: file key-value pairs>
}
The ./test_example subdirectory contains a small test example that illustrates use of the pipeline.
The code is tested by running this example, as well as formatted with black and snakefmt and linted with ruff and snakemake --lint via the GitHub Action specified in .github/workflows/test.yaml.