Skip to content

Commit

Permalink
enable plate-level indices as upstream2 in barcode parsing (#44)
Browse files Browse the repository at this point in the history
Configured to enable plate-level indices to be embedded in the round-1 PCR primers (see [this issue](#40)). Essentially, this amounts to allowing a per-plate flanking sequence to be specified for each plate, and only FASTQ reads with that flanking sequence are read for that plate. Typically this index would be specified as `upstream2` in the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html). To enable this change, altered the configuration from the previous setup of just having a single global `illumina_barcode_parser_params` applied to all plates. Now such a global parser is still specified that has default values that you want to apply to all plates. But in addition, in the per-plate configuration you can specify `illumina_barcode_parser_params` that are added to (and override) anything in the global parser params, and can contain plate specific `upstream2` and other relevant setting (eg, `upstream2_mismatch`). The test example was modified to use this option for plate2 and plate11.
  • Loading branch information
jbloom authored Apr 12, 2024
1 parent 3c762ad commit e692286
Show file tree
Hide file tree
Showing 130 changed files with 515 additions and 483 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# CHANGELOG

### version 3.1.0
- Configured to enable plate-level indices to be embedded in the round-1 PCR primers (see [this issue](https://github.com/jbloomlab/seqneut-pipeline/issues/40)). Essentially, this amounts to allowing a per-plate flanking sequence to be specified for each plate, and only FASTQ reads with that flanking sequence are read for that plate. Typically this index would be specified as `upstream2` in the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html). To enable this change, altered the configuration from the previous setup of just having a single global `illumina_barcode_parser_params` applied to all plates. Now such a global parser is still specified that has default values that you want to apply to all plates. But in addition, in the per-plate configuration you can specify `illumina_barcode_parser_params` that are added to (and override) anything in the global parser params, and can contain plate specific `upstream2` and other relevant setting (eg, `upstream2_mismatch`). The test example was modified to use this option for plate2 and plate11.

- Update software versions:
- `dms_variants` to 1.6.0
- `neutcurve` to 2.1.0
Expand Down
13 changes: 12 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ CATACAGAGTTTGTTG

### illumina_barcode_parser_params
A dictionary (mapping) specifying how to parse the Illumina FASTQ files to barcode counts.
This is a global dictionary that is applied to all plates, but can be augmented or overriden on a per-plate basis by specifying plate specific `illumina_barcode_parser_params` as described in the plate configuration below.
This mapping should just specify key-word arguments that can be passed to [dms_variants.illuminabarcodeparser.IlluminaBarcodeParser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html#dms_variants.illuminabarcodeparser.IlluminaBarcodeParser); note that the barcode-length (`bclen`) parameter should not be specified as it is inferred from the length of the barcodes.

So in general, this key will look like this:
Expand Down Expand Up @@ -220,6 +221,8 @@ plates:
<<: *default_process_plate_curvefit_params
curvefit_qc:
<<: *default_process_plate_curvefit_qc
illumina_barcode_parser_params: # optional argument
upstream2: upstream2: GCTACA
<additional_plates>
```
Expand Down Expand Up @@ -390,6 +393,12 @@ The specific meanings of these QC parameters are:

- `barcode_serum_replicates_ignore_curvefit_qc`: list (as `[barcode, serum_replicate]`) of viral-barcodes / serum-replicates where we ignore the curve-fitting QC.

#### illumina_barcode_parser_params
This key defines parameters for the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html) that override anything set in the global `illumina_barcode_parser_params` above.
It is optional, and if not defined just the global params are used.
If this is defined, it is used to update the global params (adding new params and overriding any shared ones).
The main anticipated use case is if you add plate-specific indices in the round 1 PCR and want to specify those indices here using `upstream2` and `upstream2_mismatch`.

### default_serum_titer_as
Specifies how we compute the final titers in `serum_titers`.
Can be either `midpoint` or `nt50` depending on whether you want to report the value where the fraction infectivity gets to 50%, or the midpoint of the curve, so should be either
Expand Down Expand Up @@ -458,13 +467,15 @@ miscellaneous_plates:
viral_library: <viral library>
neut_standard_set: <standard set>
samples_csv: <filename>
illumina_barcode_parser_params: # optional key
<parser params to override global>
<plate_name_2>:
...
```

The plate name is just the name assigned to the plate.
The `date`, `viral_library`, and `neut_standard_set` keys have the same meaning as for the plates specified under `plates`.
The `date`, `viral_library`, `neut_standard_set`, and `illumina_barcode_parser_params` keys have the same meaning as for the plates specified under `plates`.

The `samples_csv` should specify the samples to analyze in a CSV that has columns named "well" and "fastq", and optionally other columns as well.

Expand Down
52 changes: 26 additions & 26 deletions docs/aggregate_qc_drops.html

Large diffs are not rendered by default.

28 changes: 14 additions & 14 deletions docs/pilot_A23038d0_r32_75K_titers.html

Large diffs are not rendered by default.

161 changes: 81 additions & 80 deletions docs/process_H3N2_plate.html

Large diffs are not rendered by default.

175 changes: 88 additions & 87 deletions docs/process_plate11.html

Large diffs are not rendered by default.

175 changes: 88 additions & 87 deletions docs/process_plate2.html

Large diffs are not rendered by default.

28 changes: 14 additions & 14 deletions docs/serum_M099d0_titers.html

Large diffs are not rendered by default.

28 changes: 14 additions & 14 deletions docs/serum_M099d30_titers.html

Large diffs are not rendered by default.

28 changes: 14 additions & 14 deletions docs/serum_Y044d30_titers.html

Large diffs are not rendered by default.

28 changes: 14 additions & 14 deletions docs/serum_Y154d182_titers.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions envs/count_barcodes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ channels:
dependencies:
- pandas=2.2
- pip
- python=3.11
- python=3.12
- pip:
- dms_variants==1.5.0
- dms_variants==1.6.0
19 changes: 19 additions & 0 deletions funcs.smk
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,16 @@ def process_miscellaneous_plates(misc_plates_d):
f"{plate_dict['samples_csv']} has non-unique entries in 'well' column"
)
misc_plates[plate]["wells"] = samples.set_index("well")["fastq"].to_dict()

# get default barcode parser params, update if specified per plate
misc_plates[plate]["illumina_barcode_parser_params"] = copy.deepcopy(
config["illumina_barcode_parser_params"]
)
if "illumina_barcode_parser_params" in plate_dict:
misc_plates[plate]["illumina_barcode_parser_params"].update(
plate_dict["illumina_barcode_parser_params"]
)

return misc_plates


Expand Down Expand Up @@ -63,6 +73,15 @@ def process_plate(plate, plate_params):
if not re.fullmatch(r"\d{4}\-\d{2}\-\d{2}", str(plate_d["date"])):
raise ValueError(f"{plate =} {plate_d['date'] =} not in YYYY-MM-DD format")

# get default barcode parser params, update if specified per plate
plate_d["illumina_barcode_parser_params"] = copy.deepcopy(
config["illumina_barcode_parser_params"]
)
if "illumina_barcode_parser_params" in plate_params:
plate_d["illumina_barcode_parser_params"].update(
plate_params["illumina_barcode_parser_params"]
)

# Process samples_csv to create the sample data frame
req_sample_cols = ["well", "serum", "dilution_factor", "replicate", "fastq"]
samples_df = pd.read_csv(plate_params["samples_csv"], comment="#")
Expand Down
3 changes: 2 additions & 1 deletion notebooks/process_plate.py.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@
" - *valid barcode*: barcode that matches a known virus or neutralization standard, we hope most reads are this.\n",
" - *invalid barcode*: a barcode with proper flanking sequences, but does not match a known virus or neutralization standard. If you have a lot of reads of this type, it is probably a good idea to look at the invalid barcode CSVs (in the `./results/barcode_invalid/` subdirectory created by the pipeline) to see what these invalid barcodes are.\n",
" - *unparseable barcode*: could not parse a barcode from this read as there was not a sequence of the correct length with the appropriate flanking sequence.\n",
" - *invalid outer flank*: if using an outer upstream or downstream region (`upstream2` or `downstream2` for the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html)), reads that are otherwise valid except for this outer flank. Typically you would be using `upstream2` if you have a plate index embedded in your primer, and reads with this classification correspond to a different index than the one for this plate.\n",
" - *low quality barcode*: low-quality or `N` nucleotides in barcode, could indicate problem with sequencing.\n",
" - *failed chastity filter*: reads that failed the Illumina chastity filter, if these are reported in the FASTQ (they may not be).\n",
"\n",
Expand Down Expand Up @@ -1500,7 +1501,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
"version": "3.12.2"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion scripts/count_barcodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
**snakemake.params.illumina_barcode_parser_params,
)

counts, fates = parser.parse(snakemake.input.fastq)
counts, fates = parser.parse(snakemake.input.fastq, outer_flank_fates=True)

# write valide barcode counts including a 0 count for missing ones
(
Expand Down
10 changes: 7 additions & 3 deletions seqneut-pipeline.smk
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ samples = pd.concat(
[plate_d["samples"] for plate_d in plates.values()],
ignore_index=True,
)
assert samples["sample"].nunique() == samples["fastq"].nunique() == len(samples)
assert samples["sample"].nunique() == len(samples)
samples = samples.set_index("sample").to_dict(orient="index")

if "miscellaneous_plates" in config:
Expand Down Expand Up @@ -89,7 +89,9 @@ rule count_barcodes:
invalid="results/barcode_invalid/{sample}.csv",
fates="results/barcode_fates/{sample}.csv",
params:
illumina_barcode_parser_params=config["illumina_barcode_parser_params"],
illumina_barcode_parser_params=lambda wc: plates[samples[wc.sample]["plate"]][
"illumina_barcode_parser_params"
],
conda:
"envs/count_barcodes.yml"
log:
Expand Down Expand Up @@ -309,7 +311,9 @@ rule miscellaneous_plate_count_barcodes:
invalid="results/miscellaneous_plates/{misc_plate}/{well}_invalid.csv",
fates="results/miscellaneous_plates/{misc_plate}/{well}_fates.csv",
params:
illumina_barcode_parser_params=config["illumina_barcode_parser_params"],
illumina_barcode_parser_params=lambda wc: miscellaneous_plates[wc.misc_plate][
"illumina_barcode_parser_params"
],
conda:
"envs/count_barcodes.yml"
log:
Expand Down
9 changes: 9 additions & 0 deletions test_example/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ plates:
<<: *default_process_plate_curvefit_params
curvefit_qc:
<<: *default_process_plate_curvefit_qc
illumina_barcode_parser_params:
upstream2: GCTACA
upstream2_mismatch: 1

plate11:
group: serum
Expand All @@ -101,6 +104,9 @@ plates:
<<: *default_process_plate_curvefit_qc
barcode_serum_replicates_ignore_curvefit_qc:
- [AGGTCAAGACCACAGG, M099d0]
illumina_barcode_parser_params:
upstream2: ATCGAT
upstream2_mismatch: 1

H3N2_plate:
group: pilot
Expand Down Expand Up @@ -142,3 +148,6 @@ miscellaneous_plates:
viral_library: pdmH1N1_lib2023_loes
neut_standard_set: loes2023
samples_csv: data/miscellaneous_plates/random_plate_1.csv
illumina_barcode_parser_params:
upstream2: GCTACA
upstream2_mismatch: 1
72 changes: 36 additions & 36 deletions test_example/data/miscellaneous_plates/random_plate_1.csv
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
well,fastq
B1,fastqs/Plate2_Noserum2_S98_R1_001.fastq.gz
B2,fastqs/M099_d30_conc1_S106_R1_001.fastq.gz
B3,fastqs/M099_d30_conc2_S114_R1_001.fastq.gz
B4,fastqs/M099_d30_conc3_S122_R1_001.fastq.gz
B5,fastqs/M099_d30_conc4_S130_R1_001.fastq.gz
B6,fastqs/M099_d30_conc5_S138_R1_001.fastq.gz
B7,fastqs/M099_d30_conc6_S146_R1_001.fastq.gz
B8,fastqs/M099_d30_conc7_S154_R1_001.fastq.gz
B9,fastqs/M099_d30_conc8_S162_R1_001.fastq.gz
B10,fastqs/M099_d30_conc9_S170_R1_001.fastq.gz
B11,fastqs/M099_d30_conc10_S178_R1_001.fastq.gz
B12,fastqs/Plate2_Noserum10_S186_R1_001.fastq.gz
C1,fastqs/Plate2_Noserum3_S99_R1_001.fastq.gz
C2,fastqs/M099_d0_conc1_S107_R1_001.fastq.gz
C3,fastqs/M099_d0_conc2_S115_R1_001.fastq.gz
C4,fastqs/M099_d0_conc3_S123_R1_001.fastq.gz
C5,fastqs/M099_d0_conc4_S131_R1_001.fastq.gz
C6,fastqs/M099_d0_conc5_S139_R1_001.fastq.gz
C7,fastqs/M099_d0_conc6_S147_R1_001.fastq.gz
C8,fastqs/M099_d0_conc7_S155_R1_001.fastq.gz
C9,fastqs/M099_d0_conc8_S163_R1_001.fastq.gz
C10,fastqs/M099_d0_conc9_S171_R1_001.fastq.gz
C11,fastqs/M099_d0_conc10_S179_R1_001.fastq.gz
C12,fastqs/Plate2_Noserum11_S187_R1_001.fastq.gz
D1,fastqs/Plate2_Noserum4_S100_R1_001.fastq.gz
D2,fastqs/Y154_d182_conc1_S108_R1_001.fastq.gz
D3,fastqs/Y154_d182_conc2_S116_R1_001.fastq.gz
D4,fastqs/Y154_d182_conc3_S124_R1_001.fastq.gz
D5,fastqs/Y154_d182_conc4_S132_R1_001.fastq.gz
D6,fastqs/Y154_d182_conc5_S140_R1_001.fastq.gz
D7,fastqs/Y154_d182_conc6_S148_R1_001.fastq.gz
D8,fastqs/Y154_d182_conc7_S156_R1_001.fastq.gz
D9,fastqs/Y154_d182_conc8_S164_R1_001.fastq.gz
D10,fastqs/Y154_d182_conc9_S172_R1_001.fastq.gz
D11,fastqs/Y154_d182_conc10_S180_R1_001.fastq.gz
D12,fastqs/Plate2_Noserum12_S188_R1_001.fastq.gz
B1,fastqs/well_B1.fastq.gz
B2,fastqs/well_B2.fastq.gz
B3,fastqs/well_B3.fastq.gz
B4,fastqs/well_B4.fastq.gz
B5,fastqs/well_B5.fastq.gz
B6,fastqs/well_B6.fastq.gz
B7,fastqs/well_B7.fastq.gz
B8,fastqs/well_B8.fastq.gz
B9,fastqs/well_B9.fastq.gz
B10,fastqs/well_B10.fastq.gz
B11,fastqs/well_B11.fastq.gz
B12,fastqs/well_B12.fastq.gz
C1,fastqs/well_C1.fastq.gz
C2,fastqs/well_C2.fastq.gz
C3,fastqs/well_C3.fastq.gz
C4,fastqs/well_C4.fastq.gz
C5,fastqs/well_C5.fastq.gz
C6,fastqs/well_C6.fastq.gz
C7,fastqs/well_C7.fastq.gz
C8,fastqs/well_C8.fastq.gz
C9,fastqs/well_C9.fastq.gz
C10,fastqs/well_C10.fastq.gz
C11,fastqs/well_C11.fastq.gz
C12,fastqs/well_C12.fastq.gz
D1,fastqs/well_D1.fastq.gz
D2,fastqs/well_D2.fastq.gz
D3,fastqs/well_D3.fastq.gz
D4,fastqs/well_D4.fastq.gz
D5,fastqs/well_D5.fastq.gz
D6,fastqs/well_D6.fastq.gz
D7,fastqs/well_D7.fastq.gz
D8,fastqs/well_D8.fastq.gz
D9,fastqs/well_D9.fastq.gz
D10,fastqs/well_D10.fastq.gz
D11,fastqs/well_D11.fastq.gz
D12,fastqs/well_D12.fastq.gz
17 changes: 0 additions & 17 deletions test_example/data/plates/H3N2_sample_plate.csv

This file was deleted.

72 changes: 36 additions & 36 deletions test_example/data/plates/plate11_samples.csv
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
well,serum,dilution_factor,replicate,fastq
B1,none,,10,fastqs/Plate11_Noserum2_S2_R1_001.fastq.gz
B2,M099d30,20.0,2,fastqs/M099_d30_conc1_S10_R1_001.fastq.gz
B3,M099d30,60.0,2,fastqs/M099_d30_conc2_S18_R1_001.fastq.gz
B4,M099d30,180.0,2,fastqs/M099_d30_conc3_S26_R1_001.fastq.gz
B5,M099d30,540.0,2,fastqs/M099_d30_conc4_S34_R1_001.fastq.gz
B6,M099d30,1620.0,2,fastqs/M099_d30_conc5_S42_R1_001.fastq.gz
B7,M099d30,4860.0,2,fastqs/M099_d30_conc6_S50_R1_001.fastq.gz
B8,M099d30,14580.0,2,fastqs/M099_d30_conc7_S58_R1_001.fastq.gz
B9,M099d30,43740.0,2,fastqs/M099_d30_conc8_S66_R1_001.fastq.gz
B10,M099d30,131220.0,2,fastqs/M099_d30_conc9_S74_R1_001.fastq.gz
B11,M099d30,393660.0,2,fastqs/M099_d30_conc10_S82_R1_001.fastq.gz
B12,none,,2,fastqs/Plate11_Noserum10_S90_R1_001.fastq.gz
C1,none,,11,fastqs/Plate11_Noserum3_S3_R1_001.fastq.gz
C2,M099d0,20.0,2,fastqs/M099_d0_conc1_S11_R1_001.fastq.gz
C3,M099d0,60.0,2,fastqs/M099_d0_conc2_S19_R1_001.fastq.gz
C4,M099d0,180.0,2,fastqs/M099_d0_conc3_S27_R1_001.fastq.gz
C5,M099d0,540.0,2,fastqs/M099_d0_conc4_S35_R1_001.fastq.gz
C6,M099d0,1620.0,2,fastqs/M099_d0_conc5_S43_R1_001.fastq.gz
C7,M099d0,4860.0,2,fastqs/M099_d0_conc6_S51_R1_001.fastq.gz
C8,M099d0,14580.0,2,fastqs/M099_d0_conc7_S59_R1_001.fastq.gz
C9,M099d0,43740.0,2,fastqs/M099_d0_conc8_S67_R1_001.fastq.gz
C10,M099d0,131220.0,2,fastqs/M099_d0_conc9_S75_R1_001.fastq.gz
C11,M099d0,393660.0,2,fastqs/M099_d0_conc10_S83_R1_001.fastq.gz
C12,none,,3,fastqs/Plate11_Noserum11_S91_R1_001.fastq.gz
D1,none,,12,fastqs/Plate11_Noserum4_S4_R1_001.fastq.gz
D2,Y044d30,20.0,2,fastqs/Y044_d30_conc1_S12_R1_001.fastq.gz
D3,Y044d30,60.0,2,fastqs/Y044_d30_conc2_S20_R1_001.fastq.gz
D4,Y044d30,180.0,2,fastqs/Y044_d30_conc3_S28_R1_001.fastq.gz
D5,Y044d30,540.0,2,fastqs/Y044_d30_conc4_S36_R1_001.fastq.gz
D6,Y044d30,1620.0,2,fastqs/Y044_d30_conc5_S44_R1_001.fastq.gz
D7,Y044d30,4860.0,2,fastqs/Y044_d30_conc6_S52_R1_001.fastq.gz
D8,Y044d30,14580.0,2,fastqs/Y044_d30_conc7_S60_R1_001.fastq.gz
D9,Y044d30,43740.0,2,fastqs/Y044_d30_conc8_S68_R1_001.fastq.gz
D10,Y044d30,131220.0,2,fastqs/Y044_d30_conc9_S76_R1_001.fastq.gz
D11,Y044d30,393660.0,2,fastqs/Y044_d30_conc10_S84_R1_001.fastq.gz
D12,none,,4,fastqs/Plate11_Noserum12_S92_R1_001.fastq.gz
B1,none,,10,fastqs/well_B1.fastq.gz
B2,M099d30,20.0,2,fastqs/well_B2.fastq.gz
B3,M099d30,60.0,2,fastqs/well_B3.fastq.gz
B4,M099d30,180.0,2,fastqs/well_B4.fastq.gz
B5,M099d30,540.0,2,fastqs/well_B5.fastq.gz
B6,M099d30,1620.0,2,fastqs/well_B6.fastq.gz
B7,M099d30,4860.0,2,fastqs/well_B7.fastq.gz
B8,M099d30,14580.0,2,fastqs/well_B8.fastq.gz
B9,M099d30,43740.0,2,fastqs/well_B9.fastq.gz
B10,M099d30,131220.0,2,fastqs/well_B10.fastq.gz
B11,M099d30,393660.0,2,fastqs/well_B11.fastq.gz
B12,none,,2,fastqs/well_B12.fastq.gz
C1,none,,11,fastqs/well_C1.fastq.gz
C2,M099d0,20.0,2,fastqs/well_C2.fastq.gz
C3,M099d0,60.0,2,fastqs/well_C3.fastq.gz
C4,M099d0,180.0,2,fastqs/well_C4.fastq.gz
C5,M099d0,540.0,2,fastqs/well_C5.fastq.gz
C6,M099d0,1620.0,2,fastqs/well_C6.fastq.gz
C7,M099d0,4860.0,2,fastqs/well_C7.fastq.gz
C8,M099d0,14580.0,2,fastqs/well_C8.fastq.gz
C9,M099d0,43740.0,2,fastqs/well_C9.fastq.gz
C10,M099d0,131220.0,2,fastqs/well_C10.fastq.gz
C11,M099d0,393660.0,2,fastqs/well_C11.fastq.gz
C12,none,,3,fastqs/well_C12.fastq.gz
D1,none,,12,fastqs/well_D1.fastq.gz
D2,Y044d30,20.0,2,fastqs/well_D2.fastq.gz
D3,Y044d30,60.0,2,fastqs/well_D3.fastq.gz
D4,Y044d30,180.0,2,fastqs/well_D4.fastq.gz
D5,Y044d30,540.0,2,fastqs/well_D5.fastq.gz
D6,Y044d30,1620.0,2,fastqs/well_D6.fastq.gz
D7,Y044d30,4860.0,2,fastqs/well_D7.fastq.gz
D8,Y044d30,14580.0,2,fastqs/well_D8.fastq.gz
D9,Y044d30,43740.0,2,fastqs/well_D9.fastq.gz
D10,Y044d30,131220.0,2,fastqs/well_D10.fastq.gz
D11,Y044d30,393660.0,2,fastqs/well_D11.fastq.gz
D12,none,,4,fastqs/well_D12.fastq.gz
Loading

0 comments on commit e692286

Please sign in to comment.