enable plate-level indices as upstream2 in barcode parsing (#44)

Configured to enable plate-level indices to be embedded in the round-1 PCR primers (see [this issue](#40)). Essentially, this amounts to allowing a per-plate flanking sequence to be specified for each plate, and only FASTQ reads with that flanking sequence are read for that plate. Typically this index would be specified as `upstream2` in the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html). To enable this change, altered the configuration from the previous setup of just having a single global `illumina_barcode_parser_params` applied to all plates. Now such a global parser is still specified that has default values that you want to apply to all plates. But in addition, in the per-plate configuration you can specify `illumina_barcode_parser_params` that are added to (and override) anything in the global parser params, and can contain plate specific `upstream2` and other relevant setting (eg, `upstream2_mismatch`). The test example was modified to use this option for plate2 and plate11.
jbloomlab · Apr 12, 2024 · e692286 · e692286
1 parent 3c762ad
commit e692286
Show file tree

Hide file tree

Showing 130 changed files with 515 additions and 483 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,8 @@
 # CHANGELOG
 
 ### version 3.1.0
+- Configured to enable plate-level indices to be embedded in the round-1 PCR primers (see [this issue](https://github.com/jbloomlab/seqneut-pipeline/issues/40)). Essentially, this amounts to allowing a per-plate flanking sequence to be specified for each plate, and only FASTQ reads with that flanking sequence are read for that plate. Typically this index would be specified as `upstream2` in the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html). To enable this change, altered the configuration from the previous setup of just having a single global `illumina_barcode_parser_params` applied to all plates. Now such a global parser is still specified that has default values that you want to apply to all plates. But in addition, in the per-plate configuration you can specify `illumina_barcode_parser_params` that are added to (and override) anything in the global parser params, and can contain plate specific `upstream2` and other relevant setting (eg, `upstream2_mismatch`). The test example was modified to use this option for plate2 and plate11.
+
 - Update software versions:
   - `dms_variants` to 1.6.0
   - `neutcurve` to 2.1.0

diff --git a/README.md b/README.md
@@ -185,6 +185,7 @@ CATACAGAGTTTGTTG
 
 ### illumina_barcode_parser_params
 A dictionary (mapping) specifying how to parse the Illumina FASTQ files to barcode counts.
+This is a global dictionary that is applied to all plates, but can be augmented or overriden on a per-plate basis by specifying plate specific `illumina_barcode_parser_params` as described in the plate configuration below.
 This mapping should just specify key-word arguments that can be passed to [dms_variants.illuminabarcodeparser.IlluminaBarcodeParser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html#dms_variants.illuminabarcodeparser.IlluminaBarcodeParser); note that the barcode-length (`bclen`) parameter should not be specified as it is inferred from the length of the barcodes.
 
 So in general, this key will look like this:
@@ -220,6 +221,8 @@ plates:
       <<: *default_process_plate_curvefit_params
     curvefit_qc:
       <<: *default_process_plate_curvefit_qc
+    illumina_barcode_parser_params:  # optional argument
+        upstream2: upstream2: GCTACA
 
   <additional_plates>
 ```
@@ -390,6 +393,12 @@ The specific meanings of these QC parameters are:
 
  - `barcode_serum_replicates_ignore_curvefit_qc`: list (as `[barcode, serum_replicate]`) of viral-barcodes / serum-replicates where we ignore the curve-fitting QC.
 
+#### illumina_barcode_parser_params
+This key defines parameters for the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html) that override anything set in the global `illumina_barcode_parser_params` above.
+It is optional, and if not defined just the global params are used.
+If this is defined, it is used to update the global params (adding new params and overriding any shared ones).
+The main anticipated use case is if you add plate-specific indices in the round 1 PCR and want to specify those indices here using `upstream2` and `upstream2_mismatch`.
+
 ### default_serum_titer_as
 Specifies how we compute the final titers in `serum_titers`.
 Can be either `midpoint` or `nt50` depending on whether you want to report the value where the fraction infectivity gets to 50%, or the midpoint of the curve, so should be either
@@ -458,13 +467,15 @@ miscellaneous_plates:
     viral_library: <viral library>
     neut_standard_set: <standard set>
     samples_csv: <filename>
+    illumina_barcode_parser_params:  # optional key
+        <parser params to override global>
 
   <plate_name_2>:
     ...
 ```
 
 The plate name is just the name assigned to the plate.
-The `date`, `viral_library`, and `neut_standard_set` keys have the same meaning as for the plates specified under `plates`.
+The `date`, `viral_library`, `neut_standard_set`, and `illumina_barcode_parser_params` keys have the same meaning as for the plates specified under `plates`.
 
 The `samples_csv` should specify the samples to analyze in a CSV that has columns named "well" and "fastq", and optionally other columns as well.
 

diff --git a/docs/aggregate_qc_drops.html b/docs/aggregate_qc_drops.html
diff --git a/docs/pilot_A23038d0_r32_75K_titers.html b/docs/pilot_A23038d0_r32_75K_titers.html
diff --git a/docs/process_H3N2_plate.html b/docs/process_H3N2_plate.html
diff --git a/docs/process_plate11.html b/docs/process_plate11.html
diff --git a/docs/process_plate2.html b/docs/process_plate2.html
diff --git a/docs/serum_M099d0_titers.html b/docs/serum_M099d0_titers.html
diff --git a/docs/serum_M099d30_titers.html b/docs/serum_M099d30_titers.html
diff --git a/docs/serum_Y044d30_titers.html b/docs/serum_Y044d30_titers.html
diff --git a/docs/serum_Y154d182_titers.html b/docs/serum_Y154d182_titers.html
diff --git a/envs/count_barcodes.yml b/envs/count_barcodes.yml
@@ -8,6 +8,6 @@ channels:
 dependencies:
   - pandas=2.2
   - pip
-  - python=3.11
+  - python=3.12
   - pip:
-    - dms_variants==1.5.0
+    - dms_variants==1.6.0
diff --git a/funcs.smk b/funcs.smk
@@ -29,6 +29,16 @@ def process_miscellaneous_plates(misc_plates_d):
                 f"{plate_dict['samples_csv']} has non-unique entries in 'well' column"
             )
         misc_plates[plate]["wells"] = samples.set_index("well")["fastq"].to_dict()
+
+        # get default barcode parser params, update if specified per plate
+        misc_plates[plate]["illumina_barcode_parser_params"] = copy.deepcopy(
+            config["illumina_barcode_parser_params"]
+        )
+        if "illumina_barcode_parser_params" in plate_dict:
+            misc_plates[plate]["illumina_barcode_parser_params"].update(
+                plate_dict["illumina_barcode_parser_params"]
+            )
+
     return misc_plates
 
 
@@ -63,6 +73,15 @@ def process_plate(plate, plate_params):
     if not re.fullmatch(r"\d{4}\-\d{2}\-\d{2}", str(plate_d["date"])):
         raise ValueError(f"{plate =} {plate_d['date'] =} not in YYYY-MM-DD format")
 
+    # get default barcode parser params, update if specified per plate
+    plate_d["illumina_barcode_parser_params"] = copy.deepcopy(
+        config["illumina_barcode_parser_params"]
+    )
+    if "illumina_barcode_parser_params" in plate_params:
+        plate_d["illumina_barcode_parser_params"].update(
+            plate_params["illumina_barcode_parser_params"]
+        )
+
     # Process samples_csv to create the sample data frame
     req_sample_cols = ["well", "serum", "dilution_factor", "replicate", "fastq"]
     samples_df = pd.read_csv(plate_params["samples_csv"], comment="#")

diff --git a/notebooks/process_plate.py.ipynb b/notebooks/process_plate.py.ipynb
@@ -144,6 +144,7 @@
     " - *valid barcode*: barcode that matches a known virus or neutralization standard, we hope most reads are this.\n",
     " - *invalid barcode*: a barcode with proper flanking sequences, but does not match a known virus or neutralization standard. If you  have a lot of reads of this type, it is probably a good idea to look at the invalid barcode CSVs (in the `./results/barcode_invalid/` subdirectory created by the pipeline) to see what these invalid barcodes are.\n",
     " - *unparseable barcode*: could not parse a barcode from this read as there was not a sequence of the correct length with the appropriate flanking sequence.\n",
+    " - *invalid outer flank*: if using an outer upstream or downstream region (`upstream2` or `downstream2` for the [illuminabarcodeparser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html)), reads that are otherwise valid except for this outer flank. Typically you would be using `upstream2` if you have a plate index embedded in your primer, and reads with this classification correspond to a different index than the one for this plate.\n",
     " - *low quality barcode*: low-quality or `N` nucleotides in barcode, could indicate problem with sequencing.\n",
     " - *failed chastity filter*: reads that failed the Illumina chastity filter, if these are reported in the FASTQ (they may not be).\n",
     "\n",
@@ -1500,7 +1501,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.8"
+   "version": "3.12.2"
   }
  },
  "nbformat": 4,

diff --git a/scripts/count_barcodes.py b/scripts/count_barcodes.py
@@ -28,7 +28,7 @@
     **snakemake.params.illumina_barcode_parser_params,
 )
 
-counts, fates = parser.parse(snakemake.input.fastq)
+counts, fates = parser.parse(snakemake.input.fastq, outer_flank_fates=True)
 
 # write valide barcode counts including a 0 count for missing ones
 (

diff --git a/seqneut-pipeline.smk b/seqneut-pipeline.smk
@@ -54,7 +54,7 @@ samples = pd.concat(
     [plate_d["samples"] for plate_d in plates.values()],
     ignore_index=True,
 )
-assert samples["sample"].nunique() == samples["fastq"].nunique() == len(samples)
+assert samples["sample"].nunique() == len(samples)
 samples = samples.set_index("sample").to_dict(orient="index")
 
 if "miscellaneous_plates" in config:
@@ -89,7 +89,9 @@ rule count_barcodes:
         invalid="results/barcode_invalid/{sample}.csv",
         fates="results/barcode_fates/{sample}.csv",
     params:
-        illumina_barcode_parser_params=config["illumina_barcode_parser_params"],
+        illumina_barcode_parser_params=lambda wc: plates[samples[wc.sample]["plate"]][
+            "illumina_barcode_parser_params"
+        ],
     conda:
         "envs/count_barcodes.yml"
     log:
@@ -309,7 +311,9 @@ rule miscellaneous_plate_count_barcodes:
         invalid="results/miscellaneous_plates/{misc_plate}/{well}_invalid.csv",
         fates="results/miscellaneous_plates/{misc_plate}/{well}_fates.csv",
     params:
-        illumina_barcode_parser_params=config["illumina_barcode_parser_params"],
+        illumina_barcode_parser_params=lambda wc: miscellaneous_plates[wc.misc_plate][
+            "illumina_barcode_parser_params"
+        ],
     conda:
         "envs/count_barcodes.yml"
     log:

diff --git a/test_example/config.yml b/test_example/config.yml
@@ -83,6 +83,9 @@ plates:
       <<: *default_process_plate_curvefit_params
     curvefit_qc:
       <<: *default_process_plate_curvefit_qc
+    illumina_barcode_parser_params:
+      upstream2: GCTACA
+      upstream2_mismatch: 1
 
   plate11:
     group: serum
@@ -101,6 +104,9 @@ plates:
       <<: *default_process_plate_curvefit_qc
       barcode_serum_replicates_ignore_curvefit_qc:
         - [AGGTCAAGACCACAGG, M099d0]
+    illumina_barcode_parser_params:
+      upstream2: ATCGAT
+      upstream2_mismatch: 1
 
   H3N2_plate:
     group: pilot
@@ -142,3 +148,6 @@ miscellaneous_plates:
     viral_library: pdmH1N1_lib2023_loes
     neut_standard_set: loes2023
     samples_csv: data/miscellaneous_plates/random_plate_1.csv
+    illumina_barcode_parser_params:
+      upstream2: GCTACA
+      upstream2_mismatch: 1
diff --git a/test_example/data/miscellaneous_plates/random_plate_1.csv b/test_example/data/miscellaneous_plates/random_plate_1.csv
@@ -1,37 +1,37 @@
 well,fastq
-B1,fastqs/Plate2_Noserum2_S98_R1_001.fastq.gz
-B2,fastqs/M099_d30_conc1_S106_R1_001.fastq.gz
-B3,fastqs/M099_d30_conc2_S114_R1_001.fastq.gz
-B4,fastqs/M099_d30_conc3_S122_R1_001.fastq.gz
-B5,fastqs/M099_d30_conc4_S130_R1_001.fastq.gz
-B6,fastqs/M099_d30_conc5_S138_R1_001.fastq.gz
-B7,fastqs/M099_d30_conc6_S146_R1_001.fastq.gz
-B8,fastqs/M099_d30_conc7_S154_R1_001.fastq.gz
-B9,fastqs/M099_d30_conc8_S162_R1_001.fastq.gz
-B10,fastqs/M099_d30_conc9_S170_R1_001.fastq.gz
-B11,fastqs/M099_d30_conc10_S178_R1_001.fastq.gz
-B12,fastqs/Plate2_Noserum10_S186_R1_001.fastq.gz
-C1,fastqs/Plate2_Noserum3_S99_R1_001.fastq.gz
-C2,fastqs/M099_d0_conc1_S107_R1_001.fastq.gz
-C3,fastqs/M099_d0_conc2_S115_R1_001.fastq.gz
-C4,fastqs/M099_d0_conc3_S123_R1_001.fastq.gz
-C5,fastqs/M099_d0_conc4_S131_R1_001.fastq.gz
-C6,fastqs/M099_d0_conc5_S139_R1_001.fastq.gz
-C7,fastqs/M099_d0_conc6_S147_R1_001.fastq.gz
-C8,fastqs/M099_d0_conc7_S155_R1_001.fastq.gz
-C9,fastqs/M099_d0_conc8_S163_R1_001.fastq.gz
-C10,fastqs/M099_d0_conc9_S171_R1_001.fastq.gz
-C11,fastqs/M099_d0_conc10_S179_R1_001.fastq.gz
-C12,fastqs/Plate2_Noserum11_S187_R1_001.fastq.gz
-D1,fastqs/Plate2_Noserum4_S100_R1_001.fastq.gz
-D2,fastqs/Y154_d182_conc1_S108_R1_001.fastq.gz
-D3,fastqs/Y154_d182_conc2_S116_R1_001.fastq.gz
-D4,fastqs/Y154_d182_conc3_S124_R1_001.fastq.gz
-D5,fastqs/Y154_d182_conc4_S132_R1_001.fastq.gz
-D6,fastqs/Y154_d182_conc5_S140_R1_001.fastq.gz
-D7,fastqs/Y154_d182_conc6_S148_R1_001.fastq.gz
-D8,fastqs/Y154_d182_conc7_S156_R1_001.fastq.gz
-D9,fastqs/Y154_d182_conc8_S164_R1_001.fastq.gz
-D10,fastqs/Y154_d182_conc9_S172_R1_001.fastq.gz
-D11,fastqs/Y154_d182_conc10_S180_R1_001.fastq.gz
-D12,fastqs/Plate2_Noserum12_S188_R1_001.fastq.gz
+B1,fastqs/well_B1.fastq.gz
+B2,fastqs/well_B2.fastq.gz
+B3,fastqs/well_B3.fastq.gz
+B4,fastqs/well_B4.fastq.gz
+B5,fastqs/well_B5.fastq.gz
+B6,fastqs/well_B6.fastq.gz
+B7,fastqs/well_B7.fastq.gz
+B8,fastqs/well_B8.fastq.gz
+B9,fastqs/well_B9.fastq.gz
+B10,fastqs/well_B10.fastq.gz
+B11,fastqs/well_B11.fastq.gz
+B12,fastqs/well_B12.fastq.gz
+C1,fastqs/well_C1.fastq.gz
+C2,fastqs/well_C2.fastq.gz
+C3,fastqs/well_C3.fastq.gz
+C4,fastqs/well_C4.fastq.gz
+C5,fastqs/well_C5.fastq.gz
+C6,fastqs/well_C6.fastq.gz
+C7,fastqs/well_C7.fastq.gz
+C8,fastqs/well_C8.fastq.gz
+C9,fastqs/well_C9.fastq.gz
+C10,fastqs/well_C10.fastq.gz
+C11,fastqs/well_C11.fastq.gz
+C12,fastqs/well_C12.fastq.gz
+D1,fastqs/well_D1.fastq.gz
+D2,fastqs/well_D2.fastq.gz
+D3,fastqs/well_D3.fastq.gz
+D4,fastqs/well_D4.fastq.gz
+D5,fastqs/well_D5.fastq.gz
+D6,fastqs/well_D6.fastq.gz
+D7,fastqs/well_D7.fastq.gz
+D8,fastqs/well_D8.fastq.gz
+D9,fastqs/well_D9.fastq.gz
+D10,fastqs/well_D10.fastq.gz
+D11,fastqs/well_D11.fastq.gz
+D12,fastqs/well_D12.fastq.gz
diff --git a/test_example/data/plates/H3N2_sample_plate.csv b/test_example/data/plates/H3N2_sample_plate.csv
diff --git a/test_example/data/plates/plate11_samples.csv b/test_example/data/plates/plate11_samples.csv
@@ -1,37 +1,37 @@
 well,serum,dilution_factor,replicate,fastq
-B1,none,,10,fastqs/Plate11_Noserum2_S2_R1_001.fastq.gz
-B2,M099d30,20.0,2,fastqs/M099_d30_conc1_S10_R1_001.fastq.gz
-B3,M099d30,60.0,2,fastqs/M099_d30_conc2_S18_R1_001.fastq.gz
-B4,M099d30,180.0,2,fastqs/M099_d30_conc3_S26_R1_001.fastq.gz
-B5,M099d30,540.0,2,fastqs/M099_d30_conc4_S34_R1_001.fastq.gz
-B6,M099d30,1620.0,2,fastqs/M099_d30_conc5_S42_R1_001.fastq.gz
-B7,M099d30,4860.0,2,fastqs/M099_d30_conc6_S50_R1_001.fastq.gz
-B8,M099d30,14580.0,2,fastqs/M099_d30_conc7_S58_R1_001.fastq.gz
-B9,M099d30,43740.0,2,fastqs/M099_d30_conc8_S66_R1_001.fastq.gz
-B10,M099d30,131220.0,2,fastqs/M099_d30_conc9_S74_R1_001.fastq.gz
-B11,M099d30,393660.0,2,fastqs/M099_d30_conc10_S82_R1_001.fastq.gz
-B12,none,,2,fastqs/Plate11_Noserum10_S90_R1_001.fastq.gz
-C1,none,,11,fastqs/Plate11_Noserum3_S3_R1_001.fastq.gz
-C2,M099d0,20.0,2,fastqs/M099_d0_conc1_S11_R1_001.fastq.gz
-C3,M099d0,60.0,2,fastqs/M099_d0_conc2_S19_R1_001.fastq.gz
-C4,M099d0,180.0,2,fastqs/M099_d0_conc3_S27_R1_001.fastq.gz
-C5,M099d0,540.0,2,fastqs/M099_d0_conc4_S35_R1_001.fastq.gz
-C6,M099d0,1620.0,2,fastqs/M099_d0_conc5_S43_R1_001.fastq.gz
-C7,M099d0,4860.0,2,fastqs/M099_d0_conc6_S51_R1_001.fastq.gz
-C8,M099d0,14580.0,2,fastqs/M099_d0_conc7_S59_R1_001.fastq.gz
-C9,M099d0,43740.0,2,fastqs/M099_d0_conc8_S67_R1_001.fastq.gz
-C10,M099d0,131220.0,2,fastqs/M099_d0_conc9_S75_R1_001.fastq.gz
-C11,M099d0,393660.0,2,fastqs/M099_d0_conc10_S83_R1_001.fastq.gz
-C12,none,,3,fastqs/Plate11_Noserum11_S91_R1_001.fastq.gz
-D1,none,,12,fastqs/Plate11_Noserum4_S4_R1_001.fastq.gz
-D2,Y044d30,20.0,2,fastqs/Y044_d30_conc1_S12_R1_001.fastq.gz
-D3,Y044d30,60.0,2,fastqs/Y044_d30_conc2_S20_R1_001.fastq.gz
-D4,Y044d30,180.0,2,fastqs/Y044_d30_conc3_S28_R1_001.fastq.gz
-D5,Y044d30,540.0,2,fastqs/Y044_d30_conc4_S36_R1_001.fastq.gz
-D6,Y044d30,1620.0,2,fastqs/Y044_d30_conc5_S44_R1_001.fastq.gz
-D7,Y044d30,4860.0,2,fastqs/Y044_d30_conc6_S52_R1_001.fastq.gz
-D8,Y044d30,14580.0,2,fastqs/Y044_d30_conc7_S60_R1_001.fastq.gz
-D9,Y044d30,43740.0,2,fastqs/Y044_d30_conc8_S68_R1_001.fastq.gz
-D10,Y044d30,131220.0,2,fastqs/Y044_d30_conc9_S76_R1_001.fastq.gz
-D11,Y044d30,393660.0,2,fastqs/Y044_d30_conc10_S84_R1_001.fastq.gz
-D12,none,,4,fastqs/Plate11_Noserum12_S92_R1_001.fastq.gz
+B1,none,,10,fastqs/well_B1.fastq.gz
+B2,M099d30,20.0,2,fastqs/well_B2.fastq.gz
+B3,M099d30,60.0,2,fastqs/well_B3.fastq.gz
+B4,M099d30,180.0,2,fastqs/well_B4.fastq.gz
+B5,M099d30,540.0,2,fastqs/well_B5.fastq.gz
+B6,M099d30,1620.0,2,fastqs/well_B6.fastq.gz
+B7,M099d30,4860.0,2,fastqs/well_B7.fastq.gz
+B8,M099d30,14580.0,2,fastqs/well_B8.fastq.gz
+B9,M099d30,43740.0,2,fastqs/well_B9.fastq.gz
+B10,M099d30,131220.0,2,fastqs/well_B10.fastq.gz
+B11,M099d30,393660.0,2,fastqs/well_B11.fastq.gz
+B12,none,,2,fastqs/well_B12.fastq.gz
+C1,none,,11,fastqs/well_C1.fastq.gz
+C2,M099d0,20.0,2,fastqs/well_C2.fastq.gz
+C3,M099d0,60.0,2,fastqs/well_C3.fastq.gz
+C4,M099d0,180.0,2,fastqs/well_C4.fastq.gz
+C5,M099d0,540.0,2,fastqs/well_C5.fastq.gz
+C6,M099d0,1620.0,2,fastqs/well_C6.fastq.gz
+C7,M099d0,4860.0,2,fastqs/well_C7.fastq.gz
+C8,M099d0,14580.0,2,fastqs/well_C8.fastq.gz
+C9,M099d0,43740.0,2,fastqs/well_C9.fastq.gz
+C10,M099d0,131220.0,2,fastqs/well_C10.fastq.gz
+C11,M099d0,393660.0,2,fastqs/well_C11.fastq.gz
+C12,none,,3,fastqs/well_C12.fastq.gz
+D1,none,,12,fastqs/well_D1.fastq.gz
+D2,Y044d30,20.0,2,fastqs/well_D2.fastq.gz
+D3,Y044d30,60.0,2,fastqs/well_D3.fastq.gz
+D4,Y044d30,180.0,2,fastqs/well_D4.fastq.gz
+D5,Y044d30,540.0,2,fastqs/well_D5.fastq.gz
+D6,Y044d30,1620.0,2,fastqs/well_D6.fastq.gz
+D7,Y044d30,4860.0,2,fastqs/well_D7.fastq.gz
+D8,Y044d30,14580.0,2,fastqs/well_D8.fastq.gz
+D9,Y044d30,43740.0,2,fastqs/well_D9.fastq.gz
+D10,Y044d30,131220.0,2,fastqs/well_D10.fastq.gz
+D11,Y044d30,393660.0,2,fastqs/well_D11.fastq.gz
+D12,none,,4,fastqs/well_D12.fastq.gz