From d4ac20187b22b61f12c81d58bf733f84e2f0e2af Mon Sep 17 00:00:00 2001 From: Will Bradshaw Date: Mon, 1 Jul 2024 09:19:08 -0400 Subject: [PATCH] Added HV identification notebook entry --- docs/index.html | 100 +-- docs/listings.json | 2 +- docs/notebooks/2024-07-01_partial-hv.html | 717 ++++++++++++++++++++++ docs/search.json | 16 +- notebooks/2024-07-01_partial-hv.qmd | 42 ++ 5 files changed, 825 insertions(+), 52 deletions(-) create mode 100644 docs/notebooks/2024-07-01_partial-hv.html create mode 100644 notebooks/2024-07-01_partial-hv.qmd diff --git a/docs/index.html b/docs/index.html index 5300f74..4a65300 100644 --- a/docs/index.html +++ b/docs/index.html @@ -165,7 +165,29 @@
-
+
+
+

+
 
+

+
+ + +
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
@@ -715,7 +737,7 @@

-
+
-
-
-

-

-

-
- - -
-
+

diff --git a/docs/listings.json b/docs/listings.json index b6ca44f..c5e3eb1 100644 --- a/docs/listings.json +++ b/docs/listings.json @@ -2,6 +2,7 @@ { "listing": "/index.html", "items": [ + "/notebooks/2024-07-01_partial-hv.html", "/notebooks/2024-06-11_batch.html", "/notebooks/2024-05-07_munk.html", "/notebooks/2024-05-01_ng.html", @@ -29,7 +30,6 @@ "/notebooks/2023-10-31_project-runway-initial.html", "/notebooks/2023-10-19_deduplication.html", "/notebooks/2023-10-13_rrna-removal.html", - "/notebooks/2023-10-13_rrna-removal.html", "/notebooks/2023-10-12_fastp-vs-adapterremoval.html", "/notebooks/2023-10-12_how-does-element-sequencing-work.html", "/notebooks/2023-09-12_settled-solids-extraction-test.html" diff --git a/docs/notebooks/2024-07-01_partial-hv.html b/docs/notebooks/2024-07-01_partial-hv.html new file mode 100644 index 0000000..835582b --- /dev/null +++ b/docs/notebooks/2024-07-01_partial-hv.html @@ -0,0 +1,717 @@ + + + + + + + + + + + +Will’s Public NAO Notebook - Investigating the v2 pipeline’s human-virus assignment behavior + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Investigating the v2 pipeline’s human-virus assignment behavior

+

Checking treatment of partially-human-infecting virus taxa

+
+
+ + +
+ +
+
Author
+
+

Will Bradshaw

+
+
+ +
+
Published
+
+

July 1, 2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + +

One question that came up in a recent team meeting about the refactored v2 pipeline is how higher-level taxa are handled during human-virus identification.

+

As a reminder, the process for HV identification currently looks like this:

+
    +
  1. Preprocessed read pairs are mapped to a database of HV genomes with Bowtie2, retaining any read that meets a fairly low alignment-score threshold.
  2. +
  3. Surviving read pairs are mapped to human, cow, and assorted other contaminant sequences with Bowtie2 and BBMap, discarding any read that maps to any contaminant.
  4. +
  5. Surviving read pairs are merged into single sequences with BBMerge, then passed to Kraken2, which performs taxonomic assignment using the Standard database.
  6. +
  7. Based on the Kraken2 assignments, reads are classified into those that are (1) assigned to a human-infecting virus taxon with Kraken, (2) assigned to a non-HV taxon with Kraken, or (3) not assigned to any taxon. All reads in category (2) are filtered out.
  8. +
  9. Surviving reads are assigned HV status if they (a) are given HV assignments by both Bowtie2 and Kraken2, or (b) are unassigned by Kraken and align to an HV taxon with Bowtie2 with an alignment score above a user-specified threshold (typically 20).
  10. +
+

This all works well for reads that are assigned to a taxon that is entirely composed of human-infecting viruses. The question is, what about reads that Kraken assigns to taxa that are only partially human-infecting? For example, the virus family Coronaviridae (taxid 11118) contains several coronaviruses that infect humans (e.g. SARS-CoV-2) and many that do not (e.g. assorted bat coronaviruses). What would happen to a read that was assigned by Kraken2 to this taxid?

+

The key process here is PROCESS_KRAKEN_HV (modules/local/processKrakenHV/main.nf), which calls bin/process_kraken_hv.py on the output of Kraken2. This script:

+
    +
  1. Imports a TSV of human-infecting virus taxa (generated by FINALIZE_HUMAN_VIRUS_DB in the index workflow) as well as the NCBI taxonomy tree structure (generated by EXTRACT_NCBI_TAXONOMY in the index workflow). Importantly, this initial HV TSV does not include higher taxa.
  2. +
  3. Iterates line-by-line over the readwise Kraken2 output as follows: +
      +
    1. Extract the name and taxid assignment for the read.
    2. +
    3. Check if the taxid is in the list of HV taxids; if so, return True.
    4. +
    5. If not, get the taxid’s parent taxid from the tree structure and go to (b).
    6. +
    7. Repeat (b) and (c) until an HV taxid is found or the taxid being screened is 0 (unassigned), 1 (root) or 2 (Bacteria); in the latter case, return False.
    8. +
  4. +
  5. Save the read’s HV status, along with other information, to an output file.
  6. +
+

As such, the script checks whether the assigned taxid or any of its ancestors are HV taxa, but not whether any of its descendents are. Reads that are assigned to higher-level taxa will thus be treated as though they were assigned to a non-HV taxa, and filtered out during HV read identification.

+

This seems suboptimal. That said, it’s not obvious what the correct behavior is here; treating reads assigned to partially-HV taxa as HV comes with its own problems1. Someone should think more about what the right approach is here.

+ + + + + + +

Footnotes

+ +
    +
  1. In particular, since the Bowtie2 database used for initial screening only contains HV genomes, this approach would likely lead to closely-related non-human-infecting virus reads being classified as HV.↩︎

  2. +
+
+ +
+ + + + + \ No newline at end of file diff --git a/docs/search.json b/docs/search.json index 9dda6e9..cef8b03 100644 --- a/docs/search.json +++ b/docs/search.json @@ -32,7 +32,7 @@ "href": "index.html", "title": "Will's Public NAO Notebook", "section": "", - "text": "Setting up AWS Batch to work with the NAO’s MGS workflow\n\n\nA hopefully-simple guide to an unfortunately-complicated service.\n\n\n\n\n\nJun 11, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Munk et al. (2022)\n\n\nA global wastewater study.\n\n\n\n\n\nMay 7, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Ng et al. (2019)\n\n\nWastewater from Singapore.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Bengtsson-Palme et al. (2016)\n\n\nWastewater grab samples from Sweden.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Maritz et al. (2019)\n\n\nWastewater from NYC.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brinch et al. (2020)\n\n\nWastewater from Copenhagen.\n\n\n\n\n\nApr 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Leung et al. (2021)\n\n\nAir sampling from urban public transit systems.\n\n\n\n\n\nApr 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rosario et al. (2018)\n\n\nAir sampling from a student dorm in Colorado.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Prussin et al. (2019)\n\n\nAir filters from a daycare in Virginia.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brumfield et al. (2022)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Spurbeck et al. (2023)\n\n\nCave carpa.\n\n\n\n\n\nApr 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nFollowup analysis of Yang et al. (2020)\n\n\nDigging into deduplication.\n\n\n\n\n\nMar 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items" + "text": "Investigating the v2 pipeline’s human-virus assignment behavior\n\n\nChecking treatment of partially-human-infecting virus taxa\n\n\n\n\n\nJul 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nSetting up AWS Batch to work with the NAO’s MGS workflow\n\n\nA hopefully-simple guide to an unfortunately-complicated service.\n\n\n\n\n\nJun 11, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Munk et al. (2022)\n\n\nA global wastewater study.\n\n\n\n\n\nMay 7, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Ng et al. (2019)\n\n\nWastewater from Singapore.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Bengtsson-Palme et al. (2016)\n\n\nWastewater grab samples from Sweden.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Maritz et al. (2019)\n\n\nWastewater from NYC.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brinch et al. (2020)\n\n\nWastewater from Copenhagen.\n\n\n\n\n\nApr 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Leung et al. (2021)\n\n\nAir sampling from urban public transit systems.\n\n\n\n\n\nApr 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rosario et al. (2018)\n\n\nAir sampling from a student dorm in Colorado.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Prussin et al. (2019)\n\n\nAir filters from a daycare in Virginia.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brumfield et al. (2022)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Spurbeck et al. (2023)\n\n\nCave carpa.\n\n\n\n\n\nApr 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nFollowup analysis of Yang et al. (2020)\n\n\nDigging into deduplication.\n\n\n\n\n\nMar 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items" }, { "objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html", @@ -362,5 +362,19 @@ "title": "Setting up AWS Batch to work with the NAO’s MGS workflow", "section": "Footnotes", "text": "Footnotes\n\n\nIn more depth, you need the following actions to be enabled for the bucket in question for your IAM user or role: s3:ListBucket, s3:GetBucketLocation, s3:GetObject, s3:GetObjectAcl, s3:PutObject, s3:PutObjectAcl, s3:PutObjectTagging, and s3:DeleteObject. If you’re using a bucket specific to your user, all this is easier if you first have your administrator enable s3:GetBucketPolicy and s3:PutBucketPolicy for your user.↩︎\nIf you want even more IOPS, you can provision an io2 volume instead of gp3. However, that’s beyond the scope of this guide.↩︎\nIn the future, I’ll investigate running Batch with Fargate for Nextflow workflows. For now, using EC2 gives us greater control over configuration than Fargate, at the cost of additional setup complexity and occasional startup delays.↩︎\nOn the latest version of the v2 pipeline, most of these options are already configured appropriately when running the pipeline with -profile batch (which is also the default profile). In this case, you only need to change process.queue to point to the name of your job queue.↩︎" + }, + { + "objectID": "notebooks/2024-07-01_partial-hv.html", + "href": "notebooks/2024-07-01_partial-hv.html", + "title": "Investigating the v2 pipeline’s human-virus assignment behavior", + "section": "", + "text": "One question that came up in a recent team meeting about the refactored v2 pipeline is how higher-level taxa are handled during human-virus identification.\nAs a reminder, the process for HV identification currently looks like this:\nThis all works well for reads that are assigned to a taxon that is entirely composed of human-infecting viruses. The question is, what about reads that Kraken assigns to taxa that are only partially human-infecting? For example, the virus family Coronaviridae (taxid 11118) contains several coronaviruses that infect humans (e.g. SARS-CoV-2) and many that do not (e.g. assorted bat coronaviruses). What would happen to a read that was assigned by Kraken2 to this taxid?\nThe key process here is PROCESS_KRAKEN_HV (modules/local/processKrakenHV/main.nf), which calls bin/process_kraken_hv.py on the output of Kraken2. This script:\nAs such, the script checks whether the assigned taxid or any of its ancestors are HV taxa, but not whether any of its descendents are. Reads that are assigned to higher-level taxa will thus be treated as though they were assigned to a non-HV taxa, and filtered out during HV read identification.\nThis seems suboptimal. That said, it’s not obvious what the correct behavior is here; treating reads assigned to partially-HV taxa as HV comes with its own problems1. Someone should think more about what the right approach is here." + }, + { + "objectID": "notebooks/2024-07-01_partial-hv.html#footnotes", + "href": "notebooks/2024-07-01_partial-hv.html#footnotes", + "title": "Investigating the v2 pipeline’s human-virus assignment behavior", + "section": "Footnotes", + "text": "Footnotes\n\n\nIn particular, since the Bowtie2 database used for initial screening only contains HV genomes, this approach would likely lead to closely-related non-human-infecting virus reads being classified as HV.↩︎" } ] \ No newline at end of file diff --git a/notebooks/2024-07-01_partial-hv.qmd b/notebooks/2024-07-01_partial-hv.qmd new file mode 100644 index 0000000..2edd683 --- /dev/null +++ b/notebooks/2024-07-01_partial-hv.qmd @@ -0,0 +1,42 @@ +--- +title: "Investigating the v2 pipeline's human-virus assignment behavior" +subtitle: "Checking treatment of partially-human-infecting virus taxa" +author: "Will Bradshaw" +date: 2024-07-01 +format: + html: + code-fold: true + code-tools: true + code-link: true + df-print: paged +editor: visual +title-block-banner: black +--- + +One question that came up in a recent team meeting about the [refactored v2 pipeline](https://github.com/naobservatory/mgs-workflow) is how higher-level taxa are handled during human-virus identification. + +As a reminder, the process for HV identification currently looks like this: + +1. Preprocessed read pairs are mapped to a database of HV genomes with Bowtie2, retaining any read that meets a fairly low alignment-score threshold. +2. Surviving read pairs are mapped to human, cow, and assorted other contaminant sequences with Bowtie2 and BBMap, discarding any read that maps to any contaminant. +3. Surviving read pairs are merged into single sequences with BBMerge, then passed to Kraken2, which performs taxonomic assignment using the Standard database. +4. Based on the Kraken2 assignments, reads are classified into those that are (1) assigned to a human-infecting virus taxon with Kraken, (2) assigned to a non-HV taxon with Kraken, or (3) not assigned to any taxon. All reads in category (2) are filtered out. +5. Surviving reads are assigned HV status if they (a) are given HV assignments by both Bowtie2 and Kraken2, or (b) are unassigned by Kraken and align to an HV taxon with Bowtie2 with an alignment score above a user-specified threshold (typically 20). + +This all works well for reads that are assigned to a taxon that is entirely composed of human-infecting viruses. The question is, what about reads that Kraken assigns to taxa that are only partially human-infecting? For example, the virus family Coronaviridae (taxid 11118) contains several coronaviruses that infect humans (e.g. SARS-CoV-2) and many that do not (e.g. assorted bat coronaviruses). What would happen to a read that was assigned by Kraken2 to this taxid? + +The key process here is `PROCESS_KRAKEN_HV` ([`modules/local/processKrakenHV/main.nf`](https://github.com/naobservatory/mgs-workflow/blob/master/modules/local/processKrakenHV/main.nf)), which calls [`bin/process_kraken_hv.py`](https://github.com/naobservatory/mgs-workflow/blob/master/bin/process_kraken_hv.py) on the output of Kraken2. This script: + +1. Imports a TSV of human-infecting virus taxa (generated by `FINALIZE_HUMAN_VIRUS_DB` in the index workflow) as well as the NCBI taxonomy tree structure (generated by `EXTRACT_NCBI_TAXONOMY` in the index workflow). Importantly, this initial HV TSV does *not* include higher taxa. +2. Iterates line-by-line over the readwise Kraken2 output as follows: + a. Extract the name and taxid assignment for the read. + b. Check if the taxid is in the list of HV taxids; if so, return `True`. + c. If not, get the taxid's parent taxid from the tree structure and go to (b). + d. Repeat (b) and (c) until an HV taxid is found or the taxid being screened is 0 (unassigned), 1 (root) or 2 (Bacteria); in the latter case, return `False`. +3. Save the read's HV status, along with other information, to an output file. + +As such, the script checks whether the assigned taxid *or any of its ancestors* are HV taxa, but not whether any of its descendents are. Reads that are assigned to higher-level taxa will thus be treated as though they were assigned to a non-HV taxa, and filtered out during HV read identification. + +This seems suboptimal. That said, it's not obvious what the correct behavior is here; treating reads assigned to partially-HV taxa as HV comes with its own problems[^1]. Someone should think more about what the right approach is here. + +[^1]: In particular, since the Bowtie2 database used for initial screening only contains HV genomes, this approach would likely lead to closely-related non-human-infecting virus reads being classified as HV.