Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change krona wrapper to read from summary report instead of read report #777

Open
dpark01 opened this issue Jan 31, 2018 · 8 comments
Open

Comments

@dpark01
Copy link
Member

dpark01 commented Jan 31, 2018

In our current invocations in the WDL workflow, we run kraken, and then krona, on a bunch of samples to save on DB staging time and such. Currently, the krona portion is run in a for-loop after kraken completes.

I'm observing that the krona portion sometimes takes longer than the kraken portion (e.g. multiple hours for krona vs. 1 hr for kraken on a lane of a 2500 high output run). This is both dumb and very costly: krona is only utilizing a single core on an instance that is sized very large for kraken's sake.

Edit: this issue will now focus on changing the krona entry point to run off of the kraken summary report file exclusively (with no other inputs, see comments below for details).

@yesimon
Copy link
Contributor

yesimon commented Jan 31, 2018

Actually, we can just change krona to intepret the kraken report instead and run almost instantaneously. The only slight drawback is that it will report each taxon as having 1 read with weight = no of reads instead. (Krona has a feature to multiply the read count * additional weight variable for each taxon).

@dpark01
Copy link
Member Author

dpark01 commented Jan 31, 2018

Wait, I've always wondered why krona couldn't run directly off the summary txt file, but you're saying it can. I don't understand the drawback then -- what would the difference be from the current behavior?

@yesimon
Copy link
Contributor

yesimon commented Jan 31, 2018

The only difference would be krona reporting n_i reads with weight 1 versus 1 read with n_i weight.

@dpark01
Copy link
Member Author

dpark01 commented Jan 31, 2018

Visually, the pies would look the same... that seems worthwhile... does it even need a taxonomy db anymore at that point?

@yesimon
Copy link
Contributor

yesimon commented Jan 31, 2018

It still does but krona only needs its taxonomy.tab file which is <100 Mb

@dpark01
Copy link
Member Author

dpark01 commented Jan 31, 2018

What, informationally, is in the tab file that isn’t already in the summary report file?

@dpark01
Copy link
Member Author

dpark01 commented Feb 1, 2018

Hm. I kind of like the idea of omitting the taxonomy.tab file. Partly so that the user doesn't have to provide one more file (even if we default it), but mostly because it always bothers me that all these tools are potentially using different taxonomy databases that can get out of sync with each other.

According to this, it looks like the taxonomy.tab file is just:

  1. taxid
  2. tree distance from root
  3. parent taxid
  4. taxonomic rank in lowercase English ("superkingdom", "genus", "species")
  5. name

According to the Kraken manual, the report file is:

  1. Percentage of reads covered by the clade rooted at this taxon
  2. Number of reads covered by the clade rooted at this taxon
  3. Number of reads assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
  5. NCBI taxonomy ID
  6. indented scientific name

So to create krona's taxonomy.tab file on the fly from kraken's summary report:

  1. column 5
  2. count the number of leading spaces in column 6
  3. this will require a little stateful parsing, but can be done
  4. this is just mapping the single letter from column 4 into a full lowercase word
  5. strip leading spaces from column 6

@dpark01 dpark01 changed the title WDL: break krona from kraken Change krona wrapper to read from summary report instead of read report Feb 2, 2018
@dpark01
Copy link
Member Author

dpark01 commented Feb 2, 2018

I'm changing this particular Issue to focus on the new direction discussed here in the comments and will backlog it for some future time.

For reference, here was the original thought:

I propose splitting kraken from krona in the WDL tasks (they're already split on the Snakemake side) and invoking the krona task within a scatter (one sample per task) after kraken. The krona task would need no more than 4-5GB RAM and 1 CPU core. I'd set dx_instance_type to mem2_hdd2_x2 (most cost effective for this).

I will separately implement a quick and dirty change that invokes krona from GNU parallel within the same WDL task (instead of a bash for loop) just to improve things for now until we have time for this larger issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants