Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Pipeline #7

Open
6 of 25 tasks
averagehat opened this issue Jan 22, 2016 · 14 comments
Open
6 of 25 tasks

Proposed Pipeline #7

averagehat opened this issue Jan 22, 2016 · 14 comments
Assignees

Comments

@averagehat
Copy link
Contributor

Inputs:

  • 0-x unpaired fastq/sff files
  • 0-y pairs of illumina paired fastq files
  • [0-z pacbio files]

Transformation:

  • sff -> fastq

Filter:

  • drop reads with Ns
  • drop illumina reads where the index has a smallest quality score lower than some minimum
  • De novo: run host_filter
  • Note: individual files may be lost or empty after this process

Transformation (trimming):

  • cut 3' and 5' adapters -a, -A, -g, -G
  • cut 3' and 5' primers
  • trim all N's from end of reads --trim-n
  • trim low quality bases from 5' and 3' -q <five>,<three> or -q <five>
  • remove X bases from beginning/end of reads -u <X>
  • [run something on pacbio reads separately]

Reduction (mapping/assembly): -- This is where denovo/mapping branch

  • Mapping:
    • compile paired, unpaired reads
    • run bwa on compiled reads, separaetly
    • [run bwa/other mapper on pacbio reads, then merge bam files]
    • merge paired and unpaired bam files, if they exist
    • sort, index bam file
    • tag the bam file via tagreads
    • run freebayes on the bam file
    • create a consensus fasta
  • DeNovo:
  • compile paired and unpaired (?)
  • run a De Novo assembler (sga and spades are both in bioconda, but not ray)
  • try to figure out the number of reads for each contig
  • maybe simulate iterative_blast
@necrolyte2
Copy link
Member

sff -> fastq should be completed in convert_format.jip
The tool supports any biopython format for input and output

@necrolyte2
Copy link
Member

So it seems like the filter and transform steps will likely produce paired + unpaired output as any time you filter or possibly cut and entire read out of one of the paired files, it will orphan the paired read in the other file

@necrolyte2
Copy link
Member

I guess the important thing about that is that we will have to accept paired + unpaired for every step that takes in reads

@averagehat
Copy link
Contributor Author

I welcome ideas about how to test the consensus program (creates consensus from VCF and reference) especially using hypothesis. presentation on property-based testing

Also worth checking out "Metamorphic Testing" link specific to bioinformatics

@averagehat averagehat mentioned this issue Feb 11, 2016
4 tasks
@necrolyte2
Copy link
Member

Haha, this is just so much YES

@necrolyte2
Copy link
Member

Seems like the consensus would be fairly straight forward

Strategy to generate fasta for reference
Strategy to generate list of VCF lines(doesn't have to have same amount of items as length of reference

So then the question becomes what is the correct consensus
The consensus should always start with being equal to the reference
Then any location that there is a VCF line you would make some decision if the reference is correct or not based on parameters such as our SOP of 80/20 that also checks for low depth regions and then makes a different call there

@averagehat
Copy link
Contributor Author

Here is a cached link to the paper above.

@averagehat
Copy link
Contributor Author

Here's what I'm planning:

  1. If the total number of reads is less than 10, call it an N.
  2. if the Alternate base accounts for >= 80% of the depth, call it that alternate base.
  3. [optional] If there are two alternate bases, which account for >= 80% of the depth, call them their ambiguous base. If there are more than two, call that an N.

@necrolyte2
Copy link
Member

I feel like there should be a majority(>50%) thing in there somewhere
Like when below 10 depth the logic goes to majority or something...

@averagehat
Copy link
Contributor Author

Regarding the <10 depth case:

Indeed there is an inconsistency. from here

N is called if the depth was < 10 and N is > 20%
Note that in that implementation bases with quality < 25 are assigned as N.

We don't have the quality for each individual base, but we do have the total quality for the ALTS and ref-supporting bases (recorded as QA and QR respectively).
So you could say something like:

(25 comes from minimum quality and 8 comes from majority threshold of 80%)

  1. "if the total quality of the alternates is < 200 (25 * 8), don't call it an alternate."
  2. "if the total quality of the references is < 200, don't call the reference."
  3. "if neither the reference nor the alternate can be called, call an N."

This seems reasonable.

edit: I realized that this method doesn't bias at all towards the reference, which I think we want.

@averagehat
Copy link
Contributor Author

@mmelendrez, could use your input. Review of the the discussion starting here:

We are discussing the logic used to call bases from a freebayes VCF file. I am ready to add special logic for the case where the total depth of reads at a point is < 10. Currently, the script always outputs an N in this case. I'm not sure what behavior we want, or how important these extremely low-depth regions are.

It's not possible to exactly duplicate ngs_mapper's current behavior (which is difficult to pin down and requires enumerating the quality of individual bases, which freebayes doesn't provide--it just provides a total) but it could be approximated as described above.

You could argue that this approximation would be misleading, because ALT or REF can be multiple bases, so the QR and QA scores are the sums of the qualities of all bases (I assume) and therefore you can't determine the exact quality score of a given base, just the average. Now, the REF and ALTs have multiple bases for a reason, mainly that they occur together in the reads. The situation is closely related to the way we're handling depth (which is also not pinned down to individual bases at individual positions). The difference is that we assume that depth means the total number of reads supporting all those positions (assumes all bases would have approximately equal depth) whereas quality is the sum of multiple bases (which can have widely varying quality).

All of this, in my opinion, suggests that something simple should be done and either
a) It's up to the PI to look at it more closely if they want or
b) Do some more complex processing with base-by-base information to complement freebayes at these low-depth regions.

@averagehat
Copy link
Contributor Author

@InaMBerry may also like to contribute to the discussion starting here

@mmelendrez
Copy link
Member

Let me get my head around this. A convo might work better for me to make sure I understand what freebayes can and cannot do when implemented in our pipeline since it cannot emulate what ngs_mapper does right now (taking quality/base into account when calling bases). Lets chat tomorrow. Then I can summarize our thoughts here and @InaMBerry and @necrolyte2 can put input in.

@averagehat
Copy link
Contributor Author

Here's a listing of all the information freebayes provides by default: https://gist.github.com/averagehat/29c4ad1f3a7837fb2f82#file-gistfile1-txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants