Proposed Pipeline #7

averagehat · 2016-01-22T15:03:29Z

necrolyte2 · 2016-02-02T18:01:45Z

sff -> fastq should be completed in convert_format.jip
The tool supports any biopython format for input and output

necrolyte2 · 2016-02-03T15:24:22Z

So it seems like the filter and transform steps will likely produce paired + unpaired output as any time you filter or possibly cut and entire read out of one of the paired files, it will orphan the paired read in the other file

necrolyte2 · 2016-02-03T15:26:42Z

I guess the important thing about that is that we will have to accept paired + unpaired for every step that takes in reads

averagehat · 2016-02-11T18:53:01Z

I welcome ideas about how to test the consensus program (creates consensus from VCF and reference) especially using hypothesis. presentation on property-based testing

Also worth checking out "Metamorphic Testing" link specific to bioinformatics

necrolyte2 · 2016-02-12T13:55:52Z

Haha, this is just so much YES

necrolyte2 · 2016-02-12T18:45:02Z

Seems like the consensus would be fairly straight forward

Strategy to generate fasta for reference
Strategy to generate list of VCF lines(doesn't have to have same amount of items as length of reference

So then the question becomes what is the correct consensus
The consensus should always start with being equal to the reference
Then any location that there is a VCF line you would make some decision if the reference is correct or not based on parameters such as our SOP of 80/20 that also checks for low depth regions and then makes a different call there

averagehat · 2016-02-12T19:19:58Z

Here is a cached link to the paper above.

averagehat · 2016-02-12T19:23:18Z

Here's what I'm planning:

If the total number of reads is less than 10, call it an N.
if the Alternate base accounts for >= 80% of the depth, call it that alternate base.
[optional] If there are two alternate bases, which account for >= 80% of the depth, call them their ambiguous base. If there are more than two, call that an N.

necrolyte2 · 2016-02-12T19:37:54Z

I feel like there should be a majority(>50%) thing in there somewhere
Like when below 10 depth the logic goes to majority or something...

averagehat · 2016-03-22T19:50:31Z

Regarding the <10 depth case:

Indeed there is an inconsistency. from here

N is called if the depth was < 10 and N is > 20%
Note that in that implementation bases with quality < 25 are assigned as N.

We don't have the quality for each individual base, but we do have the total quality for the ALTS and ref-supporting bases (recorded as QA and QR respectively).
So you could say something like:

(25 comes from minimum quality and 8 comes from majority threshold of 80%)

"if the total quality of the alternates is < 200 (25 * 8), don't call it an alternate."
"if the total quality of the references is < 200, don't call the reference."
"if neither the reference nor the alternate can be called, call an N."

This seems reasonable.

edit: I realized that this method doesn't bias at all towards the reference, which I think we want.

averagehat · 2016-03-23T21:33:19Z

@mmelendrez, could use your input. Review of the the discussion starting here:

We are discussing the logic used to call bases from a freebayes VCF file. I am ready to add special logic for the case where the total depth of reads at a point is < 10. Currently, the script always outputs an N in this case. I'm not sure what behavior we want, or how important these extremely low-depth regions are.

It's not possible to exactly duplicate ngs_mapper's current behavior (which is difficult to pin down and requires enumerating the quality of individual bases, which freebayes doesn't provide--it just provides a total) but it could be approximated as described above.

You could argue that this approximation would be misleading, because ALT or REF can be multiple bases, so the QR and QA scores are the sums of the qualities of all bases (I assume) and therefore you can't determine the exact quality score of a given base, just the average. Now, the REF and ALTs have multiple bases for a reason, mainly that they occur together in the reads. The situation is closely related to the way we're handling depth (which is also not pinned down to individual bases at individual positions). The difference is that we assume that depth means the total number of reads supporting all those positions (assumes all bases would have approximately equal depth) whereas quality is the sum of multiple bases (which can have widely varying quality).

All of this, in my opinion, suggests that something simple should be done and either
a) It's up to the PI to look at it more closely if they want or
b) Do some more complex processing with base-by-base information to complement freebayes at these low-depth regions.

averagehat · 2016-03-23T21:38:20Z

@InaMBerry may also like to contribute to the discussion starting here

mmelendrez · 2016-03-24T00:54:05Z

Let me get my head around this. A convo might work better for me to make sure I understand what freebayes can and cannot do when implemented in our pipeline since it cannot emulate what ngs_mapper does right now (taking quality/base into account when calling bases). Lets chat tomorrow. Then I can summarize our thoughts here and @InaMBerry and @necrolyte2 can put input in.

averagehat · 2016-03-24T17:55:20Z

Here's a listing of all the information freebayes provides by default: https://gist.github.com/averagehat/29c4ad1f3a7837fb2f82#file-gistfile1-txt

necrolyte2 mentioned this issue Feb 10, 2016

processing bam file steps #18

Open

3 tasks

averagehat mentioned this issue Feb 11, 2016

Bam processing (WIP) #19

Open

4 tasks

averagehat mentioned this issue Mar 22, 2016

Compare consensus with ngs_mapper base_caller #22

Open

averagehat self-assigned this Mar 23, 2016

averagehat added the Needs Discussion label Mar 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Pipeline #7

Proposed Pipeline #7

averagehat commented Jan 22, 2016

necrolyte2 commented Feb 2, 2016

necrolyte2 commented Feb 3, 2016

necrolyte2 commented Feb 3, 2016

averagehat commented Feb 11, 2016

necrolyte2 commented Feb 12, 2016

necrolyte2 commented Feb 12, 2016

averagehat commented Feb 12, 2016

averagehat commented Feb 12, 2016

necrolyte2 commented Feb 12, 2016

averagehat commented Mar 22, 2016

averagehat commented Mar 23, 2016

averagehat commented Mar 23, 2016

mmelendrez commented Mar 24, 2016

averagehat commented Mar 24, 2016

Proposed Pipeline #7

Proposed Pipeline #7

Comments

averagehat commented Jan 22, 2016

necrolyte2 commented Feb 2, 2016

necrolyte2 commented Feb 3, 2016

necrolyte2 commented Feb 3, 2016

averagehat commented Feb 11, 2016

necrolyte2 commented Feb 12, 2016

necrolyte2 commented Feb 12, 2016

averagehat commented Feb 12, 2016

averagehat commented Feb 12, 2016

necrolyte2 commented Feb 12, 2016

averagehat commented Mar 22, 2016

averagehat commented Mar 23, 2016

averagehat commented Mar 23, 2016

mmelendrez commented Mar 24, 2016

averagehat commented Mar 24, 2016