-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendation for normalization methods when making comparison across samples #9
Comments
Hi Meng-Chun,
Sorry for the late reply -- I was away from all internet access for over a week and am still catching up on emails. Can I ask how many reads you're using for your samples? The pseudocount number is so low on Pair 1 and 2 (and even 4) that I worry that your coverage is not sufficient. Did you use a MiSeq? (For Drosophila, I'd recommend aiming for ~10M+ mapped reads per sample -- see my 2015 paper for a rough comparison of the effect of read depth on signal reproducibility). There must have been a fairly significant multiplexing error to give Pair 3 so many more reads, also, unless this came from a different sequencing run?
Another cause for concern is the norm factors you're seeing -- it is very odd for a Dam-fusion protein to require a norm factor less than 1 (it's not impossible, but it's unusual, all other things being equal). Again, what did the multiplexing / numbers of mapped reads look like?
Correlation is the correlation between all GATC fragments. You're correct about the rationale behind the normalisation, too. Ideally coverage should be similar and pseudocounts should be similar, and yes, your numbers will make a comparison between sample difficult!
In short, I'd be more than a little worried that a lack of sequencing coverage is playing havoc here. Can you provide some more details / numbers?
Best wishes,
Owen
Owen Marshall, PhD
Menzies Institute for Medical Research
University of Tasmania
17 Liverpool St
Hobart TAS 7000
Australia
ph: +61 3 6226 4248
www.marshall-lab.org<http://www.marshall-lab.org>
@OMarshall_lab<https://twitter.com/OMarshall_lab>
On Wed, 24 Jul 2019 at 02:02, mctseng2 <[email protected]<mailto:[email protected]>> wrote:
Hi Owen,
I am analyzing some targeted DamID samples for colleagues. We have 4 pairs of polII Fusion/DamOnly samples from drosophila and would like to compare the binding pattern between them on targeted genes (and possible mining novel genes). I ran through the default kernel density normalization methods and find that the 4 samples have very different normalization parameters:
correlation normFactor pseudocounts warning(recommend rpm normalizatiom)
pair1 0.16 0.08 0.62 Y
pair2 0.15 0.03 0.59 Y
pair3 0.55 0.13 63.07 N
pair4 0.46 0.34 6.52 N
When we looked at the results of a gene. We compared those to the plain log2(RPKM-Fusion/RPKM_Dam) which I calculated manually. I found that the signal of 3rd and 4th pairs have been wiped out quite a bit (possibly due to large pseudo counts added) while the first 2 pairs look normal. If I understand the idea of kernel normalization correctly, the method should correct the negative-biased signal and makes it more positive thus increasing experimental power. However, will the huge variation of the pseudo counts added making the comparison of processed values across sample difficult? If so, what normalization method would you recommend?
Also, I have a question of how the correlation calculated. Are they only focusing on GATC region? The first two pairs seem to have very low correlation values and I got a warning for it when running the pipeline. I ran the correlation analysis manually using deeptools and got higher values.
Thanks in advance for viewing my questions. I've also included a IGV shot of these 4 samples for the reference. The first two tracks were the RPKM coverage tracks for Fusion and DamOnly samples. The third track (in red) is the pipeline output. The fourth track is the manually calculated log(RPKM_Fusion+1/RPKM_Dam=1) vales. The pipeline-generated values of the bottom 2 pairs are smaller compare to manually calculated log2 RPMK ratio which is concerning to us.
[igv_snapshot]<https://user-images.githubusercontent.com/32713330/61727577-38407400-ad39-11e9-841c-8fcab50ae366.png>
Regards,
Meng-Chun
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#9?email_source=notifications&email_token=AB4TA6YTVSPKJPVUIAJ7UQ3QA4TQDA5CNFSM4IGGDN72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HA63GJQ>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AB4TA63D67NBGTTR7DVVB2DQA4TQDANCNFSM4IGGDN7Q>.
University of Tasmania Electronic Communications Policy (December, 2014).
This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.
|
Hi Owen, Thanks for your reply. It is really helpful. It could be the coverage issue but I couldn't figure out which step went wrong. The sequencing was all done together and the libraries size are plenty, ranged between 30M and 60M reads. The mapping rate is also good between 70% - 95%. The reads are mostly aligned to main chromosomes. However, I found something weird on the log file. When the software trying to generate new bam files, one of the samples in the pair lost a significant amount of reads (from the bigining of 42,011,068 aligned reads to 109,444). I was aware that the GATC is only built on main chromosomes, but it shouldn't be the issue as the alignment is mostly on main chromosomes. How does the new bam file creator work? Maybe the PCR duplicates got removed? I've attached the log file from pair 2 below. Thanks in advance for helping me do the troubleshooting. Meng-Chun damidseq_pipeline v1.4
Command-line options: --gatc_frag_file=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/gatc_file/dmel-r6.28.GATC.gf
f --bowtie2_genome_dir=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/bowtie2-dmel-r6.28/dmel-r6.28 --threads=7
*** Reading data files ***
*** Aligning files with bowtie2 ***
Now working on Dam ...
53989947 reads; of these:
53989947 (100.00%) were unpaired; of these:
1988718 (3.68%) aligned 0 times
42011068 (77.81%) aligned exactly 1 time
9990161 (18.50%) aligned >1 times
96.32% overall alignment rate
Now working on PolII ...
63228299 reads; of these:
63228299 (100.00%) were unpaired; of these:
2337292 (3.70%) aligned 0 times
57348566 (90.70%) aligned exactly 1 time
3542441 (5.60%) aligned >1 times
96.30% overall alignment rate
*** Reading GATC file ***
*** Extending reads up to 300 bases ***
Reading input file: Dam ...
Processing data ...
Warning: alignment contains chromosome identities not found in GATC file:
211000022278132
211000022278151
211000022278611
......
Y_mapped_Scaffold_30_D1720
Y_mapped_Scaffold_34_D1584
Y_mapped_Scaffold_5_D1748_D1610
Y_mapped_Scaffold_9_D1573
mitochondrion_genome
Seqs extended (>q30) = 38778735
Reading input file: PolII ...
Processing data ...
Warning: alignment contains chromosome identities not found in GATC file:
211000022278164
211000022279116
...
Seqs extended (>q30) = 54358922
*** Calculating bins ***
Now working on Dam-ext300 ...
Generating .bam file ...
sorting ...
109444 reads <<----this went low
Generating bins from Dam-ext300.bam ...
Converting to GATC resolution ...
Now working on PolII-ext300 ...
Generating .bam file ...
sorting ...
54358922 reads
Generating bins from PolII-ext300.bam ...
Converting to GATC resolution ...
*** Calculating quantiles ***
Now working on Dam ...
Sorting ...
Quantile 0.1: 0.17
Quantile 0.2: 0.33
Quantile 0.3: 0.44
Quantile 0.4: 0.50
Quantile 0.5: 0.71
Quantile 0.6: 1.00
Quantile 0.7: 1.00
Quantile 0.8: 1.43
Quantile 0.9: 2.00
Quantile 1.0: 13.00
Now working on PolII ...
Sorting ...
Quantile 0.1: 1.45
Quantile 0.2: 4.00
Quantile 0.3: 9.67
Quantile 0.4: 20.00
Quantile 0.5: 38.33
Quantile 0.6: 71.67
Quantile 0.7: 132.83
Quantile 0.8: 248.00
Quantile 0.9: 515.00
Quantile 1.0: 9381.50
*** Calculating normalisation factor ***
Now working on PolII ...
Spearman's correlation: 0.15
*** Warning: low correlation -- kernel density estimation may not be the best normalisation method for this dataset. Consider using readcount normalisation instead (using --norm_method=rpm) ...
Norm factor = 0.03 based off 48163 frags (total 386533)
*** Normalising ***
Processing sample: PolII ...
... normalising by 0.03
*** Generating ratios ***
Now working on PolII ...
Reading Dam ...
Reading PolII ...
... adding 0.59 pseudocounts to each sample
All done. |
That's very strange. Would you be able to email me directly (owen.marshall
at utas.edu.au) regarding this?
I'd ideally like to get a copy of your Dam-only file and see what on earth
is going on with it. It definitely doesn't make sense that only 100K reads
should come back from 38.8M reads >q30 -- especially when your PolII sample
has processed just fine under the same run.
Thanks and best wishes,
Owen
…On Tue, 17 Sep 2019 at 05:06, mctseng2 ***@***.***> wrote:
Hi Owen,
Thanks for your reply. It is really helpful. It could be the coverage
issue but I couldn't figure out which step went wrong. The sequencing was
all done together and the libraries size are plenty, ranged between 30M and
60M reads. The mapping rate is also good between 70% - 95%. The reads are
mostly aligned to main chromosomes.
However, I found something weird on the log file. When the software trying
to generate new bam files, one of the samples in the pair lost a
significant amount of reads (from the bigining of 42,011,068 aligned reads
to 109,444). I was aware that the GATC is only built on main chromosomes,
but it shouldn't be the issue as the alignment is mostly on main
chromosomes. How does the new bam file creator work? Maybe the PCR
duplicates got removed?
I've attached the log file from pair 2 below. Thanks in advance for
helping me do the troubleshooting.
`damidseq_pipeline v1.4
Command-line options:
--gatc_frag_file=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/gatc_file/
dmel-r6.28.GATC.gf
f
--bowtie2_genome_dir=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/bowtie2-dmel-r6.28/dmel-r6.28
--threads=7
*** Reading data files ***
*** Aligning files with bowtie2 ***
Now working on Dam ...
53989947 reads; of these:
53989947 (100.00%) were unpaired; of these:
1988718 (3.68%) aligned 0 times
42011068 (77.81%) aligned exactly 1 time
9990161 (18.50%) aligned >1 times
96.32% overall alignment rate
Now working on PolII ...
63228299 reads; of these:
63228299 (100.00%) were unpaired; of these:
2337292 (3.70%) aligned 0 times
57348566 (90.70%) aligned exactly 1 time
3542441 (5.60%) aligned >1 times
96.30% overall alignment rate
*** Reading GATC file ***
*** Extending reads up to 300 bases ***
Reading input file: Dam ...
Processing data ...
Warning: alignment contains chromosome identities not found in GATC file:
211000022278132
211000022278151
211000022278611
......
Y_mapped_Scaffold_30_D1720
Y_mapped_Scaffold_34_D1584
Y_mapped_Scaffold_5_D1748_D1610
Y_mapped_Scaffold_9_D1573
mitochondrion_genome
Seqs extended (>q30) = 38778735
Reading input file: PolII ...
Processing data ...
Warning: alignment contains chromosome identities not found in GATC file:
211000022278164
211000022279116
...
Seqs extended (>q30) = 54358922
*** Calculating bins ***
Now working on Dam-ext300 ...
Generating .bam file ...
sorting ...
109444 reads <<----this went low
Generating bins from Dam-ext300.bam ...
Converting to GATC resolution ...
Now working on PolII-ext300 ...
Generating .bam file ...
sorting ...
54358922 reads
Generating bins from PolII-ext300.bam ...
Converting to GATC resolution ...
*** Calculating quantiles ***
Now working on Dam ...
Sorting ...
Quantile 0.1: 0.17
Quantile 0.2: 0.33
Quantile 0.3: 0.44
Quantile 0.4: 0.50
Quantile 0.5: 0.71
Quantile 0.6: 1.00
Quantile 0.7: 1.00
Quantile 0.8: 1.43
Quantile 0.9: 2.00
Quantile 1.0: 13.00
Now working on PolII ...
Sorting ...
Quantile 0.1: 1.45
Quantile 0.2: 4.00
Quantile 0.3: 9.67
Quantile 0.4: 20.00
Quantile 0.5: 38.33
Quantile 0.6: 71.67
Quantile 0.7: 132.83
Quantile 0.8: 248.00
Quantile 0.9: 515.00
Quantile 1.0: 9381.50
*** Calculating normalisation factor ***
Now working on PolII ...
Spearman's correlation: 0.15
*** Warning: low correlation -- kernel density estimation may not be the
best normalisation method for this dataset. Consider using readcount
normalisation instead (using --norm_method=rpm) ...
Norm factor = 0.03 based off 48163 frags (total 386533)
*** Normalising ***
Processing sample: PolII ...
... normalising by 0.03
*** Generating ratios ***
Now working on PolII ...
Reading Dam ...
Reading PolII ...
... adding 0.59 pseudocounts to each sample
All done.
`
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9?email_source=notifications&email_token=AB4TA63TXQRQAIMGFVMPCA3QJ7KL5A5CNFSM4IGGDN72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62GAGY#issuecomment-531914779>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB4TA65AJ3BQ36YHWSIHAXTQJ7KL5ANCNFSM4IGGDN7Q>
.
|
Hello both, Thank you very much in advance.
|
Hi Laura, I think there are two causes for this bug. The simple fix is to upgrade samtools to at least version 1.9, and it should (hopefully) process fine; although you're only the second person to report this so it's difficult to tell. There may also be a bug in the pipeline code that affects rare and unusual alignments, and which in turn is causing old versions of samtools to exit early. I need to check this further. But if please let me know if upgrading samtools fixes the problem? Thanks, |
Hi Owen, Thanks a lot, |
Great to hear, Laura. Yes, we've switched over to PE reads completely now, and I can't imagine going back to SE (the only thing I'd suggest is that you don't mix SE and PE samples, as it will create some odd results in repetitive regions given the better mapping of PE data). (Would still like to know exactly what's going on here, so will leave this open in case it crops up again.) Cheers, |
Hi Owen,
I am analyzing some targeted DamID samples for colleagues. We have 4 pairs of polII Fusion/DamOnly samples from drosophila and would like to compare the binding pattern between them on targeted genes (and possible mining novel genes). I ran through the default kernel density normalization methods and find that the 4 samples have very different normalization parameters:
When we looked at the results of a gene. We compared those to the plain log2(RPKM-Fusion/RPKM_Dam) which I calculated manually. I found that the signal of 3rd and 4th pairs have been wiped out quite a bit (possibly due to large pseudo counts added) while the first 2 pairs look normal. If I understand the idea of kernel normalization correctly, the method should correct the negative-biased signal and makes it more positive thus increasing experimental power. However, will the huge variation of the pseudo counts added making the comparison of processed values across sample difficult? If so, what normalization method would you recommend?
Also, I have a question of how the correlation calculated. Are they only focusing on GATC region? The first two pairs seem to have very low correlation values and I got a warning for it when running the pipeline. I ran the correlation analysis manually using deeptools and got higher values.
Thanks in advance for viewing my questions. I've also included a IGV shot of these 4 samples for the reference. The first two tracks were the RPKM coverage tracks for Fusion and DamOnly samples. The third track (in red) is the pipeline output. The fourth track is the manually calculated log(RPKM_Fusion+1/RPKM_Dam=1) vales. The pipeline-generated values of the bottom 2 pairs are smaller compare to manually calculated log2 RPMK ratio which is concerning to us.
Regards,
Meng-Chun
The text was updated successfully, but these errors were encountered: