Recommendation for normalization methods when making comparison across samples #9

mctseng2 · 2019-07-23T16:02:08Z

Hi Owen,

I am analyzing some targeted DamID samples for colleagues. We have 4 pairs of polII Fusion/DamOnly samples from drosophila and would like to compare the binding pattern between them on targeted genes (and possible mining novel genes). I ran through the default kernel density normalization methods and find that the 4 samples have very different normalization parameters:

	correlation	normFactor	pseudocounts	warning(recommend rpm normalizatiom)
pair1	0.16	0.08	0.62	Y
pair2	0.15	0.03	0.59	Y
pair3	0.55	0.13	63.07	N
pair4	0.46	0.34	6.52	N

When we looked at the results of a gene. We compared those to the plain log2(RPKM-Fusion/RPKM_Dam) which I calculated manually. I found that the signal of 3rd and 4th pairs have been wiped out quite a bit (possibly due to large pseudo counts added) while the first 2 pairs look normal. If I understand the idea of kernel normalization correctly, the method should correct the negative-biased signal and makes it more positive thus increasing experimental power. However, will the huge variation of the pseudo counts added making the comparison of processed values across sample difficult? If so, what normalization method would you recommend?

Also, I have a question of how the correlation calculated. Are they only focusing on GATC region? The first two pairs seem to have very low correlation values and I got a warning for it when running the pipeline. I ran the correlation analysis manually using deeptools and got higher values.

Thanks in advance for viewing my questions. I've also included a IGV shot of these 4 samples for the reference. The first two tracks were the RPKM coverage tracks for Fusion and DamOnly samples. The third track (in red) is the pipeline output. The fourth track is the manually calculated log(RPKM_Fusion+1/RPKM_Dam=1) vales. The pipeline-generated values of the bottom 2 pairs are smaller compare to manually calculated log2 RPMK ratio which is concerning to us.

Regards,

Meng-Chun

owenjm · 2019-08-05T05:11:08Z

Hi Meng-Chun, Sorry for the late reply -- I was away from all internet access for over a week and am still catching up on emails. Can I ask how many reads you're using for your samples? The pseudocount number is so low on Pair 1 and 2 (and even 4) that I worry that your coverage is not sufficient. Did you use a MiSeq? (For Drosophila, I'd recommend aiming for ~10M+ mapped reads per sample -- see my 2015 paper for a rough comparison of the effect of read depth on signal reproducibility). There must have been a fairly significant multiplexing error to give Pair 3 so many more reads, also, unless this came from a different sequencing run? Another cause for concern is the norm factors you're seeing -- it is very odd for a Dam-fusion protein to require a norm factor less than 1 (it's not impossible, but it's unusual, all other things being equal). Again, what did the multiplexing / numbers of mapped reads look like? Correlation is the correlation between all GATC fragments. You're correct about the rationale behind the normalisation, too. Ideally coverage should be similar and pseudocounts should be similar, and yes, your numbers will make a comparison between sample difficult! In short, I'd be more than a little worried that a lack of sequencing coverage is playing havoc here. Can you provide some more details / numbers? Best wishes, Owen Owen Marshall, PhD Menzies Institute for Medical Research University of Tasmania 17 Liverpool St Hobart TAS 7000 Australia ph: +61 3 6226 4248 www.marshall-lab.org<http://www.marshall-lab.org> @OMarshall_lab<https://twitter.com/OMarshall_lab> On Wed, 24 Jul 2019 at 02:02, mctseng2 <[email protected]<mailto:[email protected]>> wrote: Hi Owen, I am analyzing some targeted DamID samples for colleagues. We have 4 pairs of polII Fusion/DamOnly samples from drosophila and would like to compare the binding pattern between them on targeted genes (and possible mining novel genes). I ran through the default kernel density normalization methods and find that the 4 samples have very different normalization parameters: correlation normFactor pseudocounts warning(recommend rpm normalizatiom) pair1 0.16 0.08 0.62 Y pair2 0.15 0.03 0.59 Y pair3 0.55 0.13 63.07 N pair4 0.46 0.34 6.52 N When we looked at the results of a gene. We compared those to the plain log2(RPKM-Fusion/RPKM_Dam) which I calculated manually. I found that the signal of 3rd and 4th pairs have been wiped out quite a bit (possibly due to large pseudo counts added) while the first 2 pairs look normal. If I understand the idea of kernel normalization correctly, the method should correct the negative-biased signal and makes it more positive thus increasing experimental power. However, will the huge variation of the pseudo counts added making the comparison of processed values across sample difficult? If so, what normalization method would you recommend? Also, I have a question of how the correlation calculated. Are they only focusing on GATC region? The first two pairs seem to have very low correlation values and I got a warning for it when running the pipeline. I ran the correlation analysis manually using deeptools and got higher values. Thanks in advance for viewing my questions. I've also included a IGV shot of these 4 samples for the reference. The first two tracks were the RPKM coverage tracks for Fusion and DamOnly samples. The third track (in red) is the pipeline output. The fourth track is the manually calculated log(RPKM_Fusion+1/RPKM_Dam=1) vales. The pipeline-generated values of the bottom 2 pairs are smaller compare to manually calculated log2 RPMK ratio which is concerning to us. [igv_snapshot]<https://user-images.githubusercontent.com/32713330/61727577-38407400-ad39-11e9-841c-8fcab50ae366.png> Regards, Meng-Chun — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#9?email_source=notifications&email_token=AB4TA6YTVSPKJPVUIAJ7UQ3QA4TQDA5CNFSM4IGGDN72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HA63GJQ>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AB4TA63D67NBGTTR7DVVB2DQA4TQDANCNFSM4IGGDN7Q>. University of Tasmania Electronic Communications Policy (December, 2014). This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.

mctseng2 · 2019-09-16T19:06:37Z

Hi Owen,

Thanks for your reply. It is really helpful. It could be the coverage issue but I couldn't figure out which step went wrong. The sequencing was all done together and the libraries size are plenty, ranged between 30M and 60M reads. The mapping rate is also good between 70% - 95%. The reads are mostly aligned to main chromosomes.

However, I found something weird on the log file. When the software trying to generate new bam files, one of the samples in the pair lost a significant amount of reads (from the bigining of 42,011,068 aligned reads to 109,444). I was aware that the GATC is only built on main chromosomes, but it shouldn't be the issue as the alignment is mostly on main chromosomes. How does the new bam file creator work? Maybe the PCR duplicates got removed?

I've attached the log file from pair 2 below. Thanks in advance for helping me do the troubleshooting.

Meng-Chun

damidseq_pipeline v1.4
Command-line options: --gatc_frag_file=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/gatc_file/dmel-r6.28.GATC.gf
f --bowtie2_genome_dir=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/bowtie2-dmel-r6.28/dmel-r6.28 --threads=7

*** Reading data files ***

*** Aligning files with bowtie2 ***
Now working on Dam ...
53989947 reads; of these:
  53989947 (100.00%) were unpaired; of these:
    1988718 (3.68%) aligned 0 times
    42011068 (77.81%) aligned exactly 1 time
    9990161 (18.50%) aligned >1 times
96.32% overall alignment rate

Now working on PolII ...
63228299 reads; of these:
  63228299 (100.00%) were unpaired; of these:
    2337292 (3.70%) aligned 0 times
    57348566 (90.70%) aligned exactly 1 time
    3542441 (5.60%) aligned >1 times
96.30% overall alignment rate


*** Reading GATC file ***
*** Extending reads up to 300 bases ***
Reading input file: Dam ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278132
    211000022278151
    211000022278611
......
    Y_mapped_Scaffold_30_D1720
    Y_mapped_Scaffold_34_D1584
    Y_mapped_Scaffold_5_D1748_D1610
    Y_mapped_Scaffold_9_D1573
    mitochondrion_genome

  Seqs extended (>q30) = 38778735

Reading input file: PolII ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278164
    211000022279116
   ...
  Seqs extended (>q30) = 54358922

*** Calculating bins ***
Now working on Dam-ext300 ...
Generating .bam file ...
  sorting ...
  109444 reads <<----this went low
  Generating bins from Dam-ext300.bam ...
  Converting to GATC resolution ...


Now working on PolII-ext300 ...
Generating .bam file ...
  sorting ...
  54358922 reads
  Generating bins from PolII-ext300.bam ...
  Converting to GATC resolution ...


*** Calculating quantiles ***
Now working on Dam ...
Sorting ...
   Quantile 0.1: 0.17
   Quantile 0.2: 0.33
   Quantile 0.3: 0.44
   Quantile 0.4: 0.50
   Quantile 0.5: 0.71
   Quantile 0.6: 1.00
   Quantile 0.7: 1.00
   Quantile 0.8: 1.43
   Quantile 0.9: 2.00
   Quantile 1.0: 13.00

Now working on PolII ...
Sorting ...
   Quantile 0.1: 1.45
   Quantile 0.2: 4.00
   Quantile 0.3: 9.67
   Quantile 0.4: 20.00
   Quantile 0.5: 38.33
   Quantile 0.6: 71.67
   Quantile 0.7: 132.83
   Quantile 0.8: 248.00
   Quantile 0.9: 515.00
   Quantile 1.0: 9381.50

*** Calculating normalisation factor ***
Now working on PolII ...
  Spearman's correlation: 0.15
  *** Warning: low correlation -- kernel density estimation may not be the best normalisation method for this dataset.  Consider using readcount normalisation instead (using --norm_method=rpm) ...

  Norm factor = 0.03 based off 48163 frags (total 386533)

*** Normalising ***
Processing sample: PolII ...
  ... normalising by 0.03

*** Generating ratios ***
Now working on PolII ...
  Reading Dam ...
  Reading PolII ...
  ... adding 0.59 pseudocounts to each sample

All done.

owenjm · 2019-09-17T00:02:05Z

That's very strange. Would you be able to email me directly (owen.marshall at utas.edu.au) regarding this? I'd ideally like to get a copy of your Dam-only file and see what on earth is going on with it. It definitely doesn't make sense that only 100K reads should come back from 38.8M reads >q30 -- especially when your PolII sample has processed just fine under the same run. Thanks and best wishes, Owen

…

On Tue, 17 Sep 2019 at 05:06, mctseng2 ***@***.***> wrote: Hi Owen, Thanks for your reply. It is really helpful. It could be the coverage issue but I couldn't figure out which step went wrong. The sequencing was all done together and the libraries size are plenty, ranged between 30M and 60M reads. The mapping rate is also good between 70% - 95%. The reads are mostly aligned to main chromosomes. However, I found something weird on the log file. When the software trying to generate new bam files, one of the samples in the pair lost a significant amount of reads (from the bigining of 42,011,068 aligned reads to 109,444). I was aware that the GATC is only built on main chromosomes, but it shouldn't be the issue as the alignment is mostly on main chromosomes. How does the new bam file creator work? Maybe the PCR duplicates got removed? I've attached the log file from pair 2 below. Thanks in advance for helping me do the troubleshooting. `damidseq_pipeline v1.4 Command-line options: --gatc_frag_file=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/gatc_file/ dmel-r6.28.GATC.gf f --bowtie2_genome_dir=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/bowtie2-dmel-r6.28/dmel-r6.28 --threads=7 *** Reading data files *** *** Aligning files with bowtie2 *** Now working on Dam ... 53989947 reads; of these: 53989947 (100.00%) were unpaired; of these: 1988718 (3.68%) aligned 0 times 42011068 (77.81%) aligned exactly 1 time 9990161 (18.50%) aligned >1 times 96.32% overall alignment rate Now working on PolII ... 63228299 reads; of these: 63228299 (100.00%) were unpaired; of these: 2337292 (3.70%) aligned 0 times 57348566 (90.70%) aligned exactly 1 time 3542441 (5.60%) aligned >1 times 96.30% overall alignment rate *** Reading GATC file *** *** Extending reads up to 300 bases *** Reading input file: Dam ... Processing data ... Warning: alignment contains chromosome identities not found in GATC file: 211000022278132 211000022278151 211000022278611 ...... Y_mapped_Scaffold_30_D1720 Y_mapped_Scaffold_34_D1584 Y_mapped_Scaffold_5_D1748_D1610 Y_mapped_Scaffold_9_D1573 mitochondrion_genome Seqs extended (>q30) = 38778735 Reading input file: PolII ... Processing data ... Warning: alignment contains chromosome identities not found in GATC file: 211000022278164 211000022279116 ... Seqs extended (>q30) = 54358922 *** Calculating bins *** Now working on Dam-ext300 ... Generating .bam file ... sorting ... 109444 reads <<----this went low Generating bins from Dam-ext300.bam ... Converting to GATC resolution ... Now working on PolII-ext300 ... Generating .bam file ... sorting ... 54358922 reads Generating bins from PolII-ext300.bam ... Converting to GATC resolution ... *** Calculating quantiles *** Now working on Dam ... Sorting ... Quantile 0.1: 0.17 Quantile 0.2: 0.33 Quantile 0.3: 0.44 Quantile 0.4: 0.50 Quantile 0.5: 0.71 Quantile 0.6: 1.00 Quantile 0.7: 1.00 Quantile 0.8: 1.43 Quantile 0.9: 2.00 Quantile 1.0: 13.00 Now working on PolII ... Sorting ... Quantile 0.1: 1.45 Quantile 0.2: 4.00 Quantile 0.3: 9.67 Quantile 0.4: 20.00 Quantile 0.5: 38.33 Quantile 0.6: 71.67 Quantile 0.7: 132.83 Quantile 0.8: 248.00 Quantile 0.9: 515.00 Quantile 1.0: 9381.50 *** Calculating normalisation factor *** Now working on PolII ... Spearman's correlation: 0.15 *** Warning: low correlation -- kernel density estimation may not be the best normalisation method for this dataset. Consider using readcount normalisation instead (using --norm_method=rpm) ... Norm factor = 0.03 based off 48163 frags (total 386533) *** Normalising *** Processing sample: PolII ... ... normalising by 0.03 *** Generating ratios *** Now working on PolII ... Reading Dam ... Reading PolII ... ... adding 0.59 pseudocounts to each sample All done. ` — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9?email_source=notifications&email_token=AB4TA63TXQRQAIMGFVMPCA3QJ7KL5A5CNFSM4IGGDN72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62GAGY#issuecomment-531914779>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB4TA65AJ3BQ36YHWSIHAXTQJ7KL5ANCNFSM4IGGDN7Q> .

laugon10 · 2020-11-09T10:35:34Z

Hello both,
I am just having the same issue. Have you find any reason for this? I am attaching my log.

Thank you very much in advance.
Laura

damidseq_pipeline v1.4.5
Command-line options: --gatc_frag_file=/damidseq_pipeline/files_TaDA/dmel_r6.36.GATC.gff --bowtie2_genome_dir=/damidseq_pipeline/files_TaDA/Dm_r6.36/dmel_r6.36 --bowtie2_path=/bowtie2/2.2.3/ --samtools_path=/samtools/1.6/ --bedtools_path=/apps/bedtools/ --dam=MZ_Dam_2_1.fastq.gz MZ_DamPOL_2_1.fastq.gz

*** Reading data files ***
  Using MZ.1 as Dam control.

Using samtools version 1.6

*** Aligning files with bowtie2 ***
Now working on MZ.1 ...
26358976 reads; of these:
  26358976 (100.00%) were unpaired; of these:
    718878 (2.73%) aligned 0 times
    23912933 (90.72%) aligned exactly 1 time
    1727165 (6.55%) aligned >1 times
97.27% overall alignment rate

Now working on MZ ...
27181047 reads; of these:
  27181047 (100.00%) were unpaired; of these:
    3961397 (14.57%) aligned 0 times
    22267913 (81.92%) aligned exactly 1 time
    951737 (3.50%) aligned >1 times
85.43% overall alignment rate


*** Reading GATC file ***
*** Extending reads up to 300 bases ***
Reading input file: MZ.1 ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278521
    211000022278673
   .........
    Y_mapped_Scaffold_9_D1573
    mitochondrion_genome

  Seqs extended (>q30) = 21067303

Reading input file: MZ ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278164
    211000022278311
    ................
    Y_mapped_Scaffold_9_D1573
    mitochondrion_genome

  Seqs extended (>q30) = 18004292

*** Calculating bins ***
Now working on MZ.1-ext300 ...
Generating .bam file ...
  sorting ...
  16613660 reads
  Generating bins from MZ.1-ext300.bam ...
  Converting to GATC resolution ...                 


Now working on MZ-ext300 ...
Generating .bam file ...
  sorting ...
  793234 reads <-- low reads
  Generating bins from MZ-ext300.bam ...
  Converting to GATC resolution ...                 


*** Calculating quantiles ***
Now working on MZ.1 ...
Sorting ...
   Quantile 0.1: 1.00
   Quantile 0.2: 2.00
   Quantile 0.3: 3.75
   Quantile 0.4: 6.29
   Quantile 0.5: 10.40
   Quantile 0.6: 17.00
   Quantile 0.7: 29.00
   Quantile 0.8: 53.80
   Quantile 0.9: 123.29
   Quantile 1.0: 10260.33

Now working on MZ ...
Sorting ...
   Quantile 0.1: 0.25
   Quantile 0.2: 0.45
   Quantile 0.3: 0.62
   Quantile 0.4: 1.00
   Quantile 0.5: 1.00
   Quantile 0.6: 1.67
   Quantile 0.7: 2.50
   Quantile 0.8: 4.50
   Quantile 0.9: 11.25
   Quantile 1.0: 2870.40

*** Calculating normalisation factor ***
Now working on MZ ...
  Spearman's correlation: 0.49  
  Norm factor = 32.00 based off 100605 frags (total 389223)

*** Normalising ***
Processing sample: MZ ...
  ... normalising by 32.00

*** Generating ratios ***
Now working on MZ ...
  Reading Dam ...
  Reading MZ ...
  ... adding 4.32 pseudocounts to each sample

All done.

owenjm · 2020-11-11T03:01:34Z

Hi Laura,

I think there are two causes for this bug. The simple fix is to upgrade samtools to at least version 1.9, and it should (hopefully) process fine; although you're only the second person to report this so it's difficult to tell.

There may also be a bug in the pipeline code that affects rare and unusual alignments, and which in turn is causing old versions of samtools to exit early. I need to check this further.

But if please let me know if upgrading samtools fixes the problem?

Thanks,
Owen

laugon10 · 2020-11-11T09:03:48Z

Hi Owen,
Thank you very much for your fast answer. I was using samtools 1.6 so that is maybe the problem (I will try with an updated samtools) and also I have paired-end reads but I used only one of the reads. I just saw that the last version accepted paired-end reads and this has just worked perfectly.

Thanks a lot,
Laura

owenjm · 2020-11-11T12:53:21Z

Great to hear, Laura. Yes, we've switched over to PE reads completely now, and I can't imagine going back to SE (the only thing I'd suggest is that you don't mix SE and PE samples, as it will create some odd results in repetitive regions given the better mapping of PE data).

(Would still like to know exactly what's going on here, so will leave this open in case it crops up again.)

Cheers,
Owen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendation for normalization methods when making comparison across samples #9

Recommendation for normalization methods when making comparison across samples #9

mctseng2 commented Jul 23, 2019 •

edited

Loading

owenjm commented Aug 5, 2019 via email

mctseng2 commented Sep 16, 2019 •

edited

Loading

owenjm commented Sep 17, 2019 via email

laugon10 commented Nov 9, 2020

owenjm commented Nov 11, 2020

laugon10 commented Nov 11, 2020

owenjm commented Nov 11, 2020

Recommendation for normalization methods when making comparison across samples #9

Recommendation for normalization methods when making comparison across samples #9

Comments

mctseng2 commented Jul 23, 2019 • edited Loading

owenjm commented Aug 5, 2019 via email

mctseng2 commented Sep 16, 2019 • edited Loading

owenjm commented Sep 17, 2019 via email

laugon10 commented Nov 9, 2020

owenjm commented Nov 11, 2020

laugon10 commented Nov 11, 2020

owenjm commented Nov 11, 2020

mctseng2 commented Jul 23, 2019 •

edited

Loading

mctseng2 commented Sep 16, 2019 •

edited

Loading