Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendation for normalization methods when making comparison across samples #9

Open
mctseng2 opened this issue Jul 23, 2019 · 7 comments

Comments

@mctseng2
Copy link

mctseng2 commented Jul 23, 2019

Hi Owen,

I am analyzing some targeted DamID samples for colleagues. We have 4 pairs of polII Fusion/DamOnly samples from drosophila and would like to compare the binding pattern between them on targeted genes (and possible mining novel genes). I ran through the default kernel density normalization methods and find that the 4 samples have very different normalization parameters:

  correlation normFactor pseudocounts warning(recommend rpm normalizatiom)
pair1 0.16 0.08 0.62 Y
pair2 0.15 0.03 0.59 Y
pair3 0.55 0.13 63.07 N
pair4 0.46 0.34 6.52 N

When we looked at the results of a gene. We compared those to the plain log2(RPKM-Fusion/RPKM_Dam) which I calculated manually. I found that the signal of 3rd and 4th pairs have been wiped out quite a bit (possibly due to large pseudo counts added) while the first 2 pairs look normal. If I understand the idea of kernel normalization correctly, the method should correct the negative-biased signal and makes it more positive thus increasing experimental power. However, will the huge variation of the pseudo counts added making the comparison of processed values across sample difficult? If so, what normalization method would you recommend?

Also, I have a question of how the correlation calculated. Are they only focusing on GATC region? The first two pairs seem to have very low correlation values and I got a warning for it when running the pipeline. I ran the correlation analysis manually using deeptools and got higher values.

Thanks in advance for viewing my questions. I've also included a IGV shot of these 4 samples for the reference. The first two tracks were the RPKM coverage tracks for Fusion and DamOnly samples. The third track (in red) is the pipeline output. The fourth track is the manually calculated log(RPKM_Fusion+1/RPKM_Dam=1) vales. The pipeline-generated values of the bottom 2 pairs are smaller compare to manually calculated log2 RPMK ratio which is concerning to us.
igv_snapshot-1

Regards,

Meng-Chun

@owenjm
Copy link
Owner

owenjm commented Aug 5, 2019 via email

@mctseng2
Copy link
Author

mctseng2 commented Sep 16, 2019

Hi Owen,

Thanks for your reply. It is really helpful. It could be the coverage issue but I couldn't figure out which step went wrong. The sequencing was all done together and the libraries size are plenty, ranged between 30M and 60M reads. The mapping rate is also good between 70% - 95%. The reads are mostly aligned to main chromosomes.

However, I found something weird on the log file. When the software trying to generate new bam files, one of the samples in the pair lost a significant amount of reads (from the bigining of 42,011,068 aligned reads to 109,444). I was aware that the GATC is only built on main chromosomes, but it shouldn't be the issue as the alignment is mostly on main chromosomes. How does the new bam file creator work? Maybe the PCR duplicates got removed?

I've attached the log file from pair 2 below. Thanks in advance for helping me do the troubleshooting.

Meng-Chun

damidseq_pipeline v1.4
Command-line options: --gatc_frag_file=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/gatc_file/dmel-r6.28.GATC.gf
f --bowtie2_genome_dir=/home/groups/hpcbio/RNA-Seq/projects/lixin/2019JunDam-ID-Seq/data/genome/bowtie2-dmel-r6.28/dmel-r6.28 --threads=7

*** Reading data files ***

*** Aligning files with bowtie2 ***
Now working on Dam ...
53989947 reads; of these:
  53989947 (100.00%) were unpaired; of these:
    1988718 (3.68%) aligned 0 times
    42011068 (77.81%) aligned exactly 1 time
    9990161 (18.50%) aligned >1 times
96.32% overall alignment rate

Now working on PolII ...
63228299 reads; of these:
  63228299 (100.00%) were unpaired; of these:
    2337292 (3.70%) aligned 0 times
    57348566 (90.70%) aligned exactly 1 time
    3542441 (5.60%) aligned >1 times
96.30% overall alignment rate


*** Reading GATC file ***
*** Extending reads up to 300 bases ***
Reading input file: Dam ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278132
    211000022278151
    211000022278611
......
    Y_mapped_Scaffold_30_D1720
    Y_mapped_Scaffold_34_D1584
    Y_mapped_Scaffold_5_D1748_D1610
    Y_mapped_Scaffold_9_D1573
    mitochondrion_genome

  Seqs extended (>q30) = 38778735

Reading input file: PolII ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278164
    211000022279116
   ...
  Seqs extended (>q30) = 54358922

*** Calculating bins ***
Now working on Dam-ext300 ...
Generating .bam file ...
  sorting ...
  109444 reads <<----this went low
  Generating bins from Dam-ext300.bam ...
  Converting to GATC resolution ...


Now working on PolII-ext300 ...
Generating .bam file ...
  sorting ...
  54358922 reads
  Generating bins from PolII-ext300.bam ...
  Converting to GATC resolution ...


*** Calculating quantiles ***
Now working on Dam ...
Sorting ...
   Quantile 0.1: 0.17
   Quantile 0.2: 0.33
   Quantile 0.3: 0.44
   Quantile 0.4: 0.50
   Quantile 0.5: 0.71
   Quantile 0.6: 1.00
   Quantile 0.7: 1.00
   Quantile 0.8: 1.43
   Quantile 0.9: 2.00
   Quantile 1.0: 13.00

Now working on PolII ...
Sorting ...
   Quantile 0.1: 1.45
   Quantile 0.2: 4.00
   Quantile 0.3: 9.67
   Quantile 0.4: 20.00
   Quantile 0.5: 38.33
   Quantile 0.6: 71.67
   Quantile 0.7: 132.83
   Quantile 0.8: 248.00
   Quantile 0.9: 515.00
   Quantile 1.0: 9381.50

*** Calculating normalisation factor ***
Now working on PolII ...
  Spearman's correlation: 0.15
  *** Warning: low correlation -- kernel density estimation may not be the best normalisation method for this dataset.  Consider using readcount normalisation instead (using --norm_method=rpm) ...

  Norm factor = 0.03 based off 48163 frags (total 386533)

*** Normalising ***
Processing sample: PolII ...
  ... normalising by 0.03

*** Generating ratios ***
Now working on PolII ...
  Reading Dam ...
  Reading PolII ...
  ... adding 0.59 pseudocounts to each sample

All done.

@owenjm
Copy link
Owner

owenjm commented Sep 17, 2019 via email

@laugon10
Copy link

laugon10 commented Nov 9, 2020

Hello both,
I am just having the same issue. Have you find any reason for this? I am attaching my log.

Thank you very much in advance.
Laura

damidseq_pipeline v1.4.5
Command-line options: --gatc_frag_file=/damidseq_pipeline/files_TaDA/dmel_r6.36.GATC.gff --bowtie2_genome_dir=/damidseq_pipeline/files_TaDA/Dm_r6.36/dmel_r6.36 --bowtie2_path=/bowtie2/2.2.3/ --samtools_path=/samtools/1.6/ --bedtools_path=/apps/bedtools/ --dam=MZ_Dam_2_1.fastq.gz MZ_DamPOL_2_1.fastq.gz

*** Reading data files ***
  Using MZ.1 as Dam control.

Using samtools version 1.6

*** Aligning files with bowtie2 ***
Now working on MZ.1 ...
26358976 reads; of these:
  26358976 (100.00%) were unpaired; of these:
    718878 (2.73%) aligned 0 times
    23912933 (90.72%) aligned exactly 1 time
    1727165 (6.55%) aligned >1 times
97.27% overall alignment rate

Now working on MZ ...
27181047 reads; of these:
  27181047 (100.00%) were unpaired; of these:
    3961397 (14.57%) aligned 0 times
    22267913 (81.92%) aligned exactly 1 time
    951737 (3.50%) aligned >1 times
85.43% overall alignment rate


*** Reading GATC file ***
*** Extending reads up to 300 bases ***
Reading input file: MZ.1 ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278521
    211000022278673
   .........
    Y_mapped_Scaffold_9_D1573
    mitochondrion_genome

  Seqs extended (>q30) = 21067303

Reading input file: MZ ...
  Processing data ...
  Warning: alignment contains chromosome identities not found in GATC file:
    211000022278164
    211000022278311
    ................
    Y_mapped_Scaffold_9_D1573
    mitochondrion_genome

  Seqs extended (>q30) = 18004292

*** Calculating bins ***
Now working on MZ.1-ext300 ...
Generating .bam file ...
  sorting ...
  16613660 reads
  Generating bins from MZ.1-ext300.bam ...
  Converting to GATC resolution ...                 


Now working on MZ-ext300 ...
Generating .bam file ...
  sorting ...
  793234 reads <-- low reads
  Generating bins from MZ-ext300.bam ...
  Converting to GATC resolution ...                 


*** Calculating quantiles ***
Now working on MZ.1 ...
Sorting ...
   Quantile 0.1: 1.00
   Quantile 0.2: 2.00
   Quantile 0.3: 3.75
   Quantile 0.4: 6.29
   Quantile 0.5: 10.40
   Quantile 0.6: 17.00
   Quantile 0.7: 29.00
   Quantile 0.8: 53.80
   Quantile 0.9: 123.29
   Quantile 1.0: 10260.33

Now working on MZ ...
Sorting ...
   Quantile 0.1: 0.25
   Quantile 0.2: 0.45
   Quantile 0.3: 0.62
   Quantile 0.4: 1.00
   Quantile 0.5: 1.00
   Quantile 0.6: 1.67
   Quantile 0.7: 2.50
   Quantile 0.8: 4.50
   Quantile 0.9: 11.25
   Quantile 1.0: 2870.40

*** Calculating normalisation factor ***
Now working on MZ ...
  Spearman's correlation: 0.49  
  Norm factor = 32.00 based off 100605 frags (total 389223)

*** Normalising ***
Processing sample: MZ ...
  ... normalising by 32.00

*** Generating ratios ***
Now working on MZ ...
  Reading Dam ...
  Reading MZ ...
  ... adding 4.32 pseudocounts to each sample

All done.

@owenjm
Copy link
Owner

owenjm commented Nov 11, 2020

Hi Laura,

I think there are two causes for this bug. The simple fix is to upgrade samtools to at least version 1.9, and it should (hopefully) process fine; although you're only the second person to report this so it's difficult to tell.

There may also be a bug in the pipeline code that affects rare and unusual alignments, and which in turn is causing old versions of samtools to exit early. I need to check this further.

But if please let me know if upgrading samtools fixes the problem?

Thanks,
Owen

@laugon10
Copy link

Hi Owen,
Thank you very much for your fast answer. I was using samtools 1.6 so that is maybe the problem (I will try with an updated samtools) and also I have paired-end reads but I used only one of the reads. I just saw that the last version accepted paired-end reads and this has just worked perfectly.

Thanks a lot,
Laura

@owenjm
Copy link
Owner

owenjm commented Nov 11, 2020

Great to hear, Laura. Yes, we've switched over to PE reads completely now, and I can't imagine going back to SE (the only thing I'd suggest is that you don't mix SE and PE samples, as it will create some odd results in repetitive regions given the better mapping of PE data).

(Would still like to know exactly what's going on here, so will leave this open in case it crops up again.)

Cheers,
Owen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants