Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Four questions about using proovread on Hst #126

Open
ZhangNestor opened this issue Apr 13, 2018 · 2 comments
Open

Four questions about using proovread on Hst #126

ZhangNestor opened this issue Apr 13, 2018 · 2 comments

Comments

@ZhangNestor
Copy link

Hi @thackl ,

I want to use proofread on correction the human transcriptome pacbio long reads, and I am confused with several questions.

  1. Can I use the cutadapt to get the quality trimmed and error corrected reads? (cutadapt -q 30,30 -m 50 -o read_PE1.fq -p read_PE2.fq CL12R_1.fastq CL12R_2.fastq) ? Is the quality score and the length limitation are ok for the short reads?
  2. For the running time, based on my long reads file (4 Gb file), If I want to run the program more effectively and I wonder I should set the chunk sizes 20M to get about 200 files or set 1G to get about 4 files? I use a 20M file to test the running time, and I wait for several hours but have not got the result yet, so do you have any suggestions?
  3. I will use the corrected long reads to identify the new isoforms and fusion genes, so I should use the output file '.trimmed.fq' to do the next analysis?
  4. For the evaluation of the corrected long reads, I am a little confused about some concepts. I can calculate the reads number in the output '.trimmed.fq' file and compare this reads number with the raw long reads number, so I can have a general idea that how many reads have been corrected. This is a performance measure. I wonder for the raw and corrected long reads, how can I know the mapping rate change information? I mean for raw the reads as lots of errors existed in, the mapping rate (mapping on the genome) should be lower than the corrected reads. And is this a necessary value to evaluate the correction performance? Can you explain some detail information for the evaluation measures?

I feel so sorry to ask these questions and some question maybe a little stupid. and I will really really appreciate for your help.

Regards,
Nestor

@thackl
Copy link
Contributor

thackl commented Apr 20, 2018

Hi Nestor,

  1. Your cutapapt parameters sound reasonable to me (though I am not an expert on that). If low short read coverage s an issue, I think it would also be fine to use min quality 20 (instead of 30)
  2. For more infos on chunking and transcriptome data have a look at how to correct transcriptome PacBio reads better? #97 and The entry of RS_IsoSeq pipeline to use proovread for correction? #51
  3. You probably want to look at untrimmed, which are still full-length including some regions that couldn't get corrected. Note, though, untrimmed reads also might contain chimeric reads. Double-check especially new Isoforms - they should be represented by at least 2 or 3 PacBio reads to ensure they are not just random library prep artifacts.
  4. OK, so for number of corrected reads - it's not as straight forward as just the number of read in the trimmed.fq, for a few reasons:
  • the decision is not binary corrected vs. not, but rather depends on how much of a read gets corrected. There will be a few Illumina reads mapped to almost all PacBio reads, but only of there are Illumina reads along the entire length of the PacBio read, than the entire reads will be corrected, otherwise, only the parts that have Illumina reads mapped. The untrimmed.fq contains the original PacBio reads with all regions that had Illlumina reads mapped corrected, and all other parts of the reads looking like in the uncorrected read. During trimming, only high quality parts of a corrected read are extracted. Meaning, if there were Illumina reads for the entire length of the PacBio read, you will get one long corrected trimmed read. If there were Illumina reads for the first half and the second half, but not for a short stretch in the middle, then you will have one read in untrimmed.fq, but two shorter reads in the trimmed.fq.
  • In addition, proovread will throw out very short reads (<2x Illumina read length) and corrected fragments in the .trimmed.fq

For a quick check of how much correction improved mapping rates and error rates of the pacbio reads, I'd suggest to map the uncorrected and corrected reads with https://github.com/lh3/minimap2, and compare mapping statistics.

Hope that helps, and sorry it took me a bit to get back to you.

@ZhangNestor
Copy link
Author

Really appreciate for your help! And I still have a question that how to check improved error rates with minimap2.
I used minimap2 to map pacbio raw reads, trimmed reads and untrimmed reads,
for raw reads, output log file show
mapped 291643 sequences
for trimmed reads, output log file show
mapped 252915 sequences
for untrimmed reads, output log file show
mapped 283248 sequences
Do you think it's a little strange that after correction, less reads mapping to the genome (it means the mapping rates decreased after correction)?
And for the mapping error rates change, do you have any idea that how to check it?
For correction step, my output after each iteration:
Running task bwa-sr-1
Masked : 36.2%
Running task bwa-sr-2
Masked : 48.3%
Running task bwa-sr-3
Masked : 53.4%
Running task bwa-sr-4
Masked : 64.7%
Running task bwa-sr-5
Masked : 67.4%
Running task bwa-sr-finish
Masked : 53.8%
How do you think this result, and is this good enough for the later analysis?
Thanks, really hope your reply.

Nestor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants