Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_kraken_reads.py returns different read counts for paired fastqs #85

Open
matbeale opened this issue Sep 26, 2023 · 2 comments
Open

Comments

@matbeale
Copy link

Hi Jen,

Maybe I'm missing something, but extract_kraken_reads.py seems to be returning a different number of reads for each pair in a fastq, which is creating issues for downstream processing (e.g. taxon-specific assembly). Is there an option to ensure both reads in a pair are extracted when one has a hit (or the reverse)?

Many thanks,
Mat

@jenniferlu717
Copy link
Owner

Are you saying that the script is finding one read but not the other? What sequencing platform did you use? I've had to modify the script a couple times to account for paired reads formatting.

Are you running the script on both fastq files at the same time?

@matbeale
Copy link
Author

matbeale commented Sep 27, 2023

Yes, I'm running the script on both fastqs at the same time - this is my run command:

extract_kraken_reads.py --include-children --fastq-output -k 47759_1#1\_clean_kraken.out -r 47759_1#1\_clean_kraken.report -s 47759_1#1\_clean_1.fastq.gz -s2 47759_1#1\_clean_2.fastq.gz -o 47759_1#1\_1.fastq -o2 47759_1#1\_2.fastq -t 3370480

When I check the read counts:
fastaq count_sequences 47759_1#1_1.fastq.gz
3636
fastaq count_sequences 47759_1#1_2.fastq.gz
3658

Since the read counts are discrepant between the fastq files, my original assumption was that the script is finding one but not the other, but I haven't confirmed that. In fact, looking at the first 3 reads in each file (head -12 and tail -12) seems to find the same read. I'll have to think a bit more about how to identify the discrepant reads.

This is data sequenced on NovaSeq 6000. Read headers look like this:

@A00537:889:HHJ7KDRX3:1:2278:25699:35916/2
TTCTAAAACATGAACAAAGGAAGAAGGAAATAAGTAAATGCAAAGCTATGCCAACCTTTCTAAAACATGCTTGTGCAAAAGGGCGTGTATGTTATGTATGTAAAAACATTTTGCTACTTAGCAATACCATTTGGCAGTACATATATTTGGC
+
FFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants