Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractHLAread.sh #19

Closed
vegetableyu opened this issue Dec 9, 2023 · 5 comments
Closed

ExtractHLAread.sh #19

vegetableyu opened this issue Dec 9, 2023 · 5 comments

Comments

@vegetableyu
Copy link

Hi,
When I apply the ExtractHLAread.sh script, the unpaired reads produced are far more than the paired reads. After researching, I saw that some people say the bam file should be sorted by read name before converting to fastq, while in ExtractHLAread.sh, it is sorted by coordinates. However, when I switched to sorting by read name, although it reduced a lot of unpaired reads, running SpecHLA.sh afterwards generated a large number of “-” results. This is very confusing, do you know what might be the situation?

@wshuai294
Copy link
Collaborator

Hi,

  • The bam file should be sorted by coordinates using samtools sort, because ExtractHLAread.sh extracts HLA-related reads by coordinates. In converting bam to fastq, samtools fastq can handle such bam file.

As for the large number of unpaired reads, please check following two points:

  • please ensure the reference version (hg19 or hg38) is the same between generating bam and running ExtractHLAread.sh.
  • For the unpaired reads, please check the alignment situation in the bam, see what alleles the unextracted read end mapped to.

please let me know if it works.

Best,
shuai

@vegetableyu
Copy link
Author

Hi shuai,
I looked at samtools fastq --help, which also says it needs to be sorted by reads name. However, using samtools fastq for bam files sorted by reads name and coordinates yielded little difference in results. With bam bam2Fastq, however, bam files sorted by read name get 20 times as many paired reads as those sorted by coordinates.
samtools

@wshuai294
Copy link
Collaborator

Hi,
Apologies for any confusion caused. It seems that ExtractHLAread.sh utilizes bam bam2Fastq to obtain the fastq files. In order to investigate the discrepancy in read numbers resulting from bam bam2Fastq, I suggest examining the reads extracted from the "sorted by read name" but not from the "sorted by coordinates" approach. Please select a subset of these reads and verify the regions to which they are mapped by examining the BAM. If the reads belong to the HLA region, it is likely that the "sorted by read name" approach is correct; otherwise, the "sorted by coordinates" approach may be more appropriate.

By the way, could you please check the alignment detail of the reads ("unpaired reads produced are far more than the paired reads") in the original BAM.

@vegetableyu
Copy link
Author

Hi shuai,
Thanks for your reply and suggestions, i will check this later. I noticed that others were also discussing the issue of preprocessing and gave some suggestions. This might help:
Kingsford-Group/kourami#30
https://bitbucket.nygenome.org/projects/COMPBIO/repos/hla_prep/browse

@wshuai294
Copy link
Collaborator

Thanks for the information. According to the issue, the reason of extracting too much unpaired reads might be keeping multiple alignments other than only the primary alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants