Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what do you want to fill in the “--chrom” option? I read the instructions and there is no document. #35

Open
oranges7 opened this issue Mar 17, 2023 · 22 comments

Comments

@oranges7
Copy link

No description provided.

@oranges7
Copy link
Author

I want to call on all chromosomes. How should I write them?

@umahsn
Copy link
Collaborator

umahsn commented Mar 17, 2023

Hi,

What version of NanoCaller are you using? From v3 onwards the default it all chromosomes. For versions before that you needed a separate command NanoCaller_WGS for whole genome.

@oranges7
Copy link
Author

Hi,

I should be using the latest version, because I installed it with bioconda this morning, but it requires-- chrom

usage: NanoCaller [-h] [-mode MODE] [-seq SEQUENCING] [-cpu CPU] [-mincov MINCOV] [-maxcov MAXCOV] [-keep_bam] [-o OUTPUT] [-prefix PREFIX]
[-sample SAMPLE] [-include_bed INCLUDE_BED] [-exclude_bed EXCLUDE_BED] [-start START] [-end END] [-p PRESET] -bam BAM -ref
REF -chrom CHROM [-snp_model SNP_MODEL] [-min_allele_freq MIN_ALLELE_FREQ] [-min_nbr_sites MIN_NBR_SITES]
[-nbr_t NEIGHBOR_THRESHOLD] [-sup] [-indel_model INDEL_MODEL] [-ins_t INS_THRESHOLD] [-del_t DEL_THRESHOLD]
[-win_size WIN_SIZE] [-small_win_size SMALL_WIN_SIZE] [-impute_indel_phase] [-phase_bam] [-enable_whatshap]
NanoCaller: error: the following arguments are required: -bam/--bam, -ref/--ref, -chrom/--chrom

I want to calling on the whole genome, how should I do it?
Thank you for your reply.

@oranges7
Copy link
Author

Hi,

If possible, can you give me the code of the training process? I want to use more data to train the model.
Look forward to your reply

@umahsn
Copy link
Collaborator

umahsn commented Mar 22, 2023

Hi,

It is possible the conda is still installing the older version by default. You can specify a version for conda to install: conda install -c bioconda nanocaller=3.0.1. You can check the version of NanoCaller installed via NanoCaller --version or conda list.

If you are using NanoCaller v3 and above, please use NanoCaller --bam BAM --ref REF with the rest of the options for whole genome variant calling. You do not need to specify any chromosomes or regions for whole genome.

If you want to use the NanoCaller version below 3, which it seems like you have installed, please use NanoCaller_WGS --bam BAM --ref REF with the rest of the options. In version 2 of NanoCaller, we have a command NanoCaller_WGS for whole genome variant calling which calls NanoCaller command on each chromosome.

But in version 3 of NanoCaller, we have combined the functionality into just one entry point so now in v3+, NanoCaller command does whole genome variant calling by default, and will only do chromosome specific variant calling if --regions option is used.

We will release the training code in the next week or two. I will post an update here when I do so.

@oranges7
Copy link
Author

Hi,

Thank you very much for your reply. I have updated the version. Thank you for your patient answer. I look forward to your training code.

@oranges7
Copy link
Author

Hi,

I encountered another problem, snp calls will be completed directly, generating an empty file. Because I'm running other programs, is it because I don't have enough memory? How can I correct it? Looking forward to your reply.

2023-03-24 15:58:07.022608: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /media/usb3/gyc/vl/nano_hg004_30x/args

SNP Calling Progress: 0%| | 0/6362 [00:38<?, ?it/s]

2023-03-24 15:59:01.451540: Combining SNP calls.

2023-03-24 15:59:01.453847: Compressing and indexing SNP calls.
Writing to /tmp/bcftools.fnzF04
Merging 0 temporary files
Cleaning
Done

@oranges7
Copy link
Author

Hi,

I changed the server and can run normally, but why is the result compared with the benchmark file with hap.py, the final result will be very bad?

NanoCaller --bam /home/gyc/ont/fastq/HG001_NBT2018_Guppy_4.2.2.fastq.gz.bam
--ref /home/gyc/ont/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
--cpu 40
--output /home/gyc/ont/nano
--preset ont

Looking forward to your reply.

@oranges7
Copy link
Author

Hi,

I am still looking forward to your process of raising features and training models. Thank you very much for your hard work.

@umahsn
Copy link
Collaborator

umahsn commented May 3, 2023

Hi,

Sorry for the delay, we are working on releasing the training code, and will need a couple of weeks to get it done because we are trying to finish some other projects at the moment.

Best,
Umair

@oranges7
Copy link
Author

oranges7 commented May 4, 2023

Hi,

Thank you for your reply in your busy schedule!
I have encountered some questions. I wonder if you can answer them.
“””
(nano) [gyc@six170sever hg004]$ NanoCaller --bam hg004_v4_30x.bam --ref /home/gyc/ont/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output /media/usb3/gyc/vl/nano_try --cpu 10 --mode snps --phase --sequencing ont

2023-05-04 09:05:21.702487: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /media/usb3/gyc/vl/nano_try/args

SNP Calling Progress: 0%| | 0/6362 [00:30<?, ?it/s]

2023-05-04 09:05:54.798367: Combining SNP calls.

2023-05-04 09:05:54.801669: Compressing and indexing SNP calls.
Writing to /tmp/bcftools.CyNo4Y
Merging 0 temporary files
Cleaning
Done

2023-05-04 09:05:54.975464: SNP calling completed. Time taken= 30.9631
“””

The steps “snp calling” when I ran nanocalller did not seem to run and were completed directly. Do you know what the possible reason is?

@oranges7
Copy link
Author

oranges7 commented May 4, 2023

Hi,

I changed the server and ran it with the following error.
“””
(nano) [ywenjing@localhost hg004]$ NanoCaller --bam hg004_v4_30x.bam --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output /mnt/ywenjing/gyc/hg004/out/ --cpu 20 --preset ont

2023-05-04 10:56:38.121256: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /mnt/ywenjing/gyc/hg004/out/args

SNP Calling Progress: 39%|????????????????????????????????? | 2493/6362 [29:50<29:41, 2.17it/s]Process Process-8:
Traceback (most recent call last):
File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/snpCaller.py", line 80, in caller
pos, test_ref, x_test, dp, freq, coverage = get_snp_testing_candidates(params, chunk)
File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/generate_SNP_pileups.py", line 136, in get_snp_testing_candidates
r=ref_dict[v_pos]
KeyError: 21950001
“””
Then it will quit snp calling and run indel calling, but will also report an error.

“””
Process Process-31:
Traceback (most recent call last):
File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/indelCaller.py", line 200, in caller
indel_run(params, indel_dict, job_Q, counter_Q, indel_files_list)
File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/indelCaller.py", line 59, in indel_run
pos, x0_test, x1_test, x2_test, alleles_seq, phase = get_indel_testing_candidates(params, chunk)
File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/generate_indel_pileups.py", line 174, in get_indel_testing_candidates
ref_dict={j:s.upper() if s in 'AGTC' else 'N' for j,s in zip(range(max(1,start-200),end+400+1),fastafile.fetch(chrom,max(1,start-200)-1,end+400)) }
File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch
KeyError: "sequence 'chr14' not present"

“””
Is it because of my bam file problem? I use HG004 subsampled to 30x.

@umahsn
Copy link
Collaborator

umahsn commented May 4, 2023 via email

@oranges7
Copy link
Author

oranges7 commented May 5, 2023

Yes, I re-checked the index file and found that the file was corrupted. I fixed the file and the program could run normally. But the file on the other server is intact, but it doesn't work properly. I'm looking for a solution.
Thank you for your reply!

@oranges7
Copy link
Author

Hi,
I'm sorry to bother you again, but I'd like to know what kind of label you use in your training model. Can you give me a brief description?

@oranges7
Copy link
Author

Another question is, is the v_pos in the generate_indel_pileups.py file the last detected site?

@umahsn
Copy link
Collaborator

umahsn commented May 10, 2023

Hi,

I have added the training code here: https://github.com/WGLab/NanoCaller/tree/master/misc/training
This code uses tensorflow 1.13 for training, but later we switched to tensorflow 2 and I only move the inference code to tf2 and not the training code. I will make another update with tf2 code for training. Use ground truth SNP VCF file for SNP model training and ground truth indel VCF file for indel training.

In the inference code, v_pos is the current position being scanned, prev is the last candidate site plus some buffer. Please note that the equivalent code for training is a little different.

@oranges7
Copy link
Author

Hi,

I used nanocaller to run out of results, using hap.py for analysis, found that the precision is very poor. I am using Guppy4.2.2-based hg004 data, down to 50x, using the following command
NanoCaller --bam hg004_50x.bam --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output nano --cpu 20 --preset ont
What adjustments or attempts should I make to improve precision? Am I going to add the "--exclude_bed" command?

@umahsn
Copy link
Collaborator

umahsn commented May 22, 2023

Hi,

Can you give some information about how you are evaluating the performance, such as: 1) what benchmark dataset are you using? 2) are you using a high confidence interval for evaluation? 3) are you separating SNP and indel performance? If yes, is the precision poor for SNP or indels?

Using --exclude_bed with hg19/hg38 parameter will skip telomere and centromere regions from variant calling so that could improve precision, but usually high confidence intervals in benchmarking dont include these regions for evaluation anyway. So it depends upon whether you are using a high confidence interval for evaluation or not.

You can also try to increase '--min_allele_freq' parameter to 0.2 or 0.25 (instead of default of 0.15), and --mincov parameter to 8 (instead of default of 4); both of these will improve precision.

@oranges7
Copy link
Author

Hi,

  1. The benchmark dataset I use is HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.
  2. I didn't use a high confidence interval.
  3. According to hap.py, the accuracy of SNP is about 55%. The accuracy of indel is about 0.06%.
    I'll try to run the program again with the '--min_allele_freq' parameter.

Thank you for your reply.

@umahsn
Copy link
Collaborator

umahsn commented May 22, 2023

You should use a high confidence interval such as: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed. Without high-confidence interval, you can end up with regions that do not have reliable SNP calls, and SNP calls from short-read sequencing for such regions are typically removed from GIAB benchmark datasets (hence you'll get false positives but not a lot of false negatives from NanoCaller).

Whereas for indels, it is ideal to remove homopolymer regions from evaluation since Nanopore sequencing is not reliable for homopolymer region (especially R9.4.1 datasets).

Can you try using the evaluation strategy shown here (https://github.com/WGLab/NanoCaller/blob/master/docs/ONT%20Case%20Study.md) which uses RTG-tool's vcfeval function? You can just replace HG002 links with HG004 links from here: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh38/

@oranges7
Copy link
Author

Hi,

I tried to use a high confidence interval. The results of rtgtools and hap.py are similar and are consistent with yours. The key is to look at the data with a quality value above 120. Thank you very much for your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants