Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vg call issues #4475

Open
asylvz opened this issue Dec 10, 2024 · 7 comments
Open

vg call issues #4475

asylvz opened this issue Dec 10, 2024 · 7 comments

Comments

@asylvz
Copy link

asylvz commented Dec 10, 2024

1. What were you trying to do?
Trying to call/genotype from the vg graph.

Commands:
1 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -z > genotypes_yamnaya.vcf
2 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -v concatenated.vcf.gz > genotypes_yamnaya.vcf
3 -vg call all.xg -k yamnaya.pack -r yamnaya.snarls > genotypes_yamnaya.vcf

Here's how I generated the graph:

(seq 1 22; echo X) | parallel -j 24 "${vg} construct -r ${ref} -v 1kGP_high_coverage_Illumina.chr{}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz -R chr{} -C -a > chr{}.vg"
${vg} ids -j -m mapping $(for i in $(seq 1 22; echo X); do echo chr${i}.vg; done)
${vg} index -x graph-with-alts.xg -L $(for i in $(seq 1 22; echo X); do echo chr${i}.vg; done)
${vg} gbwt -x graph-with-alts.xg -o all.gbwt --num-jobs 24 -v \
   $(for i in $(seq 1 22; echo X); do echo 1kGP_high_coverage_Illumina.chr${i}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz; done)

${vg} index -x all.xg $(for i in $(seq 1 22; echo X); do echo chr${i}.vg; done)
${vg} gbwt -x all.xg --gbz-format -g all.gbz all.gbwt
${vg} index -j all.dist all.xg
${vg} minimizer -t 16 -d all.dist -o all.min all.gbz

And mappings:

${vg} giraffe -Z all.gbz -m all.min -d all.dist -f Yamnaya.fastq.gz >mapped_Yamnaya.gam
${vg} pack -x all.gbz -g mapped_Yamnaya.gam -o yamnaya.pack -Q 5
${vg} snarls all.gbz > yamnaya.snarls

2. What did you want to happen?
Generate small and large variants

3. What actually happened?
Commands (1) and (2) give "reference path not found" error and (3) gives the error in the attachment.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

vg_error.txt

5. What data and command can the vg dev team use to make the problem happen?
1 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -z > genotypes_yamnaya.vcf
2 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -v concatenated.vcf.gz > genotypes_yamnaya.vcf
3 -vg call all.xg -k yamnaya.pack -r yamnaya.snarls > genotypes_yamnaya.vcf

6. What does running vg version say?

vg 1.61.0 plodio
@jltsiren
Copy link
Contributor

Your GBZ graphs do not have reference paths due to the way you built them. The vg gbwt commands you used only include haplotype paths from the VCF. You would have to build separate GBWT indexes for the reference paths and merge them with the haplotype GBWTs. It is much easier to use vg autoindex, which does all this automatically.

The third command probably crashes, because the graph and the snarl file do not match. You built the snarls for the GBZ graph, which only contains the nodes and the edges used by the haplotype paths in the GBWT index. Your XG graph is based on vg construct and likely includes additional edges and nodes. Edges, because the construction creates edges for all possible recombinations, and some of them are likely not seen in the haplotypes. And nodes, because graph construction and GBWT construction interpret overlapping variants in slightly different ways for performance reasons.

@asylvz
Copy link
Author

asylvz commented Dec 17, 2024

I could not use vg autoindex, I got the error below. Note that I merged the VCFs of each chromosome and ran with vg autoindex --workflow giraffe --prefix /mnt/NEOGENE4/projects/pipeline_2024/autoindex --ref-fasta GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna --vcf concatenated.vcf.gz

last part of the error is below (full error log is attached: 2071261-gibbon.err.zip) :

b7b38b447e6c26d5ccef.383.phased.chunked.vcf.gz
warning: [HaplotypeIndexer::parse_vcf] contig chrUn_JTFH01001152v1_decoy not present in file /tmp/vg-v5hwlS/dir-8OdtUU/a461b67fd342ea89b110b7b38b447e6c26d5ccef.383.phased.chunked.vcf.gz
vg: /usr/local/sw/vg/include/sdsl/int_vector_buffer.hpp:160: sdsl::int_vector_buffer<t_width>::int_vector_buffer(std::string, std::ios_base::openmode, uint64_t, uint8_t, bool) [with unsigned char t_width = 8; std::string = std::__cxx11::basic_string<char>; std::ios_base::openmode = std::ios_base::openmode; uint64_t = long unsigned int; uint8_t = unsigned char]: Assertion `m_ifile.good()' failed.
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.61.0-26-g62ccb5575 "Plodio"
Stack trace (most recent call last) in thread 2993747:
#14   Object "", at 0xffffffffffffffff, in 
#13   Object "", at 0x7ff9ced95a6e, in 
#12   Object "", at 0x7ff9cf4caea6, in 
#11   Object "", at 0x7ff9cef98ecf, in 
#10   Object "", at 0x55adfc7c4d06, in 
#9    Object "", at 0x55adfc780a04, in 
#8    Object "", at 0x55adfc751fac, in 
#7    Object "", at 0x55adfc74fe91, in 
#6    Object "", at 0x55adfc754f00, in 
#5    Object "", at 0x55adfcdd9d4d, in 
#4    Object "", at 0x55adfc940c4c, in 
#3    Object "", at 0x7ff9ceccb661, in 
#2    Object "", at 0x7ff9cecbc40e, in 
#1    Object "", at 0x7ff9cecbc536, in 
#0    Object "", at 0x7ff9cecd2ce1, in 
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.61.0-26-g62ccb5575
Stack trace (most recent call last) in thread 2993747:
#21   Object "", at 0xffffffffffffffff, in 
#20   Object "", at 0x7ff9ced95a6e, in 
#19   Object "", at 0x7ff9cf4caea6, in 
#18   Object "", at 0x7ff9cef98ecf, in 
#17   Object "", at 0x55adfc7c4d06, in 
#16   Object "", at 0x55adfc780a04, in 
#15   Object "", at 0x55adfc751fac, in 
#14   Object "", at 0x55adfc74fe91, in 
#13   Object "", at 0x55adfc754f00, in 
#12   Object "", at 0x55adfcdd9d4d, in 
#11   Object "", at 0x55adfc940c4c, in 
#10   Object "", at 0x7ff9ceccb661, in 
#9    Object "", at 0x7ff9cecbc40e, in 
#8    Object "", at 0x7ff9cecbc536, in 
#7    Object "", at 0x7ff9cecd2ce1, in 
#6    Object "", at 0x7ff9cf4d613f, in 
#5    Object "", at 0x55adfc65c657, in 
#4    Object "", at 0x7ff9cecd56b9, in 
#3    Object "", at 0x7ff9cecd5516, in 
#2    Object "", at 0x55adfcb00908, in 
#1    Object "", at 0x55adfcb006e0, in 
#0    Object "", at 0x7ff9ced5d073, in 
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!

@jltsiren
Copy link
Contributor

My best guess is that vg tried to open too many files simultaneously and hit the limit.

The first part of the error log is vg complaining about SVs that may be incorrectly represented or vg at least does not understand them. They should be irrelevant to the issue at hand.

In the second part, vg complains about a large number of contigs with no variants. We also see that the input FASTA and VCF files have been chunked into hundreds of parts (one per contig).

And then the actual error comes from sdsl::int_vector_buffer, which fails to open a temporary file it had just created (and is still open) for reading.

You could try increasing the number of file descriptors with ulimit -n <limit> (you can see the current limit with ulimit -n). If you can't do that or it doesn't help, another option is removing all unlocalized contigs from the reference.

@asylvz
Copy link
Author

asylvz commented Dec 21, 2024

After increasing the ulimit, I now get the error below. I guess it's because there is not enough storage in "/tmp/xg-254s2K" right? Is there a way to decrease the amount of data that VG generates in that folder?

...
[IndexRegistry]: Merging contig GBWTs.
[IndexRegistry]: Stripping allele paths from VG.
[IndexRegistry]: Constructing XG graph from VG graph.
[xg]: couldn't create temp directory: /tmp/xg-254s2K

@jltsiren
Copy link
Contributor

jltsiren commented Jan 2, 2025

The temporary files are unavoidable. You can select another directory for them by using vg autoindex option -T / --tmp-dir or by setting environment variable TMPDIR.

@asylvz
Copy link
Author

asylvz commented Jan 7, 2025

Thank you @jltsiren but our server does not allow the generation of such high memory temporary files. In your previous messages, you said that "Your GBZ graphs do not have reference paths due to the way you built them. The vg gbwt commands you used only include haplotype paths from the VCF. You would have to build separate GBWT indexes for the reference paths and merge them with the haplotype GBWTs. It is much easier to use vg autoindex, which does all this automatically. "

Can you give a bit more detail on how I can achieve this without using vg autoindex?

@jltsiren
Copy link
Contributor

jltsiren commented Jan 7, 2025

Let's go back to your original commands.

First, you should use option --preset 1000gp in the vg gbwt command where you create the haplotype GBWT. It sets --force-phasing and --discard-overlaps, which should give you chromosome-length haplotypes, instead of creating tens of millions of phase breaks when the path construction logic cannot make sense of overlapping variants. It also sets --buffer-size and --batch-size to more appropriate numbers.

Then you should create another GBWT with the reference paths. The following command should work:

vg gbwt -p -x all.xg -E -o reference.gbwt

The reference paths in reference.gbwt are technically generic named paths, because we haven't bothered updating vg construct to include reference sample name in the path names.

Then you merge all.gbwt and reference.gbwt using:

vg gbwt -p -o merged.gbwt -m all.gbwt reference.gbwt

The order of the arguments is important. If you switch the two GBWT files, the command will probably take several days to complete. Now you can create a GBZ graph from all.xg and merged.gbwt.

Because your VCF files contain thousands of haplotypes, Giraffe will probably be faster and more accurate with a downsampled GBZ. You can create it with:

vg gbwt -p -l --pass-paths --num-jobs 24 -x all.xg -g downsampled.gbz --gbz-format merged.gbwt

You may probably want to use both GBZ graphs to see which of them works better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants