vg call issues #4475

asylvz · 2024-12-10T12:51:51Z

1. What were you trying to do?
Trying to call/genotype from the vg graph.

Commands:
1 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -z > genotypes_yamnaya.vcf
2 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -v concatenated.vcf.gz > genotypes_yamnaya.vcf
3 -vg call all.xg -k yamnaya.pack -r yamnaya.snarls > genotypes_yamnaya.vcf

Here's how I generated the graph:

(seq 1 22; echo X) | parallel -j 24 "${vg} construct -r ${ref} -v 1kGP_high_coverage_Illumina.chr{}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz -R chr{} -C -a > chr{}.vg"
${vg} ids -j -m mapping $(for i in $(seq 1 22; echo X); do echo chr${i}.vg; done)
${vg} index -x graph-with-alts.xg -L $(for i in $(seq 1 22; echo X); do echo chr${i}.vg; done)
${vg} gbwt -x graph-with-alts.xg -o all.gbwt --num-jobs 24 -v \
   $(for i in $(seq 1 22; echo X); do echo 1kGP_high_coverage_Illumina.chr${i}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz; done)

${vg} index -x all.xg $(for i in $(seq 1 22; echo X); do echo chr${i}.vg; done)
${vg} gbwt -x all.xg --gbz-format -g all.gbz all.gbwt
${vg} index -j all.dist all.xg
${vg} minimizer -t 16 -d all.dist -o all.min all.gbz

And mappings:

${vg} giraffe -Z all.gbz -m all.min -d all.dist -f Yamnaya.fastq.gz >mapped_Yamnaya.gam
${vg} pack -x all.gbz -g mapped_Yamnaya.gam -o yamnaya.pack -Q 5
${vg} snarls all.gbz > yamnaya.snarls

2. What did you want to happen?
Generate small and large variants

3. What actually happened?
Commands (1) and (2) give "reference path not found" error and (3) gives the error in the attachment.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

vg_error.txt

5. What data and command can the vg dev team use to make the problem happen?
1 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -z > genotypes_yamnaya.vcf
2 - vg call all.gbz -r yamnaya.snarls -k yamnaya.pack -s Yamnaya -v concatenated.vcf.gz > genotypes_yamnaya.vcf
3 -vg call all.xg -k yamnaya.pack -r yamnaya.snarls > genotypes_yamnaya.vcf

6. What does running vg version say?

vg 1.61.0 plodio

The text was updated successfully, but these errors were encountered:

jltsiren · 2024-12-16T11:38:48Z

Your GBZ graphs do not have reference paths due to the way you built them. The vg gbwt commands you used only include haplotype paths from the VCF. You would have to build separate GBWT indexes for the reference paths and merge them with the haplotype GBWTs. It is much easier to use vg autoindex, which does all this automatically.

The third command probably crashes, because the graph and the snarl file do not match. You built the snarls for the GBZ graph, which only contains the nodes and the edges used by the haplotype paths in the GBWT index. Your XG graph is based on vg construct and likely includes additional edges and nodes. Edges, because the construction creates edges for all possible recombinations, and some of them are likely not seen in the haplotypes. And nodes, because graph construction and GBWT construction interpret overlapping variants in slightly different ways for performance reasons.

asylvz · 2024-12-17T10:10:48Z

I could not use vg autoindex, I got the error below. Note that I merged the VCFs of each chromosome and ran with vg autoindex --workflow giraffe --prefix /mnt/NEOGENE4/projects/pipeline_2024/autoindex --ref-fasta GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna --vcf concatenated.vcf.gz

last part of the error is below (full error log is attached: 2071261-gibbon.err.zip) :

b7b38b447e6c26d5ccef.383.phased.chunked.vcf.gz
warning: [HaplotypeIndexer::parse_vcf] contig chrUn_JTFH01001152v1_decoy not present in file /tmp/vg-v5hwlS/dir-8OdtUU/a461b67fd342ea89b110b7b38b447e6c26d5ccef.383.phased.chunked.vcf.gz
vg: /usr/local/sw/vg/include/sdsl/int_vector_buffer.hpp:160: sdsl::int_vector_buffer<t_width>::int_vector_buffer(std::string, std::ios_base::openmode, uint64_t, uint8_t, bool) [with unsigned char t_width = 8; std::string = std::__cxx11::basic_string<char>; std::ios_base::openmode = std::ios_base::openmode; uint64_t = long unsigned int; uint8_t = unsigned char]: Assertion `m_ifile.good()' failed.
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.61.0-26-g62ccb5575 "Plodio"
Stack trace (most recent call last) in thread 2993747:
#14   Object "", at 0xffffffffffffffff, in 
#13   Object "", at 0x7ff9ced95a6e, in 
#12   Object "", at 0x7ff9cf4caea6, in 
#11   Object "", at 0x7ff9cef98ecf, in 
#10   Object "", at 0x55adfc7c4d06, in 
#9    Object "", at 0x55adfc780a04, in 
#8    Object "", at 0x55adfc751fac, in 
#7    Object "", at 0x55adfc74fe91, in 
#6    Object "", at 0x55adfc754f00, in 
#5    Object "", at 0x55adfcdd9d4d, in 
#4    Object "", at 0x55adfc940c4c, in 
#3    Object "", at 0x7ff9ceccb661, in 
#2    Object "", at 0x7ff9cecbc40e, in 
#1    Object "", at 0x7ff9cecbc536, in 
#0    Object "", at 0x7ff9cecd2ce1, in 
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.61.0-26-g62ccb5575
Stack trace (most recent call last) in thread 2993747:
#21   Object "", at 0xffffffffffffffff, in 
#20   Object "", at 0x7ff9ced95a6e, in 
#19   Object "", at 0x7ff9cf4caea6, in 
#18   Object "", at 0x7ff9cef98ecf, in 
#17   Object "", at 0x55adfc7c4d06, in 
#16   Object "", at 0x55adfc780a04, in 
#15   Object "", at 0x55adfc751fac, in 
#14   Object "", at 0x55adfc74fe91, in 
#13   Object "", at 0x55adfc754f00, in 
#12   Object "", at 0x55adfcdd9d4d, in 
#11   Object "", at 0x55adfc940c4c, in 
#10   Object "", at 0x7ff9ceccb661, in 
#9    Object "", at 0x7ff9cecbc40e, in 
#8    Object "", at 0x7ff9cecbc536, in 
#7    Object "", at 0x7ff9cecd2ce1, in 
#6    Object "", at 0x7ff9cf4d613f, in 
#5    Object "", at 0x55adfc65c657, in 
#4    Object "", at 0x7ff9cecd56b9, in 
#3    Object "", at 0x7ff9cecd5516, in 
#2    Object "", at 0x55adfcb00908, in 
#1    Object "", at 0x55adfcb006e0, in 
#0    Object "", at 0x7ff9ced5d073, in 
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!

jltsiren · 2024-12-17T12:47:48Z

My best guess is that vg tried to open too many files simultaneously and hit the limit.

The first part of the error log is vg complaining about SVs that may be incorrectly represented or vg at least does not understand them. They should be irrelevant to the issue at hand.

In the second part, vg complains about a large number of contigs with no variants. We also see that the input FASTA and VCF files have been chunked into hundreds of parts (one per contig).

And then the actual error comes from sdsl::int_vector_buffer, which fails to open a temporary file it had just created (and is still open) for reading.

You could try increasing the number of file descriptors with ulimit -n <limit> (you can see the current limit with ulimit -n). If you can't do that or it doesn't help, another option is removing all unlocalized contigs from the reference.

asylvz · 2024-12-21T18:56:07Z

After increasing the ulimit, I now get the error below. I guess it's because there is not enough storage in "/tmp/xg-254s2K" right? Is there a way to decrease the amount of data that VG generates in that folder?

...
[IndexRegistry]: Merging contig GBWTs.
[IndexRegistry]: Stripping allele paths from VG.
[IndexRegistry]: Constructing XG graph from VG graph.
[xg]: couldn't create temp directory: /tmp/xg-254s2K

jltsiren · 2025-01-02T16:36:55Z

The temporary files are unavoidable. You can select another directory for them by using vg autoindex option -T / --tmp-dir or by setting environment variable TMPDIR.

asylvz · 2025-01-07T09:47:15Z

Thank you @jltsiren but our server does not allow the generation of such high memory temporary files. In your previous messages, you said that "Your GBZ graphs do not have reference paths due to the way you built them. The vg gbwt commands you used only include haplotype paths from the VCF. You would have to build separate GBWT indexes for the reference paths and merge them with the haplotype GBWTs. It is much easier to use vg autoindex, which does all this automatically. "

Can you give a bit more detail on how I can achieve this without using vg autoindex?

jltsiren · 2025-01-07T20:37:37Z

Let's go back to your original commands.

First, you should use option --preset 1000gp in the vg gbwt command where you create the haplotype GBWT. It sets --force-phasing and --discard-overlaps, which should give you chromosome-length haplotypes, instead of creating tens of millions of phase breaks when the path construction logic cannot make sense of overlapping variants. It also sets --buffer-size and --batch-size to more appropriate numbers.

Then you should create another GBWT with the reference paths. The following command should work:

vg gbwt -p -x all.xg -E -o reference.gbwt

The reference paths in reference.gbwt are technically generic named paths, because we haven't bothered updating vg construct to include reference sample name in the path names.

Then you merge all.gbwt and reference.gbwt using:

vg gbwt -p -o merged.gbwt -m all.gbwt reference.gbwt

The order of the arguments is important. If you switch the two GBWT files, the command will probably take several days to complete. Now you can create a GBZ graph from all.xg and merged.gbwt.

Because your VCF files contain thousands of haplotypes, Giraffe will probably be faster and more accurate with a downsampled GBZ. You can create it with:

vg gbwt -p -l --pass-paths --num-jobs 24 -x all.xg -g downsampled.gbz --gbz-format merged.gbwt

You may probably want to use both GBZ graphs to see which of them works better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vg call issues #4475

vg call issues #4475

asylvz commented Dec 10, 2024

jltsiren commented Dec 16, 2024

asylvz commented Dec 17, 2024

jltsiren commented Dec 17, 2024

asylvz commented Dec 21, 2024

jltsiren commented Jan 2, 2025

asylvz commented Jan 7, 2025

jltsiren commented Jan 7, 2025

vg call issues #4475

vg call issues #4475

Comments

asylvz commented Dec 10, 2024

jltsiren commented Dec 16, 2024

asylvz commented Dec 17, 2024

jltsiren commented Dec 17, 2024

asylvz commented Dec 21, 2024

jltsiren commented Jan 2, 2025

asylvz commented Jan 7, 2025

jltsiren commented Jan 7, 2025