How to handle chromosome names? #41

janaobsteter · 2023-11-09T07:07:40Z

Currently, we get chromosome numbers from the config file - and then we define a loop over range(1:nChromosomes+1). But what if we have non-numeric chromosomes in there, like other contigs or mitochondrial genome?

gregorgorjanc · 2023-11-09T07:17:48Z

Maybe follow what stdpopsim does?

hannesbecher · 2023-11-28T14:55:33Z

I think it would be useful to have a text file with chromosome names and lengths. See the genome file format used by bedtools. This has one chromosome per line, a tab, and the chromosome's length:

$ cat my.genome
chr1  1000
chr2  500

Should a genome file be generated as part of this pipeline?

This would be easy if the entry point was one multi-chromosome VCF file. The file could be parsed and each chromosome's highest variant position could be used as the chromosome length. It would also be easy if a genome FASTA file was available.
But it could be tricky if the entry point is multiple VCF files.

Alternatively, we might require the genome file as an additional input, and we could supply a script to generate such a file from VCF/genome FASTA.

Opinions? @gregorgorjanc @gmafrafortuna @janaobsteter

Generally, Stdpopsim sounds good, but we may want to run this pipeline also on small test datasets and organisms that are not on stdpopsim ATM?

janaobsteter assigned hannesbecher Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle chromosome names? #41

How to handle chromosome names? #41

janaobsteter commented Nov 9, 2023

gregorgorjanc commented Nov 9, 2023

hannesbecher commented Nov 28, 2023

How to handle chromosome names? #41

How to handle chromosome names? #41

Comments

janaobsteter commented Nov 9, 2023

gregorgorjanc commented Nov 9, 2023

hannesbecher commented Nov 28, 2023

Should a genome file be generated as part of this pipeline?