- arabidopsis
- 1001 Genomes
- Reference
- 1001 genomes has the full genomes for 1100 strains so might not need the reference (132 Gb)
- c elegans
- humans
- 1000 Genomes
- reference
- Lots of options - not sure which files to use
- over 3000 humans in total over the three studies in 1000G
- mouse - 17 sequences , Paper for another dataset - dataset , reference
- Mouse Genome Project
- ftp download site: ftp://ftp-mouse.sanger.ac.uk/
- reference
- 21 Gb for the variants (REL-1505-SNPs_Indels)
- yeast
strain issues - Is it a good idea to assume all strains as a single population? If not, we might not have strong results for yeast and arabidopsis
How to use this data - All of these contain VCF files (variant crossed with population), thus the actual sequence is not given. What we can do is: 1. Acquire the promoter regions for each organism 2. Find the variants within the region 3. Get the monomorphic parts of the sequence from the reference genome 4. Calculate H values