Skip to content

Latest commit

 

History

History
33 lines (29 loc) · 1.75 KB

organism_data_sources.md

File metadata and controls

33 lines (29 loc) · 1.75 KB

Species

  1. arabidopsis
    1. 1001 Genomes
    2. Reference
    3. 1001 genomes has the full genomes for 1100 strains so might not need the reference (132 Gb)
  2. c elegans
    1. CeNDR
    2. reference
    3. VCF is about 2 Gb but it is possible to download all alignment data (not sure of the size)
  3. humans
    1. 1000 Genomes
    2. reference
    3. Lots of options - not sure which files to use
    4. over 3000 humans in total over the three studies in 1000G
  4. mouse - 17 sequences , Paper for another dataset - dataset , reference​
    1. Mouse Genome Project
    2. ftp download site: ftp://ftp-mouse.sanger.ac.uk/
    3. reference
    4. 21 Gb for the variants (REL-1505-SNPs_Indels)
  5. yeast
    1. Yeast Genome Project
    2. reference
    3. reference

strain issues - Is it a good idea to assume all strains as a single population? If not, we might not have strong results for yeast and arabidopsis

How to use this data - All of these contain VCF files (variant crossed with population), thus the actual sequence is not given. What we can do is: 1. Acquire the promoter regions for each organism 2. Find the variants within the region 3. Get the monomorphic parts of the sequence from the reference genome 4. Calculate H values