Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summary stats on X chromosome data are incorrect and simulated SNP data are unrealistic #2

Open
bnsacks opened this issue Aug 22, 2018 · 1 comment

Comments

@bnsacks
Copy link

bnsacks commented Aug 22, 2018

`I expect I am doing something wrong (v 2.1.0), but here are the two issues I am having:
(1) using 633 X chromosome SNPs with males and females (and sex ratio recorded as indicated in the user manual) and 3 populations, when I run scenarios and do a pre-eval, all of the single-sample stats are strongly rejected and the "observed values" (HM1_1, HMO_1) are considerably higher than those in my data set when I calculate those metrics by hand. I suspect part of this could be due to indicating the overall sex ratio (pooled across populations) but which does not match sex ratios in any of the original samples (they vary a lot among populations). Therefore, I attempted to simulate data sets using one of the same scenarios and found the second issue:
(2) Most of the simulated X chromsome F genotypes are composed of all 2s or all 0s depending on the locus and the M genotypes are, as expected, composed only of 0s and 1s. This seems highly unrealistic in terms of females. They should be more or less in Hardy-Weinberg, not either all heterozygous or all homozygous (and always for the allele indicated by 0 and never by 1).
I would be grateful for any advice on what I might be doing wrong.

I have attached an input file and a simulated dataset below (I changed to .SNP ext to .txt so I could attach them)
3sppinput-1hetsNoMx-corrected2.txt
TESTSIM2_1000.txt

Thanks

@aestoup
Copy link

aestoup commented Apr 7, 2020

Problem "solved": this user did not understand how to code genotypes = SNP genotypes are coded 0, 1 or 2 (9 for missing data) according to the number of (aritrary chosen) reference alleles at the corresponding locus.

He believed that 0 and 1 = homozygous genotype and 2 = heterozygous genotype which is wrong!!!

CCL: we will check anyway (Eric) that simulated (real) pseudo-obs datasets are all ok with respect to the genotypes indicated (0, 1 or 2 (9 for missing data) according to the number of (aritrary chosen) reference alleles at the corresponding locus). We will do that for all type of markers = A, H, M, X and Y.

Note: the genotypes of X-linked and Y-linked loci for individuals with unknown sex cannot be safely determined and are hence noted 9 for missing data. This choice may be discussed....but it is a conservative one !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants