forked from MPI-EVA-Archaeogenetics/comp_human_adna_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
fstats.qmd
402 lines (304 loc) · 28.7 KB
/
fstats.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
---
title: "Introduction to F3- and F4-Statistics"
author: "Stephan Schiffels"
date: "10/26/2023"
bibliography: references.bib
toc: true
---
## Admixture - F3 Statistics
F3 statistics are a useful analytical tool to understand population relationships. F3 statistics, just as F4 and F2 statistics measure allele frequency correlations between populations and were introduced by Nick Patterson [@Patterson2012], but see also [@Peter2016] for another introduction.
F3 statistics are used for two purposes: i) as a test whether a target population (C) is admixed between two source populations (A and B), and ii) to measure shared drift between two test populations (A and B) from an outgroup (C).
F3 statistics are in both cases defined as the product of allele frequency differences between population C to A and B, respectively:
$$F3(A,B;C)=\langle(c−a)(c−b)\rangle$$
Here, $\langle\cdot\rangle$ denotes the average over all genotyped sites, and a, b and c denote the allele frequency for a given site in the three populations A, B and C.
It can be shown that if $F3(A, B; C)$ is negative, it provides unambiguous proof that population C is admixed between populations A and B, as in the following phylogeny (taken from Figure 1 from [@Patterson2012]:
<img src="img/fstats/f3_phylogeny.png" alt="F3-phylogeny" style="width: 300px;"/>
Intuitively, an F3 statistics becomes negative if the allele frequency of the target population C is on average intermediate between the allele frequencies of A and B. Consider as an extreme example a genomic site where $a=0$, $b=1$ and $c=0.5$. Then we have $(c−a)(c−b)=−0.25$, which is negative. So if the entire statistics is negative, it suggests that in many positions, the allele frequency c is indeed intermediate, suggesting admixture between the two sources.
One way to understand this is by looking what happens to a list of SNPs and allele frequencies for groups A, B and C:
| SNP | A | B | C | $(c-a)(c-b)$ |
|---------------|-----|-----|-----|--------------|
| 1 | 1 | 0 | 0.5 | -0.25 |
| 2 | 0.8 | 0 | 0 | 0 |
| 3 | 0 | 0.7 | 0 | 0 |
| 4 | 0.1 | 0.5 | 0.3 | -0.04 |
| 5 | 0 | 0.1 | 0.2 | 0.02 |
| 6 | 1 | 0.2 | 0.9 | -0.07 |
| $F_3(A, B;C)$ | | | | **-0.057** |
Every SNP where C has an allele frequency _intermediate_ between A and B contributes negatively. Here, the average is also negative, providing evidence for admixture. For statistical certainty, an error bar for this estimate is needed, which is typically computed via Jackknife (see for example the [xerxes whitepaper](https://github.com/poseidon-framework/poseidon-analysis-hs/blob/main/docs/xerxes_whitepaper.pdf)).
:::{.callout-caution}
If an F3 statistics is *not* negative, it does *not* proof that there is no admixture!
:::
### Computing Admixture-F3 with xerxes
We will use this statistics to test if Finnish are admixed between East and West, using different Eastern and Western sources. In the West, we use French, Icelandic, Lithuanian and Norwegian as source, and in the East we use Nganasan and one of the ancient individuals analysed in [@Lamnidis2018], from the site of *Bolshoy Oleni Ostrov* from the Northern Russian Kola-peninsula, and dating to 3500 years before present.
:::{.callout-tip}
If you happen to have downloaded a copy of the [Poseidon Community Archive](https://www.poseidon-adna.org/#/archive_overview?id=public-poseidon-archives) already, then just use the path to that archive in the following commands. Otherwise you download the entire archive via
```{bash eval=FALSE}
trident fetch -d /somewhere/to/store/the/archive --downloadAll
```
or just the relevant packages for the examples in this chapter:
```{bash eval=FALSE}
trident fetch -d /somewhere/to/store/the/archive -f "*2012_PattersonGenetics*,*2014_RaghavanNature*,*2014_LazaridisNature*,*2018_Lamnidis_Fennoscandia*"
```
:::
We use the software `xerxes fstats` from the [Poseidon Framework](https://www.poseidon-adna.org/#/xerxes?id=xerxes-cli-software). Here is a command line that computes 8 statistics for us:
```{bash eval=FALSE}
xerxes fstats -d ~/dev/poseidon-framework/community-archive \
--stat "F3(Nganasan,French,Finnish)" \
--stat "F3(Nganasan, Icelandic, Finnish)" \
--stat "F3(Nganasan, Lithuanian, Finnish)" \
--stat "F3(Nganasan, Norwegian, Finnish)" \
--stat "F3(Russia_Bolshoy, French, Finnish)" \
--stat "F3(Russia_Bolshoy, Icelandic, Finnish)" \
--stat "F3(Russia_Bolshoy, Lithuanian, Finnish)" \
--stat "F3(Russia_Bolshoy, Norwegian, Finnish)" \
```
:::{.callout-note}
Note that `xerxes fstats` will automatically find the right packages from your local archive that contain these groups. You can see in the output of the program which packages contribute:
```
[Info] 5 relevant packages for chosen statistics identified:
[Info] *2012_PattersonGenetics-2.1.3*
[Info] *2014_LazaridisNature-4.0.2*
[Info] *2016_LazaridisNature-2.1.3*
[Info] *2018_Lamnidis_Fennoscandia-2.1.0*
[Info] *2019_Flegontov_PalaeoEskimo-2.2.1*
```
So these five packages contain the samples requested in these statistics. You can inquire about this also more manually using [`trident list`](https://www.poseidon-adna.org/#/trident?id=list-command)
:::
Here is the result that you should get, nicely layouted in a Text-table:
```
.-----------.----------------.------------.---------.---.---------.----------------.--------------------.------------------.---------------------.
| Statistic | a | b | c | d | NrSites | Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
:===========:================:============:=========:===:=========:================:====================:==================:=====================:
| F3 | Nganasan | French | Finnish | | 593124 | -1.0450e-3 | -1.0451e-3 | 1.2669e-4 | -8.249133659451905 |
| F3 | Nganasan | Icelandic | Finnish | | 593124 | -1.1920e-3 | -1.1920e-3 | 1.3381e-4 | -8.908381869946188 |
| F3 | Nganasan | Lithuanian | Finnish | | 593124 | -1.1605e-3 | -1.1605e-3 | 1.5540e-4 | -7.4680182465607245 |
| F3 | Nganasan | Norwegian | Finnish | | 593124 | -1.0913e-3 | -1.0914e-3 | 1.3921e-4 | -7.83945981272796 |
| F3 | Russia_Bolshoy | French | Finnish | | 542789 | -6.1807e-4 | -6.1809e-4 | 1.0200e-4 | -6.059872900228102 |
| F3 | Russia_Bolshoy | Icelandic | Finnish | | 542789 | -6.2801e-4 | -6.2802e-4 | 1.1792e-4 | -5.325695961373772 |
| F3 | Russia_Bolshoy | Lithuanian | Finnish | | 542789 | -3.7310e-4 | -3.7310e-4 | 1.2973e-4 | -2.8760029685791637 |
| F3 | Russia_Bolshoy | Norwegian | Finnish | | 542789 | -3.7646e-4 | -3.7653e-4 | 1.1630e-4 | -3.2375830440434323 |
'-----------'----------------'------------'---------'---'---------'----------------'--------------------'------------------'---------------------'
```
:::{.callout-tip}
Use the option `-f <FILE>` to output the results additionally to a tab-separated file, or `--raw` if you prefer the standard output to be tab-separated
:::
You can see that in all cases this statistic is negative (`Estimate_Total`). The next two columns ( `Estimate_Jackknife` and `StdErr_Jackknife`) are computed using Jackknifing (see [the xerxes whitepaper](https://github.com/poseidon-framework/poseidon-analysis-hs/blob/main/docs/xerxes_whitepaper.pdf) for details). The last column is the Z score, and it is important here: It gives the deviation of the f3 statistic from zero in units of the standard error. As general rule, a Z score of -3 or more suggests a significant rejection of the Null hypothesis that the statistic is not negative. In this case, all of the statistics are significantly negative (with one borderline exception), proving that Finnish have ancestral admixture of East and West Eurasian ancestry. Here, Eastern ancestry is represented by Nganasan or Russia_Bolshoy, respectively, while Western ancestry by French, Icelandic, Lithuanian and Norwegian, respectively. Note that the statistics does not suggest when this admixture happened!
### Running xerxes via a configuration file
As you will have noticed, the command line above is getting quite long, since a separate `--stat` option has to be entered for every statistic to be computed. There is a more powerful and elegant interface to xerxes, which uses a configuration file in [YAML format](https://de.wikipedia.org/wiki/YAML). To illustrate it, let us consider the configuration file (which can be found in `fstats_working/F3_finnish.config` in the git-repository of this book) needed to compute the same statistic as above:
```Yaml
fstats:
- type: F3
a: ["Nganasan", "Russia_Bolshoy"]
b: ["French", "Icelandic", "Lithuanian", "Norwegian"]
c: ["Finnish"]
```
You can then run xerxes as
```{basheval=FALSE}
xerxes fstats -d ~/dev/poseidon-framework/community-archive --statConfig fstats_working/F3_finnish.config
```
So `xerxes` then automatically creates all combinations of populations listed in slots `a`, `b` and `c`.
:::{.callout-note}
Note that there are actually three types of F3-statistics supported by xerxes:
* `F3vanilla`: The purest form, defined literally as $\langle(c−a)(c−b)\rangle$
* `F3`: A bias-corrected version, which is only valid for groups in C that have non-zero heterozygosity
* `F3star`: This one is normalised by the heterozgygosity of the third population, C, as suggested in [@Patterson2012] and implemented in the [Admixtools package](https://github.com/DReichLab/AdmixTools).
The [white-paper]((https://github.com/poseidon-framework/poseidon-analysis-hs/blob/main/docs/xerxes_whitepaper.pdf)) explains this in detail.
:::
:::{.callout-tip}
The configuration file format has a lot more options. Here is a bit more complex example, but see also [the documentation](https://www.poseidon-adna.org/#/xerxes?id=input-via-a-configuraton-file):
```Yaml
# You can define groups right within the configuration file.
# here we use negative selection to remove individuals from the
# newly defined groups
groupDefs:
CEU2: ["CEU.SG", "-<NA12889.SG>", "-<NA12890.SG>"]
FIN2: ["FIN.SG", "-<HG00383.SG>", "-<HG00384.SG>"]
GBR2: ["GBR.SG", "-<HG01791.SG>", "-<HG02215.SG>"]
IBS2: ["IBS.SG", "-<HG02238.SG>", "-<HG02239.SG>"]
fstats:
- type: F2 # this will create 2x2 = 4 F2-Statistics
a: ["French", "Spanish"]
b: ["Han", "CEU2"]
- type: F3vanilla # This will create 3x2x1 = 6 Statistics
a: ["French", "Spanish", "Mbuti"]
b: ["Han", "CEU2"]
c: ["<Chimp.REF>"]
- type: F4 # This will create 5x5x4x1 = 100 Statistics
a: ["<I0156.SG>", "<I0157.SG>", "<I0159.SG>", "<I0160.SG>", "<I0161.SG>"]
b: ["<I0156.SG>", "<I0157.SG>", "<I0159.SG>", "<I0160.SG>", "<I0161.SG>"]
c: ["CEU2", "FIN2", "GBR2", "IBS2"]
d: ["<Chimp.REF>"]
# Altogether: 110 statistics of different types
```
which will not just create multiple statistic using row-combinations, as described, but also uses newly defined groups and combines multiple statistic types (F2, F3 and F4) in one run.
:::
## F4 Statistics
A different way to test for admixture is by “F4 statistics” (or “D statistics” which is very similar), also introduced in [@Patterson2012].
F4 statistics are also defined in terms of correlations of allele frequency differences, similarly to F3 statistics, but involving four different populations, not just three. Specifically we define
$$F4(A,B;C,D)=\langle(a−b)(c−d)\rangle.$$
### Shaping intuition - the ABBA- and BABA-sites
To understand the statistics, consider the following tree:
<img src="img/fstats/f4_phylogeny.png" alt="F4-phylogeny" style="width: 300px;"/>
In this tree, without any additional admixture, the allele frequency difference between A and B should be completely independent from the allele frequency difference between C and D. In that case, F4(A, B; C, D) should be zero, or at least not statistically different from zero. However, if there was gene flow from C or D into A or B, the statistic should be different from zero. Specifically, if the statistic is significantly negative, it implies gene flow between either C and B, or D and A. If it is significantly positive, it implies gene flow between A and C, or B and D.
It is helpful to again consider an example using a SNP list, this time assuming that every population is just a single (haploid) individual, so each allele frequency can just be 0 or 1. For example:
| SNP | A | B | C | D | $(a-b)(c-d)$ |
|------------------|---|---|---|---|--------------|
| 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 1 | 0 |
| 3 | 0 | 1 | 1 | 0 | -1 |
| 4 | 0 | 1 | 0 | 1 | 1 |
| 5 | 1 | 0 | 0 | 1 | -1 |
| 6 | 1 | 0 | 0 | 0 | 0 |
| $F_4(A, B;C, D)$ | | | | | **-0.0167** |
You can see that the only SNPs that contribute positively to this statistics are SNPs where the alleles are distributed as `1010` and `0101`, and the only SNPs that contribute negatively are `1001` and `0110`. In the literature, the two patterns have been dubbed "ABBA" and "BABA", which is why the statistical test behind this statistic (see below) was sometimes called the ABBA-BABA test (see for example [@Martin2015]).
The intuition here is straight-forward: In positions that are polymorphic in both $(A,B)$ and $(C,D)$, this statistic asks whether B is genetically more similar to C than it is to D. This is most useful as a test for "treeness": If A, B, C, D are related to each other as indicated in the above tree, then C should be equally closely related to C as to D. But if we actually find evidence that B is closer to C than to D, or vice versa, then this means that the tree above cannot be correct, but that there must be a closer connection between B and C or B and D, depending on the sign of the statistic.
### From single samples to allele frequencies
So the ABBA- and BABA-categories of SNPs help shape intuition for how this statistic behaves for single haploid genomes. But what about population allele frequencies? Looking back at the formula $\langle(a−b)(c−d)\rangle$ this doesn't help very much with intuition how this behaves with frequencies. Well, a nice feature of F4-Statistics is that averages factor out. This means, that if you have multiple samples in one or multiple slots A, B, C or D, the total F4-statistic of the _groups_ is exactly equal to the _average_ of F4-Statistics of the _individuals_. Here is a more mathematical definition.
Let's say we have 2 individuals in each of A and B, so we may perhaps write $A=\{A_1,A_2\}$ and $B=\{B_1,B_2\}$. Then one can show to have
$$F4(A, B; C, D) = \text{Average of}[F4(A_1, B_1; C, D), F4(A_1, B_2; C, D), F4(A_2, B_1; C, D), F4(A_2, B_2; C, D)]$$
so just thte average over all individual-based F4-statistics. And this can be shown to be true for arbitrary numbers of samples. So in other words: An F4-Statistic _always_ measures the _average_ excess of pairwise BABA SNPs over ABBA SNPs. To me, this is a useful insight, as I find thinking in terms of ABBA-BABA somehow more helpful than thinking in terms of correlations of allele-frequency differences (which is really what the original formula is).
:::{.callout-note}
F4-statistics have been famously used to show that Neanderthals are more closely related to Non-African populations than to Africans, suggesting gene-flow between Neanderthals and Non-Africans (shown in [@Green2010]). You can reproduce this famous result with
```{bash eval=FALSE}
xerxes fstats -d ~/poseidon_repo --stat 'F4(<Chimp.REF>,<Altai_published.DG>,Yoruba,French)' \
--stat 'F4(<Chimp.REF>,<Altai_published.DG>,Sardinian,French)'
```
which shows that the first statistic is significantly positive with a Z-score of 7.99, while the second one is insignificantly different from zero (Z=1.01)
:::
The way this statistic is often used, is to put a divergent outgroup as population A, for which we know for sure that there was no admixture into either C or D. With this setup, we can then test for gene flow between B and D (if the statistic is positive), or B and C (if it is negative).
### Running F4-Statistics with xerxes
Here, we can use this statistic to test for East Asian admixture in Finns, similarly to the test using Admixture F3 statistics above. We will again use `xerxes fstats`. We again prepare a configuration file (in `fstats_working/F4_finish.config` in the git-repository of this book), this time with four populations (A, B, C, D):
```Yaml
fstats:
- type: F4
a: ["Mbuti"]
b: ["Nganasan", "Russia_Bolshoy"]
c: ["French", "Icelandic", "Lithuanian", "Norwegian"]
d: ["Finnish"]
```
You can again run via
```{bash eval=FALSE}
xerxes fstats -d ~/dev/poseidon-framework/community-archive --statConfig fstats_working/F4_finnish.config
```
The result is:
```
.-----------.-------.----------------.------------.---------.---------.----------------.--------------------.------------------.--------------------.
| Statistic | a | b | c | d | NrSites | Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
:===========:=======:================:============:=========:=========:================:====================:==================:====================:
| F4 | Mbuti | Nganasan | French | Finnish | 593124 | 2.3114e-3 | 2.3115e-3 | 1.2676e-4 | 18.235604067907143 |
| F4 | Mbuti | Nganasan | Icelandic | Finnish | 593124 | 1.6590e-3 | 1.6590e-3 | 1.4861e-4 | 11.163339072181776 |
| F4 | Mbuti | Nganasan | Lithuanian | Finnish | 593124 | 1.3290e-3 | 1.3290e-3 | 1.4681e-4 | 9.052979707622278 |
| F4 | Mbuti | Nganasan | Norwegian | Finnish | 593124 | 1.6503e-3 | 1.6503e-3 | 1.5358e-4 | 10.745850997260929 |
| F4 | Mbuti | Russia_Bolshoy | French | Finnish | 542789 | 1.8785e-3 | 1.8785e-3 | 1.2646e-4 | 14.854487416366263 |
| F4 | Mbuti | Russia_Bolshoy | Icelandic | Finnish | 542789 | 1.0829e-3 | 1.0828e-3 | 1.4963e-4 | 7.236818881873822 |
| F4 | Mbuti | Russia_Bolshoy | Lithuanian | Finnish | 542789 | 5.4902e-4 | 5.4907e-4 | 1.4601e-4 | 3.7605973064589096 |
| F4 | Mbuti | Russia_Bolshoy | Norwegian | Finnish | 542789 | 9.3473e-4 | 9.3475e-4 | 1.5302e-4 | 6.108881868125652 |
'-----------'-------'----------------'------------'---------'---------'----------------'--------------------'------------------'--------------------'
```
As you can see, in all cases, the Z score is positive and larger than 3, indicating a significant deviation from zero, and implying gene flow between Nganasan and Finnish, and BolshoyOleniOstrov and Finnish, when compared to French, Icelandic, Lithuanian or Norwegian.
## Outgroup-F3-Statistics
Outgroup F3 statistics are a special case how to use F3 statistics. The definition is the same as for Admixture F3 statistics, but instead of a target C and two source populations A and B, one now gives an outgroup C and two test populations A and B.
To get an intuition for this statistics, consider the following tree:
<img src="img/fstats/outgroupf3_phylogeny.png" alt="Outgroup-F3-phylogeny" style="width: 300px;"/>
In this scenario, the statistic F3(A, B; C) measures the branch length from C to the common ancestor of A and B, coloured red. So this statistic is simply a measure of how closely two population A and B are related with each other, as measured from a distant outgroup. It is thus a similarity measure: The higher the statistic, the more genetically similar A and B are to one another.
Here is again a SNP table to illustrate, using haploid individuals:
| SNP | A | B | C | $(c-a)(c-b)$ |
|---------------|---|---|---|--------------|
| 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 | 0 |
| 5 | 1 | 1 | 0 | 1 |
| 6 | 0 | 0 | 1 | 1 |
| $F_3(A, B;C)$ | | | | **0.5** |
You can see that each position which is similar between A and B, but different to C contributes 1, all other SNPs 0. So it directly measures similarity between A and B on alleles that differ from the outgroup C.
:::{.callout-note}
Note that the averaging-relation shown for F4 statistics above is also true for Outgroup-F3 statistics, but only for populations A and B, not for C. So if you have multiple samples in A and B, you may think of this statistic being the _average_ over all pairwise nucleotide similarities between individuals in A and B with respect to the same outgroup C.
:::
We can use this statistic to measure for example the genetic affinity to East Asia, by performing the statistic F3(Han, X; Mbuti), where Mbuti is a distant African population and acts as outgroup here, Han denote Han Chinese, and X denotes various European populations that we want to test.
You can again define a configuration file that performs looping over various populations X for you:
```Yaml
fstats:
- type: F3
a: ["Han"]
b: ["Chuvash", "Albanian", "Armenian", "Bulgarian", "Czech", "Druze", "English",
"Estonian", "Finnish", "French", "Georgian", "Greek", "Hungarian", "Icelandic",
"Italian_North", "Italian_South", "Lithuanian", "Maltese", "Mordovian", "Norwegian",
"Orcadian", "Russian", "Sardinian", "Scottish", "Sicilian", "Spanish_North",
"Spanish", "Ukrainian", "Finland_Levanluhta", "Russia_Bolshoy", "Russia_Chalmny_Varre", "Saami.DG"]
c: ["Mbuti"]
```
which cycles through many populations from Europe, including the ancient individuals from Chalmny Varre, Bolshoy Oleni Ostrov and Levänluhta (described in [@Lamnidis2018]). We store this file in a file called `fstats_working/OutgroupF3_europe.config` and run via:
```{bash, eval=FALSE}
xerxes fstats --statConfig fstats_working/OutgroupF3_europe.config -d ~/dev/poseidon-framework/community-archive -f fstats_working/outgroupf3_europe.tsv
```
:::{.callout-warning}
Often in Outgroup-F3-statistics you use single genomes for population C, sometimes even single haploid genomes. In this case, `F3` and `F3star` will get undefined results, because ordinary `F3` and `F3star` statistics require population C to have non-zero average heterozygosity, so you will need at least one diploid sample, or multiple haploid or diploid samples.
Use `F3vanilla` if your third population C is a single pseudo-haploid sample.
:::
Here is the output of this run (but note that a tab-separated version was also stored in `fstats_working/outgroupf3_europe.tsv` using the option `-f`):
```
.-----------.-----.----------------------.-------.---.---------.----------------.--------------------.------------------.--------------------.
| Statistic | a | b | c | d | NrSites | Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
:===========:=====:======================:=======:===:=========:================:====================:==================:====================:
| F3 | Han | Chuvash | Mbuti | | 593124 | 5.3967e-2 | 5.3967e-2 | 5.0668e-4 | 106.51180329550319 |
| F3 | Han | Albanian | Mbuti | | 593124 | 4.9972e-2 | 4.9973e-2 | 4.9520e-4 | 100.91326321202445 |
| F3 | Han | Armenian | Mbuti | | 593124 | 4.9531e-2 | 4.9531e-2 | 4.7771e-4 | 103.68366652942314 |
| F3 | Han | Bulgarian | Mbuti | | 593124 | 5.0103e-2 | 5.0103e-2 | 4.8624e-4 | 103.04188532686614 |
| F3 | Han | Czech | Mbuti | | 593124 | 5.0536e-2 | 5.0536e-2 | 4.9261e-4 | 102.58792370749681 |
| F3 | Han | Druze | Mbuti | | 593124 | 4.8564e-2 | 4.8564e-2 | 4.6788e-4 | 103.79674299622445 |
| F3 | Han | English | Mbuti | | 593124 | 5.0280e-2 | 5.0281e-2 | 4.9183e-4 | 102.23198323949656 |
| F3 | Han | Estonian | Mbuti | | 593124 | 5.1154e-2 | 5.1155e-2 | 5.0350e-4 | 101.59882496016485 |
| F3 | Han | Finnish | Mbuti | | 593124 | 5.1784e-2 | 5.1784e-2 | 5.0603e-4 | 102.33488758899031 |
| F3 | Han | French | Mbuti | | 593124 | 5.0207e-2 | 5.0208e-2 | 4.8552e-4 | 103.40976592749682 |
| F3 | Han | Georgian | Mbuti | | 593124 | 4.9711e-2 | 4.9711e-2 | 4.8100e-4 | 103.34881140790415 |
| F3 | Han | Greek | Mbuti | | 593124 | 4.9874e-2 | 4.9874e-2 | 4.8994e-4 | 101.79554640756365 |
| F3 | Han | Hungarian | Mbuti | | 593124 | 5.0497e-2 | 5.0498e-2 | 4.9878e-4 | 101.24215699276706 |
| F3 | Han | Icelandic | Mbuti | | 593124 | 5.0680e-2 | 5.0680e-2 | 4.9729e-4 | 101.91303336514295 |
| F3 | Han | Italian_North | Mbuti | | 593124 | 4.9903e-2 | 4.9904e-2 | 4.8436e-4 | 103.03094306099203 |
| F3 | Han | Italian_South | Mbuti | | 592980 | 4.9201e-2 | 4.9201e-2 | 5.1170e-4 | 96.15239597674244 |
| F3 | Han | Lithuanian | Mbuti | | 593124 | 5.0896e-2 | 5.0896e-2 | 5.0638e-4 | 100.50984037418753 |
| F3 | Han | Maltese | Mbuti | | 593124 | 4.8751e-2 | 4.8751e-2 | 4.7500e-4 | 102.63442479673623 |
| F3 | Han | Mordovian | Mbuti | | 593124 | 5.1820e-2 | 5.1820e-2 | 4.8853e-4 | 106.07409963190884 |
| F3 | Han | Norwegian | Mbuti | | 593124 | 5.0724e-2 | 5.0724e-2 | 4.9514e-4 | 102.4454387098217 |
| F3 | Han | Orcadian | Mbuti | | 593124 | 5.0469e-2 | 5.0469e-2 | 4.9485e-4 | 101.98814656611475 |
| F3 | Han | Russian | Mbuti | | 593124 | 5.1277e-2 | 5.1277e-2 | 4.8613e-4 | 105.48070801791317 |
| F3 | Han | Sardinian | Mbuti | | 593124 | 4.9416e-2 | 4.9417e-2 | 4.8908e-4 | 101.04049389691913 |
| F3 | Han | Scottish | Mbuti | | 593124 | 5.0635e-2 | 5.0635e-2 | 5.0565e-4 | 100.13962104744425 |
| F3 | Han | Sicilian | Mbuti | | 593124 | 4.9194e-2 | 4.9194e-2 | 4.8157e-4 | 102.15353663091187 |
| F3 | Han | Spanish_North | Mbuti | | 593124 | 5.0032e-2 | 5.0032e-2 | 4.9377e-4 | 101.32594226555439 |
| F3 | Han | Spanish | Mbuti | | 593124 | 4.9693e-2 | 4.9693e-2 | 4.8551e-4 | 102.35200847948641 |
| F3 | Han | Ukrainian | Mbuti | | 593124 | 5.0731e-2 | 5.0731e-2 | 4.9506e-4 | 102.47529111692852 |
| F3 | Han | Finland_Levanluhta | Mbuti | | 303033 | 5.4488e-2 | 5.4488e-2 | 5.7681e-4 | 94.46487653920919 |
| F3 | Han | Russia_Bolshoy | Mbuti | | 542789 | 5.7273e-2 | 5.7273e-2 | 5.2875e-4 | 108.31739594898687 |
| F3 | Han | Russia_Chalmny_Varre | Mbuti | | 428215 | 5.4000e-2 | 5.4000e-2 | 5.6936e-4 | 94.84371082564112 |
| F3 | Han | Saami.DG | Mbuti | | 585193 | 5.4727e-2 | 5.4728e-2 | 5.5546e-4 | 98.5265149143263 |
'-----------'-----'----------------------'-------'---'---------'----------------'--------------------'------------------'--------------------'
```
Now it’s time to plot these results using R. Let's first read in the table:
```{r}
d <- read.csv("fstats_working/outgroupf3_europe.tsv", sep = "\t")
```
We can check that it worked:
```{r}
head(d)
```
Nice, now on to plotting (here I'm using Base R for zero-dependency pain, you're welcome!):
```{r}
#| fig-height: 8
#| fig-width: 8
#| out-width: "100%"
#| fig-align: "center"
order <- order(d$Estimate_Total) # order the estimates for visual effect
x <- d$Estimate_Jackknife[order]
xErr <- d$StdErr_Jackknife[order]
y <- seq_along(d$b)
plot(x, y, xlab = "Z Score", ylab = NA, yaxt = "n", # plot no y-axis ticks
xlim = c(0.048,0.06),
main = "F3(Han, X; Mbuti)")
# plot the labels
text(x + 0.001, y, labels = d$b, adj=0)
# plot the error bars
arrows(x - xErr, y, x + xErr, y, length=0.05, angle=90, code=3)
```
As expected, the ancient samples and modern Saami are the ones with the highest allele sharing with present-day East Asians (as represented by Han) compared to many other Europeans.