-
Notifications
You must be signed in to change notification settings - Fork 1
/
cognestic_stats_matlab.m
2447 lines (2322 loc) · 118 KB
/
cognestic_stats_matlab.m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% Introduction to statistics for hypothesis testing
% This Matlab livescript is intended to help explain statistical methods for
% hypothesis testing using code and examples (with minimal mathematics). It is
% restricted to univariate tests on continuous data, and focuses on issues that
% arise in neuroimaging, e.g, when there are multiple tests on topographically
% organised and smooth data. The code is written to aid comprehension, rather
% than for speed/efficiency. Some sections require having run code in preceding
% sections. It assumes basic knowledge of programming (eg. loops / functions /
% vectorisation). It was originally written for the MRC CBU's COGNESTIC Summer
% School, but can stand-alone.
%
% It starts with sampling theory, central limit thoerem and randomisation
% testing, then proceeds to null hypothesis testing (error rate and power), focusing
% on T-tests and including Bayes Factors, then moves to the multiple comparison
% problem, height and cluster-level family-wise and false-disovery correction,
% before introducing the General Linear Model, ANOVAs, F-tests, error nonsphericity,
% error partitioning, multiple regression, timeseries, and finishes with Linear
% Mixed Effects models.
%
% Any corrections/comments welcomed. Rik Henson, May 2024. [email protected].
% Thanks to Danny Mitchell for comments.
%% Random Sampling
% Let's assume we have data from a large population of people, where the values
% have a Gaussian distribution with mean of 0 and standard deviation (std) of
% 1 (a "unit normal" distribution). We will create this distribution explicitly
% below for the sake of comprehension (even though most programming languages
% allow you to draw randomly from such a distribution on the fly):
clear all
rng(1,'twister'); % Initialise random seed
pop_size = 1e6; % Needs to be big (but not exceed RAM!)
pop_mean = 0; pop_std = 1;
pop = randn(pop_size,1)*pop_std + pop_mean; % Draw randomly from distribution
% since pop_size finite, re-centre (important if pop_size not big enough)
pop = (pop - mean(pop))/std(pop); % Reset to exactly unit normal ("Z-scoring")
pop = pop*pop_std + pop_mean; % re-set to desired std and mean
[pdf,vals] = hist(pop,100); pdf = pdf/pop_size; % plot histogram
figure,bar(vals,pdf),title('Unit normal population distribution')
xlabel('Data value'); ylabel('Proportion')
%%
% Let's pretend we don't know the mean of the population, but want to estimtate
% it by drawing a random sample from the population.
%%
samp_size = 1e2; % Sample size
select = ceil(rand(samp_size,1)*pop_size); % random subset of people from population
samp = pop(select); % their values
fprintf('Mean of one sample = %4.3f\n',mean(samp))
%%
% We can make this process an inline function (technically, this is sampling
% with replacement, but when samp_size << pop_size, repeat values are extremely
% unlikely, and this is faster than sampling without replacement):
sample_pop = @(pop,samp_size,num_samp) pop(ceil(rand(samp_size,num_samp)*length(pop)));
%%
% But of course there will be uncertainty associated with any such estimate
% of the population mean from a smaller sample. We can see this by repeating the
% process many times, to get a new distribution, but this time of the sample _means_:
num_samp = 1e4; % Number of samples
samp = sample_pop(pop,samp_size,num_samp);
mean_vals = mean(samp);
[pdf,vals] = hist(mean_vals,100); pdf = pdf/num_samp;
figure,bar(vals,pdf),title(fprintf('Distribution of means from %d samples of size %d',num_samp,samp_size))
xlabel('Estimate of Mean'); ylabel('Proportion')
fprintf('Std of distribution of means of samples = %4.3f\n',std(mean_vals))
%%
% Note that the std of the distribution of mean estimates ("standard error
% of mean") is equal to the std of the population divided by the square-root of
% the sample size - something we will come back to later.
%% Central Limit Theorem
% Another thing worth noting about this distribution of mean estimates is that,
% with big enough sample size, it will approach a Gaussian distribution, even
% if the original population distribution is not Gaussian ("Central Limit Theorem"):
%%
pop2 = pop.^2; % Create a positively skewed distribution
%pop3 = pop.^3; % Create a positively kurtotic distribution
pop2 = (pop2 - mean(pop2))/std(pop2); % Reset to have mean of 0 and std of 1 (if were Gaussian!)
[pdf,vals] = hist(pop2,100); pdf = pdf/pop_size;
figure,bar(vals,pdf),title('Non-normal population distribution')
xlabel('Data value'); ylabel('Proportion')
samp_size = [10 40 100]; % Explore range of sample sizes
mean_vals = nan(num_samp,length(samp_size));
for s = 1:length(samp_size)
samp = sample_pop(pop2,samp_size(s),num_samp);
mean_vals(:,s) = mean(samp)';
end
[pdf,vals] = hist(mean_vals,100); pdf = pdf/num_samp; figure,bar(vals,pdf,3)
title('Distribution of sample means for different sample sizes')
legend(compose('%d',samp_size))
%%
% Not only does the spread of the distributions decrease as sample size
% increases, but the distributions become more Gaussian (e.g, the distribution
% from a sample size of 10 is still skewed, like the native population, but hardly
% skewed when sample size is 100).
%% Randomisation (permutation) testing
% Normally we only have one sample, and hence only one mean, so how do we know
% whether the population mean is, for example, greater than 0? One way to estimate
% the variability of our mean estimate is to re-sample randomly, but this time
% from our single sample (rather than the population). One approach is to select
% values randomly with replacement, which is called "bootstrapping". However,
% another way that is more appropriate for hypothesis testing is called "permutation
% testing" (sampling without replacement). For testing the mean of a single sample,
% permutation is equivalent to randomly switching the sign of each value in the
% sample (see later example of a paired T-test for justification). This is a type
% of "randomisation" test (also called a "non-parametric" test). If we repeat
% this a large number of times, we get an estimate of the distribution of the
% mean:
%%
samp_size = 100
one_sample = sample_pop(pop,samp_size,1);
num_rand = 1e4; % Number of randomisations
mean_vals = nan(num_rand,1);
for r = 1:num_rand
%resamp = one_sample(ceil(rand(samp_size,1)*samp_size)); % bootstrap sample with replacement
resamp = one_sample.*sign(rand(samp_size,1)-0.5); % permutation sample with replacement
mean_vals(r) = mean(resamp);
end
[pdf,vals] = hist(mean_vals,100); pdf = pdf/num_samp; figure, bar(vals,pdf)
true_mean = mean(one_sample);
hold on, line([true_mean true_mean]',[0 1.1*max(pdf)],'Color','r')
title('Distribution of sample means from randomisation')
legend({'Rand. Est.','True Mean'})
fprintf('Std of distribution of means from randomisation = %4.3f\n',std(mean_vals))
%%
% Notice that the standard error of the mean is again close to expected
% value of pop_std/sqrt(samp_size). This randomised distribution can be used to
% calculate confidence intervals around the mean, e.g, the upper part of the 95%
% confidence interval can be obtained by sorting the estimates of the mean from
% smallest to largest, and selecting the value just above the 95th percentile
% (assuming num_rand>100; see later):
mean_vals = sort(mean_vals, 'ascend');
ci95 = mean_vals(floor(.95*length(mean_vals))+1);
fprintf('Sample mean is %4.3f with upper 95-percent confidence interval of %4.3f\n',true_mean,ci95);
%%
% In most cases (but does depend on your random seed!), the true mean should
% be less than the 95% upper confidence value.
%% Hypothesis Testing
% We can now have a first stab at hypothesis testing. In traditional "null hypothesis
% signficance testing" (NHST), to test our hypothesis that the mean of the population
% is greater than 0 (sometimes called the alternate hypothesis, H1), we strive
% to reject the null hypothesis (H0) that the mean of the population is zero.
% To test this with our sample, we can use the randomise_one_sample function given
% at end of script, which just uses the permutation code above (or other randomisation
% method) and returns the proportion of permuted means that are as big, or bigger,
% than the actual sample mean:
%%
p_value = randomise_one_sample(one_sample);
fprintf('Proportion of randomised means greater than sample mean = %4.3f',p_value);
%%
% Under H0, this proportion is equally likely to lie anywhere between 0
% and 1. This proportion is an estimate of the "p-value": the probablity of getting
% a statistic as large or larger than a critical value, given that H0 is true.
% We generally want this p-value to be small, i.e. want minimal risk of falsely
% rejecting H0 when it is in fact true. This threshold is called "alpha" and is
% conventionally set at .05, ie 1 in 20 chance that we conclude the population
% mean is not zero when it is in fact. This is also called the false positive
% rate (or "type I error" rate).
%
% _Interesting factoid: What is the origin of 0.05 for alpha? A story says
% that the statistician Sir Ronald Fisher was crossing a court of his Cambridge
% college when someone yelled: "What odds would you trust for something not happening
% by chance?". Apparently he replied "1 in 20", perhaps without intending that
% this should become scientific dogma (though 0.05 was also used in his publications)._
%% False positive rate and power
% To check that our test does control false positive rate (FPR), we can simulate
% drawing many samples from the population (eg. doing many experiments), and calculate
% the proportion of experiments that give a false positive:
alpha = 0.05
samp_size = 20 % more realistic for human psychological experiment?
num_samp = 1e4; % reduce if want to speed up, but fpr will be less precise
p_values = nan(num_samp,1);
for s = 1:num_samp
samp = sample_pop(pop,samp_size,1); % Take sample from population
p_values(s) = randomise_one_sample(samp,num_rand);
end
fpr = sum(p_values<alpha)/num_samp;
fprintf('Randomised (false) positive rate when population has zero-mean = %4.3f',fpr);
%%
% This shows that our permutation test does ensure that only around 5% of
% experiments would produce a false positive.
%
% Note that the resolution of the estimated p-value depends on the number
% of random resamples. For example, if num_rand=1e2, then we could only estimate
% the p-value to 2 decimal places. In general, a minimum num_rand of 1e4 is recommended.
% However, p-value resolution can also be affected by the sample size if the samp_size
% is small. For example, if samp_size=10, there are 2^samp_size=1024 possible
% permutations for a one-sample test. Furthermore, the randomise_one_sample function
% below does not enumerate all such permutations - it randomly samples possible
% permutations - making it likely that the same permutation will be sampled more
% than once if the sample size is not large enough (proper statistical software
% does enumerate all (or a subset of num_rand) permutations if feasible, but not
% done here to save time). So be wary of randomised p-values if the sample size
% is less than log2(1e4) (though note that greater resolution can be achieved
% by parametric approximation to the randomised distribution, e.g, using "percentile"
% interpolation).
%
% _Exercise: you can check robustness to non-Gaussian populations by replacing
% "pop" with "pop2" or "pop3" (for skewed or kurtotic distributions), or try "bootstrap"
% as the randomisation method (passing 'bootstrap' as the third argument randomise_one_sample).
% What do you find, particularly when reducing samp_size to ~10? (Note there are
% yet other randomisation methods like Jacknife (leave-one-out resampling), but
% these are beyond present scope.)_
%
% But what if our hypothesis is true, and the population mean is greater
% than 0? How often does our test detect this? This is called the "power" of a
% test (sometimes denoted as 1-beta, given a certain alpha). Let's create a new
% population with mean of 0.5 (which corresponds to a Cohen's d effect size of
% 0.5, since Cohen's d is mean divided by std, and here std=1):
%%
effect_size = 0.5; % Effect size for H1
pop_H1 = pop + effect_size; % (Population if H1 true)
for s = 1:num_samp
samp = sample_pop(pop_H1,samp_size,1);
p_values(s) = randomise_one_sample(samp,num_rand);
end
tpr = sum(p_values<alpha)/num_samp;
fprintf('Randomised (true) positive rate with population effect size %3.2f = %4.3f',effect_size,tpr);
%%
% This is a power around 70% (normally we aim for >80%). Power increases
% with sample_size, as you will see later.
%
% Note that there are other classes of non-parametric tests that have analytic
% solutions, e.g, those that ignore the magnitude of values and use only their
% ranks (or even just their signs). These rank-based statistics test the null
% hypothesis that the median (rather than mean) is zero, but are not considered
% here. Note also that you do not have to permute the mean - you can permute many
% different summary statistics, such as the median, or the T-statistic (below),
% or even properties like the maximum or cluster extent (later).
%% The T-test
% Next though, we introduce the T-test. This test uses a T-statistic (sometimes
% called "Student's T") which is defined for one sample as the mean divided by
% the standard error of the mean:
%%
one_t = @(y) squeeze(mean(y)./(std(y)/sqrt(size(y,1))));
%%
% Some maths can show that, under certain assumptions (that the population
% is Gaussian), the T-values under H0 follow a distribution called the T-distribution,
% which depends also on the degrees of freedom (df). For this simple case of a
% one-sample test, the df = samp_size - 1 (the -1 because we have used one df
% in our data to estimate the mean). From this, we can calculate the (one-tailed;
% see later) probability of obtaining a value of T or higher:
T2p_higher = @(T,df) squeeze(1-tcdf(T,df));
T = one_t(samp); % use last sample from above
df = samp_size - 1; % degrees of freedom
p = 1-tcdf(T,df); % p-value (use "p" because now parametric)
fprintf('T(%d) = %3.2f, p = %4.3f\n',df,T,p);
%%
% where "tcdf" is the cumultive density function for the T distribution
% (in Matlab stats toolbox). For this sample (drawn from a population with mean
% of 0.5), we can reject H0 (that population mean is 0) because p<.05.
%
% This is called a "parametric" test, because it depends on a parametric
% (here Gaussian) assumption about the distribution of the property under test
% (here the mean). According to the CLT shown earlier, this assumption is met
% when samp_size is large (in fact, when df is very large, T-distribution becomes
% Gaussian, and the p-value is equivalent to that for the Z-statistic, Z = mean/std).
% For smaller samples, there are tests to determine whether the sample comes from
% a normal distribution, but ironically, the smaller the sample size (and hence
% bigger the risk that the assumption of a Gaussian distribution of the mean is
% violated), the less sensitive such "tests of normality" become.
%
% _Interesting factoid: the "Student" in "Student's T-test" was the pseudonym
% of a statistician called Gosset, who worked for Guinness to quality control
% batches of beer, and used the pseudonym to avoid the ban of publishing research
% that might have commerical implications._
%
% We can also check that the T-test controls the FPR and gives similar Power:
%%
num_samp = 1e4;
samp = sample_pop(pop, samp_size, num_samp); % Draw num_samp samples of samp_size
p = T2p_higher(one_t(samp), samp_size-1);
fpr = sum(p<alpha)/num_samp;
fprintf('False positive rate for T-test when H0 true = %4.3f\n', fpr);
samp = sample_pop(pop_H1, samp_size, num_samp);
p = T2p_higher(one_t(samp), samp_size-1);
tpr = sum(p<alpha)/num_samp;
fprintf('True positive rate for T-test under H1 with effect size of %3.2f = %4.3f\n', effect_size, tpr);
fprintf('(cf. for randomisation test above)');
%%
% (note how much quicker this is than randomisation testing!). However,
% what happens when parametric assumptions not met? We can repeat but drawing
% a large number of small samples from a non-Gaussian population (eg the skewed
% "pop2"):
samp_size = 10
samp = sample_pop(pop2, samp_size, num_samp); % Draw num_samp samples of samp_size
p = T2p_higher(one_t(samp), samp_size-1);
fpr = sum(p<alpha)/num_samp;
fprintf('False positive rate for T-test when H0 true = %4.3f\n', fpr);
%%
% The first thing to note is that the FPR is now less than alpha, i.e.,
% the T-test has become conservative (ie it "fails to safe"). This means the test
% may be less sensitive, but at least has not increased FPR above an accepted
% rate (cf. randomisation tests if you tried exercise above).
%
% So if this one-sample T-test is quite robust, even to small samples, you
% might wonder why we bother with permutation tests, given they are computationally
% expensive? (though note that are situations in two-sample T-tests where it is
% capricious; see later). One reason is that randomisation is more general, in
% that we can calculate probabilities for properties other than the mean (such
% as the maximum; see later), which may not have a parametric (analytic) solution.
%
% So far, we have been testing H1 that the mean is greater than 0. This is
% called a directional or "one-tailed" test. We can also perform a two-tailed
% test to ask whether the mean is simply different from 0, by using the absolute
% value of T:
T2p_2tails = @(T,df) squeeze(1-tcdf(abs(T),df));
%%
% Because we are effectively doing two tests now, we can divide our alpha
% by 2 (a special case of the Bonferroni correction; see later).
%
% We can now calculate power curves for a two-tailed, one-sample T-test:
%%
num_samp = 1e3;
alpha_cor = alpha/2
samp_sizes = [10:20:210];
effect_sizes = [0 0.2 0.5 0.8];
figure; hold on; title(sprintf('Power curves for two-tailed, one-sample T'))
pr = nan(length(samp_sizes),length(effect_sizes));
for es = 1:length(effect_sizes);
pop_H1 = pop + effect_sizes(es);
for s = 1:length(samp_sizes)
samp = sample_pop(pop_H1, samp_sizes(s), num_samp);
p = T2p_2tails(one_t(samp),samp_sizes(s)-1);
%p = randomise_one_sample(samp,num_rand,2,'permute');
pr(s,es) = sum(p<alpha_cor)/num_samp;
end
end
plot(samp_sizes,pr,'o-');
line([0 max(samp_sizes)+10],[alpha alpha],'LineStyle',':');
line([0 max(samp_sizes)+10],[.8 .8],'LineStyle','--');
txt = compose('%2.1f',effect_sizes); txt{end+1} = 'alpha'; txt{end+1} = '1-beta';
legend(txt,'Location','East')
axis([min(samp_sizes)-10 max(samp_sizes)+10 0 1])
xlabel('Sample Size'); ylabel('Power')
%%
% So for what Cohen called a "medium" effect size of 0.5, you now need a
% sample size of ~30 for 80% power (more than earlier because now two-tailed),
% but for what he called a "small" effect of 0.2, you need around 200.
%% Bayes Factors
% The above classical or "frequentist" statistics estimate the likelihood of
% getting a statistic as large or larger than that obtained from the data, given
% that the null hypothesis is true. One consequence of this is that, if this p-value
% does not fall below our alpha, then we cannot reject H0, but nor can we accept
% it ("absence of evidence is not evidence of absence"). A Bayes Factor (BF) on
% the other hand is the ratio of two likelihoods: the probability of exceeding
% that statistic value if H1 is true, relative to the same probability if H0 is
% true (this ratio is often called BF10; while its reciprocal is BF01). A high
% BF10 (e.g, 6 or 10; a matter of convention like alpha) would constitute evidence
% for H1, whereas a high BF01 (low BF10) would constitute evidence for the null.
% Being able to claim support for H0 is a major strength of BFs (there are other
% philosophical differences that are beyond the present remit).
%
% The problem with calculating a BF however is specifying H1 precisely. This
% means not just an effect size (which could be based on previous studies for
% example), but also the precise distribution of the effect. The nice thing about
% NHST is that scientists can generally agree that an effect of exactly 0 is a
% reasonable null, but with BFs, they also have to agree on H1. To estimate the
% likelihood under H1, one needs to impose some priors, and if scientists use
% different priors, they could come to different conclusions.
%
% One solution called "subjective priors" is to use priors that come from
% previous data or simulations or theory. While other scientists might disagree
% with these, they can at least be specified in advance, ideally with agreement
% from independent reviewers (eg in a Study Registration). An alternative is to
% stick with "objective priors", i.e, default priors that do not differ by experiment,
% and which the community agrees in advance (e.g, default priors for a Bayesian
% T-test). (A third option is "empirical priors", which arise when we have heirarhical
% models, as mentioned later).
%
% To calculate the likelihood of getting a statistic (here the mean), we
% need to parametrise its prior distribution under H0 and H1. The likelihood is
% then estimated by integrating over all possible values of these parameters (so-called
% "marginalisation").
%
% For H0, this is easy, because the prior corresponds to a delta function
% at 0. So the likelihood is just the likelihood of getting the actual mean given
% the standard error of the mean. Under the assumption of a Gaussian distribution
% of the population, this is given by the normal probability density function
% (PDF):
%%
likelihood_data = @(y,mu) normpdf(mean(y), mu, std(y)/sqrt(length(y)));
samp = sample_pop(pop,1e2,1); % a random sample of 100 from unit normal
likelihood_mean_H0 = likelihood_data(samp,0) % Likelihood of mean under H0
%%
% (where likelihood is relative to integral of PDF, so not necessarily between
% 0-1). If we also assume a Gaussian distribution for H1, then this is fully described
% by two parameters: its mean and std. Here we assume the H1 mean is 0 (like H0),
% but by allowing a non-zero std (unlike H0), we increase the chance of mean values
% further from zero. If we can agree on a std (eg 0.5), then the likelihood under
% H1 is given by integrating over all possible values of the mean generated by
% this prior distirbution:
H1_mean = 0; H1_std = 0.5;
prior = @(x) normpdf(x, H1_mean, H1_std);
marginal = @(x,y) likelihood_data(y,x) .* prior(x);
likelihood_mean_H1 = integral(@(x) marginal(x,samp), -Inf, Inf)
%%
% BF10 is then just the ratio of these likelihoods:
BF10 = likelihood_mean_H1/likelihood_mean_H0
BF01 = 1/BF10;
%%
% which will tend to be <1 (or BF01>1), since the data were sampled from
% a zero-mean population, so the BF should favour H0. Note there are other types
% of BF, for example based on the T-statistic rather than mean, and which use
% more sophisticated assumptions about the distributions of the mean and std,
% which are beyond present scope (such as the JZS prior common as the "objective
% prior" for a T-test).
%
% The above process is implemented in the function "one_sample_BF" at end
% of this script. We can plot BFs as a function of the effect size in the data
% and the variability in the mean allowed by H1:
%%
effect_sizes = [0 0.2 0.4 0.6]; % effect sizes in data
H1_mean = [0 0 0]; H1_std = [0.2 0.4 0.6]; % variability under H1
num_samp = 1e3;
samp_size = 1e2
BF10_thr = 3
pr = nan(length(effect_sizes),length(H1_std)+1); % positive rate for BF10>thr
nr = pr; % negative rate for BF01>thr (BF10<1/thr)
for es = 1:length(effect_sizes)
pop_H1 = pop + effect_sizes(es);
samp = sample_pop(pop_H1,samp_size,num_samp);
p = T2p_2tails(one_t(samp),samp_size-1);
BF10 = nan(num_samp,length(H1_std));
for h1 = 1:length(H1_std)
for s = 1:num_samp
BF10(s,h1) = one_sample_BF(samp(:,s),H1_std(h1),H1_mean(h1));
end
end
pr(es,1) = sum(p<alpha_cor)/num_samp;
pr(es,2:end) = sum(BF10 > BF10_thr)/num_samp;
nr(es,2:end) = sum(BF10 < 1/BF10_thr)/num_samp;
end
txt = {}; for h1 = 1:length(H1_std); txt{h1} = sprintf('H1std=%2.1f',H1_std(h1)); end
txt{end+1} = sprintf('p<%4.3f',alpha_cor);
figure, bar(effect_sizes,pr(:,2:end)), title(sprintf('Positive rate for BF10>%d',BF10_thr))
hold on, plot(effect_sizes,pr(:,1),'o'), axis([-0.1 0.7 0 1])
legend(txt,'Location','NorthWest')
set(gca,'XTick',effect_sizes)
ylabel('Positive Rate'), xlabel('Data effect size')
%%
% Note that thresholding BFs in the above manner leads to the NHST notion
% of (false) positive rates, and many Bayesians prefer to keep only the BF as
% a continuous estimate of how much one should update their beliefs given the
% data (or prefer to focus on the posterior distributions of the parameters themselves,
% which is beyond present remit). Nonetheless, the bars for effect size = 0 show
% comparable false positives when there is no effect. Importantly, when the effect
% size is greater than 0, power for the BF approach increases as the prior std
% for H1 becomes tighter. Even more important however is the positive rate when
% the BF favours H0:
figure, bar(effect_sizes,nr(:,2:end)), title(sprintf('Positive rate for BF10<1/%d',BF10_thr))
legend(txt{1:(end-1)},'Location','NorthEast'), axis([-0.1 0.7 0 1])
set(gca,'XTick',effect_sizes)
ylabel('Positive Rate'), xlabel('Data effect size')
%%
% This is not something we can calculate with a p-value (though classical
% "equivalence tests" perform a similar function), but with BFs, we have a reasonable
% chance of concluding that the evidence favours the null when the null is true
% (effect size is 0). Note that there is also some chance that we will support
% H0 when the true effect size is non-zero, but smaller than we expected (i.e,
% when the effect size is 0.2 but H1std>0.2). This is not necessarily unreasonable.
% In other words, BFs are sensitive to the size of the effect; not just whether
% it is present (non-zero) or absent (zero).
%
% _Exercise: what happens if you set prior mean of H1 to be >0, eg H1_mean
% = 0.6? You should see more cases when H0 is favoured when the effect size is
% less than 0.6 (particularly when H1 std is small, ie when we have more precise
% expectations that effect should be 0.6) - understand?_
%
% Another advantage of using BFs is they naturally allow sequential designs,
% where we keep collecting more data until our BF threshold (for H1 or H0) is
% exceeded, which can be more efficient on average (at risk of slight increase
% in false positives). Nonethless, we return to the currently more conventional
% frequentist p-values for the rest of this script (where there is at least less
% room for debate about priors).
%% Multiple Comparisons (correcting for many tests)
% Sometimes we have multiple hypotheses that we wish to test. This is common
% in neuroimaging, eg for an ERP from a single channel, where we might have a
% 100 time points, or a 3D MRI image of 100x100x100 voxels. We'll only consider
% the 1D case here, with the number of tests given by "num_tests", though the
% logic here generalises to more than one dimension. So our data will now be a
% 3D matrix of sample_size (eg, number of participants) by num_tests (eg, number
% of voxels) by num_samp (number of experiments), which we're going to call "y".
%
% Let's start by assuming that each test is independent, and H0 is true for
% all tests. We now wish to control the "family-wise error rate" (FWE) of at least
% one false positive across _all the tests_ (to be less than alpha):
%%
rng(10)
samp_size = 20
num_tests = 50
num_samp = 1e4;
y = nan(samp_size,num_tests,num_samp);
p = nan(num_tests,num_samp);
for s = 1:num_samp
samp = sample_pop(pop,samp_size,num_tests);
y(:,:,s) = samp; % store for below
p(:,s) = T2p_higher(one_t(samp),samp_size-1);
end
sig = (p < alpha);
figure,bar(sum(sig,2)/num_samp), title('Proportion significant for each test')
axis([0 num_tests+1 0 0.1])
xlabel('Test number (eg voxel)'); ylabel('Proportion of samples with p<alpha')
fprintf('Family-wise error rate = %4.3f\n',sum(any(sig,1))/num_samp)
%%
% Although the FPR in each test is around 0.05, the FWE is much higher (i.e,
% it is likely that at least one test significant every time). One solution is
% to scale our threshold by the number of tests - the so-called "Bonferonni" correction:
%%
alpha_cor = alpha/num_tests % Bonferonni
sig = (p < alpha_cor);
figure,bar(sum(sig,2)/num_samp), title('Proportion significant per Test')
axis([0 num_tests+1 0 0.1])
xlabel('Test number (eg voxel)'); ylabel('Proportion of samples Bonferonni corrected')
fprintf('Family-wise error rate = %4.3f\n',sum(any(sig,1))/num_samp)
%%
% _Interesting factoid: Carlo Bonferonni was a keen climber of glaciers,
% but we don't know whether he corrected his chance of death by the number of
% his climbs._
%
% This does control FWE at 5%. But in imaging, our tests are rarely independent.
% For example, ERP data are temporally autocorrelated across time, while MRI voxels
% are typically correlated across space. We can simulate this by smoothing our
% data (here with a simple, moving average of all values within a smooth_window
% for speed purposes):
%%
smooth_window = 20
smooth_y = nan(size(y));
for s = 1:num_samp
for p = 1:samp_size
smooth_y(p,:,s) = smooth(squeeze(y(p,:,s)),smooth_window);
end
end
%%
% In this case, Bonferronni correction is too conservative:
%%
T = one_t(smooth_y);
p = T2p_higher(T,samp_size-1);
sig = (p < alpha_cor);
%figure,bar(sum(sig,2)/num_samp), title('Proportion significant for each test')
%axis([0 num_tests+1 0 0.1])
%xlabel('Test number (eg voxel)'); ylabel('Proportion of samples Bonferonni corrected')
fprintf('Family-wise error rate = %4.3f\n',sum(any(sig,1))/num_samp)
%% Permuting maximal statistics
% Fortunately, permutation tests come in useful here. It turns out that if we
% calculate the maximum T-value across tests, and its null distribution from permuting
% the data many times, then the (1-alpha) percentile of this distribution of "maxT"s
% provides a threshold that controls FWE at alpha, as demonstrated below:
%%
y1 = squeeze(smooth_y(:,:,1)); % Calculate max T thr from one sample (experiment) for speed
num_rand = 1e4;
maxT_null = nan(num_rand,1);
for r = 1:num_rand
rs = sign(rand(samp_size,1)-0.5);
ry = y1 .* repmat(rs,[1 num_tests]); % must use same randomisation for all participants
rT = one_t(ry);
maxT_null(r) = max(rT,[],2);
end
maxT_null = sort(maxT_null,'ascend');
maxT_thr = maxT_null(floor((1-alpha)*num_rand)+1)
[pdf,vals] = hist(maxT_null,100); pdf = pdf/num_samp; figure, bar(vals,pdf)
hold on, line([maxT_thr maxT_thr],[0 max(pdf)],'Color',[1 0 0])
title('Null distribution of maximal T-statistic (and 95centile)')
xlabel('max T'); ylabel('Proportion of permutations')
%%
% (This is the basis of the "permute_maxT_one_sample" function at end of
% this script; note that it is important that each permutation is used for all
% voxels). Given this threshold for T-values, we can confirm the FWE is around
% .05:
sig = (T > maxT_thr);
figure,bar(sum(sig,2)/num_samp), title('Proportion significant for each test')
axis([0 num_tests+1 0 0.1])
xlabel('Test number (eg voxel)'); ylabel('Proportion of samples with p<alpha')
fprintf('Family-wise error rate = %4.3f\n',sum(any(sig,1))/num_samp)
%%
% Note that one should calculate maxT_thr for each new dataset (experiment),
% but since our simulated experiments are identical, it is quicker to approximate
% this threshold for one dataset and apply to all experiments (cf. cluster size
% below).
%
% We can now repeat for the unsmoothed case, confirming we still maintain
% FWE:
%%
y1 = squeeze(y(:,:,1)); % Take first experiment to calculate Tthr
maxT_thr = permute_maxT_one_sample(y1)
T = one_t(y);
sig = (T > maxT_thr);
%figure,bar(sum(sig,2)/num_samp), title('Proportion significant for each test')
%axis([0 num_tests+1 0 0.1])
%xlabel('Test number (eg voxel)'); ylabel('Proportion of samples with p<alpha')
fprintf('Family-wise error rate = %4.3f\n',sum(any(sig,1))/num_samp)
%%
% Now let's add some signal (to the original unsmoothed data). If consider
% the tests as voxels in a 1D image for the moment, let's assume a large but "localised"
% effect for voxel index 30, plus a smaller but "distributed" effect across 10
% voxels from 10:19 inclusive:
%%
y_signal = y;
y_signal(:,30,:) = y(:,30,:) + 2;
y_signal(:,10:19,:) = y(:,10:19,:) + 0.5;
y1 = squeeze(y_signal(:,:,1)); % Select example sample (first will do)
figure,bar(mean(y1)), title('Values for one sample'), xlabel('Voxel Number'), ylabel('Value')
%%
% Let's see what the positive rates look like over lots of experiments:
%%
maxT_thr = permute_maxT_one_sample(y1)
T = one_t(y_signal);
sig = (T > maxT_thr);
figure,bar(sum(sig,2)/num_samp), title('Proportion significant for each voxel')
axis([0 num_tests+1 0 1]) % Note increase in y-axis scale
xlabel('Voxel number'); ylabel('Proportion of samples with p<alpha')
fprintf('Family-wise positive rate = %4.3f\n',sum(any(sig,1))/num_samp)
%%
% We can detect the localised activation most of the time, but not the diffuse
% activation. One solution to the latter is to smooth the data, chosing a smoothing
% window that matches the extent of the diffuse cluster:
%%
smooth_window = 10
smooth_y_signal = nan(size(y_signal));
for s = 1:num_samp
for p = 1:samp_size
smooth_y_signal(p,:,s) = smooth(squeeze(y_signal(p,:,s)),smooth_window);
end
end
%%
% Note that data may be intrinsically smooth, but we often want to explicitly
% smooth. One reason for this is to help justify the parametric assumptions of
% T-test, because smoothing effectively increases the amount of data being averaged,
% and so benefits from the CLT.
%
% So what happens when we apply the same maximum T threshold to the smoothed
% data:
%%
y1 = squeeze(smooth_y_signal(:,:,1));
maxT_thr = permute_maxT_one_sample(y1)
T = one_t(smooth_y_signal);
sig = (T > maxT_thr);
figure,bar(sum(sig,2)/num_samp), title('Proportion significant after smoothing')
axis([0 num_tests+1 0 1]) % Note increase in y-axis scale
xlabel('Voxel number'); ylabel('Proportion of samples with p<alpha')
%fprintf('Family-wise positive rate = %4.3f\n',sum(any(sig,1))/num_samp)
%%
% You will notice that, while the power to detect voxels in the diffuse
% cluster has increased, that for the localised voxel has decreased, and its signal
% has in fact been smoothed into nearby voxels.
%
% _Exercise: compare results with a smaller (eg 5) or bigger (eg 20) smoothing
% window (kernel) - do the results support the "matched filter theorem"?_
%% Cluster-level Correction
% But can we avoid "blurring" the data by smoothing, and still detect diffuse
% signals? Here permutation testing again helps us, because rather than permuting
% the maximal T value (which is suitable for detecting whether a single voxel
% is significant, sometimes called "peak" or "height" thresholding), we could
% permute the "cluster extent" - ie the number of contiguous voxels above some
% initial threshold.
%
% We first need to determine the cluster size after an initial thresholding
% of voxels, and label each voxel with the size of the cluster in which it sits.
% This is done by the function "label_with_cluster_size" at end below, but here
% is an example:
%%
y1 = squeeze(mean(y_signal,3)); % mean of many samples of unsmoothed
cluster_defining_Tval = tinv(1-alpha, samp_size-1) % initial threshold to define clusters
cs = label_with_cluster_size(one_t(y1), cluster_defining_Tval);
figure,bar(cs), title('Voxels labelled by cluster size'), xlabel('Voxel Number'), ylabel('Cluster size')
%%
% Note that the diffuse cluster is labelled with true size of 10 (because
% we have averaged out most of noise by averaging across many samples), while
% the local activation is labelled correctly as size 1 (plus there might be some
% false positive clusters). We can now estimate the distribution of maximal cluster
% size:
y1 = squeeze(y_signal(:,:,1));
cs_null = nan(num_rand,1);
for r = 1:num_rand
rs = sign(rand(samp_size,1)-0.5);
ry = y1 .* repmat(rs,[1 num_tests]); % must use same randomisation for all tests
rT = one_t(ry);
cs_null(r) = max(label_with_cluster_size(rT, cluster_defining_Tval));
end
cs_null = sort(cs_null,'ascend');
cs_thr = cs_null(floor((1-alpha)*num_rand)+1);
fprintf('Minimum extent for cluster-level FWE correction: %d\n', cs_thr)
%%
% This is called "cluster-level correction" (and is the basis of the "permute_cluster_metric_one_sample"
% function at end). Thus clusters of at least 2 voxels are unlikely by chance.
% Note that the initial threshold (here the T-value corresponding to p<.05 uncorrected)
% will affect extent threshold.
%
% With only 50 voxels and hence small range of cluster sizes, we need to
% lower the cluster-defining threshold to get sufficient range of cluster sizes.
% Also, it is again more accurate to calculate the cs_thr for each sample (experiment),
% but for purposes of speed, we estimate from the first sample only:
%%
%num_samp = 1e4; % reduce if need to speed up
cluster_defining_Tval = 0.5 % initial cluster-defining height threshold
cs = nan(num_samp,num_tests); cs_thr = [];
for s = 1:num_samp
ys = squeeze(y_signal(:,:,s));
if s==1 % Faster to estimate from one sample, since all samples equivalent, and CS_thr relatively stable
cs_thr = permute_cluster_metric_one_sample(ys,cluster_defining_Tval,alpha,num_rand,'size');
% else
% cs_thr(end+1) = permute_cluster_metric_one_sample(ys,cluster_defining_Tval,alpha,num_rand,'size');
end
T = one_t(ys);
cs(s,:) = label_with_cluster_size(T,cluster_defining_Tval);
cs(s,:) = (cs(s,:) > cs_thr(end));
end
figure,bar(sum(cs)/num_samp), title('Proportion significant using cluster-threshold')
xlabel('Voxel Number'), ylabel('Proportion of samples'), axis([0 num_tests+1 0 1])
fprintf('Family-wise positive rate using Cluster-threshold = %4.3f\n',sum(any(cs,2))/num_samp)
%%
% Notice that the diffuse cluster is now detected reasonably well, but the
% localised one is not (cf. the case for corrected peak threshold earlier). We
% are basically trading-off statistical sensitivity for localising power: with
% the cluster-level correction, we are more sensitive (to clusters), but are unable
% to localise within clusters (ie cannot claim a particular voxel within a cluster
% is significant).
%
% _Exercise: Here we have applied to the unsmoothed data with signal, but
% you can see what happens for the smoothed data, smooth_y_signal, or the unsmoothed
% data with no signal, y (to check FWE), and what happens as you change the initial
% cluster-defining height threshold. You can also compare results when the threshold
% is calculated for each sample separately (by uncommenting "else" part above),
% and examine the range of extent thresholds resulting (cs_thr), but it will take
% a long time..._
%
% Note that there are parametric versions that correct FWE for peak or cluster
% extent using "Random Field Theory" but they are beyond scope here, and make
% more assumptions, such as requiring data to have a minimal smoothness, and a
% sufficiently stringent initial threshold for defining cluster size.
%% Threshold-free Cluster Enhancement
% Finally, one could permute a property that includes a contribution from both
% the height and extent of each cluster, such as the cluster "mass" (e.g, the
% product of height and extent of each cluster in the current 1D example). Note
% also that the results from the above cluster-level correction still depend on
% the initial height threshold that defines the clusters. A popular method that
% combines these ideas is the "Threshold-free Cluster Enhancement" (TFCE), which
% integrates the cluster masses across a range of thresholds, and is implemented
% in the "label_with_cluster_TFCE" function at end (with also allows different
% weightings of height and extent).
%%
cm = nan(num_samp,num_tests); cm_thr = []; % cm = cluster mass
for s = 1:num_samp
ys = squeeze(y_signal(:,:,s));
T = one_t(ys);
thrs = prctile(T,[60:10:90]); % range of height thresholds to integrate over
if s==1 % Faster to estimate from one sample, though TFCE threshold quite variable
cm_thr = permute_cluster_metric_one_sample(ys,thrs,alpha,num_rand,'TFCE');
% else
% cm_thr(end+1) = permute_cluster_metric_one_sample(ys,thrs,alpha,num_rand,'TFCE');
end
cm(s,:) = label_with_cluster_TFCE(T,thrs);
cm(s,:) = (cm(s,:) > cm_thr(end));
end
figure,bar(sum(cm)/num_samp), title('Proportion significant using TFCE-threshold')
xlabel('Voxel Number'), ylabel('Proportion of samples'), axis([0 num_tests+1 0 1])
fprintf('Family-wise positive rate using TFCE-threshold = %4.3f\n',sum(any(cm,2))/num_samp)
%%
% Now we are detecting both the diffuse activation and the localised activation
% with reasonable power. Note that, unlike the cs_thr above, the TFCE threshold
% varies quite a lot with the data.
%% False Discovery Rate
% Another common correction for multiple comparisons is not to control the FWE,
% but rather something called the False Discovery Rate (FDR). In the "Benjamini-Hochberg"
% version of FDR, the p-values are sorted from smallest to largest, and checked
% against thresholds with decreasing levels of correction, and then all p-values
% up to the largest that survives this threshold are declared significant ("discoveries").
%%
%y1 = squeeze(smooth_y_signal(:,:,1)); % Select one sample for illustration
%y1 = squeeze(smooth_y(:,:,1)); % No true positives
%y1 = squeeze(smooth_y(:,:,1))+10; % All true positives
y1 = squeeze(y_signal(:,:,1));
T = one_t(y1);
p = T2p_higher(one_t(y1),samp_size-1);
thr = alpha./[num_tests:-1:1];
ps = sort(p,'ascend') + eps; % eps = tiny float = to avoid -Inf in log plot below
st = find(ps < thr);
figure,plot(log10(ps),'o'); hold on; ylabel('Log10(p)'), xlabel('Sorted Voxels')
plot([1:num_tests], log10(thr),'r-')
FDRthr = thr(max(st));
if isempty(FDRthr), FDRthr = thr(1); end
line([1 num_tests], [1 1]*log10(FDRthr), 'Color','k')
axis([1 num_tests log10(min([ps thr]))-0.1 0])
legend({'p-values', 'thresholds','FDR threshold'},'Location','SouthEast')
fprintf('Number of discoveries = %d\n',length(st))
%%
% Not that FDR defaults to Bonferroni if there are no true positives, and
% defaults to uncorrected alpha if they are all true positives.
%
% FDR is compared with FWE in the example below, where signal is put in first
% 40 voxels but not the remaining 10:
%%
y_signal = y;
y_signal(:,[1:40],:) = y_signal(:,[1:40],:) + 1; % Add signal to first 40 voxels
% FWE
y1 = squeeze(y_signal(:,:,1)); % Select one sample max T threshold estimation
maxT_thr = permute_maxT_one_sample(y1);
T = one_t(y_signal);
sig = (T > maxT_thr)';
figure,bar(sum(sig,1)/num_samp), title('Height FWE')
axis([0 num_tests+1 0 1])
xlabel('Voxel number'); ylabel('Positive rate per voxel')
fprintf('Mean error rate in non-active voxels under FWE = %4.3f\n',sum(any(sig(:,41:end),2))/num_samp)
% FDR
p = T2p_higher(T,samp_size-1);
sig = zeros(num_samp,num_tests);
for s = 1:num_samp
[ps,ind] = sort(p(:,s),'ascend');
st = find(ps' < thr);
sig(s,ind(1:max(st))) = 1;
end
figure,bar(sum(sig,1)/num_samp), title('Height FDR')
axis([0 num_tests+1 0 1])
xlabel('Voxel number'); ylabel('Discovery rate per voxel')
fprintf('Mean error rate in non-active voxels under FDR = %4.3f\n',sum(any(sig(:,41:end),2))/num_samp)
%%
% Note FDR leads to more true positives for the first 40 voxels. However,
% this comes at the cost of increased false positives in non-active voxels (the
% remaining 10 voxels). This is because FDR controls the number of false positives
% as a proportion (eg 5%) of voxels declared significant, rather than as a proportion
% of tests performed (like FWE). So the more active voxels there are in one part
% of the image, the more false positives are likely in other parts of the image
% where there is no signal. Thus it is really a qualitatively different type of
% inference, and some would argue that maintaining the traditional FWE is more
% important.
%
% Finally, note that all the methods above are designed for (safe) inferences
% on "mass univariate" tests. This is appropriate if the aim is localisation in
% space or time, but if the aim is to detect signal anywhere, there are much more
% sensitive "multivariate" tests like PLS/CCA that leverage on covariance between
% tests.
%% Two-sample Tests
% We now turn to two-sample tests. Here H0 is that the means of (whatever generated)
% two samples are equal. We start with the easy case of a "paired" test, where
% two "conditions" are measured on the same people, i.e., each person contributes
% two values (sometimes called a "repeated measure"). In this case, we can simply
% calculate the difference between these two values for each person, and run a
% one-sample test on these "difference scores". In the context of a permutation
% test, we would be randomly swapping the two values for each person, hence the
% basis for randomising the sign of the (difference) values in the one-sample
% test examples above.
%
% More interesting is the "unpaired" (or "independent samples") case, when
% the samples (groups) are drawn from two populations of different people, and
% we have one value per person. Under H0 that these populations are identical,
% we could randomly assign people to either group. This is called "exchangeability";
% in the unpaired case, all values are exchangeable, whereas in the paired case,
% only values within the same person are exchangeable (exchangeability is normally
% a condition of randomisation tests). If there are N1 people in sample of first
% population, and N2 in second sample, then for a permutation test, there are
% (N1+N2)!/(N1!N2!) permutations, which quickly gets very large as N1 and/or N2
% increase. Nonetheless, it is normally sufficient to sample ~1e4 to estimate
% a p-value, as done in the function "randomise_two_samples" function at end.
%
% Alternatively, we can also use a parametric approach. There are several
% versions of the parametric version of the two-sample T-test, but we can start
% with a standard one that assumes equal variance in both groups:
%%
pool_var = @(y) ((size(y{1},1)-1)*var(y{1}) + (size(y{2},1)-1)*var(y{2})) / (size(y{1},1) + size(y{2},1) - 2)
two_t = @(y) (mean(y{2}) - mean(y{1})) ./ sqrt(pool_var(y) * (1/size(y{1},1) + 1/size(y{2},1)))
%%
% The first "pool_var" function above pools the variance across both groups,
% weighted by their respective sizes. This pooled estimate of variance is then
% used in the second function "two_t", in which the numerator is the difference
% of group means, while the denominator is equivalent to the pooled standard error.
% The p-value can then be calculated from the same T-distribution with dfs equal
% to the total number across both groups minus 2 (the 2 dfs used for estimating
% the mean within each group).
%
% Assuming equal group sizes, we can plot power curves for two-tailed, unpaired
% T-tests as a function of group sizes:
%%
%num_samp = 1e3; % Reduce if doing randomisation below and want to save time
alpha_cor = alpha/2; % Two-tailed
samp_sizes = [10:20:210]; % Number of participants per group
effect_sizes = [0 0.2 0.5 0.8]';
figure; hold on; title(sprintf('Power curves for two-tailed, two-sample unpaired T-test'))
pr = nan(length(samp_sizes),length(effect_sizes));
y = cell(1,2);
for es = 1:length(effect_sizes)
pop_H1 = pop + effect_sizes(es);
for s = 1:length(samp_sizes)
df = 2*samp_sizes(s)-2;
y{1} = sample_pop(pop, samp_sizes(s), num_samp);
y{2} = sample_pop(pop_H1, samp_sizes(s), num_samp);
p = T2p_2tails(two_t(y),df);
%p = randomise_two_samples(y, num_rand, 2);
pr(s,es) = squeeze(sum(p < alpha_cor)/num_samp);
end
end
plot(samp_sizes,pr,'o-');
line([0 max(samp_sizes)+10],[alpha alpha],'LineStyle',':');
line([0 max(samp_sizes)+10],[.8 .8],'LineStyle','--');
txt = cellstr(num2str(effect_sizes)); txt{end+1} = 'alpha'; txt{end+1} = '1-beta';
legend(txt,'Location','SouthEast')
axis([min(samp_sizes)-10 max(samp_sizes)+10 0 1])
xlabel('Sample Size in each group'); ylabel('Power')
%%
% Note that groups of ~70 are needed to get 80% power with a medium effect
% size, which means testing 140 people in total. This is quite a bit more than
% the ~30 needed for the one-sample test above (which is equivalent to a paired,
% two-sample T-test, demonstrating the increased power of repeated-measures designs).
% Note also that many imaging studies use relatively small samples of ~20 when
% comparing groups, which would entail power of only ~30%...
%
% _Exercise: Calculate power curves after controlling FWE across a large
% number of tests. Imagine for example you have an fMRI study with 1e6 voxels
% (and no a priori idea where you might find an effect): even allowing for an
% intrinsic smoothness in the data (such that you might only have 1e3 "resels"
% - "resolvable elements" - for Bonferonni correction), what are chances of traditional
% fMRI group sizes finding a true effect somewhere in the brain while controlling
% FWE?_
%% Unequal group sizes and pooled variance estimates
% We can also plot what happens when the total number of participants is fixed,
% but the ratio of group sizes varies:
%%
%num_samp = 1e3; % Reduce if doing randomisation below and want to save time
totsamp_size = 50;
df = totsamp_size - 2;
group_ratio = [0.1:0.1:0.5]; % max 0.5
figure; hold on; title(sprintf('Power as function of group ratio for two-tailed unpaired T-test'))
pr = nan(length(group_ratio),length(effect_sizes));
y = cell(1,2);
for es = 1:length(effect_sizes);
pop_H1 = pop + effect_sizes(es);
for s = 1:length(group_ratio)
samp_size(1) = round(group_ratio(s)*totsamp_size);
samp_size(2) = totsamp_size - samp_size(1);
y{1} = sample_pop(pop, samp_size(1), num_samp);
y{2} = sample_pop(pop_H1, samp_size(2), num_samp);
p = T2p_2tails(two_t(y),df);
%p = randomise_two_samples(y, num_rand, 2);
pr(s,es) = squeeze(sum(p < alpha_cor)/num_samp);
end
end
plot(group_ratio,pr,'o-');
line(group_ratio([1 end]),[alpha alpha],'LineStyle',':');
line(group_ratio([1 end]),[.8 .8],'LineStyle','--');
txt = cellstr(num2str(effect_sizes)); txt{end+1} = 'alpha'; txt{end+1} = '1-beta';
legend(txt,'Location','NorthWest')
axis([0.05 0.55 0 1])
xlabel('Proportion of total sample in Group 1'); ylabel('Power')
%%
% Note that power decreases as the groups become more unbalanced (though
% FPR is still controlled).
%% Inhomogeniety of variance: correcting dfs (Welch)
% However the variance pooling used by the standard T-test fails when the variance
% of one group differs from the other. Such "inhomogeniety" of variance is a special
% case of the more general problem of "nonpshericity" discussed in the GLM section
% below.
%
% Indeed, the combination of unequal variance and unequal group sizes is
% particularly serious for the standard T-test, and can lead to an inflated FPR,
% as we will see below. One solution is called Welch's T-test, which does not
% pool the variance, and reduces the dfs to a fraction of original dfs (an example
% of Sattherthwaite approximation, to which we will return in the GLM section
% below). This correction is illustrated in the simulations below:
%%
rng(1) % reset seed to get same results
samp_sizes = [5 45]; % At least one small group
diff_means = [0 0.8]; % Difference in means
var_ratios = [0.5 1 2]; % Ratio of variances in Group 2 vs Group 1
%num_samp = 1e3; % reduce if need speed up permutation test below
y = cell(1,2);
pr = nan(length(diff_means),length(var_ratios),3); % 3rd argument is method
for es = 1:length(diff_means)
fprintf('Difference in mean = %2.1f\n', diff_means(es))
for vr = 1:length(var_ratios)
fprintf('\t\tVariance ratio = %2.1f\n', var_ratios(vr))
pop_H1 = pop * var_ratios(vr) + diff_means(es);
y{1} = sample_pop(pop, samp_sizes(1), num_samp);
y{2} = sample_pop(pop_H1, samp_sizes(2), num_samp);
[T,df] = two_sample_T(y,'pooled'); p = T2p_2tails(T,df);
pr(es,vr,1) = sum(p<alpha_cor)/num_samp;
fprintf('\t\t\t\tPooled T (df=%d),\tPR=%4.3f\n',mean(df),pr(es,vr,1))
[T,df] = two_sample_T(y,'welch'); p = T2p_2tails(T,df);
pr(es,vr,2) = sum(p<alpha_cor)/num_samp;
fprintf('\t\t\t\tWelchs T (df=%3.2f),\tPR=%4.3f\n',mean(df),pr(es,vr,2));
%p = randomise_two_samples(y,num_rand,2);
%fprintf('\t\t\t\tPermuted \tPR=%4.3f\n',sum(p<alpha_cor)/num_samp)
end
end
figure; hold on; title(sprintf('Power as function of variance ratio and method'))
txt = {}; mkr{1,1}='ro:'; mkr{1,2}='bo:'; mkr{2,1}='ro--'; mkr{2,2}='bo--';
for q = 1:length(diff_means)
plot(var_ratios,squeeze(pr(q,:,1)),mkr{q,1});
txt{end+1} = sprintf('Pooled T: Effect %2.1f',diff_means(q));