-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathBayesian_statistics.Rmd
2777 lines (1835 loc) · 221 KB
/
Bayesian_statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Elements of Bayesian statistics {#ebs}
```{r setup, message = FALSE, warning = FALSE, include = FALSE}
source("RMDR.R")
CE = c("Rainfall", "Sec99", "IPump", "Rational")
load_CE_(CE)
```
As human beings we make our decisions on what has happened to us in the past. For example, we trust a person or a company more, when we can look back at a series of successful transactions. And we have remarkable capability to recall what has just happened, but also what happened yesterday or years ago. By integrating over all the evidence, we form a view of the world we forage. When evidence is abundant, we vigorously experience a feeling of certainty, or lack of doubt. That is not to deny, that in a variety of situations, the boundedness of the human mind kicks in and we become terrible decision makers. This is for a variety of psychological reasons, to name just a few:
- plain forgetting
- the primacy effect: recent events get more weight
- confirmation bias: evidence that supports a belief is actively sought for, counter-evidence gets ignored.
- the hindsight bias: once a situation has taken a certain outcome, we believe that it had to happen that way.
The very aim of scientific research is to avoid the pitfalls of our minds and act as rational as possible by translating our theory into a formal model of reality, gathering evidence in an unbiased way and weigh the evidence by formal procedures. This weighing of evidence using data essentially is *statistical modeling* and statistical models in this book all produces two sorts of numbers: *magnitude of effects* and *level of certainty*. <!-- When a statistic is reported together with the strength of evidence, this is conventionally called an *inferential statistic*. --> In applied research, real world decisions depend on the evidence, which has two aspects: first, the strength of effects and the level of certainty we have reached.
Bayesian inferential statistics grounds on the idea of accumulating evidence, where past data is not lost. *Certainty* (or *strength of belief* or *credibility* or *credence*) in Bayesian statistics is formalized as a *probability scale (0 = impossible, 1 = certain)*. The level of certainty is determined by two sources, everything that we already know about the subject and the data we just gathered. Both sources are only seemingly different, because when new data is analyzed, a transition occurs from what you new before, *prior belief*, to what you know after seeing the data, *posterior belief*. In other words: by data the current belief gets an update.
Updating our beliefs is essential for acting in rational ways. The first section of this chapter is intended to tell the Big Picture. It puts statistics into the context of decision-making in design research. For those readers with a background in statistics, this section may be a sufficient introduction all by itself.
In the remainder of this chapter the essential concepts of statistics and Bayesian analysis will be introduced from ground up. First we will look at descriptive statistics and I will introduce basic statistical tools, such as summary statistics and figures. Descriptive statistics can be used effectively to explore data and prepare the statistical modeling, but they lack one important ingredient: information about the level certainty.
\@ref(bayes-prob-theory) first derives probability from set theory and relative frequencies, before we turn to the the famous Bayes theorem \@ref(bayes-theorem), followed by an introduction to Bayesian thinking \@ref(dynamics-belief).
Then we go on to basic concepts of statistical modeling, such as the likelihood and we finish with Bayes theorem, which does the calculation of the posterior certainty from prior knowledge and data. Despite the matter, I will make minimal use of mathematical formalism. Instead, I use R code as much as possible to illustrate the concepts. If you are not yet familiar with R, you may want to read chapter \@ref(r-primer) first, or alongside.
Section \@ref(statmod) goes into the practical details of modeling. A statistical model is introduced by its two components: the structural part, which typically carries the research question or theory, followed by a rather deep account of the second component of statistical models: the random part.
## Rational decision making in design research {#decision-making}
- I see clouds. Should I take my umbrella?
- Should I do this bungee jump? How many people came to death by jumping? (More or less than alpine skiing?) And how much fun is it really?
- Overhauling our company website will cost us EUR 100.000. Is it worth it?
All the above cases are examples of decision making under uncertainty. The actors aim for maximizing their outcome, be it well being, fun or money. But, they are uncertain about what will really happen. And their uncertainty occurs on two levels:
1. One cannot precisely foresee the exact outcome of one's chosen action:
- Taking the umbrella with you can have two consequences: if it rains, you have the benefit of staying dry. If it does not rain, you have the inconvenience of carrying it with you.
- You don't know if you will be the rare unlucky one, who's bungee rope breaks.
- You don't know by how much the new design will attract more visitors and how much the income will raise.
2. It can be difficult to precisely determine the benefits or losses of potential outcomes:
- How much worse is your day when carrying a useless object with you? How much do you hate moisture? In order to compare the two, they must be assessed on the same scale.
- How much fun (or other sources of reward, like social acknowledgments) is it to jump a 120 meter canyon? And how much worth is your own life to you?
- What is the average revenue generated per visit? What is an increase of recurrence rate of, say, 50% worth?
Once you know the probabilities of all outcomes and the respective losses, *decision theory* provides an intuitive framework to estimate these values. *Expected utility* $U$ is the sum product of *outcome probabilities* $P$ and the involved *losses*.
In the case of the umbrella, the decision is between two options: taking an umbrella versus taking no umbrella, when it is cloudy.
$$
\begin{aligned}
&P(\text{rain}) &&&=& 0.6 \\
&P(\text{no rain}) &=& 1 - P(\text{rain}) &=& 0.4 \\
&L(\text{carry}) &&&=& 2 \\
&L(\text{wet}) &&&=& 4 \\
&U(\text{umbrella}) &=& P(\text{rain}) L(\text{carry}) + P(\text{no rain}) L(\text{carry}) = L(\text{carry}) &=& 2\\
&U(\text{no umbrella}) &=& P(\text{rain}) L(\text{wet}) &=& 2.4
\end{aligned}
$$
Tables \@ref(tab:umbrella-2) through \@ref(tab:umbrella-4) develop the calculation of expected utilities `U` in a step-by-step manner:
```{r umbrella-1}
attach(Rainfall)
```
```{r umbrella-2}
Outcomes <-
tibble(outcome = c("rain", "no rain"),
prob = c(0.6, 0.4))
Outcomes %>% kable(caption = "Probabilities of outcomes")
```
```{r umbrella-3}
Actions <-
tibble(action = c("umbrella", "no umbrella"))
Losses <-
expand.grid(action = Actions$action,
outcome = Outcomes$outcome) %>%
join(Outcomes) %>%
mutate(loss = c(2, 4, 2, 0))
Losses %>% kable(caption = "Losses conditional on outcome and action")
```
```{r umbrella-4}
Utility <-
Losses %>%
mutate(conditional_loss = prob * loss) %>%
group_by(action) %>%
summarise(expected_loss = sum(conditional_loss))
Utility %>% kable(caption = "Expected losses")
```
We conclude that, given the high chance for rain, and the conditional losses, the expected loss is larger for not taking an umbrella with you. It is rational to take an umbrella when it is cloudy.
### Measuring uncertainty {#measuring-uncertainty}
As we have seen above, a decision requires two investigations: outcomes and their probabilities, and the assigned loss. Assigning loss to decisions is highly context dependent and often requires domain-specific expertise. The issues of probabilistic processes and the uncertainty that arises from them is basically what the idea of *New Statistics* represents. We encounter uncertainty in two forms: first, we usually have just a limited set of observations to draw inference from, this is *uncertainty of parameter estimates*. From just 20 days of observation, we cannot be absolutely certain about the true chance of rain. It can be 60%, but also 62% or 56%. Second, even if we precisely knew the chance of rain, it does not mean we could make a certain statement of the future weather conditions, which is *predictive uncertainty*. For a perfect forecast, we had to have a complete and exact figure of the physical properties of the atmosphere, and a fully valid model to predict future states from it. For all non-trivial systems (which excludes living organisms and weather), this is impossible.
Review the rainfall example: the strategy of taking an umbrella with you has proven to be superior under the very assumption of *predictive uncertainty*. As long as you are interested in long-term benefit (i.e. optimizing the average loss on a long series of days), this is the best strategy. This may sound obvious, but it is not. In many cases, where we make decisions under uncertainty, the decision is not part of a homogeneous series. If you are member of a startup team, you only have this one chance to make a fortune. There is not much opportunity to average out a single failure at future occasions. In contrast, the investor, who lends you the money for your endeavor, probably has a series. You and the investor are playing to very different rules. For the investor it is rational to optimize his strategy towards a minimum average loss. The entrepreneur is best advised to keep the maximum possible loss at a minimum.
As we have seen, predictive uncertainty is already embedded in the framework of rational decision making. Some concepts in statistics can be of help here: the uncertainty regarding future events can be quantified (for example, with posterior predictive distributions \@ref(posterior-dist) and the process of model selection can assist in finding the model that provides the best predictive power.
Still, in our formalization of the Rainfall case, what magically appears are the estimates for the chance of rain. Having these estimates is crucial for finding an optimal decision, but they are created outside of the framework. Furthermore, we pretended to know the chance of rain exactly, which is unrealistic. Estimating parameters from observations is the reign of statistics. From naive calculations, statistical reasoning differs by also regarding uncertainty of estimates. Generally, we aim for making statements of the following form:
"With probability $p$, the attribute $A$ is of magnitude $X$."
In the umbrella example above, the magnitude of interest is the chance of rain. It was assumed to be 60%. This appears extremely high for an average day. A more realistic assumption would be that the probability of rainfall is 60% *given the observation of a cloudy sky*. How could we have come to the belief that with 60% chance, it will rain when the sky is cloudy? We have several options, here:
1. Supposed, you know that, on average, it rains 60% of all days, it is a matter of common sense, that the probability of rain must be equal or larger than that, when it's cloudy.
2. You could go and ask a number of experts about the association of clouds and rain.
3. You could do some systematic observations yourself.
Imagine, you have recorded the coincidences of clouds and rainfall over a period, of, let's say 20 days (Table \@ref(tab:umbrella-5)) :
```{r umbrella-5}
Rain
```
Intuitively, you would use the average to estimate the probability of rain under every condition (Table \@ref(tab:umbrella-6)).
```{r umbrella-6}
Rain %>%
group_by(cloudy) %>%
kable(caption = "Chance of rain depending on cloudiness of the sky")
```
```{r opts.label = "inv"}
T_desc <-
Rain %>%
group_by(cloudy) %>%
summarize(chance_of_rain = mean(rain))
C_desc <- as.matrix(ungroup(T_desc))
```
```{r}
detach(Rainfall)
```
These probabilities we can feed into the decision framework as outlined above. The problem is, that we obtained just a few observations to infer the magnitude of the parameter $P(rain|cloudy) = `r C_desc[2,2] * 100`$%. Imagine, you would repeat the observation series on another 20 days. Due to random fluctuations, you would get a more or less different series and different estimates for the probability of rain. More generally, the *true* parameter is only imperfectly represented by any sample, it is not unlikely, that it is close to the estimate, but it could be somewhere else, for example, $P(rain|cloudy) = `r jitter(as.numeric(C_desc[2,2]), 5) * 100`$%.
The trust you put in your estimation is called *level of certainty* or *belief* or *confidence*. It is the primary aim of statistics to rationally deal with uncertainty, which involves to *measure the level of certainty* associated with any statement derived from teh data. So, what would be a good way to determine certainty? Think for a moment. If you were asking an expert, how would you do that to learn about magnitude and uncertainty regarding $P(rain|cloudy)$?
Maybe, the conversation would be as follows:
> YOU: What is the chance of rain, when it's cloudy.
> EXPERT: Wow, difficult question. I don't have a definite answer.
> YOU: Oh, c'mon. I just need a rough answer. Is it more like 50%-ish, or rather 70%-ish.
> EXPERT: Hmm, maybe somewhere between 50 and 70%.
> YOU: Then, I guess, taking an umbrella with me is the rational choice of action.
Note how the expert gave two endpoints for the parameter in question, to indicate the location and the level of uncertainty. If she had been more certain, she had said "between 55 and 65%. While this is better than nothing, it remains unclear, which level of uncertainty is enclosed. Is the expert 100%, 90% or just 50% sure the true chance is in the interval? Next time, you could ask as follows:
> ...
> EXPERT: Hmm, maybe somewhere between 70-90%
> YOU: What do you bet? I'm betting 5 EUR that the true parameter is outside the range you just gave.
> EXPERT: I dare you! 95 EUR it's inside!
The expert feels 95% certain, that the parameter in question is in the interval. However, for many questions of interest, we have no expert at hand (or we may not even trust them altogether). Then we proceed with option 3: making our own observations.
### Benchmarking designs {#benchmarking-designs}
The most basic decision in practical design research is whether a design fulfills an external criterion. External criteria for human performace in a human-machine system are most common, albeit not abundant, in safety-critical domains.
Consider Jane: she is user experience researcher at the mega-large rent-a-car company smartr.car. Jane was responsible for a overhaul of the customer interface for mobile users. Goal of the redesign was to streamline the user interface, which had grown wild over the years. Early customer studies indicated that the app needed a serious visual de-cluttering and stronger funneling of tasks. 300 person months went into the re-development and the team did well: a recent A/B study had shown that users learned the smartr.car v2.0 fast and could use its functionality very efficiently. Jane's team is prepared for the roll-out, when Marketing comes up with the following request:
> Marketing: We want to support market introduction with the following slogan: "rent a car in 99 seconds".
> Jane: Not all users manage a full transaction in that short time. That could be a lie.
> Marketing: Legally, the claim is fine if it holds on average.
> Jane: That I can find out for you.
Jane takes another look at the performance of users in the smartr car v2.0 condition. As she understands it, she has to find out whether the average of all recorded time-on-tasks with smartr.car 2.0 is 99 seconds, or better. Here is how the data looks like:
```{r}
attach(Sec99)
```
```{r sec99-1, opts.label = "fig.small", fig.cap = "Distribution of ToT"}
Ver20 %>%
ggplot(aes(x = ToT)) +
geom_histogram()
```
The performance is not completely off the 99 seconds, many users are even faster. Jane figures out that she has to ask a more precise question, first, as teh slogan can mean different things, like:
- all users can do it within 99 seconds
- at least one user can do it
- half of the users can do it
Jane decides to go the middle way and chooses the population average, hence the average ToT must not be more than 99 seconds. Unfortunately, she had only tested a small minority of users and therefore cannot be certain about the true average:
```{r}
mean(Ver20$ToT)
```
Because the sample average is an uncertain, Jane is afraid, Marketing could use this as an argument to ignore the data and go with the claim. Jane sees no better way as quantifying the chance of being wrong using a statistical model, which will learn to know as the *Grand Mean Model* (\@ref(gmm)). The following code estimates the model and shows the estimated coefficient (Table \@ref(tab:sec99-3)).
```{r mc:sec99-2, opts.label = "mcmc"}
M_1 <-
Ver20 %>%
stan_glm(ToT ~ 1, data = .)
P_1 <-
posterior(M_1)
```
```{r opts.label = "mcsync"}
sync_CE(Sec99, M_1, P_1)
```
```{r sec99-3}
coef(P_1)
```
```{r opts.label = "inv"}
T_fixef <- coef(P_1)
```
Let's see what the GMM reports about the population average and its uncertainty: The table above is called a *CLU table*, because it reports three estimates per coefficient:
- Center, which (approximately) is the most likely position of the true value
- Lower, which is the lower 95% credibility limit. There is a 2.5% chance that the true value is lower
- Upper, the 95% upper limit. The true value is larger than this with a chance of 2.5%.
This tell Jane that most likely the average time-on-task is $`r T_fixef[[1,2]]`$. That is not very promising, and it is worse: 99 is even below the lower 95% credibility limit. So, Jane can send a strong message: The probability that this claim is justified, is smaller than 2.5%.
Luckily, Jane had the idea that the slogan could be changed to *"rent a card in 1-1-1 seconds"*. The 95% credibility limits are in her favor, since 111 is at the upper end of the credibility limit. It would be allowed to say that the probability to err is not much smaller than 2.5%. But Jane desires to make an accurate statement. But what precisely is the chance that the true population average is 111 or lower? In Bayesian analysis there is a solution to that. When estimating such a model, we get the complete distribution of certainty, called the posterior distribution. In fact, a CLU table with 95% credibility limits is just a summary on the posterior distribution. This distribution is not given as a function, but has been generated by a (finite) random walk algorithm, known as Markov-Chain Monte Carlo. At every step (or most, to be precise), this algorithm jumps to another set of coordinates in parameter space and a frequency distribution arises that can be used to approximate levels of certainty. Figure \@ref(fig:sec99-4) shows the result of the MCMC estimation as a frequency distribution (based on 4000 iterations in total), divided into the two possible outcomes.
```{r sec99-4, opts.label = "fig.small", fig.cap = "A histogram of MCMC results split by the 111-seconds criterion"}
P_1 %>%
filter(parameter == "Intercept") %>%
mutate(outcome = ifelse(value <= 111,
"111 seconds or shorter",
"longer than 111 seconds")) %>%
ggplot(aes(x = value,
fill = outcome)) +
geom_histogram(binwidth = 2)
```
<!-- In the present case, we can derive the chance that the true average of the customer population is lower than the target of 111: -->
```{r opts.label = "deprecated"}
N_99s <-
P_1 %>%
filter(parameter == "Intercept") %>%
mutate(confirm = value <= 99) %>%
summarise(certainty = mean(confirm)) %>%
mutate(odds = (1-certainty)/certainty) %>%
as.numeric()
```
In a similar manner to how the graph above was produced, a precise certainty level can be estimated from the MCMC frequency distribution contained in the posterior object. Table \@ref(tab:sec99-5) shows that the 111 seconds holds with much better certainty.
```{r sec99-5}
P_1 %>%
filter(parameter == "Intercept") %>%
summarise(certainty = mean(value <= 111)) %>%
kable(caption = "Level of certainty that the population average is 111 second or better.")
```
```{r}
detach(Sec99)
```
The story of Jane is about decision making under risk and under uncertainty. We have seen how easily precise statements on uncertainty can be derived from a statistical model. But regarding rational decision making, this is not an ideal story: What is missing is a systematic analysis of losses (and wins). The benefit of going with the slogan has never been quantified. How many new customers it will really attract and how much they will spend cannot really be known upfront. Let alone, predicting the chance to loose in court and what this costs are almost unintelligeable. The question must be allowed, what good is the formula for utility, when it is practically impossible to determine the losses. And if we cannot estimate utilities, what are the certainties good for?
Sometimes, one possible outcome is just so bad, that the only thing that practically matters, is to avoid it at any costs. Loosing a legal battle often falls into this category and the strategy of Marketing/Jane effectively reduced this risk: they dismissed a risky action, the 99 seconds statement, and replaced it with a slogan that they can prove is true with good certainty.
In general, we can be sure that there is at least some implicit calculation of utilities going on in the minds of Marketing. Perhaps, that is a truly intuitive process, which is felt as an emotional struggle between the fear of telling the untruth and love for the slogan. This utility analysis probably is inaccurate, but that does not mean it is completely misleading. A rough guess always beats complete ignorance, especially when you know about the attached uncertainty. Decision-makers tend to be pre-judiced, but even then probabilities can help find out to what extent this is the case: Just tell the probabilities and see who listens.
### Comparison of designs {#comparing-designs}
The design of systems can be conceived as a choice between design options. For example, when designing an informational website, you have the choice of making the navigation structure flat or deep. Your choice will change usability of the website, hence your customer satisfaction, rate of recurrence, revenue etc. Much practical design research aims at making good choices and from a statistical perspective that means to compare the outcomes of two (or more) design option. A typical situation is to compare a redesign with its predecessor, which will now be illustrated by a hypothetical case:
Violet is a manager of an e-commerce website and at present, a major overhaul of the website is under debate. The management team agrees that this overhaul is about time and will most likely increase the revenue per existing customer. Still, there are considerable development costs involved and the question arises whether the update will pay itself in a reasonable time frame. To answer this question, it is not enough to know that revenues increase, but an more accurate prediction of *how much precisely* is gained. In order to return the investment, the increase must be in the ballpark of 10% increase in revenue. For this purpose, Violet carries out a user study to compare the two designs. Essentially, she observes the transactions of 50 random customers using the current system with 50 transactions with a prototype of the new design. The measured variable is the money every user spends during the visit. The research question is: *By how much do revenues increase in the prototype condition?*. Figure \@ref(fig:rational-1) shows the distribution of measured revenue in the experiment.
```{r}
attach(Rational)
```
```{r rational-1, opts.label = "fig.small", fig.cap = "Density plot comparing the revenue of two designs"}
RD %>%
ggplot(aes(x = Euro, fill = Design)) +
geom_density(alpha = .5)
```
There seems to be a slight benefit for the prototype condition. But, is it a 10% increase? The following calculation shows Violet that it could be the case (Table \@ref(tab:rational-2))
```{r rational-2}
RD %>%
group_by(Design) %>%
summarize(mean_revenue = mean(Euro)) %>%
kable(caption = "Comparing the mean revenue of two designs")
```
Like in the previous case, testing only a sample of 100 users out of the whole population leaves room for uncertainty. So, how certain can Violet be? A statistical model can give a more complete answer, covering the magnitude of the improvement, as well as a level of certainty. Violet estimates a model that compares the means of the two conditions \@ref(cgm), assuming that the randomness in follows a Gamma distribution \@ref(exp-gam-reg).
```{r rational-3, opts.label = "mcmc"}
library(rstanarm)
library(tidyverse)
library(bayr)
M_1 <-
RD %>%
stan_glm(Euro ~ Design,
family = Gamma(link="log"),
data = .)
P_1 <- posterior(M_1)
```
```{r opts.label = "mcsync"}
sync_CE(Rational, M_1, P_1)
```
The coefficients of the Gamma model are on a logarithmic scale, but when exponentiated, they can directly be interpreted as multiplications \@ref(speaking-multipliers). That precisely matches the research question, which is stated as percentage increase, rather than a difference (Table \@ref(tab:rational-4)).
```{r rational-4}
coef(P_1, mean.func = exp)
```
```{r opts.label = "inv"}
T_fixef <- fixef(P_1, mean.func = exp) %>%
select(center, lower, upper) %>%
as.matrix()
```
The results tell Violet that, most likely, the average user spends `r T_fixef[1,1]` Euro with the current design. The prototype seems to increase the revenue per transaction by a factor of $`r T_fixef[2,1]`$. That would be a sufficient increase, however this estimate comes from a small sample of users and there remains a considerable risk, that the true improvement factor is much weaker (or stronger). The above coefficient table tells that with a certainty of 95% the true value lies between $`r T_fixef[2,2]`$ and $`r T_fixef[2,3]`$. But, what precisely is the risk of the true value being lower than 1.1? This information can be extracted from the model (or the posterior distribution):
```{r rational-5}
N_risk_of_failure <-
P_1 %>%
filter(parameter == "Designproto") %>%
summarize(risk_of_failure = mean(exp(value) < 1.1))
paste0("risk of failure is ", c(N_risk_of_failure))
```
So, risk of failure is just below 30%. With this information in mind Violet now has several options:
1. deciding that `r N_risk_of_failure$risk_of_failure * 100`% is a risk she *dares* to take and recommend going forward with the development
2. *continue testing* more users to reach a higher level of certainty
3. improve the model by taking into account sources of evidence from the past, i.e. *prior knowledge*
<!-- The first option represents an unspecific tolerance towards risk. Extreme risk tolerance of managers became is infamous as one cause for large scale economic crises [BLACK_SWANs]. Her motives could be manifold, still: maybe she was nursed to take risks in her life; or she is confident that consequences in case of failure will be bearable. -->
<!-- The second and third options have one thing in common: they make use of *external knowledge*. This is explained in the coming section. The fourth option is another way of making *cumulative use of knowledge*, namely *[adaptive testing: find correct term]*. -->
```{r}
detach(Rational)
```
### Prior knowledge {#prior-knowledge}
It is rarely the case that we encounter a situation as a *blank slate*. Whether we are correct or not, when we look at the sky in the morning, we have some expectations on how likely there will be rain. We also take into account the season and the region and even the very planet is sometimes taken into account: the Pathfinder probe carried a bag of high-tech gadgets to planet Mars. However, the included umbrella was for safe landing only, not to cover from precipitation, as Mars is a dry planet.
In most behavioral research it still is the standard that every experiment had to be judged on the produced data alone. For the sake of objectivity, researchers were not allowed to take into account previous results, let alone their personal opinion. In Bayesian statistics, you have the choice. You can make use of external knowledge, but you don't have to.
Violet, the rational design researcher has been designing and testing e-commerce systems for many years and has supervised several large scale roll-outs. So the current project is not a new situation, at all. From the top of her head, Violet produces the following table to capture her past twenty projects and the increase in revenue that had been recorded afterwards.
```{r}
attach(Rational)
```
```{r sim:rational-prior-1, opts.label = "invisible"}
set.seed(2)
D_prior <-
tibble(project = as.factor(1:20),
revenue_increase = exp(rnorm(20, log(1.01), .03))) %>%
as_tbl_obs()
```
```{r rational-prior-2}
D_prior
```
On this data set, Violet estimates another grand mean model that essentially captures prior knowledge about revenue increases after redesign:
```{r mc:sim:rational-prior-3, opts.label = "mcmc"}
M_prior <-
D_prior %>%
stan_glm(revenue_increase ~ 1,
family = Gamma(link = "log"),
data = .,
iter = 5000)
P_prior <- posterior(M_prior)
```
```{r opts.label = "mcsync"}
sync_CE(Rational, D_prior, M_prior, P_prior)
```
*Note* that the above model is a so called Generalized Linear Model with a Gamma shape of randomness, which will be explained more deeply in chapter \@ref(duration-measures).
```{r include=FALSE}
T_prior <- coef(P_prior, mean.func = exp)
```
Table \@ref(tab:rational-prior-4) shows the results. The mean increase was `r T_prior[[1,2]]` and without any further information, this is the best guess for revenue increase in any future projects (of Violet). A statistical model that is based on such a small number of observations usually produces very uncertain estimates, which is why the 95% credibility limits are wide. There even remains a considerable risk that a project results in a decrease of revenue, although that has never been recorded. (In Gamma models coefficients are usually multiplicative, so a coefficient $<1$ is a decline).
```{r rational-prior-4}
coef(P_prior, mean.func = exp)
```
Or graphically, we can depict the belief as in Figure \@ref(fig:rational-prior-5):
```{r rational-prior-5, opts.label = "fig.small", fig.cap = "Prior knowledge about a parameter is expressed as a distribution"}
P_prior %>%
filter(parameter == "Intercept") %>%
mutate(value = exp(value)) %>%
ggplot(aes(x = value)) +
geom_density() +
xlab("Strength of prior belief")
```
The population average (of projects) is less favorable than what Violet saw in her present experiment. If the estimated revenue on the experimental data is correct, it would be a rather extreme outcome. And that is a potential problem, because extreme outcomes are rare. Possibly, the present results are overly optimistic (which can happen by chance) and do not represent the true change revenue, i.e. on the whole population of users. In Bayesian Statistics, mixing present results with prior knowledge is a standard procedure to correct this problem. In the following step, she uses the (posterior) certainty from M_prior and employs it as prior information (by means of a Gaussian distribution). Model M_2 has the same formula as M_1 before, but combines the information of both sources, data and prior certainty.
```{r mc:rational-prior-6, opts.label = "mcmc"}
T_prior <-
P_prior %>%
filter(parameter == "Intercept") %>%
summarize(mean = mean(value), sd = sd(value))
M_2 <-
stan_glm(formula = Euro ~ Design,
prior_intercept = normal(0, 100),
prior = normal(T_prior[[1,1]], T_prior[[1,2]]),
family = Gamma(link = "log"),
data = RD)
P_2 <- posterior(M_2)
```
*Note* that the standard deviation here is saying how the strength of belief is distributed for the average revenue. It is not the standard deviation in the a population of projects. Table \@ref(tab:rational-prior-7) and Figure \@ref(fig:rational-prior-8) show a comparison of the models with and without prior information.
```{r opts.label = "mcsync"}
sync_CE(Rational, T_prior, M_2, P_2)
```
```{r opts.label = "invisible"}
T_fixef_2 <- fixef(P_2, mean.func = exp)
```
```{r rational-prior-7, echo = F}
P_comb <-
bind_rows(
filter(P_1, parameter == "Designproto"),
filter(P_2, parameter == "Designproto")) %>%
mutate(value = exp(value))
coef(P_comb)
```
```{r rational-prior-8, fig.cap = "Comparison of expected revenue increase with and without prior information"}
P_comb %>%
ggplot(aes(x = model, y = value)) +
geom_violin() +
geom_crossbar(data = coef(P_comb),
aes(y = center, ymin = lower, ymax = upper),
width = .05)
```
Model M_2 reduces the estimated expected revenue by a small amount. But, remember that Violet has to meet the criterion of 110% in revenue. In the following she extracts the risk of failure (revenue smaller than 110%) from the posterior distribution (Table \@ref(tab:rational-prior-9)) .
```{r rational-prior-9, opts.label = "fig.small"}
P_comb %>%
mutate(outcome = value < 1.1) %>%
group_by(model) %>%
summarize(risk_to_fail = mean(outcome)) %>%
kable(caption = "Estimated risk to fail with prior (M2) and without (M1)")
```
So what should Violet report to decision makers? Something like this: "The data from the study us that our chance of success is around two-thirds. *However*, in my long career I have never actually reached such an increase in revenue, so I'd rather say, chance of success is more around 50%."
```{r}
detach(Rational)
```
<!-- ## What is wrong with classic statistics? [TBC] -->
<!-- The short version of the story is this: in face of typically noisy data, there is always the risk that experimental data supports a certain hypothesis, seemingly. If the researcher calls the results supportive, although they were just due to randomness, this is called a *false alarm*. Many scientific disciplines have therefore committed to keep the overall number of false alarms under a threshold, for example 5%. -->
<!-- The even shorter version is that voluntary commitment failed. The true rate of false alarms is about a magnitude higher. -->
<!-- The p-value [...] Imagine giant replication study that showed that 35% of published psychology studies are not replicable. In fact, such a study exists and it has far-reaching consequences. [...] [REF] shows that rejection bias is one reason: studies that have not reached the *magic .05 level* of significance, have a lower chance of publication. [REF] sees as a reason that author's implicitly trick the system by *the garden of forking paths* strategy. The research project collects an array of variables, followed by a fishing expenditure. Theory is conveniently considered last, but written up in a pseudo-a prior way. -->
## Observations and measures {#observations-measures}
The statistical models presented in this book have in common, that there is exactly one measure, which we call the *outcome*. In design research, outcome variables directly represent the current value of a design, or an aspect of it. This section provides an overview on the types of outcome variables.
### Interaction sequences {#interaction-seq}
A general concept in design is to think of purposes as *user tasks*. A user task can be defined as an initial state, a desired outcome and a procedure to get there. The knowledge required to follow the procedure can be found in two major locations: the users mind (mental model) and the interface (design model). A principle of user-centered design is to create a design model that matches the users current representation of the task, or easily enters the users mind, at least. The first is called *intuitiveness* of a design, the latter *ease-of-learning*.
A highly intuitive design matches the users mental model, and can be used out-of-the-box, without the requirement to instruct or train a user. Intuitiveness relies on the match between design and mental model. Whereas the design model is explicit, the mental model needs to be elicited from the users mind. Eliciting procedural mental models is difficult. You may think it is sufficient to just ask users about the idea of a procedure, but that is a limited approach. If I ask you about the precise steps in preparing a cup of coffee in your kitchen, that probably is a more demanding and error-prone than to ask you to prepare an actual cup of coffee. Two reasons make up for this discrepancy:
1. The human mind has separate facilities for procedural knowledge and declarative knowledge. If I ask you about a procedure, you first have to make a translation.
2. The more automated a task is, the less available it is to your conscious mind.
Our ability to verbalize our actions are limited. If there were no such verbalization limit, designing for tasks would be simple: just ask! Only mind-reading is more convenient. In user-centered design verbal inquiries are a central method. But, results from user interviews can be incomplete or biased.
An example for incompleteness is that users sometimes fail to report what they think is obvious. For example, in modern computer desktop applications a function to undo a step is standard and probably it is one of the most frequently used functions. In other types of systems, for example data-base driven information systems, implementing an Undo function is non-trivial, because data base records can be changed by multiple users. It can be implemented by so-called roll-back mechanisms, but these reduce performance and storage capacity of the system. When collecting user requirements for such a system, the developer has little interest in Undo, whereas the user has a huge interest, but does not mention it. If that is not resolved during the design process, there will be trouble.
Verbal reports of action sequences are often incomplete. But, if there is something even worse, then it is the limited ability of humans to imagine new situations, or novel ways of doing something. Imagine you had a time machine and you would travel back to the year 1981, which is a year before the series Knight Rider went on broadcast. If you would ask car drivers from that era, how they would envision a navigation assistant device, you would probably end up with something that looks more like a Star Trek computer console than with anything comparable to modern navigation device.
This why one central principle of user-centered design is the *direct observation* of users. By directly observing how a person prepares a cup of coffee, we can learn about the details of behavior and close the verbalization gap. By observing many action sequences, we can carve out detailed user requirements, or fix an existing design. Observing user interactions does not require one-way mirrors and nights of video coding. Because of the verbalization gap, a lot can be learned just by watching over the users shoulder. The method of *think-aloud usability testing* even goes one step further and combines behavior observation with verbalization.
Interaction sequences can sometimes be collected without a human observer. Log files of web servers provide sequences of users navigating web site. Plugin software is available that records keystrokes and mouse actions on computers. The difficult part is the following: When observing 50 users while doing a non-trivial task, no two interaction sequences are exactly the same (if i had to bet on it). By itself, there is little value without further means of interpretation and this can go two ways up: qualitative and quantitative.
The *qualitative design researcher* will collect interaction sequences, as well as verbal reports and shuffle them around until an intelligible picture emerges. One way to describe the purpose of user testing is to *find out all possible ways how things can go wrong.* Every break-down that is observed at least once in a sample of test users, tells a story of what may happen to hundreds or thousands of users in the future. That is why in early development phases, qualitative research rules.
<!-- is not just an art, but also scientific at its heart. -->
<!-- The diligent and skilled *qualitative* researchers are able to extract meaningful patterns from such sequences. For example, in usability tests, the following patterns are frequently observed and interpreted. -->
<!-- + a user jumping back to the main screen felt being in a dead-end -->
<!-- + a frowning user feels confused -->
<!-- + -->
The *quantitative* researcher will aim at deriving measures from the interaction sequence. Formally, *to measure* means assigning numbers to observed sequences, so that these can be brought into an *order*, at least. If the original data is qualitative, we need some method of transformation that gives us measures.
Sometimes, you have to go a long way up the qualitative route, before you can derive useful measures. In [@schnittker2016] we coded sequences from nurses using an infusion pump. Individual sequences were compared to a optimal reference path. The closer a user stays on this path, the better. But how to measure similarity? We used the *Levensthein distance*, which takes two sequences, the optimal patzh and the observed path and determines the minimum number of edits to transform one sequence into the other. This results in a score for error proneness, the *deviation from optimal path*.
Another example is a study we did on the active user paradox. We recorded interaction sequences of users doing editing tasks with a graphics software. The sessions were recorded and analysed using a behavioural coding scheme. First, events were classified on a low level (e.g. "reading the handbook", "first time trying a function") and later aggregated to broader classes (e.g. "exploratory behavior"). The number of exploratory events then served as a measure for the exploratory tendencies of the participant.
### Performance measures {#perf-measures}
User errors are qualitative by nature and need a transformation to be used as outcome variables. Other aspects of performance can be measured directly. A useful framework for outcome measures is the classic concept of *usability*, which the ISO 9142-11 defines by the following three high-level criteria:
- effectiveness: can users accomplish their tasks?
- efficiency: what resources have to be spend for a task, e.g. time.
- satisfaction: how did the user like it?
While these criteria originated in the field of Human-Computer Interaction, they can easily be adapted to compare everything that people do their work with. Even within the field, it has been adapted to hundreds of studies and a hundred ways are reported of assessing these criteria.
*Effectiveness* is often measured by *completion rate (CR)*. A classic waterfall approach would be to consult the user requirements documents and identify the, let's say eight, crucial tasks the system must support. User test might then show that most users fail at the two tasks, and a completion rate of 75% is recorded for the system. Completion rate is only a valid effectiveness measure with *distinct tasks*. Strictly, the set of tasks also had to be *complete*, covering the whole system. When completion rate is taken from in series of *repetitive tasks*, it depends on whether it is effectiveness or efficiency. It is effectiveness, when a failed operation is unrepairable, such as a traffic accident, data loss on a computer or medical errors. But, who cares for a single number between 0 and 1, when the user test provides such a wealth of information on *why* users failed? Effectiveness, in the sense of task completion, is primarily a qualitative issue and we shall rarely encounter it in this book.
A more subtle notion of effectiveness is the *quality of outcome*, and despite the very term, it is a measure. (Perhaps, it should better be called *level of perfection*.) Reconsider the AUP study, where participants had to modify a given graphic, e.g. change some colors and erase parts of it. Of course, some participants worked neatly, whereas others used broader strokes (literally). There are several ways to rank all results by quality and thereby create an outcome variable.
*Efficiency* is where it really gets quantitative as we ask about resources: time, attention, strain, distraction and Euros. Efficiency can be measured in a variety of ways: time-on-task (ToT), clicks, mental workload or time spent watching the traffic while driving.
Counting the *number of clicks* before someone has found the desired piece of information is a coarse, but easy to acquire and intuitive measure of efficiency, and so is *time-on-task (ToT)*. <!-- Still, these It differs from ToT in that it only counts real action, not the time a participant read the manual or watched the sky. --> <!-- In the downside, this is a very limited definition of action. --> <!-- While errors are usually best analyzed qualitatively, *error rate* can be efficiency, too, when failures are non-critical, but can be repaired in some time, such as correcting a typing error. -->
ToT and other performance measures can be very noisy. The general recipe to reduce noise and improve certainty is to collect more data. The obvious way of doing this is by increasing the sample size, but that can be costly. A more efficient way to reduce uncertainty with a given sample is to collect more data per participant, which is called *repeated measures*. Chapter \@ref(mlm) will introduce models that deal with repeated measures in a straight-forward way.
<!-- Above, I did not mean to say that design research always requires the sometimes painful recording of interaction sequences. The majority of studies jumps over them and proceed to performance measures, directly, by counting attempts or pressing stop watches. -->
Counting steps-to-completion or errors often requires the laborous process of interaction analysis. Measures of duration are much easier to obtain and in many situations they are spot-on: In usability studies, *time-on-task* is a direct measure of efficiency. When controlling a car in dense city traffic, a few hundred milliseconds is what makes huge a difference and therefore *reaction time* is a valid performance measure. In experimental cognitive research reaction times have been succesfully used in countless experiments to reveal the structure of the mind.
<!-- The Stroop tasks is a golden evergreen cognitive psychology and serves as an example. Participants are shown a words drawn in one of three ink colors and are asked to name the ink as quick as possible. The Stroop effect occurs: in the incongruent condition, participants respond much slower than in the congruent and neutral (a few hundred ms). There is ample sweet replication of this effect and an ecosystem of theories surround it. -->
<!-- black tower PC + "explore" -> "can someone hand me a screwdriver" -->
<!-- CONDITION WORD : Ink -->
<!-- congruent [GREEN ]: green -->
<!-- incongr [YELLOW]: green -->
<!-- neutral [CHAIR ]: green -->
### Satisfaction and other feelings
The standard definition of usability defines three levels of usability, the two performance-related effectiveness and efficiency, and satisfaction. The latter has long been held the enfant terrible of the three. That's not a big surprise, as satisfaction is about the *feelings* of users.
Early research in the 1990s took enough with a few rating scales, that measure satisfaction, as the *absence of negative feelings*. Later, when User Experience era took off, a wave of constructs and measures washed over the research community, here are just a few examples:
- beauty
- hedonic quality
- coolness
- credibility
- meaning of life
Feelings are a tricky business when it comes to measuring them. Most frequently, rating scales are employed, and often the same author makes the remark, that the study should be replicated with more objective measures for emotional states, such as physiological measures or implicit measures. However, I haven't seen many studies where such measures produced good results.
<!-- Unfortunately those do not exist. -->
<!-- Also, several authors took on the same construct, but called it differently, or the other way round. -->
Despite all criticism, *self-report rating scales* are still the primary method of measuring feelings towards designs. Rating scale instruments exist for a variety of concepts, but they come in only a few forms: first, there exist single item or multi-item scales. *Single-item* instruments are the hydrogen of rating scales. Because they are so short, they cause little interruption and are therefore useful for obtaining repeated measures over time. That is very valuable, when the goal is to track feelings of an extended period of time. For example, the effect of learning can be seen in a decline of cognitive workload over time. From a methodological point-of-view, single-item scales are inferior, because many quality checks rest on having multiple items (see \@ref(psychometrics)).
*Multi-item* rating scales are the standard. Participants or users respond to a number of statements, that have an overarching theme. In contrast to single items, multi-item scales are amenable to a number of procedures for quality check, such as reliability. <!-- In design research, measures of feelings are usually taken as design benchmarks. --> <!-- It would be impractical to report responses of individual items. --> Whereas a psychometric assessment requires the data on item level, for the purpose of design eveluation multi-item ratings are often aggregated total scores, such as the *average score*. However, aggregation always causes a loss in information and with the multi-level models presented in \@ref(psychometrics), this is also no longer required.
Next to the number of items, rating scale instruments differ in *cardinality of the response*, which is the number of possible responses to an item. By far most rating scales use between four and nine *ordered bins*. In contrast, so called *visual analog* rating scales measure on a continuum. This can be done on paper, using a ruler to translate the response into numbers, or using a slider bar in a computer program. <!-- That being said, one can also think of situations, where the answer is a simple Yes/No, such as for the question: Do you think you solved the task correctly? -->
Finally, rating scales have either *unipolar or bipolar anchoring*. Many rating scales put labels (instead of just numbers) on minimum and the maximum bins. With a unipolar item, the left anchor is neutral, whereas the right is not, like in:
0: Dull ... 1: Bliss
Dull is not the opposite of bliss, it is the absence of it. (Just like Zero is not the opposite of One.) one example of a bipolar scales is:
-1: Ugly ... + 1: Beautiful
Another consideration with binned (rather than continuous) bipolar scales. If such a scale has an uneven number, the center bin is neutral.
-1:Ugly ... 0:Neutral ... +1:Beautifuly
Sometimes, it is useful that participants have the opportunity for a neutral answer, for example, when the participant may just not know:
The help pages of the program are written in a clear language.
-1: Not at all ... 0: I didn't notice +1: ... Totally
Especially with long questionnaires, it can happen that participants are getting bored an just always respond neutral. In such a case, an even number of bins, i.e. the absence of a neutral response, forces participants to make a choice of direction, at least.
What kind of rating scale you use has consequences for your statistical model. As we will see in section \@ref(psychometrics), multi-level models are a good match for multi-item rating scales. What also matters for the choice of model is whether you have been using a visual analog scale or a binned scale. Data from visual analog scales is more easily treated, by either a Gaussian \@ref(psychometrics) or a Beta linearized model \@ref(beta-reg). Both are relatively lean methods and easy to report. For binned rating scales the complicated beast called ordinal logistic regression \@ref(ord-logist-reg) applies. My advice would be to use visual analog scales, whenever possible, even if this means to not exactly following the original instructions for the scale.
While rating scales are prevalent for measuring feelings, that does not mean there are no other, more objective, ways. There has been some buzz about *physiological measures*, like galvanic skin response or EEG, lately. In our own lab, I have seen these measures fail more often than succeed for the evaluation of products (such as wine or floor cleaning robots).
*Implicit measures* are means to assess certain attitudes or feelings by means of experimental setups. For example, We once tried to measure technophile attitude (geekism) using a variant of the classic Stroop task. In [@schmettow2013] we showed pictures of computers to a samnple of students, followed by the Stroop task, which means that participants had to name the ink color of a color-printed word (e.g., "explore" in color Red). It was conceived that reaction time increases when a participant experiences a strong association, like how good it felt to build a computer all by yourself. The initial success was soon washed away by a failure to reproduce these results. Another implicit method sometimes proposed is the Approach-Avoidance task, which has gotten some merits in research on addiction and racism. In simple terms, participants (or users) are asked to push or pull a joystick and it seems that they pull faster, when they see something they like (a bottle of booze) and push faster when they dislike what they see (a photograph of war victims). However, I have seen this approach failing to produce relevant results in a design comparison experiment. Generally, such experiments produce reaction time differences below the 100ms mark and therefore many trials are needed to carve out any differences. At the same time, I have doubts that the emotional reactions towards, say computer interfaces, play in the same league as the stimuli used in research on addiction or racism research. Emotional responses in design research may just be too mellow to disturb cognitive processes with a strength that is measurable.
<!-- The problem, I feel, is that too many reseachers base their research on feelings on their own feelings. That is not always wrong, if these feelings are well-reflected, like was the case when the apple fell on Newton's head or when Mori discovered the Uncanny Valley effect. -->
<!-- ### Designs and Features [TBD] -->
<!-- ### the human factor [TBD] -->
## Descriptive statistics {#descriptive-stats}
In empirical research we systematically gather observations. Observations of the same kind are usually subsumed as variables. A set of variables that have been gathered on the same sample are called a data set, which typically is a table with variables in columns. In the most general meaning, *a statistic* is a single number that somehow represents relevant features of a data set, such as:
- frequency: how many measures of a certain kind can be found in the data set?
- mean: do measures tend to be located left (weak) or right (strong) on a scale?
- variance: are measures close together or widely distributed along the scale?
- association: does one variable X tend to change when another variable Y changes?
### Frequencies {#frequencies}
```{r}
attach(Sec99)
```
The most basic statistics of all probably is the number of observations on a variable $x$, usually denoted by $n_{x}$. The number of observations is a rough indicator for the amount of data that has been gathered. In turn, more data usually results in better accuracy of statistics and higher levels of certainty can be reached.
```{r}
nrow(Ver20$ToT)
```
The number of observations is not as trivial as it may appear at first. In particular, it is usually not the same as the sample size, for two reasons: First, most studies employ repeated measures to some extent. You may have invited $N_\textrm{Part} = 20$ participants to your lab, but each participant is tested on, let's say, $N_\textrm{Task} = 5$ tasks, the number of observations is $N_\textrm{Obs} = N_\textrm{Part}N_\textrm{Task} = 100$. Second, taking a valid measure can always fail for a variety of reasons, resulting in *missing values (NA)*. For example, in the 99 seconds study, it has happened, that a few participants missed to fill in their age on the intake form. The researcher is left with fewer measures of age $n_{age}$ than there were participants.
```{r}
N_obs <- function(x) sum(!is.na(x))
N_obs(Ver20$age)
```
Another important issue is the distribution of observations across groups (Table \@ref(tab:sec99-6)) . Again, the number of observations in a group is linked to the certainty we can gain on statistics of that group. Furthermore, it is sometimes important to have the distribution match the proportions in the population, as otherwise biases may occur.
```{r sec99-6}
Ver20 %>%
group_by(Gender) %>%
summarize(n()) %>%
kable(caption = "Absolute frequencies of gender")
```
The table above shows so called *absolute frequencies*. Often, we have to compare frequencies of two groups of different size, it often is more appropriate to report *relative frequencies* or *proportions*:
```{r sec99-7}
n_Gender <- N_obs(Ver20$Gender)
Ver20 %>%
group_by(Gender) %>%
summarize(rel_freq = n()/n_Gender) %>%
kable(caption = "Relative frequencies of gender")
```
Summarizing frequencies of metric measures, such as time-on-task (ToT) or number of errors is useful, too. However, a complication arises by the fact that continuous measures do not naturally fall into groups. Especially in duration measures no two measures are exactly the same.
```{r}
length(unique(Ver20$Gender))
length(unique(Ver20$ToT))
```
The answer to this problem is *binning*: the scale of measurement is divided into a number of adjacent sections, called bins, and all measures that fall into one bin are counted. For example, we could use bins of 10 seconds and assess whether the bin with values larger than 90 and smaller or equal to 100 is representative in that it contains a large proportion of values. If we put such a binned summary of frequencies into a graph, that is called a *histogram*.
```{r sec99-8, fig.cap = "Histogram showing relative frequencies"}
bin <- function(x, bin_width = 10) floor(x/bin_width) * bin_width
n_ToT <- N_obs(Ver20$ToT)
Ver20 %>%
mutate(bin = bin(ToT)) %>%
group_by(bin) %>%
summarize(rel_freq = n()/n_ToT) %>%
ggplot(aes(x = bin, y = rel_freq)) +
geom_col()
```
Strictly spoken, grouped and binned frequencies are not one statistic, but a vector of statistics. It approximates what we will later get to know more closely as a *distribution* \@ref(distributions).
### Central tendency {#central-tendency}
Reconsider the rational design researcher Jane \@ref(benchmarking-designs). When asked about whether users can complete a transaction within 99, she looked at the population average of her measures. The population average is what we call the *(arithmetic) mean*. The mean is computed by summing over all measures and divide by the number of observations. The mean is probably the most often used measure of central tendency, but two more are being used and have their own advantages: *median* and *mode*.
```{r}
my_mean <- function(x) sum(x)/N_obs(x) ## not length()
my_mean(Ver20$ToT)
```
*Note* that I am using function `N_obs()` from Section \@ref(frequencies), not `length()`, to not accidentally count missing values (`NA`).
Imagine a competitor of the car rental company goes to court to fight the 99-seconds claim. Not an expert in juridical matters, my suggestion is that one of the first questions to be regarded in court probably is: what does "rent a car in 99 seconds" actually promise? One way would be the mean ("on average users can rent a car in 99 seconds"), but here are some other ways to interpret the same slogan:
"50% (or more) of users can ... ". This is called the *median*. The median is computed by ordering all measures and identify the the element right in the center. If the number of observations is even, there is no one center value, and the mean of the center pair is used, instead.
```{r opts.label = "rtut.nr"}
my_median <- function(x){
n <- length(x)
center <- (n + 1)%/%2
if (n%%2 == 1)
sort(x, partial = center)[center]
else mean(sort(x, partial = center + 0:1)[center + 0:1])
}
my_median(Ver20$ToT)
```
Actually, the median is a special case of so called *quantiles*. Generally, an quantiles are based on the order of measures and an X% quantile is that value where X% of measures are equal to or smaller. The court could decide that 50% of users is too lenient as a criterion and could demand that 75% percent of users must complete the task within 99 seconds for the slogan to be considered valid.
```{r}
quantile(Ver20$ToT, c(.50, .75))
```
A common pattern to be found in distributions of measures is that a majority of observations are clumped in the center region. The point of highest density of a distribution is called the *mode*. In other words: the mode is the region (or point) that is most likely to occur. For continuous measures this once again poses the problem that every value is unique. Sophisticated procedures exist to smooth over this inconvenience, but by the simple method of binning we can construct an approximation of the mode: just choose the center of the bin with highest frequency. This is just a crude approximation. Advanced algorithms for estimating modes can be found in the R package *Modeest*.
```{r}
mode <- function(x, bin_width = 10) {
bins <- bin(x, bin_width)
bins[which.max(tabulate(match(x, bins)))] + bin_width/2
}
```
```{r}
mode(Ver20$ToT)
```
```{r sec99-9}
Ver20 %>%
group_by() %>%
summarize(
mean_ToT = mean(ToT),
median_ToT = median(ToT),
mode_ToT = mode(ToT)) %>%
kable(caption = "Summary of central tendency statistics")
```
The table above shows the three statistics for central tendency side-by-side. Mean and median are close together. This is frequently the case, but not always. Only if a distribution of measures is completely symmetric, mean and median perfectly coincide. In section \@ref(distributions) we will encounter distributions that are not symmetric. The more a distribution is skewed, the stronger the difference between mean and median increases (Figure \@ref(fig:dist-skew)).
```{r dist-skew, opts.label = "invisible", fig.cap = "Left-skewed, right-skewed and symmetric distributions"}
set.seed(42)
Distributions <- tribble(~Dist, ~a, ~b,
"left_skewed", 8, 1,
"symmetric", 8, 8,
"right_skewed", 1, 8)
Data <- as_tibble(lapply(Distributions, rep, 1000)) %>%
mutate(y = rbeta(3000, a, b))
Stats <-
Data %>%
group_by(Dist) %>%
summarize(mean = mean(y),
median = median(y),
mode = modeest::shorth(y)) %>%
gather(Statistic, value, - Dist)
Data %>%
ggplot(aes(x = y)) +
facet_grid(Dist ~ .) +
geom_histogram(binwidth = .05) +
geom_vline(data = Stats, aes(col = Statistic, xintercept = value))
```
To be more precise: for left skewed distributions the mean is strongly influenced by few, but extreme, values in the left tail of the distribution. The median only counts the number of observations to both sides and is not influenced by how extreme these values are. Therefore, it is located more to the right. The mode does not regard any values other than those in the densest region and just marks that peak. The same principles hold in reversed order for right-skewed distributions.
To summarize, the mean is the most frequently used measure of central tendency, one reason being that it is a so called *sufficient statistic*, meaning that it exploits the full information present in the data. The median is frequently used when extreme measures are a concern. The mode is the point in the distribution that is most typical.
### Dispersion {#dispersion}
In a symmetric distribution with exactly one peak, mean and mode coincide and the mean represents the most typical value. But, a value being *more* typical does *not* mean it is *very* typical. That depends on how the measures are dispersed over the whole range. In the figure below, the center value of the narrow distribution contains 60% of all measures, as compared to 40% in the wide distribution, and is therefore more representative (Figure \@ref(fig:dispersion-1)) .
```{r dispersion-1, fig.cap = "Narrow and wide distributions"}
D_disp <-
tribble(~y, ~narrow, ~wide,
1, 0, 1,
2, 2, 2,
3, 6, 4,
4, 2, 2,
5, 0, 1) %>%
gather(Distribution, frequency, -y)
D_disp %>%
ggplot(aes(x = y,
y = frequency)) +
facet_grid(Distribution ~ .) +
geom_col()
```
A very basic way to describe dispersion of a distribution is to report the *range* between the two extreme values, *minimum* and *maximum*. These are easily computed by sorting all values and selecting the first and the last element. Coincidentally, they are also special cases of quantiles, namely the 0% and 100% quantiles.
A *boxplot* is a commonly used geometry to examine the shape of dispersion. Similar to histograms, boxplots use a binning mechanism and are useful for continuous measures. Whereas histograms use equidistant bins on the scale of measurement, boxplots create four bins based on 25% quantile steps (Figure \@ref(fig:sec99-10)). . These are also called *quartiles*.
```{r sec99-10, fig.cap = "A boxplot shows quartiles of a distribution"}
Ver20 %>%
ggplot(aes(y = ToT)) +
geom_boxplot()
```
The min/max statistics only uses just these two values and therefore does not fully represent the amount of dispersion. A statistic for dispersion that exploits the full data is the *variance*, which is the mean of squared deviations from the mean. Squaring the deviations makes variance difficult to interpret, as it no longer is on the same scale as the measures. The *standard deviation* solves this problem by taking the square root of variance. By reversing the square the standard deviation is on the same scale as the original measures and can easily be compared to the mean (Table \@ref(tab:sec99-11)) .
```{r sec99-11}
min <- function(x) sort(x)[1]
max <- function(x) quantile(x, 1)
range <- function(x) max(x) - min(x)
var <- function(x) mean((mean(x) - x)^2)
sd <- function(x) sqrt(var(x))
Ver20 %>%
summarize(min(ToT),
max(ToT),
range(ToT),
var(ToT),
sd(ToT)) %>%
kable(caption = "Summary table with five dispersion statistics")
```
```{r opts.label = "invisible"}
rm(min, max, range, var, sd)
```
```{r}
detach(Sec99)
```
### Associations {#associations}
- Are elderly users slower at navigating websites?
- How does reading speed depend on font size?
- Is the result of an intelligence test independent from gender?
In the previous section we have seen how all individual variables can be described by location and dispersion. A majority of research deals with associations between measures and the present section introduces some statistics to describe them. Variables represent properties of the objects of research and fall into two categories: *Metric variables* represent a measured property, such as speed, height, money or perceived satisfaction. *Categorical variables* put observations (or objects of research) into non-overlapping groups, such as experimental conditions, persons who can program or cannot, type of education etc. Consequently, associations between any two variables fall into precisely one of three cases, as shown in Table \@ref(tab:associations-1).
```{r associations-1}
tribble(~between, ~categorical, ~metric,
"categorical", "frequency cross tables", "differences in mean",
"metric", "", "correlations") %>%
kable(caption = "Possible associations between any two variables")
```
#### Categorical associations {#cat-associations}
Categorical variables group observations, and when they are *both categorical*, the result is just another categorical case and the only way to compare them is relative frequencies. To illustrate the categorical-categorical case, consider a study to assess the safety of two syringe infusion pump designs, called Legacy and Novel. All participants of the study are asked to perform a typical sequence of operation on both devices (categorical variable Design) and it is recorded whether the sequence was completed correctly or not (categorical variable Correctness, Table \@ref(tab:ipump-1). ).
```{r ipump-1, fig.cap = "A boxplot showing differences between two designs"}
attach(IPump)
D_agg %>%
filter(Session == 3) %>%
group_by(Design, completion) %>%
summarize(frequency = n()) %>%
ungroup() %>%
spread(completion, frequency) %>%
kable(caption = "Summary of task completion")
```
Besides the troubling result that incorrect completion is the rule, not the exception, there is almost no difference between the two designs. Note that in this study, both professional groups were even in number. If that is not the case, absolute frequencies are difficult to compare and we better report *relative frequencies*. Note how every row sums up to $1$ in the Table \@ref(tab:ipump-2) :
```{r ipump-2}
D_agg %>%
filter(Session == 3) %>%
group_by(Design, completion) %>%
summarize(frequency = n()) %>%
group_by(Design) %>%
mutate(frequency = frequency/sum(frequency)) %>%
ungroup() %>%
spread(completion, frequency) %>%
kable(caption = "A cross table with relative frequencies")
```
In addition, absolute or relative frequencies can be shown in a stacked *bar plot* (Figure \@ref(fig:ipump-3)).
```{r ipump-3, fig.cap = "A stacked bar plot with absolute frequencies"}
D_agg %>%
ggplot(aes(x = Design, fill = completion)) +
geom_bar()
```
#### Categorical-metric associations {#cat-metric-associations}
Associations between *categorical and metric* variables are reported by *grouped location statistics*. In the case of the two infusion pump designs, the time spent to complete the sequence is compared in comparison-of-means Table \@ref(tab:ipump-4).
```{r ipump-4}
D_agg %>%
filter(Session == 3) %>%
group_by(Design) %>%
summarize(mean_ToT = mean(ToT),
sd_ToT = sd(ToT)) %>%
kable(caption = "Comparison of group means")
```
For the illustration of categorical-metric associations case, *boxplots* haven proven useful. Boxplots show differences in central tendency (median) and dispersion (other quartiles) simultaneously. In Figure \@ref(fig:ipump-5), we observe that Novel produces shorter ToT and also seems to be less dispersed.
```{r ipump-5, fig.cap = "Boxplot comparison of two groups"}
D_agg %>%
ggplot(aes(x = Design, y = ToT)) +
geom_boxplot()
```
<!-- ########## HERE ########### -->
#### Covariance and correlation {#cov-corr}
For associations between a pair of metric variables, *covariance* and *correlations* are commonly employed statistics.
A covariance is a real number that is *zero* when there really is *no association* between two variables. When two variables move into the *same direction*, covariance gets the *positive*. When they move in *opposite directions*, covariance is *negative*
For an illustration, consider the following hypothetical example of study on the relationship between mental ability and performance in a minimally-invasive surgery (MIS) task. MIS tasks are known to involve a lot of visual-spatial cognition, which means that performance on other visual-spatial tasks should be associated.
<!-- In the Laptrain study, a number of visual-spatial tests were given to participants, such as a mental rotation task (MRT). The Corsi block-tapping task was even taken twice, because the researchers are also interested in how stable over time the scores are. -->
<!-- The data set is contained in package *Asymptote*, which resides on GitHub. -->
<!-- In this experiment, participants were trained in twelve trials, but we only take the last. -->
<!-- ```{r} -->
<!-- attach(asymptote) -->
<!-- D_tests <- -->
<!-- D_laptrain %>% -->
<!-- filter(trial == 12) %>% -->
<!-- select(MRT, Corsi1, Corsi2, ToT = Duration) -->
<!-- D_tests %>% sample_n(5) -->
<!-- ``` -->
The following code simulates such a set of measures from a multivariate-normal distribution. The associations are defined as a matrix of correlations and are then up-scaled by the standard error to result in covariances. Later, we will do the reverse to obtain correlations from covariances.
```{r sim:covariance-1}
cor2cov <- function(cor, sd) diag(sd) %*% cor %*% t(diag(sd))
cor_mat <- matrix(c(1, .95, -.5, .2,
.95, 1, -.5, .2,
-.5, -.5, 1, .15,
.2, .2, .15, 1), ncol = 4)
sd_vec <- c(.2, .2, 40, 2)
mean_vec <- c(2, 2, 180, 6)