-
Notifications
You must be signed in to change notification settings - Fork 0
/
pros-of-AB-testing.html
310 lines (297 loc) · 16.5 KB
/
pros-of-AB-testing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
<!DOCTYPE html>
<html lang="en">
<head>
<!-- ## for client-side less
<link rel="stylesheet/less" type="text/css" href="/theme/css/style.less">
<script src="http://cdnjs.cloudflare.com/ajax/libs/less.js/1.7.3/less.min.js" type="text/javascript"></script>
-->
<link rel="stylesheet" type="text/css" href="/theme/css/style.css">
<link rel="stylesheet" type="text/css" href="/theme/css/pygments.css">
<link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=PT+Sans|PT+Serif|PT+Mono">
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Gaël">
<meta name="description" content="Posts and writings by Gaël">
<meta name="keywords" content="">
<title>
And yet it moves
– The pros of the A/B testing method </title>
</head>
<body>
<aside>
<div id="user_meta">
<div id="metalogo">
<a href="/">
<img src="/images/logo.svg" alt="logo" id="logo">
</a>
<a href="/" id="title">And yet it moves</a>
</div>
<p></p>
<ul>
<li><a href="/category/data-processing.html" class="active">Data processing</a></li>
</ul>
</div>
</aside>
<main>
<header>
<p>
<a href="/">Index</a> ¦ <a href="/archives.html">Archives</a>
</p>
</header>
<article>
<div class="series_links">
<a style="float: left;" href='/basics.html'>
<< First article of the series
</a>
</div>
<div class="article_meta">
<p style="text-align: right; margin: 0;">Posted on: Sun, 15 May 2016</p>
</div>
<div class="article_title">
<h3><a href="/pros-of-AB-testing.html">The pros of the A/B testing method</a></h3>
</div>
<div class="article_text">
<img alt="" class="align-center" src="/images/dab.gif" style="width: 640px; height: auto; max-width: 100%;"/>
<div class="section" id="test-setup">
<h2 id="test-setup">Test setup</h2>
<p>As we have already said, when a (statistical) test is applied to data,
it answers the question for which it was designed with a given
<em>level of significance</em> and <em>power</em>; two components of what we could
improperly call “level of confidence” for simplicity. Here:</p>
<ul class="simple">
<li>The <em>level of significance</em> is the probability of correctly
concluding that the variants (<object class="align-middle" data="/images/orange_btn.svg" type="image/svg+xml">orangeBtn</object> and <object class="align-middle" data="/images/green_btn.svg" type="image/svg+xml">greenBtn</object>
here) <em>have no significant impact</em> on the click-through rate.</li>
<li>The <em>power</em> is the probability of correctly concluding that the
variants <em>have a significant impact</em> on the click-through rate.</li>
</ul>
<p>These two values in turn depend on the chosen sample size and the actual
difference in click-through rate: greater sample sizes and click-through rate
differences would both increase the level of significance and the power of the test.</p>
<p>Given a minimum level of significance, a minimum power and a detection
threshold (i.e. minimal significant difference to be identified), the sample
size is fully determined <em>from the start</em> in A/B testing.</p>
<p>In the multi-armed bandit framework, the sample size cannot be determined that
way basically because we don’t know when and how many times the algorithm is
going to favour another variant (i.e. switch).
In practice, this means that you have actually
no idea of the level of confidence to expect in that case.</p>
<p><em>Really? No idea at all?!</em> Actually, you could resort to mathematics and even
numerical modelling to get a proper estimate. However that estimate depends on
the current state of the algorithm (so it is calculated after, not before).
That said, I will show some hints that the proposed multi-armed bandit
strategy do introduce a bias that would invalidate our confidence estimates.
Anyway, there seems to be no straightforward way to properly setup the
multi-armed bandit strategy from the start.</p>
<p>On the other hand, the multi-armed
bandit strategy can be setup in a <em>set-and-forget</em> mode. You would not get any
more information about the confidence in the results, but you could use your
whole population instead of a sample of it and yet keep a high overall
click-through rate.</p>
</div>
<div class="section" id="a-b-testing-is-more-powerful">
<h2 id="ab-testing-is-more-powerful">A/B testing is more powerful</h2>
<p>Now, determining that the variants do have an impact on the
click-through rate seems more important that proving that there is none.
So, it seems interesting to represent the power of each method at a given
level of significance to see how large the sample should be to reach a given power:</p>
<div class="figure">
<img alt="" src="/images/power_vs_sample_size.svg"/>
<p class="caption">Monte-Carlo estimation of the power of Fisher’s exact contingency test
as a function of the sample size (i.e. the number of visitors “spent” in
each method) at a given level of significance.</p>
<div class="legend">
<p>significance: level of significance of the contingency test.</p>
<p><span class="math">\(P(O|A)\)</span>: (hidden) real click-through rate of the <span class="math">\(A\)</span>
(<object class="align-middle" data="/images/orange_btn.svg" type="image/svg+xml">orangeBtn</object>) website variant used as an input in the simulation.</p>
<p><span class="math">\(P(O|B)\)</span>: (hidden) real click-through rate of the <span class="math">\(B\)</span>
(<object class="align-middle" data="/images/green_btn.svg" type="image/svg+xml">greenBtn</object>) website variant used as an input in the simulation.</p>
<p>error bands: error bar on the averaged power value.</p>
</div>
</div>
<p>It appears clearly that A/B testing reaches higher powers
at far smaller sample sizes: A/B testing typically requires sample sizes <em>5
times</em> smaller than the multi-armed bandit strategy!</p>
<p><em>Why is that?</em> As I claimed before when I presented the A/B testing
method, it is generally more efficient to provide each variant as often
than it is to favour one. We will see by the end of this series that A/B
testing seems always more powerful than the multi-armed bandit strategy with
non-zero <span class="math">\(varepsilon\)</span>.</p>
</div>
<div class="section" id="a-b-testing-is-correct-more-often-at-small-sample-sizes">
<h2 id="ab-testing-is-correct-more-often-at-small-sample-sizes">A/B testing is correct more often at small sample sizes</h2>
<p>The problem with power estimations is that they are tied to one test.</p>
<p>From what I have seen the original blog author and quite a few people out there
really don’t see the appeal of proper testing (e.g. the original blog author
seems to believe that A/B testing is one-staged and he never ever talks about
confidence, significance, power and even testing). In a nutshell: they are
resorting to their good luck alone to provide a correct answer.</p>
<p>From the testing perspective however, <strong>assuming</strong> that there is a difference
anyway is equivalent to choosing a level of significance of <span class="math">\(0\%\)</span>.
Yet that does not mean that the power would reach <span class="math">\(100\%\)</span>: we can still
choose the wrong variant.</p>
<p>So, if we ditch the test all together, would the multi-armed
bandit method be correct more often than A/B testing? The answer turned out to
be (at least generally) <em>no</em>:</p>
<div class="figure">
<img alt="" src="/images/correct_vs_sample_size.svg"/>
<p class="caption">The A/B testing data-gathering strategy and the multi-armed bandit strategy
were used to estimate the click-through rates of both variants of the website.</p>
<div class="legend">
<p><span class="math">\(P(O|A)\)</span>: (hidden) real click-through rate of the <span class="math">\(A\)</span>
(<object class="align-middle" data="/images/orange_btn.svg" type="image/svg+xml">orangeBtn</object>) website variant used as an input in the simulation.</p>
<p><span class="math">\(P(O|B)\)</span>: (hidden) real click-through rate of the <span class="math">\(B\)</span>
(<object class="align-middle" data="/images/green_btn.svg" type="image/svg+xml">greenBtn</object>) website variant used as an input in the simulation.</p>
<p>error bands: error bar on the averaged power value.</p>
</div>
</div>
<p>In this example, the A/B framework again provide better results at
smaller sample sizes. For instance, the A/B framework only requires
<span class="math">\(300\)</span> trials to be correct <span class="math">\(9\)</span> times out of
<span class="math">\(10\)</span>. To provide the <em>same</em> guarantee, the <span class="caps">MAB</span> strategy would
require <span class="math">\(600\)</span> trials: twice as many!</p>
<p>This can be explained the same way as above: providing each website
variant as often to the visitors to get the best-performing variant is
more efficient. This also means that comparing the performances of the two
methods at the same sample size is largely misleading.</p>
</div>
<div class="section" id="a-b-testing-provides-more-accurate-estimations-of-the-differences">
<h2 id="ab-testing-provides-more-accurate-estimations-of-the-differences">A/B testing provides more accurate estimations of the differences</h2>
<p>A correct estimation of the differences in click-through rates is
important: it is obviously required in methods relying on its quantifications
(e.g. multivariate analysis) but also to actually determine which
variant performs better as stating that</p>
<div class="math">
\begin{equation*}
a > b
\end{equation*}
</div>
<p>is equivalent to</p>
<div class="math">
\begin{equation*}
a - b > 0\text{.}
\end{equation*}
</div>
<p>I estimated the difference in click-through rates in the same setup
as above and obtained the following figure:</p>
<div class="figure">
<img alt="" src="/images/difference_vs_sample_size.svg"/>
<p class="caption">Estimates of the difference in click-through rate at different sample sizes
using the A/B testing data-gathering strategy and the multi-armed bandit
strategy (<span class="math">\(\varepsilon = 90\%\)</span>). The A/B testing method converges
needs a smaller sample size to reach the expected value of
<span class="math">\(P(O|B) - P(O|A) = 5\%\)</span>.</p>
<div class="legend">
<p><span class="math">\(P(O|A)\)</span>: (hidden) real click-through rate of the <span class="math">\(A\)</span>
(<object class="align-middle" data="/images/orange_btn.svg" type="image/svg+xml">orangeBtn</object>) website variant used as an input in the simulation.</p>
<p><span class="math">\(P(O|B)\)</span>: (hidden) real click-through rate of the <span class="math">\(B\)</span>
(<object class="align-middle" data="/images/green_btn.svg" type="image/svg+xml">greenBtn</object>) website variant used as an input in the simulation.</p>
<p>error bands: error bar on the averaged power value.</p>
</div>
</div>
<p>We can clearly see that the A/B testing data gathering method converges
toward the expected value of <span class="math">\(5\%\)</span> much faster than the <span class="caps">MAB</span>
strategy. Worse, that remaining difference, though small, actually goes
on and on for a very long time. This remaining difference is not enough to
talk about a bias in the results yet, this is one of the many reasons why
we need to go further there.</p>
</div>
<div class="section" id="summary-what-is-a-b-testing-good-at">
<h2 id="summary-what-is-ab-testing-good-at">Summary: what is A/B testing good at?</h2>
<p>These results illustrated what it means for A/B testing to answer the
question: <em>Is there a difference?</em></p>
<ul class="simple">
<li>Its data-gathering scheme is <strong>efficient</strong>: it provide a very high power
in the contingency tests, it leads to correct results more often at
lower sample rates than with the <span class="caps">MAB</span> strategy.</li>
<li>It makes use of the contingency test to actually answer that question
at a given “level of confidence” (significance and power). It is also
used to determine what the sample size should be from the start.</li>
<li>It also generally provides more precise and more accurate estimates
of the actual difference in click-through rates.</li>
</ul>
<p>Beyond demonstrating that the presented multi-armed bandit strategy <strong>is not
systematically better</strong> than A/B testing, the results outline another important
fact: this multi-armed bandit strategy requires a larger sample size to provide
the same practical guarantees as A/B testing.</p>
<p>From my perspective, this is just like an insurance seller undermining
the competition by comparing the price of his most basic contract with
the price of a broader contract (covering more stuff). Yes the basic
contract is cheaper but this is comparing apples to oranges.</p>
<p>That said, that basic contract might very well be sufficient for web optimisation
and this is what I am going to assess in the next post.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
</div>
<div class="series_links">
<a style="float: left;" href='/AB-and-MAB.html'>
< Part 2: A/B testing and the multi-armed bandit
</a>
</div>
<div class="series_links">
<a style="float: right;" href='/reward-maximisation.html'>
Part 4: Reward maximisation >
</a>
</div>
<div style="clear: both;"></div>
</article>
<div id="ending_message">
<p>© Gaël. Built using <a href="http://getpelican.com" target="_blank">Pelican</a>. Theme by Giulio Fidente on <a href="https://github.com/gfidente/pelican-svbhack" target="_blank">github</a>. </p>
</div>
</main>
</body>
</html>