-
Notifications
You must be signed in to change notification settings - Fork 0
/
pros-of-ab-testing.html
330 lines (319 loc) · 19 KB
/
pros-of-ab-testing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>The pros of the A/B testing method</title>
<link rel="stylesheet" href="/theme/css/main.css" />
<!--[if IE]>
<script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body id="index" class="home">
<header id="banner" class="body">
<h1><a href="/">And yet it moves! </a></h1>
<nav><ul>
<li class="active"><a href="/category/data-processing.html">Data processing</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header>
<h1 class="entry-title">
<a href="/pros-of-ab-testing.html" rel="bookmark"
title="Permalink to The pros of the A/B testing method">The pros of the A/B testing method</a></h1>
</header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2016-05-03T15:00:00+01:00">
Published: mar. 03 mai 2016
</abbr>
<address class="vcard author">
By <a class="url fn" href="/author/gael.html">Gaël</a>
</address>
<p>In <a href="/category/data-processing.html">Data processing</a>.</p>
</footer><!-- /.post-info --> <p>This post is the fourth technical post of the <a href="{filename}01-index.md">series</a>.</p>
<p><em>A/B testing was</em> designed <em>to tell whether or not an event has a significant
effect on an outcome. The strength of that method comes from the contingency
test that both help to determine what the sample size should be and what
confidence we should have on the results.</em></p>
<p><em>It should not be surprising that the multi-armed bandit strategy is not as
good as A/B testing to actually answer this question. In particular, the
multi-armed bandit strategy requires a bigger sample to provide the same
guarantees as A/B testing in the same conditions and that strategy actually
skews the results (as opposed to what was claimed in the original blog post).</em></p>
<h1 id="ab-testing-is-more-efficient-at-telling-whether-there-is-a-difference">A/B testing is more efficient at telling whether there is a difference</h1>
<p>We are going to temporarily blindly assume that we can apply a <em>contingency
test</em> (here, <a href="https://en.wikipedia.org/wiki/Fisher%27s_exact_test">Fisher's exact test</a>)
to the data gathered through the multi-armed bandit strategy.</p>
<p>Given a sample size and a click-through rate difference (<span class="math">\(|P(O|A)-P(O|B)|\)</span>),
two "confidence-related" criteria can be associated to a test:</p>
<ul>
<li>Its <em>level of significance</em>: the probability of correctly concluding that
there is no difference.</li>
<li>Its <em>power</em>: the probability of correctly concluding that there is a
difference.</li>
</ul>
<p>Increasing either the sample size or <span class="math">\(|P(O|A)-P(O|B)|\)</span> increases both the
level of significance and the power. But increasing the power decreases the
significance and vice-versa.</p>
<p>As we just said, given a sample size and a difference, a higher significance
should reduce the power of the method. What we want to assess here is the
which method exhibits the highest power given a fixed significance. This is
what we did in the following picture.</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
We should see that the A/B testing method provides a better power (i.e.
confidence than the multi-armed bandit strategy in the whole range of
sample sizes (20 -> 10000).
" src="/images/power_vs_sample_size.svg"/>
</center></p>
<p>We observe that the power of the A/B testing method is always above the
power provided by the multi-armed bandit strategy. This is actually caused by
the different sampling methods: it is usually more efficient to sample the
events equally than to favor one particular event.</p>
<p>In this case, we should gather around <strong>5 times as many</strong>
trials in the multi-armed bandit strategy to get the same confidence about
our results! Conversely, if we are happy
with a lower confidence, we could reduce the sample
size by as much by using A/B testing!</p>
<h1 id="ab-testing-chooses-the-correct-variant-more-often">A/B testing chooses the correct variant more often</h1>
<p>To be fair, it is important to understand that these significance/power
values only concern our <em>confidence</em> in the results that is assessed
through the contingency test. In particular, these
do <em>not</em> represent the number of times the best variation is chosen correctly
by the algorithms.</p>
<p>That said, such a number can be easily calculated. This time, we will not
apply any statistical test. Instead, we are going to estimate the ratio of
good guesses (i.e. choosing the event providing the highest click-through rate
P(O|X)) obtained through the data-gathering stage in A/B testing
and through the multi-armed bandit strategy.</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
We should see that the A/B testing guesses are correct more often
for smaller sample sizes (here, 40 -> 3000). At most, A/B testing
provides 10% more correct guesses.
" src="/images/correct_vs_sample_size.svg"/>
</center></p>
<p>We can see that the A/B testing method does provide better guesses at any
sample size. For instance, only <span class="math">\(300\)</span> trials are
required to identify that <span class="math">\(5\%\)</span> difference in click-through rates in
A/B testing (with the certainty of being correct <span class="math">\(9\)</span> times out
of <span class="math">\(10\)</span>). The exact same confidence would require a sample size of
around <span class="math">\(600\)</span> trials in the multi-armed bandit framework, or <strong>twice as many</strong>.</p>
<p>This confirms that comparing A/B testing with the multi-armed bandit method at
constant sample size as it has been done in the original blog post
is clearly misleading: the sample size must be adjusted to get
the same confidence guarantees in both methods otherwise we are comparing
apples to oranges.</p>
<h1 id="ab-testing-is-better-at-quantifying-the-difference-in-click-through-rates">A/B testing is better at quantifying the difference in click-through rates</h1>
<p>Estimating the difference <span class="math">\(\tilde{P}(O|B) - \tilde{P}(O|A)\)</span>
is important because this is what we ultimately use to determine which version
performs the best. It is also a great way to prevent the introduction of
a few biases (like some sorts of hidden coupling between <span class="math">\(\tilde{P}(O|A)\)</span> and
<span class="math">\(\tilde{P}(O|B)\)</span>).</p>
<p>Until now, we only simulated Bernoulli processes: we "repeated" the same
observation multiple times and got series of true/false values (be it to
determine whether a user clicked on the button or whether a given method
correctly identified the best variant).
Each time, we estimated the uncertainties by resorting to the Beta
distribution which we know is the result of such processes.</p>
<p>Here we are going a step further to actually estimate the distribution
of the difference <span class="math">\(\tilde{P}(O|B) - \tilde{P}(O|A)\)</span>. To do that, we need to
choose a sample size, calculate
<span class="math">\(\tilde{P}(O|A)\)</span> and <span class="math">\(\tilde{P}(O|B)\)</span> and finally perform the subtraction.</p>
<p>This whole process is repeated <span class="math">\(M\)</span> times and the actual
distribution can be estimated either by calculating an histogram of the values
or by using more advanced method. We actually chose the
<a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">kernel density estimation</a>
with a Gaussian kernel which behaves better than histograms.</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
We can see the average of the difference between the click-through rates
(\(\tilde{P}(O|B) - \tilde{P}(O|A)\)). We can observe that the A/B method is
the fastest to converge to the expected value (\(5\%\)).
" src="/images/dist_difference_vs_sample_size.svg"/>
</center></p>
<p>We can see that the distributions of values are nearly Gaussian-shaped
(i.e. normally distributed).
The A/B testing data-gathering process does provide a narrower distribution
than the multi-armed bandit method.</p>
<p>This is in total accordance with the
claim that it is more efficient to sample each event equally rather than
favoring one. This is also why the power of the contingency test applied to
the multi-armed bandit method is so much worse: more spread means that the
differences need to be bigger for the test to succeed.</p>
<p>Because the data are normally distributed, we can reduce these distributions
to an expected value (the arithmetic mean) and a spread indicator (derived from
the standard deviation). This is particularly useful to represent the expected
difference depending on the sample size:</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
We can see the average of the difference between the click-through rates
(\(\tilde{P}(O|B) - \tilde{P}(O|A)\)). We can observe that the A/B method is
the fastest to converge to the expected value (\(5\%\)).
" src="/images/difference_vs_sample_size.svg"/>
</center></p>
<p>We observe that the A/B method converges towards the expected
<span class="math">\(P(O|B) - P(O|A) = 5\%\)</span> much more quickly than the multi-armed bandit strategy
does. For instance with <span class="math">\(1000\)</span> trials, the A/B testing method provided
a precision of <span class="math">\(2\%\)</span> while the multi-armed bandit method provided
a precision of <span class="math">\(20\%\)</span>.</p>
<p>This tends to prove that the A/B testing data-gathering process is more
efficient to determine the actual difference. This result is in complete
accordance with the fact that A/B testing also seems more efficient at
determining which variant is better: if </p>
<div class="math">$$P(O|B) - P(O|A)$$</div>
<p> is more accurate,
then </p>
<div class="math">$$P(O|B) > P(O|A) \equiv P(O|B) - P(O|A) > 0$$</div>
<p>
should also be.</p>
<p>It is not clear whether the remaining difference at higher sample sizes
would totally disappear:
we did the same calculation for larger values and that difference seems to
remain fairly constant (if that is really the case, we can say that the
underlying distribution is skewed).</p>
<h1 id="ab-testing-provides-non-skewed-results">A/B testing provides non-skewed results</h1>
<p>At this point, it is not clear whether of not the multi-armed bandit strategy
skews the results. Proving that is does so or not would not be easy
without resorting to algebra. At the same time, the difference that
we observe is small, too small to matter in most cases.</p>
<p>As we want to see the effect of the strategy itself on the distribution of
differences, we should simulate the case where the algorithm is supposed to
make decisions the most often: when the two click-through rates are actually
identical.</p>
<p>We performed the same difference calculation as above but this
time with <span class="math">\(P(O|B) = P(O|A) = 10\%\)</span> and we obtained the following
distributions:</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
It shows two probability distribution functions. The one associated to
A/B testing is Gaussian-shaped and centered on 0. The one associated to
the multi-armed bandit strategy if also centered on 0, but it is not
Gaussian. Some non-zero difference is clearly favored.
" src="/images/dist_zero_difference_vs_sample_size.svg"/>
</center></p>
<p>We see that the distribution arising from A/B testing data-gathering method
is (roughly) Gaussian-shaped and centered on <span class="math">\(0\%\)</span>. This is what is expected
and the same from above. </p>
<p>The distribution arising from the multi-armed bandit
strategy however is clearly skewed: the expected value, <span class="math">\(0\)</span> ("zero") is not
at the highest point. </p>
<p>The reason why the distribution is that peculiar is
that we are not looking at the "correct" distance: we are looking
at <span class="math">\(P(O|B) - P(O|A)\)</span> while we should really be looking at
<span class="math">\(P(O|\text{favored}) - P(O|\text{disfavored})\)</span>, the distance between the favored
and disfavored event in each trial:</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
It represents the distribution of the absolute values of the distances
for both methods. The skew is really visible in the case of the multi-armed
bandit method (0.4%).
" src="/images/abs_dist_zero_difference_vs_sample_size.svg"/>
</center></p>
<p>We can observe that there is no skew in A/B testing: the maximal value in
indeed reached at <span class="math">\(0\%\)</span>. The multi-armed bandit strategy tends to <strong>introduce</strong>
a non-existing difference of <span class="math">\(0.4\%\)</span> in this particular case. This is a clear
proof that the data are skewed by the method.</p>
<p>Where does it come from? Think of it this way: the problem is that <strong>we</strong>
choose <strong>when</strong> to change <span class="math">\(P(X)\)</span> <strong>using</strong> the results themselves!</p>
<p>Likewise, we could toss a coin and decide to toss the coin a second time
<em>if, and only if</em> we get a head. Nobody would be the least surprised that
we obtained more tails than heads then.
In contrast if we randomly decided to toss again that coin without
knowing anything about it, we would expect to get as many heads as tails.
This is the same kind of problem here, if only more subtle...
The algorithm actually <em>favors</em> <strong>underestimated</strong> probabilities for
the disfavored events.</p>
<p>So, the multi-armed bandit strategy <em>does</em> skew the data
(as opposed to another one of the
<a href="http://stevehanov.ca/blog/index.php?id=132">other blog post</a>
unproved claims).
This can have serious implications especially if we need to determine
whether there is a difference or not (like in clinical trials) or if we need
to quantify these differences (for instance to perform multivariate analysis).</p>
<p><em>At this point, we have seen many of the advantages of the A/B testing method
over the multi-armed bandit strategy.
In particular we proved that the multi-armed bandit method needs bigger sample
size to provide the same guarantees as A/B testing (be it to exhibit the same
power in contingency tests, to correctly choose the best site variant
or to give an accurate result). We also showed how the multi-armed bandit
strategy does skew the results (especially when there is no difference between
the clicks-through rates to begin with).</em></p>
<p><em>We could conclude that comparisons at the same sample size are not fair:
the methods should be compared at the same level guarantees. As well, the
multi-armed bandit method should <strong>not</strong> be used in cases where determining
whether there is a difference or not matters (especially in drug trials in
opposition to what is recommanded in the other blog post).</em></p>
<p><em>Would that be enough to completely ditch the multi-armed bandit strategy?
Not at all: the multi-armed bandit method has been designed to maximize
the reward. This is what we are going to investigate in the
<a href="{filename}07-maximizing_reward.md">next post</a>.</em></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
</div><!-- /.entry-content -->
</article>
</section>
<section id="extras" class="body">
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
</body>
</html>