-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
162 lines (121 loc) · 16.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8'>
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<link href='https://fonts.googleapis.com/css?family=Architects+Daughter' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="stylesheets/stylesheet.css" media="screen">
<link rel="stylesheet" type="text/css" href="stylesheets/pygment_trac.css" media="screen">
<link rel="stylesheet" type="text/css" href="stylesheets/print.css" media="print">
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js"></script>
<script type="text/javascript" src="https://www.google.com/jsapi"></script>
<script src="http://code.highcharts.com/highcharts.js"></script>
<script src="http://code.highcharts.com/highcharts-more.js"></script>
<script type="text/javascript" src="javascripts/drawWPSChart.js"></script>
<script type="text/javascript" src="javascripts/drawBoxPlot.js"></script>
<script type="text/javascript" src="javascripts/drawStackedPercentPlot.js"></script>
<script type="text/javascript" src="javascripts/drawLineGraph.js"></script>
<script type="text/javascript" src="javascripts/main.js"></script>
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<title>Against All Odds: An Examination of Bias in Sports Betting by pbang55</title>
</head>
<body>
<header>
<div class="inner">
<h1>Against All Odds: An Examination of Bias in Sports Betting</h1>
<h2>Benjamin Zhou, Howard Zhang, Kuriakin Zeng, and Peter Bang</h2>
<a href="https://github.com/pbang55/nba-betting" class="button"><small>View project on</small> GitHub</a>
</div>
</header>
<div id="content-wrapper">
<div class="inner clearfix">
<section id="main-content">
<h3>Background</h3>
<p>All of us in the group are avid sports fans, and we would like to explore the sports betting market to see if it is efficient, or if there is an opportunity to profit from betting in a certain pattern consistently. It’s a market that is oftentimes invisible, but a fascinating one nonetheless. According to ESPN, “more than $3.6 billion was wagered on sports at Nevada sports books in 2013” while estimates for illegal betting tops <strong>$380 billion</strong>. This topic is especially of interest after recent pushes for legalization in the US, which is currently only legal in Nevada and through offshore websites. Recent supporters include Adam Silver, the commissioner of the NBA, calling for the federal legalization of sports betting in the US as a common sense solution given the amount of illicit betting already at play and the potential for sports betting as a budgetary aid. This stands in contrast to similar systems in Europe, where sports betting in legal and volumes are significantly higher than comparable sports in the U.S.</p>
<p>Given the potential of sports betting in the U.S. and our collective interest in the topic, we want to explore the relationship between the structures and incentives of sports betting markets, comparing the efficiency of this market relative to financial markets, and observe where inefficiencies might arise. One piece of research of particular interest to us were the studies of Steven Levitt, who argues in a 2004 paper that bookmakers “do not play the traditional role of market makers matching buyers and sellers but, rather, take large positions with respect to the outcome of game” and bias the odds against favorites. In our project, we want to dive deeper into this hypothesis as well as take a look at what other factors might come into play regarding systematic over/undervaluation in placing odds in the world of sports betting.</p>
<h3>The Problem</h3>
<p>In both NBA and NFL betting, the odds are given in the form of a spread. For instance, when the Los Angeles Lakers plays the Cleveland Cavaliers, the spread from the Lakers’ perspective is the points that they score minus the points that the Cavaliers score. Theoretically, if betting spreads were perfect predictors of the outcome of each game, then we would expect any given team to beat the spread with 50% probability on any given night, and it would be impossible to consistently profit from betting on sports games. However, we wanted to see if we would be able to find a strategy that allows us to pick a team for each game such that the probability that our chosen team wins on any given night is better than 50%.</p>
<p>However, in reality, just being better than 50% is not enough because the odds makers take a small commission for every bet. So, how correct do we have to be if we want to profit from sports betting? If the odds maker takes 9% as commission (<a href="http://sports.bovada.lv/sports-betting/nba-basketball-lines.jsp">source</a>), then over the course of <em>n</em> bets, we need be correct with probability <em>p</em> to break even:</p>
<pre>
p * n * (1.91 * bet-size) = n * bet-size
p = 52.4 %
</pre>
<p>Therefore, in order for us to profit from sports betting, we need to win at least 52.4% of the time.</p>
<h3>Betting against the favored team.</h3>
<p>First, let's look at a simple strategy of betting against the favored team in both the NBA and the NFL.</p>
<div id="nflWPSChart"></div>
<div id="nbaWPSChart"></div>
<p>In the NFL, this strategy does not seem to be successful at all, which was to be expected. However, we see that in the NBA, this simple strategy of consistency betting against the favored team would be profitable if there was no commission taken. This is consistent with what has been noted in literature already. Research by Simmons and Nelson (2006) show that people often choose intuitively between two alternatives. “When people decide how to bet on a game, first they identify who is going to win,” Nelson said. That decision is often fast and easy, particularly when teams are not evenly matched. “The faster and easier it is, the less concerned they are with correcting that intuition when answering the more difficult question of whether the favorite is going to beat the point spread.”</p>
<p>We see that although the simple strategy described above would work in a world where odds makers do not take commission, it is not successful enough (not above 52.4% consistently) to beat a more realistic market. Therefore, we look for another strategy.</p>
<h3>Variables that may cause systematic biases</h3>
<p>Odds makers do not simply predict the outcome of the games. They also take into account how they expect people will bet. For example, let us say that team A and team B are about to play a game, and analysis based on in-game statistics shows that team A is likely to win by an average margin of 5.5. However, if team A is on a long winning streak, or if team B is on a long losing streak, or if team A is much more popular than team B, then we may expect a greater volume of bets for team A, encouraging the odds makers to set the margin to something more like 8.5 in order to gain a larger profit. We decided to look at a few characteristics which we thought may provide us information about this "unreasonableness" in how people bet. The following list includes the features we looked at:</p>
<ul>
<li>Population of city</li>
<li>Win streak of favorite team (negative if losing streak)</li>
<li>Win streak of underdog (negative if losing streak)</li>
<li>ESPN Weekly Power Rankings</li>
<li>Ticket Prices</li>
<li>Home Attendance </li>
<li>Home Attendance/Home Capacity </li>
<li>Home Advantage</li>
<li>Team Value (how much the team could be sold for)</li>
</ul>
<h3>Using random forests</h3>
<p>Using the features listed above, we constructed random forests to help predict the outcomes of both NBA and NFL games. Below are the win percentages of the picks made by the random forests using stratefied 10-Fold Cross-Validation. Below are the graphs for 2007 (first year for which full data was available) and for 2013 (most recent year). The red horizontal line at 52.4 percent indicates the win rate required to break even, as calculated above.</p>
<div id="nflBoxPlot2008"></div>
<div id="nflBoxPlot2013"></div>
<p>In the NFL, we see that the random forests do not seem to provide us an edge in estimating when bias occurs.</p>
<div id="nbaBoxPlot2007"></div>
<div id="nbaBoxPlot2013"></div>
<p>In the NBA, we see that the random forests do seem to provide a significant edge in estimating when bias occurs. Using the features we selected, we can achieve a win percentage of more than 60 percent when using random forests with more than 5 trees in 2007. In 2013, this effect is smaller compared to that of 2007, but the effect is still significant.</p>
<p>Now that we have seen that our random forests performs at well above 52.4 percent, we wanted to see which features contributed most to this result. Below is what we found:</p>
<div id="nbaStackedPercentPlot"></div>
<p>As seen above, each feature except <em>favoriteIsHome</em> had relatively similar importance in the random trees.</p>
<p>However, we must take the above result with a grain of salt. If observation points are correlated with each other, then stratified K-fold cross-validation may not reveal accurate scores. In other words, as we have time-series data, leaving out an observation does not remove all the associated information due to the correlations with other observations. </p>
<p>To test if this is the case, for each year we sort the games by date and select the training set as the first 2/3 of games. The test set is then the last 1/3 of games. After fitting on the training set and predicting on the test set, we obtain a mean accuracy score from random forests using a range of 5 to 20 trees. We also run the analysis on several different max depths, ranging from 5 to 10.</p>
<div id="nbaLineGraph"></div>
<p>Alas, the accuracy decreases substantially, and we are only able to make money in years 2007 and 2009 and perform above 50 percent in years 2007 to 2009 and 2013. This is more fitting of our expectations, as we would have been very surprised to find that the sports betting market was not efficient.</p>
<h3>Regression approach</h3>
<p>We initially started off using a linear regression model, team fixed effects (indicator variables for each team), and a logistic regression. We optimized the selection of coefficients according to adjusted R2 and BIC, which take into account both the accuracy of the predictions as well as the number of terms included. For 2013, using a linear regression showed that the relative value of the favorite team to the non-favorite team was the only variable significant at the 5% level. Relative value had a negative coefficient of -0.1282. A negative coefficient decreases the probability that WPS=1, which means that the favorite team did not beat the spread. Hence the betting spread is biased against the favorite. Nevertheless, our R^2 was only 0.009 or 0.9%. If we choose to bet on the favorite team when the probability is greater than 0.5 and on the non-favorite when the probability is less than 0.5, then we have a 49.00% accuracy rate for 2013. As the mean of WPS is 0.492, this is not a worse model then always betting for the non-favorite team, which would lead us to 50.8% win rate, and betting for the non-favorite team, which would lead us to a 49.2% win rate.</p>
<p>This coefficient was also significant when we included team fixed effects but disappeared in the logistic regression. In the logistic regression, the only significant variable was the relative population sizes. However, the coefficient is not directly interpretable due to the nonlinearity of the model.</p>
<h3>Conclusion</h3>
<p>The variables do not seem to offer any predictive power of the bias. This can be caused by two reasons:</p>
<ol>
<li>The factors are not considered by odds makers and do not affect the final game result</li>
<li>The factors affect the game results and the odds makers almost perfectly take them into account It seems more likely that the first is the correct interpretation.</li>
</ol>
<p>Our most successful technique was random forests, but we were still not able to beat 52.4 percent. Hence looking forward to beat the odds we may have to use in-game data. This is currently a significant area of exploration (and one currently active in the marchine learning literature) that uses increasing available advanced in game data to try to predict the outcome of games. (<a href="http://www.seas.upenn.edu/~gberta/uploads/3/1/4/8/31486883/">source</a>)</p>
<h3>Sources of data</h3>
<strong>NBA Data Scraping</strong>
<p>We scraped the odds (point spread) and the actual scores for each NBA match that took place between 1990 and 2015 from covers.com. The dataset has about 60,000 rows and six columns. </p>
<p>We thought that the winning/losing streaks of a team during the season could influence the betting odds, so we wrangled the data to come up with the metric. </p>
<p>We also hypothesized that metrics such as a team’s performance during the season would results in some bias on the odds, so we scraped the weekly ranking of each NBA team between 2002 and 2014 from espn.go.com. To do that, we needed to find out which week of the year is the start of the NBA season, so we scraped the information from basketball-reference.com, which results in just a 13 rows containing the start date of each season. The ranking data from espn.go.com was given in a JavaScript object format (not JSON), so we had to perform more data wrangling to extract the information we needed. The ranking dataset has 15,000 rows and one column.</p>
<p>NBA attendance data and capacity were obtained from a csv located on the website of Clay Cressler originally used by him to determine what teams were most susceptible to relocation (<a href="http://cressler.weebly.com/blog/nba-contraction-which-teams-should-be-on-the-chopping-block">source</a>).</p>
<p>Population data was downloaded using available census data, using the population of the cities in which our NBA and NFL teams were located or situated near.</p>
<p>NBA and NFL team revenue data were scraped from the Annual Forbes rankings for teams using modified R code originally written by Aragorn Technologies. We manually matched the team names used in these datasets with the team names that we use in our main NBA and NFL dataframes.</p>
<strong>NFL Data Scraping</strong>
<p>Similarly for NFL, we scraped the odds and actual scores in each match that took place between 1999 and 2015 from covers.com, and calculated each team’s winning streak. The resultant dataset has 5,219 rows and six columns. We also scraped the weekly ranking of each NBA team between 2002 and 2014 from espn.go.com. The ranking dataset has 1,849 rows and one column. We hypothesized that a team’s popularity could also affect the odds, so we scrape the attendance in each match from pro-football-reference.com. This dataset has 2,530 rows and one column.</p>
<strong>Soccer Scraping</strong>
<p>Soccer data was scraped for our analysis from http://www.football-data.co.uk/, a UK website that consolidates betting data and refers users to betting sites. In total, we scraped data on roughly 25,000 matches played within top leagues in England, France, Germany, Spain, and Italy, including betting odds, home/away distinctions, and final scores. </p>
<p>Soccer revenue data was scraped from historical data on Statto, which we used as a proxy for betting volume within a specific league</p>
<h3>Transformation of Data</h3>
<p>Team valuations are lagged by 1 year while ESPN Power Rankings are lagged by 1 week. Note that I transform the tricket price, population, value, average home attendance, and percentage attendance data by calculating the relative value between the favorite team and non-favorite team. I achieve this by dividing the favorite team observation by the non-favorite team observation. This is to account for relative differences between teams. This also helps with adjusting for inflation to allow us to draw comparisons between coefficient values among different years.</p>
</section>
<aside id="sidebar">
<a href="https://github.com/pbang55/nba-betting/blob/gh-pages/zeng_k_against_all_odds.zip?raw=true" class="button">
<small>Download</small>
notebooks
</a>
<a href="https://www.youtube.com/watch?v=s9AMxU4rjq0" class="button">
<small>Watch</small>
Screencast
</a>
<p class="repo-owner"><a href="https://github.com/pbang55/nba-betting"></a> is maintained by <a href="https://github.com/pbang55">pbang55</a>.</p>
</aside>
</div>
</div>
</body>
</html>