Bachelor Thesis: The Correlation between Individual Talent and Team Achievement: A data-driven approach
Author: Giοrgiο Hajimichael
Supervisor: Dr. Ioannis Katakis
This repository contains all of the Python scripts written and used in order to perform the calculations necessary for assessing the relationship between talent and achievement in football and basketball, as described in the Abstract section below. It also contains the files used as input for the football and basketball programs, as well as the output files generated by these two programs.
-
NBAAnalyzer.py - The Python script which was written to calculate the Pearson correlation coefficient between talent position and league position in basketball - more specifically, in the NBA Regular Season between seasons 2010/2011 and 2018/2019. It takes the basketball-player-data.csv file as input, as well as the basketball-table-data.csv file as input in order to perform its calculations. It outputs two files, one with less fields (basketball-truncated-results.csv) and one with more fields (basketball-extended-results.csv).
-
basketball-player-data.csv - This input file contains data that was initially downloaded from Basketball-Reference and then merged across seasons with only the necessary fields kept. For each season, data was downloaded from the "NBA Player Stats: Advanced" section on Basketball-Reference - for example, for the 2018/2019 season, it was downloaded from 2018-19 NBA Player Stats: Advanced. The file contains the following fields:
- UID - Uniquely identifies players numerically.
- Player - The first and last name of each player as it appears on Basketball-Reference. This field also contains an extra unique identifier that is alphanumerical.
- Tm - The team that each player represents in a given season, abbreviated.
- MP - The minutes played by each player in a given season for a given team.
- PER - The Player Efficiency Rating of each player in a given season for a given team.
- Season - The season that the rest of the data relates to.
-
basketball-table-data.csv - This input file also contains data that was initially downloaded from Basketball-Reference and then merged across seasons with only the necessary fields kept. For each season, data was downloaded from the "NBA Standings" section on Basketball-Reference - for example, for the 2018/2019 season, it was downloaded from 2018-19 NBA Standings. The table used is the "Expanded Standings" which is a table taking into account teams from both the Eastern and Western Conferences. The file contains the following fields:
- Season - The season that the rest of the data relates to.
- Rank - The team's position in the NBA Regular Season Expanded Standings in a given season.
- Team - The abbreviated name of each team.
-
basketball-extended-results.csv - This output file can be used to investigate the relationship between talent and achievement in basketball. It contains the following fields:
- season - The season that the rest of the data relates to.
- team - The abbreviated name of each team.
- league_position - The team's position in the NBA Regular Season Expanded Standings in a given season.
- elites - The number of identified elite players (top 15% PER) for a given team in a given season.
- non-elites - The number of identified non-elite players (bottom 75% PER) for a given team in a given season.
- missing - The number of missing players for a given team in a given season. A missing player is a player who did not meet the minimum minutes requirement in order to be considered for calculations in the previous season (for example, a player new to the league).
- excluded - The number of excluded players for a given team in a given season. An excluded player is a player who did not meet the minimum minutes requirement in order to be considered for calculations in the current season (for example, a player who played for 200 minutes across the whole season).
- talent_position - The calculated talent rank for a given team compared to the other teams in that season. The criteria for talent are the following, with the next criterion only being considered in case of a draw: number of elite talents, higher collective elite talent, less missing players, higher collective non-elite talent.
- elite_collective_talent - The sum of each elite player's PER for each team in a given season.
- non-elite_collective_talent - The sum of each non-elite player's PER for each team in a given season.
-
basketball-truncated-results.csv - This output file can be used to investigate the relationship between talent and achievement in basketball. It is exactly like the previous file, but only has the fields season, team, elites, league_position, and talent_position. It was designed for easier viewing.
-
PLAnalyzer.py - The Python script which was written to calculate the Pearson correlation coefficient between talent position and league position in football - more specifically, in the Premier League between seasons 2010/2011 and 2018/2019. It takes the football-player-data.csv file as input, as well as the football-table-data.csv file as input in order to perform its calculations. It outputs two files, one with less fields (football-truncated-results.csv) and one with more fields (football-extended-results.csv).
-
WhoScoredScraper.py - This script is mostly the work of StackOverflow users jchadwick92 and crookedleaf. It was found on a StackOverflow question and modified slightly to fit the needs of the thesis. It uses the headless browser Selenium to scrape player data from WhoScored. In the file uploaded to this Github repository, it is configured to scrape data from 2013/2014 on WhoScored as an example.
-
football-player-data.csv - This input file contains data that was initially scraped from WhoScored and then merged across seasons with only the necessary fields kept. For each season, data was downloaded from the "Premier League Player Statistics" section on WhoScored - for example, for the 2013/2014 season, it was downloaded from here. The file contains the following fields:
- player - The first and last name of each player as it appears on WhoScored. In very few cases, players with exactly the same names were manually distinguished by adding a number to the end of their names (e.g. "Danny Ward1" and "Danny Ward2").
- team - The team that each player represents in a given season.
- rating - The WhoScored Rating of each player in a given season for a given team.
- minutes - The minutes played by each player in a given season for a given team.
- season - The season that the rest of the data relates to.
-
football-table-data.csv - This input file contains data that was initially downloaded from the official website of the Premier League and then merged across seasons with only the necessary fields kept. For each season, data was downloaded from the "Tables" section by specifying the season through the "Filter by season" feature. The file contains the following fields:
- season - The season that the rest of the data relates to.
- team - The name of each team.
- position - The team's league position in the Premier League in a given season.
- points - The points accrued by each team in a given season in the Premier League.
-
football-extended-results.csv - This output file can be used to investigate the relationship between talent and achievement in football. It contains the following fields:
- season - The season that the rest of the data relates to.
- team - The name of each team.
- league_position - The team's league position in the Premier League in a given season.
- points - The points accrued by each team in a given season in the Premier League.
- elites - The number of identified elite players (top 15% WhoScored Rating) for a given team in a given season.
- non-elites - The number of identified non-elite players (bottom 75% WhoScored Rating) for a given team in a given season.
- missing - The number of missing players for a given team in a given season. A missing player is a player who did not meet the minimum minutes requirement in order to be considered for calculations in the previous season (for example, a player new to the league).
- excluded - The number of excluded players for a given team in a given season. An excluded player is a player who did not meet the minimum minutes requirement in order to be considered for calculations in the current season (for example, a player who played for 200 minutes across the whole season).
- talent_position - The calculated talent rank for a given team compared to the other teams in that season. The criteria for talent are the following, with the next criterion only being considered in case of a draw: number of elite talents, higher collective elite talent, less missing players, higher collective non-elite talent.
- elite_collective_talent - The sum of each elite player's WhoScored Rating for each team in a given season.
- non-elite_collective_talent - The sum of each non-elite player's WhoScored Rating for each team in a given season.
-
football-truncated-results.csv - This output file can be used to investigate the relationship between talent and achievement in football. It is exactly like the previous file, but only has the fields season, team, elites, league_position, and talent_position. It was designed for easier viewing.
The purpose of this thesis is to explore the relationship between talent and achievement in both football and basketball, and to assess whether the achievement of teams suffers when they are made up of a high number of elite talents.
In order to assess the relationship between talent and achievement in football, the correlation of talent positions, calculated primarily by considering the number of elite talents in each team based on WhoScored Performance Ratings, with the finishing positions of teams in the Premier League in the nine seasons between 2010/2011 and 2018/2019 was investigated.
Furthermore, in order to assess the relationship between talent and achievement in basketball, the correlation of talent positions, calculated primarily by considering the number of elite talents in each team based on Player Efficiency Ratings (PER), with the finishing positions of teams in the NBA Regular Season in the Expanded Table of their conferences in the nine seasons between 2010/2011 and 2018/2019 was investigated.
The results indicated that the top 3 most talented football teams per season tended to do best in terms of achievement, while also showing a clear gap in achievement between the top 6 teams when compared to the rest of the league. Meanwhile, the most talented football team per season tended to finish higher in the league than 95% of the remaining teams. A Pearson correlation of 65.7% was found between talent position and league position, indicating a somewhat strong positive correlation between the two.
Furthermore, the results indicated that the top 4 most talented basketball teams per season tended to do best in terms of achievement, while also showing a clear gap in achievement between the top 4 teams when compared to the rest of the league, aside for a single outlier. Meanwhile, the most talented basketball team per season tended to finish higher in the league than 93% of the remaining teams. A Pearson correlation of 57.8% was found between talent position and league position, indicating a positive correlation between the two.
It can be argued that the fact that the Pearson correlation between talent position and league position was 65.7% in football while it was 57.8% in basketball could be influenced by the fact that in basketball, there can be no more than five players on the field for each team at any given time, whereas in football, there can be no more than eleven. This suggests that a clash between two elite players in basketball could be more influential than a clash between two elite players in football because two basketball player account for 40% of the team on the field, whereas two football players account for 18% of the team on the field.