Skip to content

FanisGl/Video-games-sales-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Video-games Sales Data Analysis

Tags: [Data Analytics], [Statistics]

Technical Skills: [Python], [Jupyter Notebook], [Pandas], [Numpy], [Matplotlib]

Theoretical Frameworks: [Poisson Distribution]

Note

An exercise done with the assistance provided from the Workearly team when I was doing my bootcamp back in 2023. They are specialised educational professionals that offer extraordinary upskilling bootcamps and more.

I started my journey of upskilling last February with them, and thanks to the learning structure they offer, they had helped me become adept in 6 different tools and use them in a professional environment in only 10 months, including how to utilise Python in Data Science as seen below.

Make sure you check them out if you are interested in any sort of upskilling for you or your team: https://www.workearly.gr/

The dataset was distributed by the user GregorySmith(gregorut): https://www.kaggle.com/datasets/gregorut/videogamesales

A simplistic execution of EDA (Exploratory Data Analysis), to answer 7 questions, and perform a poisson distribution to model the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
# Calling the CSV
df = pd.read_csv(r'B:\Python Environments\Video Game Data Analysis\Raw Data\vgsales.csv')
#r'' because I just want to copy/paste the path without changing all the \ to / or to \\

Cleaning the Dataset

Dropping N/A and nulls.

df.dropna(inplace=True)

Exploring the CSV's features

df.head()
</style>
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37

Question 1: Top-selling Video Games

# Group by name and sum of global sales
top_selling = df.groupby('Name')['Global_Sales'].sum()

# Sort by global sales
top_selling = top_selling.sort_values(ascending=False)

# Displaying the top 15 games
print(top_selling.head(15))
Name
Wii Sports                        82.74
Grand Theft Auto V                55.92
Super Mario Bros.                 45.31
Tetris                            35.84
Mario Kart Wii                    35.82
Wii Sports Resort                 33.00
Pokemon Red/Pokemon Blue          31.37
Call of Duty: Modern Warfare 3    30.83
New Super Mario Bros.             30.01
Call of Duty: Black Ops II        29.72
Call of Duty: Black Ops           29.40
Wii Play                          29.02
New Super Mario Bros. Wii         28.62
Duck Hunt                         28.31
Call of Duty: Ghosts              27.38
Name: Global_Sales, dtype: float64

Most selling games is Wii Sports with 82.74m global sales, followed by Grand Theft Auto V with 55.92m global sales, followed by the old classic Super Mario Bros. with 45.31m global sales.

Question 2: Which Platform was more popular

# Group by platform and sum of global sales
top_platform = df.groupby('Platform')['Global_Sales'].sum()

# Sort by platform sales
top_platform = top_platform.sort_values(ascending=False)

# Displaying the top 10 platforms
print(top_platform.head(10))
Platform
PS2     1233.46
X360     969.60
PS3      949.35
Wii      909.81
DS       818.91
PS       727.39
GBA      305.62
PSP      291.71
PS4      278.10
PC       254.70
Name: Global_Sales, dtype: float64

Based on global sales, the most popular platform of all time is Playstation 2 with 1233.46m games being sold.

Question 3: Genre Popularity

# Group by genre and sum of global sales
top_genre = df.groupby('Genre')['Global_Sales'].sum()

# Sort by genre sales
top_genre = top_genre.sort_values(ascending=False)

# Displaying the top 10 genres
print(top_genre.head(10))
Genre
Action          1722.84
Sports          1309.24
Shooter         1026.20
Role-Playing     923.83
Platform         829.13
Misc             789.87
Racing           726.76
Fighting         444.05
Simulation       389.98
Puzzle           242.21
Name: Global_Sales, dtype: float64

Question 4: Sales by Region

region_sales = df[['NA_Sales','EU_Sales','JP_Sales','Other_Sales']].sum()

# Creating the bar graph
region_sales.plot(kind='bar')
plt.title('VGS By Region')
plt.xlabel('Region')
plt.ylabel('Sales (in Mil.)')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>

png

Question 5, Hypothesis 1: Relationship between year of release and global sales

# Creating the Scatterplot
plt.scatter(df['Year'],df['Global_Sales'])
plt.title('Year of Release vs. Global Sales')
plt.xlabel('YoR')
plt.ylabel('Global Sales (in Mil.)')
plt.show()

png

Clearly, there is no positive correlation between the year of release a game released against how it sold. If there was, we should be able to see for a specific year, density of points towards the higher scales of the Global Sales (the Y axis.)

Therefore, Hypothesis 1 is rejected.

Question 6, Hypothesis 2: Relationship between publisher and global sales

# Creating the Scatterplot
plt.scatter(df['Publisher'],df['Global_Sales'])
plt.title('Publisher vs. Global Sales')
plt.xlabel('Publisher')
plt.ylabel('Global Sales (in Mil.)')
plt.show()

png

Whoops! It seems we need to do an extra step for this one. Let's try to group them by sales and then sort them.

# Grouping by the sum of sales
publisher_vs = df.groupby('Publisher')['Global_Sales'].sum()

# Sorting by highest selling publisher
publisher_vs = publisher_vs.sort_values(ascending=False)

# Creating the bar plot
publisher_vs.head(15).plot(kind='bar')
plt.title('Publisher vs. Global Sales')
plt.xlabel('Publisher')
plt.ylabel('Global Sales (in Mil.)')
plt.show()

png

In the graph above, we immediately observe that Nintendo and EA has had most sales of all times in video games sales.

We could go as far to state that since the dataset supports that Nintendo did not achieve those numbers by an irregularity/outlier (i.e one good selling game), but through consistent successful entries, there is definitely something to be learnt from that company.

Question 7: Would a Poisson distribution be appropriate to model the sales data?

First Calculations

Mean and variance calculation

mean = df['Global_Sales'].mean()
var = df['Global_Sales'].var()

Expected mean and variance calculation

exp_mean = mean
exp_var = var

Showing and comparing the mean & the variance vs. the expected values

print('Mean sales', mean)
print('Variance of sales', var)
print('Expected mean', exp_mean)
print('Expected variance', exp_var)
Mean sales 0.5409103185808114
Variance of sales 2.456568802945086
Expected mean 0.5409103185808114
Expected variance 2.456568802945086

To evaluate that a Poisson distribution can be applied for the model, we need a goodness-of-fit test. That is done by calculating the mean and variance, as we have done above on steps 8.1.1. and 8.1.2. and by comparing the observed values to the expected values.

If the mean and var have approx. similar values or are equal, then the distribution can be applied. Also for each category, if the mean and var. vs the expected mean and var. have a closer range, then it definitely fits.

In this case, there is a quite substantial difference between mean and var. without even comparing the expected mean and var. Regardless, we will still attempt to fit the Poisson Distribution to the data and see how well it fits.

Second Calculations

Poisson Distribution

def poisson_prob(k, lam):
    return (lam**k)*math.exp(-lam)/math.gamma(k+1)

mean = np.mean(df['Global_Sales'])

# Generating a histogram of the sales
n, bins, patches = plt.hist(df['Global_Sales'], bins=50, density=True, alpha=0.5)

# Calculating the distribution
poisson_dist = [poisson_prob(b,mean) for b in bins]

# Plotting the distribution over the histogram
plt.plot(bins, poisson_dist, 'r-', linewidth = 1)

# Labelling the plot
plt.title('Poisson Distribution of Global Sales')
plt.xlabel('Global Sales (in Mil.)')
plt.ylabel('Probability')
Text(0, 0.5, 'Probability')

png

In the histogram above, the red line acts as the depiction of the distribution. To be more exact, the red line represents the Poisson probability mass function (PMF).

The PMF is a function over a sample of a discrete value X which annotates the probability of value X that is equals to a certain value (f(x) = P[X=x]).

As we had already seen on the goodness-for-fit test, the distribution does not work well, as it is clear that the PMF overestimates the frequency of low sales and underestimates that of the high volume of sales. That concludes that Poisson is indeed not a good fit for this dataset.

Note

That does not necessarily mean that this data set does not follow any probability distribution, just that it does not follow specifically a Poisson distribution.