Data analysis with python and pandas

Welcome to Exploratory Data Analysis With Python and Pandas. In this project, I applied practical Exploratory Data Analysis (EDA) techniques on tabular dataset using Python packages such as Pandas and Numpy. Also, Produced data visualizations using Seaborn and Matplotlib.

Dataset used

Data link : Link to data source: https://www.kaggle.com/aungpyaeap/supermarket-sales

Project Objectives

In this Project, we are going to focus on three learning objectives:

Apply practical Exploratory Data Analysis (EDA) techniques on any tabular dataset using Python.
Produce data visualizations using Seaborn and Matplotlib
Identify and handle duplicate and missing data

Project Structure

The hands on project on Exploratory Data Analysis With Python and Pandas is divided into following tasks:

Task 1: Initial Data Exploration

In this task, we are introduced to the project and learning outcomes.
Once we are familiarized with the Rhyme interface, we begin working in Jupyter Notebooks, a web-based interactive computational environment for creating notebook documents.
Next, we will import essential libraries such as NumPy, Pandas, Seaborn, Matplotlib and so on. We use Pandas to read in the data, get a brief glimpse of the first few rows, and calculate some quick summary statistics of the numeric columns.

Task 2: Univariate Analysis

In this task, we conduct univariate analysis on both continuous and categorical variables.
We first plot the distribution of customer ratings with seaborn and also overlay the mean, 25th and 75th percentile quantiles calculated using Numpy.
We then use Pandas' .hist() method to plot the distribution for all numeric variables.
Using Seaborn's .countplot() method, we see the frequency distribution of 'Branch' and 'Payment' which are categorical variables.

Task 3: Bivariate Analysis

In this task, we conduct bivariate analysis on both continuous and categorical variables.
We use Seaborn to plot scatterplots and regression plots to identify the relationship between customer rating and gross income.
Additionally, we use Seaborn to plot a boxplot to check the difference in aggregate sales figures between the three branches of supermarkets, and to compare sales patterns between men and women.
We plot a time series graph to check for trends in gross income over a period of three months.

Task 4: Dealing With Duplicate Rows and Missing Values

In this task, we identify and deal with duplicate rows and missing values in our dataset.
We calculate the number of duplicate rows and delete them using Pandas.
We then do the same with missing values, but instead of deleting those rows, we replace missing values by the means of their respective columns.
We explore our dataset using Pandas Profiler to see how we can automate a lot of exploratory data analysis given certain conditions are met.

Task 5: Correlation Analysis

In this task, we conduct correlation analysis on the numeric variables in our dataset.
We use Numpy to calculate the correlation between two numeric variables.
We then use pandas to calculate a correlation matrix to show all pairwise correlations of numeric variables.
Finally, we use seaborn to plot the calculated correlation matrix as a heatmap that is easily interpretable.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Learner_Notebook.ipynb		Learner_Notebook.ipynb
README.md		README.md
logo.png		logo.png
supermarket_sales.csv		supermarket_sales.csv
testpython.py		testpython.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data analysis with python and pandas

Dataset used

Project Objectives

Project Structure

About

Releases

Packages

Languages

kavita19/Exploratory-Data-Analysis-With-Python-and-Pandas

Folders and files

Latest commit

History

Repository files navigation

Data analysis with python and pandas

Dataset used

Project Objectives

Project Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages