Kshitij Kumar
Khelawan Singh
Lokesh Khande
Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis(EDA) with just a few lines of code. Besides, if this is not enough to convince us to use this tool, it also generates interactive reports in web format that can be presented to any person.In short, what pandas profiling does is save us all the work of visualizing and understanding the distribution of each variable. It generates a report with all the information easily available.Pandas allows importing data from various file formats such as CSV, JSON, Microsoft Excel.
Exploratory Data Analysis-In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
One of the nice points of the generated report is the warnings that appear at the beginning. It tells us the variables that contain NaN values, variables with many zeros, categorical variables etc.Pandas Profilling is a very nice package which can help to those who are new to Data Science and can start their carrier by exploring these generated report and learn many new terms about statistics.
pip install pandas
pip install pandas-profiling
conda install -c conda-forge pandas
conda install -c conda-forge pandas-profiling
We can generate report through two interfaces-through widgets and through a HTML report.But here we will generate report through HTML file.
import pandas as pd
from pandas_profiling import ProfileReport
For reading data from specified file follow the given command(give extension accordingly depending on the file type)-
df=pd.read_csv(r"covidindia.csv")
profile=df.profile_report(title="Covid India Analysis Report",plot={"dpi": 800, "image_format": "png"})
profile.to_file(output_file='CovidIndia.html')
Now we can see that the it generates profile reports with the file name covidindia in the form of HTML file.Through profile.to_widgets
the HTML report can be included in a Jupyter notebook.We can also obtain a json file through .json extension.We can also specify the resolution and format of the image.
For large datasets we use minimal mode-This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).
Syntax-
profile = ProfileReport(df, minimal=True)
profile.to_file("CovidIndia.html")
For large datsets we can also take samples of the data.
Syntax-
sample=covidindia.sample(1000)
profile=df.profile_report(minimal=True)
We can also select a sample of data to generate a profile report.For example-to select the first n rows of data if actual data is too large.
Reproduction:Analysis started,Analysis finished,Duration,Version,Command line,Download configuration
Warnings
Type inference: detect the types of columns in a dataframe such as Boolean, Numerical, Date, Categorical, URL, Path, File and Image.
Dataset statistics: Number of variable,Number of observation,Missing cells,Missing cells(%),Duplicate rows,Duplicate rows(%),Total size in memory
Average record size in memory
Essentials: type,unique values,missing values,infinite
Quantile statistics: minimum value,5th percentile,Q1,median,Q3,95th percentile,maximum,range,interquartile range
Descriptive statistics: mean,median,standard deviation,variance,sum,median absolute deviation,coefficient of variation,kurtosis,skewness
Most frequent values
Histogram
Extreme Values:minimum and maximum five values
Correlations: Spearman,Pearson,Kendall,Phik
Missing values matrix,distinct count,heatmap,dendrogram
Meanings of some of the important terms in profile report:
- Range-It is the difference between highest and lowest value.
- Mean-It is the average of the dataset.
- Median-It is the middle of the set of numbers.
- Mode-Frequently or mostly occuring numbers in the dataset.
- Median Absolute Deviation(MAD)-It is a robust(they are not affected by very high or low value) measure of how spread out a set of data is.It is absolute value of the difference between the value and the median.
- Standard Deviation-It is a measure of dispersion of observations within a data set about mean.
- Variance-It is the numerical value, which describes how variable the observations are about mean.
- 95th percentile-It is a number that is greater than 95% of the numbers in a given set or It is the highest value left when the top 5% of a numerically sorted set of collected data is discarded.Percentiles can be calculated using the formula n = (P/100) x N, where P = percentile, N = number of values in a data set.
- Coefficient of variation(CV)-It shows the extent of variability in relation to the mean of the population.It is the ratio of standard deviation to the mean.The higher the coefficient of variation, the greater the level of dispersion around the mean.
- Interquartile range(IQR)-It describes the middle 50% of values when ordered from lowest to highest.To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1=25th percentile) and quartile 3 (Q3=75th percentile).Q2(=50th percentile) is the median of the dataset.
- Skewness-It is the tendency of a distribution that determines its symmetry about the mean.Types-Positive skewness and Negative skewness.
- Kurtosis- It means the measure of the sharpness of the peak of probablity distribution curve.
- Dendrogram-It is a branching diagram that represents the relationships of similarity among a group of entities.
- Heatmap-It is a graphical representation of data that uses a system of color-coding to represent different values.
- Correlation-It is a statistical measure that expresses the extent to which two variables are linearly related.The value of the correlation coefficient varies between +1 and -1.As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a - sign indicates a negative relationship.
- Pearson's r Correlation-The Pearson's correlation coefficient(r) is a measure of linear correlation between two variables.It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation.
- Spearman's ρ Correlation-The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r.
- Kendall's τ Correlation- The Kendall rank correlation coefficient (τ) measures ordinal association between two variables.
- Phik(φk) Correlation-Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution.
So,here we are going to analyse about the various information about covid-19 in India and can get a wonderful results through pandas-profiling package.
Hence pandas-profiling is a good package but it cannot be used to solve all the problems as the information inside it is too much and sometimes we does not need that much information.Generally with the increase in the size of the data the time to generate the report also increases a lot and for that we should have a powerful computer for getting our work to be done faster.Sometimes we can also take sample of the data to analyse it and through generated report we can understand it and can take major decision about what will happen in future.
The profile report is written in HTML and CSS, which means pandas-profiling requires a modern browser. We need Python 3 to run this package.
For more reference visit:https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/introduction.html