-
Notifications
You must be signed in to change notification settings - Fork 0
Pandas Tutorial
Here we'll be learning about Pandas, a popular library used to load and manipulate datasets. Pandas makes preparing, analyzing, and presenting data easy. We'll be looking at some functions that are used often. It is strongly recommended that you do look at the documentation for these functions when you use them. You might find some parameters useful to your use case, that may not be mentioned here. Plus it's always a good practice to learn from the documentation.
- Mac and Linux users can use the following command:
pip install pandas
- Windows users can use the following command from command line:
python -m pip install pandas
To use pandas in your code you'll simply use the following command:
import pandas as pd
In pandas series refers to a 1d array . pd.Series() constructor is used to create a series object.
data = np.array(['p', 'a', 'n', 'd', 'a', 's'])
s = pd.Series(data)
print(s)
It produces the following output:
0 p
1 a
2 n
3 d
4 a
5 s
dtype: object
As you can see pandas adds indices on its own.
Indexing a series is quite simple.
print(s[0])
print(s[:2])
Output:
p
0 p
1 a
dtype: object
Dataframes are 2 dimensional arrays, and you'll be using them more often.
Dataframes provide an important advantage, each column in a dataframe can be of different datatypes. Pandas provides many functions that allow us to perform various operations on these dataframes.
To create a dataframe we use pd.DataFrame() constructor. It has the following syntax
pd.DataFrame(data, index, columns, dtype, copy)
data can be array, series, or dict.
index can be used to provide indices of our own, defaults to (0, 1, 2, ..., range-1).
columns is used for specifying labels for each column. It defaults to range index if no labels are provided.
dtype can be used to enforce a single datatype to all columns.
copy is a boolean value, set to false by default. When set to true, a copy of the original data is created. So changes to original data won't reflect in the dataframe.
cars = ["G 90", "S Class", "A8", "7 Series"]
hp = np.array([365, 612, 335, 535])
mpg = np.array([20, 23, 22, 20])
df = pd.DataFrame(data={'Cars' : cars, 'Horsepower' : hp, 'Miles Per Gallon' : mpg}, index=['car1', 'car2', 'car3', 'car4'])
print(df)
Output:
Cars Horsepower Miles Per Gallon
car1 G 90 365 20
car2 S Class 612 23
car3 A8 335 22
car4 7 Series 535 20
Accessing data in a pandas dataframe is a little bit different from numpy.
To access a column, we can directly pass the label inside square brackets. To get a row, we use the loc method. To simply access the data by indices, we use the iloc method. : is used to indicate selection of all values in that row or column.
print(df['Cars']) # Gets entire column, is same as
print(df.iloc[:, 0])
print('')
print(df.loc['car2']) # Gets entire row, is same as
print(df.iloc[1, :])
print('')
print(df['Miles Per Gallon'].loc['car3']) # Is same as
print(df.iloc[2, 2])
Output:
car2 S Class
car3 A8
car4 7 Series
Name: Cars, dtype: object
car1 G 90
car2 S Class
car3 A8
car4 7 Series
Name: Cars, dtype: object
Cars S Class
Horsepower 612
Miles Per Gallon 23
Name: car2, dtype: object
Cars S Class
Horsepower 612
Miles Per Gallon 23
Name: car2, dtype: object
22
22
To append row(s) to a dataframe, pandas makes use of the append method. It takes a dataframe or series object as parameter. The syntax is as follows:
df = df.append(row) # row is a dataframe object
Note that you need to assign df.append() to df, this is because the append function returns an object. It does not work inplace.
axes returns a list of row and column labels.
print(df.axes)
Output:
[Index(['car1', 'car2', 'car3', 'car4'], dtype='object'), Index(['Cars', 'Horsepower', 'Miles Per Gallon'], dtype='object')]
shape returns a tuple containing number of rows and columns in the dataframe.
size returns total number of elements in the dataframe.
print(df.shape)
print(df.size)
Output:
(4, 3)
12
head and tail
head returns the first n rows of the dataframe whereas tail returns last n rows of the dataframe. If no parameter is passed, first/last 5 rows will be returned.
mean returns mean of all values along the mentioned axis(row or column here). You can also pass in label for a column to get sum of all values in that column.
count returns a count of all non null observations.
In addition to these, pandas provides many other mathematical functions for median, mode, standard deviation, absolute value, etc.
print(df.head(2))
print(df['Miles Per Gallon'].mean()) # axis is 0 by default
print(df.count())
Output:
Cars Horsepower Miles Per Gallon
car1 G 90 365 20
car2 S Class 612 23
21.25
Cars 4
Horsepower 4
Miles Per Gallon 4
dtype: int64
sort_values sorts the dataframe by column label passed in as parameter. You can also pass in multiple labels, so when 2 values for a label are the same, the corresponding rows in the next label will be checked.
apply can be used to apply operations for each column.
df=df.sort_values(ascending=False, by=['Miles Per Gallon', 'Horsepower'])
print(df)
print(df['Horsepower'].apply(lambda x : 1.1*x + 5)) # Read up about lambda function if you're new to python
Output:
Cars Horsepower Miles Per Gallon
car2 S Class 612 23
car4 7 Series 535 20
car1 G 90 365 20
car3 A8 335 None
car2 678.2
car4 593.5
car1 406.5
car3 373.5
Name: Horsepower, dtype: float64
Often you'll find datasets with missing values. It's always good practice to check for missing values. Pandas provides functions to do exactly that. And guess what, there also functions to "fix" these missing values, to ensure minimum loss of information.
isna is used to find out null values in a dataframe . It returns a Dataframe object with positions corresponding to missing data set to true.
dropna simply drops a row if it contains at least one null value. You can set a different threshold, by making use of the thresh parameter, if you want the row to be dropped only if it has say 2 null values.
fillna fills null values using the specified method.
# Making a dataframe with missing values
cars = ["G 90", "S Class", "A8", "7 Series"]
hp = np.array([365, np.NaN,335, 535]) # NaN is not exactly the same as None, but it will be considered as null
mpg = np.array([20, None, None, 20])
df = pd.DataFrame(data={'Cars' : cars, 'Horsepower' : hp, 'Miles Per Gallon' : mpg}, index=['car1', 'car2', 'car3', 'car4'])
print(df.isna())
print('')
print(df.isna().sum()) #Gives a count of null values
print('')
print(df.dropna(thresh=2))# Only car2 row is dropped
print('')
print(df['Miles Per Gallon'].fillna(df['Miles Per Gallon'].mean())) # You can give custom methods also to fill missing values
Output:
Cars Horsepower Miles Per Gallon
car1 False False False
car2 False True True
car3 False False True
car4 False False False
Cars 0
Horsepower 1
Miles Per Gallon 2
dtype: int64
Cars Horsepower Miles Per Gallon
car1 G 90 365.0 20
car3 A8 335.0 None
car4 7 Series 535.0 20
car1 20.0
car2 20.0
car3 20.0
car4 20.0
Name: Miles Per Gallon, dtype: float64
read_csv is used to get csv files. Simply passing the filename with correct address is enough to get the csv file, but there are many parameters you can use according to your needs. It returns a dataframe object.
df = pd.read_csv("filename")
to_csv is a dataframe object method to convert a dataframe to csv file. You simply need to pass the filename.