Skip to content

Pandas Tutorial

Tejas Hegde edited this page Jan 5, 2021 · 1 revision

Pandas

Here we'll be learning about Pandas, a popular library used to load and manipulate datasets. Pandas makes preparing, analyzing, and presenting data easy. We'll be looking at some functions that are used often. It is strongly recommended that you do look at the documentation for these functions when you use them. You might find some parameters useful to your use case, that may not be mentioned here. Plus it's always a good practice to learn from the documentation.

Installation

  1. Mac and Linux users can use the following command:

pip install pandas

  1. Windows users can use the following command from command line:

python -m pip install pandas

To use pandas in your code you'll simply use the following command:

import pandas as pd

Basics

Series

In pandas series refers to a 1d array . pd.Series() constructor is used to create a series object.

data = np.array(['p', 'a', 'n', 'd', 'a', 's'])       
s = pd.Series(data)      
print(s)  

It produces the following output:

0    p
1    a
2    n
3    d
4    a
5    s
dtype: object

As you can see pandas adds indices on its own.
Indexing a series is quite simple.

print(s[0])
print(s[:2])

Output:

p
0    p
1    a
dtype: object

Dataframes

Dataframes are 2 dimensional arrays, and you'll be using them more often.
Dataframes provide an important advantage, each column in a dataframe can be of different datatypes. Pandas provides many functions that allow us to perform various operations on these dataframes.

To create a dataframe we use pd.DataFrame() constructor. It has the following syntax

pd.DataFrame(data, index, columns, dtype, copy)

data can be array, series, or dict.
index can be used to provide indices of our own, defaults to (0, 1, 2, ..., range-1).
columns is used for specifying labels for each column. It defaults to range index if no labels are provided.
dtype can be used to enforce a single datatype to all columns.
copy is a boolean value, set to false by default. When set to true, a copy of the original data is created. So changes to original data won't reflect in the dataframe.

cars = ["G 90", "S Class", "A8", "7 Series"]
hp = np.array([365, 612, 335, 535])
mpg = np.array([20, 23, 22, 20])
df = pd.DataFrame(data={'Cars' : cars, 'Horsepower' : hp, 'Miles Per Gallon' : mpg}, index=['car1', 'car2', 'car3', 'car4'])
print(df)

Output:

          Cars  Horsepower  Miles Per Gallon
car1      G 90         365                20
car2   S Class         612                23
car3        A8         335                22
car4  7 Series         535                20

Accessing data in a pandas dataframe is a little bit different from numpy.
To access a column, we can directly pass the label inside square brackets. To get a row, we use the loc method. To simply access the data by indices, we use the iloc method. : is used to indicate selection of all values in that row or column.

print(df['Cars']) # Gets entire column, is same as
print(df.iloc[:, 0])
print('')
print(df.loc['car2']) # Gets entire row, is same as
print(df.iloc[1, :]) 
print('')
print(df['Miles Per Gallon'].loc['car3']) # Is same as
print(df.iloc[2, 2])

Output:

car2     S Class
car3          A8
car4    7 Series
Name: Cars, dtype: object
car1        G 90
car2     S Class
car3          A8
car4    7 Series
Name: Cars, dtype: object

Cars                S Class
Horsepower              612
Miles Per Gallon         23
Name: car2, dtype: object
Cars                S Class
Horsepower              612
Miles Per Gallon         23
Name: car2, dtype: object

22
22

To append row(s) to a dataframe, pandas makes use of the append method. It takes a dataframe or series object as parameter. The syntax is as follows:

df = df.append(row) # row is a dataframe object

Note that you need to assign df.append() to df, this is because the append function returns an object. It does not work inplace.

Useful dataframe methods

axes returns a list of row and column labels.

print(df.axes)

Output:

[Index(['car1', 'car2', 'car3', 'car4'], dtype='object'), Index(['Cars', 'Horsepower', 'Miles Per Gallon'], dtype='object')]

shape returns a tuple containing number of rows and columns in the dataframe.
size returns total number of elements in the dataframe.

print(df.shape)
print(df.size)

Output:

(4, 3)
12

head and tail
head returns the first n rows of the dataframe whereas tail returns last n rows of the dataframe. If no parameter is passed, first/last 5 rows will be returned.

mean returns mean of all values along the mentioned axis(row or column here). You can also pass in label for a column to get sum of all values in that column.
count returns a count of all non null observations.
In addition to these, pandas provides many other mathematical functions for median, mode, standard deviation, absolute value, etc.

print(df.head(2))
print(df['Miles Per Gallon'].mean()) # axis is 0 by default
print(df.count())

Output:

         Cars  Horsepower  Miles Per Gallon
car1     G 90         365                20
car2  S Class         612                23
21.25
Cars                4
Horsepower          4
Miles Per Gallon    4
dtype: int64

sort_values sorts the dataframe by column label passed in as parameter. You can also pass in multiple labels, so when 2 values for a label are the same, the corresponding rows in the next label will be checked.
apply can be used to apply operations for each column.

df=df.sort_values(ascending=False, by=['Miles Per Gallon', 'Horsepower'])
print(df)
print(df['Horsepower'].apply(lambda x : 1.1*x + 5)) # Read up about lambda function if you're new to python

Output:

          Cars  Horsepower Miles Per Gallon
car2   S Class         612               23
car4  7 Series         535               20
car1      G 90         365               20
car3        A8         335             None
car2    678.2
car4    593.5
car1    406.5
car3    373.5
Name: Horsepower, dtype: float64

Dealing with missing values

Often you'll find datasets with missing values. It's always good practice to check for missing values. Pandas provides functions to do exactly that. And guess what, there also functions to "fix" these missing values, to ensure minimum loss of information.

isna is used to find out null values in a dataframe . It returns a Dataframe object with positions corresponding to missing data set to true.
dropna simply drops a row if it contains at least one null value. You can set a different threshold, by making use of the thresh parameter, if you want the row to be dropped only if it has say 2 null values.
fillna fills null values using the specified method.

# Making a dataframe with missing values
cars = ["G 90", "S Class", "A8", "7 Series"]
hp = np.array([365, np.NaN,335, 535]) # NaN is not exactly the same as None, but it will be considered as null
mpg = np.array([20, None, None, 20])
df = pd.DataFrame(data={'Cars' : cars, 'Horsepower' : hp, 'Miles Per Gallon' : mpg}, index=['car1', 'car2', 'car3', 'car4'])
print(df.isna())
print('')
print(df.isna().sum()) #Gives a count of null values
print('') 
print(df.dropna(thresh=2))# Only car2 row is dropped
print('')
print(df['Miles Per Gallon'].fillna(df['Miles Per Gallon'].mean())) # You can give custom methods also to fill missing values

Output:

       Cars  Horsepower  Miles Per Gallon
car1  False       False             False
car2  False        True              True
car3  False       False              True
car4  False       False             False

Cars                0
Horsepower          1
Miles Per Gallon    2
dtype: int64

          Cars  Horsepower Miles Per Gallon
car1      G 90       365.0               20
car3        A8       335.0             None
car4  7 Series       535.0               20

car1    20.0
car2    20.0
car3    20.0
car4    20.0
Name: Miles Per Gallon, dtype: float64

Other Functions

read_csv is used to get csv files. Simply passing the filename with correct address is enough to get the csv file, but there are many parameters you can use according to your needs. It returns a dataframe object.

df = pd.read_csv("filename")

to_csv is a dataframe object method to convert a dataframe to csv file. You simply need to pass the filename.

Table Of Contents

  • Title 1

  • Title 2

Clone this wiki locally