- Load data
- Work with data
- Visualization
- Machine Learning
- Titanic Example
Pandas library is the most extended one for analytics in python. It uses numpy on the background, what makes it very fast.
In order to read files, the best way is to use pandas's predefined functions. It allows to load the following type of files into a panda object:
- csv
- Excel
- hdf
- sql
- json
- msgpack
- html
- gbq
- stata
- sas
- pickle
It also allows you to read text from your clipboard.
The use is as below:
import pandas as pd
df = pd.read_csv('datos.csv')
Due to the wide use of json, it is important to know how pandas can help to load information with that format.
The method json_normalize can convert a json file into a pandas dataframe.
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
from pandas.io.json import json_normalize
result = json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
The ouput would be like:
index | name | population | info.governor | state | shortname |
---|---|---|---|---|---|
0 | Dade | 12345 | Rick Scott | Florida | FL |
1 | Broward | 40000 | Rick Scott | Florida | FL |
2 | Palm Beach | 60000 | Rick Scott | Florida | FL |
3 | Summit | 1234 | John Kasich | Ohio | OH |
4 | Cuyahoga | 1337 | John Kasich | Ohio | OH |
Example taken from pandas doc
It still have problems by loading nested objects, but nothing that cannot be solved with some additional operations.
To read files containing a dataset in which values are separated by any characters different than the well known commas or tabs, you must use the parameter sep or delimiter (both are valid) to indicate it.
It must be taken into account that the use of custom delimiters might force to change from C's engine to python's engine. C is always faster but it only allows some delimiters. Python's engine is slower but lets you choose even regular expressions as delimiter.
import pandas as pd
df = pd.read_csv('datos.csv', sep='::', engine='python') # Setting the engine removes warning message.
In order to get visualize a loading bar when you are iterating over any dataset information, tqdm library is a good choice.
Suppose you want to modify something in your input dataset, which is very big. You can check if the process is running or blocked with:
for elem in tqdm(np.nditer(elements), total=elements.shape[0]):
do_something()
The output would be something like:
76%|████████████████████████████ | 7568/10000 [00:33<00:10, 229.00it/s]
Numpy library is an extension for Python which provides mathematical functions for problems where arrays and matrix computations are required. For Matlab software users, Numpy library could be a great substitute. Numpy has also the advantage that was part of python from the beginning and it has a lot of developments. Next piece of code could be used in order to load this library:
import numpy as np
The main characteristic of Numpy is array object class. It is quite similar to lists in Python, except one condition: In a numpy array all the elements must be of the same type (ex. float, int, str ...). It is used to make mathematical operations faster and more efficient than using lists.
For example, using the next code a Numpy array (2 rows and 3 columns) is created. The function np.shape()
is used to check the dimension, and it is useful in case of array multiplication errors.
X = np.array( [ [1,2,3], [4,5,6]])
np.shape(X)
How to index and slice a numpy array?
This could be one of the first questions when a person starts with this kind of numerical libraries. Using previous X array, the way to access to first element in the first row and its last element is shown in the next code. Unlike Matlab (or R) Numpy uses zero-based indexing, i.e. the first element is indexed with 0 and not with 1.
first = X[0][0]
last = X[0][-1]
As in Matlab the eye()
function is helpful when you want to create a 2D array with ones on the diagonal and zeros elsewhere. It can be used to reduce computational cost in many optimization algorithms...
Numpy library has a lot of useful functions when you need to work with random numbers. These functions can be imported using numpy.random
. Notice that you must set a certain seed()
before using these functions in order to get reproducible results.
np.random.seed(32) # example seed is set to 32
Some functions from numpy.random
are: randn()
which generates a 'standard normal' distribution; randint
which returns random integers from a low to a high input values; shuffle()
is useful to modify an input sequence by shuffling its contents; permutation()
randomly permutes a sequence...
SciPy (Scientific Python) is a Python library which is often mentioned in the same way as NumPy. SciPy extends the capabilities of NumPy with further useful functions for minimization, regression, Fourier-transformation and many others.
This part gives a brief introduction to pandas data structures and some advices. Pandas is a Python library for data analysis which has many functions for using DataFrame structures. A DataFrame structure called df
is used for clarify all the examples contained in this part. The next code allows to import the library and to create an empty dataframe.
import pandas as pd
df = pd.DataFrame()
An easy way to start with pandas library is loading a dataset from a csv file, returning a DataFrame structure. Next code shows how.
df= pd.read_csv('../datos.csv').fillna(" ")
In order to introduce this library some tipical questions are answered.
How to get information from a DataFrame structure?
It is useful to extract and get some information from your DataFrame, for example with the functions df.info
and df.describe
. The second one also provides a brief statistical description about your dataset, for example the mean, standard deviation, maximum values and percentiles…
A really good function in order to check all the types which compose your DataFrame structure is df.dtypes
.
A quickly way to see the first and the last records is to use df.head(N)
and df.tail(N)
respectively, where N is the number of records that you want to check.
How to select a certain field or slicing a DataFrame structure?
The easy way to select a column or field in a DataFrame is using the notation df[‘name’]
. A great thing is to use the previous functions in order to get information just for this column. For example: df[‘name’].describe()
or df[‘name’].dtypes
. Several columns can be selected with an additional bracket as df[[‘name1’, ‘name2’]]
.
How to join, combine and group several DataFrame structures?
In almost every analysis, we need to merge and join datasets, usually with a specific order and relational way. To resolve this issue pandas library contains at least 3 great functions; groupby()
, merge()
and concat()
.
Groupby function is used basically to compute an aggregation (ex. Sum, mean…), split into slices or groups and perform a transformation. It returns an object called GroupBy which allows other great funcionalities. Also, it provides the ability to group by multiple columns. An example could be, grouping by columns named A and B, compute its mean value (by group):
Group = df.groupby('A','B']).mean()
Also useful if you want to apply multiple functions to a group and collect results. And again, describe()
function is so useful after group and apply functions because it gives a lot of information about the output. pandas-groupby functionality is great, it performs some operation on each of the pieces and it is similar as plyr
and dplyr
packages in R language.
For SQL programmers, merge()
function provides two DataFrames to be joined on one or more keys, using common syntax (on, left, right, inner, outer...). For example:
pd.merge(df1,df2, on ='key', how= 'outer')
This library also provides concat()
as a way to combine DataFrame structures. It is similar to UNION
function in SQL language. So useful when a different approach and model provides a part of the final result and you just want to combine.
pd.concat([df1, df2])
Matplotlib, seaborn and Bokeh libraries are used for plotting and visualization.
import matplotlib as mp
import seaborn as sn
import bokeh as bk
The main python library for Machine Learning is scikit-learn. It is built on top of Numpy, Scipy and Matplotlib. And it's well documented.
# k nearest neighbours
from sklearn.neighbors import KNeighborsClassifier
# Random Forest
from sklearn.ensemble import RandomForestClassifier
BEEVA | Technology and innovative solutions for companies