- Introduction
- Project Components
2.1. ETL Pipeline
2.2. ML Pipeline
2.3. Flask Web App - Program
3.1. Dependencies
3.2. Running
3.3. Additional Material - Acknowledgement
- Screenshots
This Project is part of the Data Science Nanodegree Program given by Udacity in collaboration with Figure Eight.
The initial dataset contains pre-labelled messages that were sent during disaster events.
The aim of the project is to build a Natural Language Processing tool to categorize messages.
There are three main components.
File process_data.py
contains a ETL (Extract, Transform and Load
Pipeline) that:
- Loads the
messages
andcategories
datasets - Merges the two datasets
- Cleans the data
- Stores it in a SQLite database
File train_classifier.py
contains a Machine Learning Pipeline that:
- Loads data from the SQLite database
- Splits the dataset into training and test sets
- Builds a text processing and machine learning pipeline
- Trains and tunes a model using GridSearchCV
- Outputs results on the test set
- Exports the final model as a pickle file
File run.py
builds a webapp where users can input a message and it will be
classified with the previously trained model.
- Python 3.5+
- Machine Learning Libraries: NumPy, Pandas, Scikit-Learn
- Natural Language Process Libraries: NLTK
- SQLlite Database Libraries: SQLalchemy
- Web App and Data Visualization: Flask, Plotly
Run the following commands in the project's root directory to set up your database and model. Note that steps 1 and 2 (ETL and ML) are optional because the model already trained is in the repository.
- To run the ETL pipeline that cleans data and stores in database:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run the ML pipeline that trains and saves the classifier:
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
- To run the webapp (run it from the app's directory):
python run.py
- Go to http://127.0.0.1:4000/
In the Notebooks
directory, you may find ETL and ML Pipelines in a Jupyter
Notebook that may help you understand how the model work step by step.
Please note that ML Pipeline contained in the notebook (.ipynb) has parameter tunning whereas the one in the script (.py) format has the set of parameters fixed.
You may find that the parameters of the script were not the optimal found in the notebook due to the large size of the pickle file (>100MB). By simplifying the parameters, pickle size file was considerably reduced.
- Udacity for providing such a complete Data Science Nanodegree Program
- Figure Eight for providing the dataset.