This repository contains Jupyter notebooks and associated data for the District Data Labs introductory training on data science and big data.
Before running any code in this repository, make sure you have installed the requirements (preferably in a virtualenv
or conda
environment) with:
pip install -r requirements.txt
This repository contains the following notebooks:
exploratory_data_analysis.ipynb
: a notebook demonstrating basic exploratory data analysis (EDA) techniques using Yelp and U.S. Census datasupervised_learning.ipynb
: a notebook demonstrating supervised learning techniques on baseball player statisticsdata_collection.ipynb
: a notebook demonstrating data acquisition through webscraping public speeches by the U.S. Secretary of Defenseunsupervised_learning.ipynb
: a notebook demonstrating unsupervised learning (clustering) on the public speeches referenced abovestring_matching.ipynb
: a notebook demonstrating techniques for entity resolution using string matchingelasticsearch_overview.ipynb
: a notebook demonstrating how to interact with an Elasticsearch cluster (requires access to Elasticsearch either remotely or locally--using Docker, etc.)
This repository is self-contained: the relevant data for the notebooks is available in /data
. There are a number of .csv
files. Each of these files provenance is explained in the relevant notebook where it gets used.