The purpose of this project is to conduct data visualization analysis to understand Parking and Camera Violation in NYC. We will be going through the process of containerization, terminal navigation, python scripting, artifact deployment and AWS EC2 provisioning.
The Open Parking and Camera Violations dataset was initially loaded with all violations contained in the mainframe database as of May 2016 (Initial dataset loaded 05/14/2016.). New or open violations will be updated weekly (Sunday). NYC Open Data
You will need to install Docker, Docker Compose, register a Docker Hub account, create a Amazon EC2 server account and set up a AWS EC2 instance. create a NYC App Token from NYC Open Data.
The project is broken into three parts.
- Part1: Python Scripting (Docker & AWS EC2)
- Part2: Loading into ElasticSearch
- Part3: Visualizing and Analysis on Kibana
Develop a python command line interface that will connect to the OPCV API and demonstrate that the data is accessible via python.
Build image from Dockerfile and push the image to Docker Hub. Lastly, run the image in AWS EC2 server.
-
Build an image from our dockerfile on your computer:
$ docker build -t image_name .
-
Create a container and run Python script locally. The command line is:
$ docker run -v $(pwd):/app -e APP_KEY=API_Token -t image_name python main.py --page_size=1000 --num_page=4 --output=results.json
Key Arguments
- --page_size: This command line argument is required. It will ask for how many records to request from the API per call.
- --num_pages: This command line argument is optional. If not provided, your script should continue requesting data until the entirety of the content has been exhausted. If this argument is provided, continue querying for data num_pages times.
- --output: This command line argument is optional. If not provided, your script should simply print results to stdout. If provided, your script should write the data to the file (in this case, results.json).
-
Deploying via Docker Hub:
$ docker push your_username/image_name
-
Pulling the image from dockerhub and running it in AWS EC2
~$ sudo docker pull image_name ~$ sudo docker run -it your_username/image_name /bin/bash ~$ sudo docker run -e APP_KEY=API_token -it your_username/image_name python main.py --page_size=1000 --num_page=4 --output=results.json
In this part, you will leverage docker-compose to bring up a service that encapsulates your bigdata1 container and an elasticsearch container and ensures that they are able to interact.
You must update your original script (from Part_1) to now not only download the data but also load it into the elasticsearch instance.
Under Part_2/bigdata1 folder
- create docker-compose.yml - include three images/servcies (python, elasticsearch, kibana) to interact with your container
- requirements.txt - add elasticsearch
- main.py - add 'push_elastic' argument and pass it to the
call_api function
Under Part2/bigdata1/src/bigdata1 folder
- callapi.py add:
- import
Elasticsearch
anddatetime, date
- add
create_update_index()
function - create an elasticsearch index to store data - add
format_push_data()
function - format python data type and push to elasticsearch - update
call_api
function - takes argument from the interface (main.py)
- import
-
Build ElasticSearch and Kibana (Create Images):
$ docker-compose build pyth
-
Run ElasticSearch and Kibana (Run Services):
$ docker-compose up -d
-
Interact at Interface
$ docker-compose run -e APP_KEY=API_token -v ${PWD}:/app pyth python -m main --page_size=100 --num_pages=10 --output=results.json --push_elastic=True
-
Query ElasticSearch
$ curl -o output.txt http://localhost:9200/opcv/_search?q=state:NY&size=10
- Return 10 records where state is NY
- output.txt will be generated under Part_2/bigdata1
You have successfully pushed to ElasticSearch
You now can visualize and analyze data in Kibana by accessing http://localhost:5601/app/kibana
Then you will go to the Management
tab, define your Index pattern using indices
, and use Time Filter using issue_data
Here are some data visualization examples:
Horizontal Bar Chart - Average Reduction Amount By County
Pie Chart - Top 10 Violation Type
Bar Chart - Number of Violation By County