This project demonstrates how to explode json-arrays into a relational format using spark.
Example:
This source JSON object:
{"business_id":"1","categories":["Tobacco Shops","Nightlife","Vape Shops"],"price":100}
will be converted into such a table in parquet format.
business_id | categorie | price |
---|---|---|
1 | Tobacco Shops | 100 |
1 | Nightlife | 100 |
1 | Vape Shop | 100 |
As source file input it utilizes the json files provided by Yelp Dataset Challenge round#9.
This project consists of a spark-app doing the transformation and a docker image providing the required software and libraries to run the spark-app.
The Spark-App reads all json files one after another into a dataframe, validates the schema with help of case classes, explodes all arrays into additional dataset and leads through all remaining attributes into the parent table. All generated datasets are persisted as parquet.
The parent table furthermore contains the array values as comma-separated string (helpful for some analyses to avoid joining).
Spark-App is written in Scala
and build with SBT
. It utilizes scalatest
for unit and integration test.
Find the sources are in ./src
.
To run the Spark-App with spark-submit this project also provides Spark 2.1 together with Zeppelin as docker image.
The docker image is uploaded in dockerhub in a public repository.
A sample zeppelin notebook to analyze exploded tables is here ./zeppelin_notebooks/dataset-analysis.json.
- docker 1.13.1
- spark 2.1
- scala 2.17
- sbt 0.13.9
To get this project running you need at least 10 GB available harddisk. Follow these steps to use compiled spark-app in provided docker image.
-
download tar file from Yelp Dataset Challenge round#9 containing input json files
-
run docker container
it will download image from dockerhub and run it in a container
docker run -it -p 8088:8080 mirkoprescha/spark-zeppelin
If you want to use zeppelin immediately, wait roughly 10 second until daemon started
- Copy yelp_dataset_challenge_round9.tgz to docker container
Start another shell session and copy the file into the docker container. (your latest started container)
docker cp yelp_dataset_challenge_round9.tgz $(docker ps -l -q):/home/
- run spark job
Go back to your first session. You should be connected as root in the docker container
cd /home
spark-submit --class com.mprescha.json2Table.Json2Table \
/usr/local/bin/spark-json-to-table_2.11-1.0.jar \
/home/yelp_dataset_challenge_round9.tgz
Spark processing will take roughly 5 minutes.
If the job ran successful, following output-structure is generated in /home/output/.
- businessAsTable
- businessAttributes
- businessCategories
- businessHours
- checkinAsTable
- checkinTimes
- review
- tip
- userAsTable
- userElite
- userFriends
Each subdir represents an entity-type that can be analyzed in zeppelin notebook.
You can verfiy result on your machine with du -h output/
.
This should produce an output like this.
root@c6c0a39bc1fa:/home# du -h output/
4.8M output/businessCategories
17M output/checkinAsTable
4.2M output/businessHours
4.8M output/businessAttributes
703M output/userAsTable
712M output/userFriends
25M output/userElite
9.5M output/checkinTimes
1.8G output/review
21M output/businessAsTable
55M output/tip
3.3G output/
- goto zeppelin ui: http://localhost:8088/#/
Open the Notebook called analysis
.
Accept ("save") the interpreter bindings.
In the menu bar click to play button to run all paragraphs.
if the notebook is not available you have to download it from this git repo and import into zeppelin. Alternatively check the results of the notebook in zeppelin hub
Clone this project.
After any changes to the spark-app you need to build a new package with
sbt package
If all test are successful, place the package here
./spark-docker/bin/spark-json-to-table_2.11-1.0.jar
After changes in Dockerfile
goto project home dir and run
docker build --file spark-docker/Dockerfile -t mirkoprescha/spark-zeppelin .