Skip to content

Sample ETL process written in Spark 2.1 using dataset type safety including unittests. Runs on docker image providing spark and zeppelin.

Notifications You must be signed in to change notification settings

mirkoprescha/spark-json-to-table

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-json-to-parquet-table

This project demonstrates how to explode json-arrays into a relational format using spark.

Example:

This source JSON object: {"business_id":"1","categories":["Tobacco Shops","Nightlife","Vape Shops"],"price":100}

will be converted into such a table in parquet format.

business_id categorie      price
1 Tobacco Shops 100
1 Nightlife 100
1 Vape Shop 100

As source file input it utilizes the json files provided by Yelp Dataset Challenge round#9.

This project consists of a spark-app doing the transformation and a docker image providing the required software and libraries to run the spark-app.

Spark-App

The Spark-App reads all json files one after another into a dataframe, validates the schema with help of case classes, explodes all arrays into additional dataset and leads through all remaining attributes into the parent table. All generated datasets are persisted as parquet. The parent table furthermore contains the array values as comma-separated string (helpful for some analyses to avoid joining). Spark-App is written in Scala and build with SBT. It utilizes scalatest for unit and integration test. Find the sources are in ./src.

Spark-Zeppelin Docker Image

To run the Spark-App with spark-submit this project also provides Spark 2.1 together with Zeppelin as docker image. The docker image is uploaded in dockerhub in a public repository. A sample zeppelin notebook to analyze exploded tables is here ./zeppelin_notebooks/dataset-analysis.json.

Works with

  • docker 1.13.1
  • spark 2.1
  • scala 2.17
  • sbt 0.13.9

Getting Started

To get this project running you need at least 10 GB available harddisk. Follow these steps to use compiled spark-app in provided docker image.

  1. download tar file from Yelp Dataset Challenge round#9 containing input json files

  2. run docker container

it will download image from dockerhub and run it in a container

docker run -it -p 8088:8080   mirkoprescha/spark-zeppelin

If you want to use zeppelin immediately, wait roughly 10 second until daemon started

  1. Copy yelp_dataset_challenge_round9.tgz to docker container

Start another shell session and copy the file into the docker container. (your latest started container)

docker cp yelp_dataset_challenge_round9.tgz $(docker ps  -l -q):/home/
  1. run spark job

Go back to your first session. You should be connected as root in the docker container

cd /home
spark-submit   --class com.mprescha.json2Table.Json2Table \
      /usr/local/bin/spark-json-to-table_2.11-1.0.jar \
      /home/yelp_dataset_challenge_round9.tgz

Spark processing will take roughly 5 minutes.

If the job ran successful, following output-structure is generated in /home/output/.

  • businessAsTable
  • businessAttributes
  • businessCategories
  • businessHours
  • checkinAsTable
  • checkinTimes
  • review
  • tip
  • userAsTable
  • userElite
  • userFriends

Each subdir represents an entity-type that can be analyzed in zeppelin notebook.

You can verfiy result on your machine with du -h output/.

This should produce an output like this.

root@c6c0a39bc1fa:/home# du -h output/
4.8M	output/businessCategories
17M	output/checkinAsTable
4.2M	output/businessHours
4.8M	output/businessAttributes
703M	output/userAsTable
712M	output/userFriends
25M	output/userElite
9.5M	output/checkinTimes
1.8G	output/review
21M	output/businessAsTable
55M	output/tip
3.3G	output/
  1. goto zeppelin ui: http://localhost:8088/#/

Open the Notebook called analysis. Accept ("save") the interpreter bindings. In the menu bar click to play button to run all paragraphs.

if the notebook is not available you have to download it from this git repo and import into zeppelin. Alternatively check the results of the notebook in zeppelin hub

deploy changes in spark-app

Clone this project.

After any changes to the spark-app you need to build a new package with

sbt package

If all test are successful, place the package here ./spark-docker/bin/spark-json-to-table_2.11-1.0.jar

changes in dockerfile

After changes in Dockerfile goto project home dir and run

docker build --file spark-docker/Dockerfile -t mirkoprescha/spark-zeppelin .

About

Sample ETL process written in Spark 2.1 using dataset type safety including unittests. Runs on docker image providing spark and zeppelin.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages