The repo contains a Hadoop cluster configuration and a client-server app. The goal is to predict smartphone's price range using a machine learning model generated over Apache Spark, and visualize charts about smarphone statistics using data originated by Apache Hive.
This image represents one possible output result.
The application is tested in both local and cluster modes.
We used ZeroTier to connect our two machines.
These configuration are available as README to:
For run the project you need these packages:
- NumPy library
- JPMML-SparkML jar
- PySpark2PMML
- OpenScoring Server
- Node.js
- Google Chrome
First, you need to download JPMML-SparkML jar and add it to Spark jars folder. After downloading it, you can install PySpark2PMML. Before downloading and installing all pmml libraries please check which Spark version you have, you can check on pmml documentation which version is compatible with yours. Remember to install numpy too, because is used by PysparkPMML.
The project use Google Chrome to open the client because it consents to manage CORS policy.
In both cluster and local folders, you can find a folder named script, inside there are the main scripts for launch the project.
These scripts are:
- runHadoop.sh
- runHive.sh
- runSparkAPP.sh
- runProject.sh
You need to run just runProject.sh, but before running it you should change all paths containend in the other scripts, like Hadoop, Spark and Hive paths and Python path too.