This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
Name | Description | Default |
---|---|---|
region | AWS region | us-east-1 |
access_key | AWS access key | |
secret_key | AWS secret key | |
token | AWS token | null |
instance_type | AWS instance type | m5.xlarge |
ami_image | AWS AMI image | ami-0885b1f6bd170450c |
key_name | Name of the key pair used between nodes | localkey |
key_path | Path of the key pair used between nodes | . |
aws_key_name | AWS key pair used to connect to nodes | amzkey |
amz_key_path | AWS key pair path used to connect to nodes | amzkey.pem |
namenode_count | Namenode count | 1 |
datanode_count | Datanode count | 3 |
ips | Default private ips used for nodes | See variables.tf |
hostnames | Default private hostnames used for nodes | See variables.tf |
- Default AMI image: ami-0885b1f6bd170450c (Ubuntu 20.04, amd64, hvm-ssd)
- Spark: 3.0.1
- Hadoop: 2.7.7
- Python: last available (currently 3.8)
- Java: openjdk 8u275 jdk
- app/: folder where you can put your application, it will copied to the namenode
- install-all.sh: script which is executed in every node, it install hadoop/spark and do all the configuration for you
- main.tf: definition of the resources
- output.tf: terraform output declaration
- variables.tf: terraform variable declaration
- Download and install Terraform
- Download the project and unzip it
- Open the terraform project folder "spark-terraform-master/"
- Create a file named "terraform.tfvars" and paste this:
access_key="<YOUR AWS ACCESS KEY>"
secret_key="<YOUR AWS SECRET KEY>"
token="<YOUR AWS TOKEN>"
Note: without setting the other variables (you can find it on variables.tf), terraform will create a cluster on region "us-east-1", with 1 namenode, 3 datanode and with an instance type of m5.xlarge.
- Put your application files into the "app" terraform project folder
- Open a terminal and generate a new ssh-key
ssh-keygen -f <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/localkey
Where <PATH_TO_SPARK_TERRAFORM>
is the path to the /spark-terraform-master/ folder (e.g. /home/user/)
-
Login to AWS and create a key pairs named amzkey in PEM file format. Follow the guide on AWS DOCS. Download the key and put it in the spark-terraform-master/ folder.
-
Open a terminal and go to the spark-terraform-master/ folder, execute the command
terraform init
terraform apply
After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances.
- Connect via ssh to all your instances via
ssh -i <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/amzkey.pem ubuntu@<PUBLIC DNS>
- Execute on the master (one by one):
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077
- You are ready to execute your app! Execute this command on the master
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077 --executor-cores 2 --executor-memory 14g yourfile.py
- Remember to do
terraform destroy
to delete your EC2 instances
Note: The steps from 0 to 5 (included) are needed only on the first execution ever
- TransE PySpark: an application using this project
- hadoop-spark-cluster-deployment: the starting point of this project