Skip to content

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

License

Notifications You must be signed in to change notification settings

conema/spark-terraform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop/Spark with Terraform on AWS

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

  1. Variables
  2. Software version
  3. Project Structure
  4. How to
  5. See also

Variables

Name Description Default
region AWS region us-east-1
access_key AWS access key
secret_key AWS secret key
token AWS token null
instance_type AWS instance type m5.xlarge
ami_image AWS AMI image ami-0885b1f6bd170450c
key_name Name of the key pair used between nodes localkey
key_path Path of the key pair used between nodes .
aws_key_name AWS key pair used to connect to nodes amzkey
amz_key_path AWS key pair path used to connect to nodes amzkey.pem
namenode_count Namenode count 1
datanode_count Datanode count 3
ips Default private ips used for nodes See variables.tf
hostnames Default private hostnames used for nodes See variables.tf

Software version

  • Default AMI image: ami-0885b1f6bd170450c (Ubuntu 20.04, amd64, hvm-ssd)
  • Spark: 3.0.1
  • Hadoop: 2.7.7
  • Python: last available (currently 3.8)
  • Java: openjdk 8u275 jdk

Project Structure

  • app/: folder where you can put your application, it will copied to the namenode
  • install-all.sh: script which is executed in every node, it install hadoop/spark and do all the configuration for you
  • main.tf: definition of the resources
  • output.tf: terraform output declaration
  • variables.tf: terraform variable declaration

How to

  1. Download and install Terraform
  2. Download the project and unzip it
  3. Open the terraform project folder "spark-terraform-master/"
  4. Create a file named "terraform.tfvars" and paste this:
access_key="<YOUR AWS ACCESS KEY>"
secret_key="<YOUR AWS SECRET KEY>"
token="<YOUR AWS TOKEN>"

Note: without setting the other variables (you can find it on variables.tf), terraform will create a cluster on region "us-east-1", with 1 namenode, 3 datanode and with an instance type of m5.xlarge.

  1. Put your application files into the "app" terraform project folder
  2. Open a terminal and generate a new ssh-key
ssh-keygen -f <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/localkey

Where <PATH_TO_SPARK_TERRAFORM> is the path to the /spark-terraform-master/ folder (e.g. /home/user/)

  1. Login to AWS and create a key pairs named amzkey in PEM file format. Follow the guide on AWS DOCS. Download the key and put it in the spark-terraform-master/ folder.

  2. Open a terminal and go to the spark-terraform-master/ folder, execute the command

terraform init
terraform apply

After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances.

  1. Connect via ssh to all your instances via
ssh -i <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/amzkey.pem ubuntu@<PUBLIC DNS>
  1. Execute on the master (one by one):
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077
  1. You are ready to execute your app! Execute this command on the master
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077  --executor-cores 2 --executor-memory 14g yourfile.py
  1. Remember to do terraform destroy to delete your EC2 instances

Note: The steps from 0 to 5 (included) are needed only on the first execution ever

See also