-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up remote Spark job execution
Jan Ehmueller edited this page Jul 27, 2017
·
7 revisions
To enable running Spark jobs from the Curation Interface (e.g. for restoring versions or committing changes), do the following:
- Create an SSH key pair for the user that is running the Curation Interface API server
- Copy the created PUBLIC key (
~/.ssh/id_rsa.pub
) into the cluster's~/.ssh/authorized_keys
file (for the user that will run the Spark jobs) - Try logging in via SSH (e.g.
ssh <user>@<cluster host>
), see if it works without a password (if it doesn't, configure your SSH server for public key authentication or modify the access rights for the.ssh
folder) - Customize the
run_job.sh
file in the Curation repository with your user- and hostname - On the cluster node, add a build of the Ingestion pipeline containing the jobs that need to be run from Curation as
~/jars/curation_jobs.jar
and add thespark.sh
script from the Ingestion repository as~/scripts/spark.sh
- Modify the paths in the
run_job.sh
script for different paths if necessary (e.g. different user name) - Test the job execution by navigating to
http://<curation host>:3000/api/run/versiondiff/667ccd90-5cc4-11e7-9047-dfcf226f2431,aa8ac8e0-5ca9-11e7-aea9-c37dbfcb3b83