Skip to content

Setting up remote Spark job execution

Jan Ehmueller edited this page Jul 27, 2017 · 7 revisions

To enable running Spark jobs from the Curation Interface (e.g. for restoring versions or committing changes), do the following:

  • Create an SSH key pair for the user that is running the Curation Interface API server
  • Copy the created PUBLIC key (~/.ssh/id_rsa.pub) into the cluster's ~/.ssh/authorized_keys file (for the user that will run the Spark jobs)
  • Try logging in via SSH (e.g. ssh <user>@<cluster host>), see if it works without a password (if it doesn't, configure your SSH server for public key authentication or modify the access rights for the .ssh folder)
  • Customize the run_job.sh file in the Curation repository with your user- and hostname
  • On the cluster node, add a build of the Ingestion pipeline containing the jobs that need to be run from Curation as ~/jars/curation_jobs.jar and add the spark.sh script from the Ingestion repository as ~/scripts/spark.sh
  • Modify the paths in the run_job.sh script for different paths if necessary (e.g. different user name)
  • Test the job execution by navigating to http://<curation host>:3000/api/run/versiondiff/667ccd90-5cc4-11e7-9047-dfcf226f2431,aa8ac8e0-5ca9-11e7-aea9-c37dbfcb3b83
Clone this wiki locally