Skip to content

Grove Grooviness

StefanKennedy edited this page Feb 25, 2019 · 6 revisions

Ubuntu 16.04 docker image doesn't work

A bit of a time waster for me was using the Ubuntu 16.04 image with udocker. The image seems to be broken, I tried it on my machine and had the same issues. It has out of date or unsupported dependencies, and it gives you an error that the GPG keys have a bad signature. As soon as I installed the Ubuntu 18.04 image it worked well.

Building docker images on your machine (laptop/desktop) wont work

As a consequence of udocker being slow on Grove, I tried to build the container locally, export it and load it on the cluster. This didn't work because even though I was using an Ubuntu image the downloads were installed for my machine's hardware components. These are different to Grove's and Grove couldn't use it.

udocker is so slow you can barely do anything

My solution to this one was a bit creative (I think). Being unable to create the container on my local machine, and with udocker being really, really slow on the head node of Grove I had to wait long time periods to really move along with installing dependencies. The solution I came up with is to create a script like this:

#!/bin/bash
#SBATCH -p compute
#SBATCH -w g005

sleep 7200

Which will create a job on node g005. Once a job is running on g005, we are allowed to ssh into that node ssh g005. This node is not really, really slow, and we can use it to install our dependencies. The dependencies are still accessible from other nodes. The script gives you 2 hours to do your work, you can increase the time limit if you need longer. If you finish early you can cancel this workaround script using scancel <jobid>

Cant find keras.*

This might be because of the TensorFlow version being 1.8.0, my workaround was just to import tensorflow as tf and qualify the others with tf.keras.<whatever>

Environment variables needed for TensorFlow

In the .sh file run by sbatch I create the udocker container and run a bash command in there using bash -c "cd /home/skennedy/scripts; python p.py", but in order to use tensorflow an environment variable needs to be set export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/

My problem was getting the environment variable to be set inside the docker container using bash -c. With normal bash you can simply include it in ~/.profile or ~/.bashrc, however with bash -c these files are not loaded. The documentation for udocker talks about setting environment variables using a -env="VAR=VAL" switch, however I could not make this work in any form. Eventually the solution I found was to edit the udocker properties file, .udocker/containers/tfcuda/container.json and change the entries for env to include another array element LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/

sbatch slow and cant allocate more memory

This one could not be solved before sprint end. All attempts to allocate more memory return the same error 'Illegal Instruction'. This doesn't appear to be a Slurm error message because searching the Slurm source does not show that message anywhere. The message does seem to come from assembly though, so it could be Slurm, or it could come from anything else.

Not using the HPC correctly.

We need to put work into working out how to make use of the HPC, probably our jupyter notebook scripts need to be able to be parallelized/distributed.

Jupyter Notebook grooviness

Big thanks to Lucas for his genius help with this

I did this with venv. Everything seems to get slower with udocker :(. For a lab computer the port forwarding will be like this:

ssh -vL 1235:localhost:1235 [email protected]

And once you're on grove you can do this:

ssh -L 1235:localhost:8888 g005

Then to use Jupyter Notebook open your browser and navigate to localhost:1235. You can find the authorisation token in the terminal you ran Jupyter Notebook on.

Clone this wiki locally