-
Notifications
You must be signed in to change notification settings - Fork 4
Grove Grooviness
A bit of a time waster for me was using the Ubuntu 16.04 image with udocker. The image seems to be broken, I tried it on my machine and had the same issues. It has out of date or unsupported dependencies, and it gives you an error that the GPG keys have a bad signature. As soon as I installed the Ubuntu 18.04 image it worked well.
As a consequence of udocker being slow on Grove, I tried to build the container locally, export it and load it on the cluster. This didn't work because even though I was using an Ubuntu image the downloads were installed for my machine's hardware components. These are different to Grove's and Grove couldn't use it.
My solution to this one was a bit creative (I think). Being unable to create the container on my local machine, and with udocker being really, really slow on the head node of Grove I had to wait long time periods to really move along with installing dependencies. The solution I came up with is to create a script like this:
#!/bin/bash
#SBATCH -p compute
#SBATCH -w g005
sleep 7200
Which will create a job on node g005. Once a job is running on g005, we are allowed to ssh into that node ssh g005
. This node is not really, really slow, and we can use it to install our dependencies. The dependencies are still accessible from other nodes. The script gives you 2 hours to do your work, you can increase the time limit if you need longer. If you finish early you can cancel this workaround script using scancel <jobid>
This might be because of the TensorFlow version being 1.8.0, my workaround was just to import tensorflow as tf
and qualify the others with tf.keras.<whatever>
In the .sh file run by sbatch I create the udocker container and run a bash command in there using bash -c "cd /home/skennedy/scripts; python p.py"
, but in order to use tensorflow an environment variable needs to be set export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/
My problem was getting the environment variable to be set inside the docker container using bash -c. With normal bash you can simply include it in ~/.profile
or ~/.bashrc
, however with bash -c these files are not loaded. The documentation for udocker talks about setting environment variables using a -env="VAR=VAL"
switch, however I could not make this work in any form. Eventually the solution I found was to edit the udocker properties file, .udocker/containers/tfcuda/container.json
and change the entries for env to include another array element LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/
This one could not be solved before sprint end. All attempts to allocate more memory return the same error 'Illegal Instruction'. This doesn't appear to be a Slurm error message because searching the Slurm source does not show that message anywhere. The message does seem to come from assembly though, so it could be Slurm, or it could come from anything else.
We need to put work into working out how to make use of the HPC, probably our jupyter notebook scripts need to be able to be parallelized/distributed.
Big thanks to Lucas for his genius help with this
I did this with venv. Everything seems to get slower with udocker :(. For a lab computer the port forwarding will be like this:
ssh -vL 1235:localhost:1235 [email protected]
And once you're on grove you can do this:
ssh -L 1235:localhost:8888 g005
Then to use Jupyter Notebook open your browser and navigate to localhost:1235
. You can find the authorisation token in the terminal you ran Jupyter Notebook on.
- ACLSW 2019
- Our datasets
- Experiment Results
- Research Analysis
- Hypothesis
- Machine Learning
- Deep Learning
- Paper Section Drafts
- Word Embeddings
- References/Resources
- Correspondence with H. Aghakhani
- The Gotcha! Collection