VGCN health check #16

bgruening · 2018-11-17T11:59:44Z

It would be nice if we could check the image before Condor starts up and adds themselves to the cluster.

A few ideas:

check filesystems
- NFS
- CVMFS
- tmp
- check size of filesystems
check important ENVs
run tiny job in Singulairty, Condor, Docker
check network

AndreasSko · 2018-11-19T12:07:22Z

I already found a solution on Github for a similar problem (https://github.com/HEP-Puppet/htcondor/blob/master/templates/20_workernode.config.erb & https://github.com/HEP-Puppet/htcondor/blob/master/files/healthcheck_wn_condor) and tested this approach (calling a healthcheck-script on startup and as a cron) on my own - and it worked pretty good. The next step would be to write the actual healthchecks, which I will discuss with @bgruening on thursday.

bgruening · 2018-11-22T14:48:51Z

Status: Andreas has a python script that is started with condor and blocks condor as long as there is an error. Next steps are building VGCN and getting this script as PR in. Playing around with a local influxdb and sending events from this script to influxdb to indicate a new node is available.

We also discussed potential FaaS for creating training materials with planemo.

hexylena · 2018-11-22T14:51:17Z

We generally don't need to send events to influxdb, our VGCN includes an influxdb stat of health = 1 which is sent every few seconds that gives us notification of this.

Edit: it is how we built this graph: https://stats.galaxyproject.eu/d/000000021/galaxy-condor-cluster?refresh=15m&panelId=1&fullscreen&orgId=1

AndreasSko · 2019-01-08T10:38:06Z

A little status update:

I improved my healthcheck-script (it now also detects unresponsive mounts, which took a longer time than expected) and fixed some bugs.
I also played with the idea to run the script outside/independet of HTCondor: On boot it starts right after HTCondor, checks the health and decides to start/stop it. You can also check periodically via Cronjobs if everything is still working and stop HTCondor if there is an error. The nice thing is, that you could extent it easily for other services. If you like the idea, I could extend the script to use something like a config file so its easier to work with.
I tried multiple times to build the vgcn, but without success (I already changed the iso_url to Version 1810, as the old one isn’t working anymore). I’m using Ubuntu 18.04 on the BWCloud, with the newest versions of Packer, Ansible and qemu. While building, I get the following error:

==> qemu: Connected to SSH!
==> qemu: Provisioning with Ansible...
==> qemu: Executing Ansible: ansible-playbook --extra-vars packer_build_name=qemu packer_builder_type=qemu -i /tmp/packer-provisioner-ansible619958900 /home/ubuntu/vgcn/ansible-roles/setup-vgcn-bwcloud.yml -e ansible_ssh_private_key_file=/tmp/ansible-key248446249
qemu: ERROR! 'umask' is not a valid attribute for a Task
qemu:
qemu: The error appears to have been in '/home/ubuntu/vgcn/ansible-roles/CyVerse-Ansible.singularity/tasks/main.yml': line 43, column 3, but may
qemu: be elsewhere in the file depending on the exact syntax problem.
qemu:
qemu: The offending line appears to be:
qemu:
qemu:
qemu: - name: execute make install
qemu: ^ here
qemu:
qemu: This error can be suppressed as a warning using the "invalid_task_attribute_failed" configuration
==> qemu: Deleting output directory...
Build 'qemu' errored: Error executing Ansible: Non-zero exit status: exit status 4

Do you have an idea what my problem could be?
If I can get the build to work, I will include my script, so you can have a look.

hexylena · 2019-01-08T10:47:01Z

I improved my healthcheck-script (it now also detects unresponsive mounts, which took a longer time than expected) and fixed some bugs.

cool, could you share it somehow? As a pull request to this repo maybe? :)

If you like the idea, I could extend the script to use something like a config file so its easier to work with.

This sounds nice in theory, but I am not sure what other services we might target. I had not discussed this extensively with bjoern, in my mind I had imagined that this was one script that we run and based on exit code, the deployer of the script would write some wrapper which decides which services to start/stop. I had come with the assumption bjoern had only asked you to write the detection routine, and then I would supply something like

check_stuff
ec=$?
if (( ec != 0 )); then
  systemctl stop htcondor ....
fi

and as deployer I'd chose to run that just on boot, or on cron, or something else. But if it is in the scope of your project that you do these things additionally, then a config file sounds nice! :)

I’m using Ubuntu 18.04 on the BWCloud, with the newest versions of Packer, Ansible and qemu. While building, I get the following error:

It is generally not possible to build images within VMs the bwcloud. I'm amazed it got as far as the playbook, it should have crashed much earlier. But yes, umask is not a valid task attribute, I have now removed it.

I'm guessing it failed for you and not us, probably, because you are using the newest version of ansible? We use 2.7.1

AndreasSko · 2019-01-10T15:39:30Z

So I made a pull request with my first version. What do you think? Is there maybe something I should add or do differently?

I had not discussed this extensively with bjoern, in my mind I had imagined that this was one script that we run and based on exit code, the deployer of the script would write some wrapper which decides which services to start/stop. I had come with the assumption bjoern had only asked you to write the detection routine, and then I would supply something like

The idea of my additional script was basically that, just implemented in Python: It checks if everything is healthy and stops the service if a problem occurs. As I understood it, the task of my project also included this idea. Depending on your needs (is a simple healthcheck-script enough or do you need more?) I can further work on this :)

It is generally not possible to build images within VMs the bwcloud. I'm amazed it got as far as the playbook, it should have crashed much earlier. But yes, umask is not a valid task attribute, I have now removed it.
I'm guessing it failed for you and not us, probably, because you are using the newest version of ansible? We use 2.7.1

Ok, good to know :D After your commit (and installing ansible 2.7.1) I still tried it one more time one the BWCloud and it got further to this error:

qemu: TASK [galaxy : galaxy account group] *******************************************
qemu: fatal: [default]: FAILED! => {"changed": false, "msg": "groupadd: GID '999' already exists\n", "name": "galaxy"}
qemu: to retry, use: --limit @/home/ubuntu/vgcn/ansible-roles/setup-vgcn-bwcloud.retry
qemu:
qemu: PLAY RECAP *********************************************************************
qemu: default : ok=1 changed=0 unreachable=0 failed=1

I also tried to build it on my mac (with kvm-accelerator disabled), which just didn't do anything and the google-cloud (as it supports nested VMs), this time with this error:

==> qemu: Executing Ansible: ansible-playbook --extra-vars packer_build_name=qemu packer_builder_type=qemu -i /tmp/packer-provisioner-ansible592665748 /home/askorczyk/vgcn/ansible-roles/setup-vgcn-bwcloud.yml -e ansible_ssh_private_key_file=/tmp/ansible-key358398409
qemu: ERROR! no action detected in task
qemu:
qemu: The error appears to have been in '/home/askorczyk/vgcn/ansible-roles/basic/tasks/main.yml': line 47, column 3, but may
qemu: be elsewhere in the file depending on the exact syntax problem.
qemu:
qemu: The offending line appears to be:
qemu:
qemu:
qemu: - name: Ensure services are enabled + started
qemu: ^ here

Unfortunately I don't have a local linux machine available.. I will still try a few things, but I think the best idea for me is first to concentrate on the script - at least until my next meeting with Björn.

hexylena · 2019-03-14T07:02:29Z

The groupadd error was now fixed in #19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VGCN health check #16

VGCN health check #16

bgruening commented Nov 17, 2018

AndreasSko commented Nov 19, 2018

bgruening commented Nov 22, 2018

hexylena commented Nov 22, 2018 •

edited

Loading

AndreasSko commented Jan 8, 2019

hexylena commented Jan 8, 2019

AndreasSko commented Jan 10, 2019

hexylena commented Mar 14, 2019

VGCN health check #16

VGCN health check #16

Comments

bgruening commented Nov 17, 2018

AndreasSko commented Nov 19, 2018

bgruening commented Nov 22, 2018

hexylena commented Nov 22, 2018 • edited Loading

AndreasSko commented Jan 8, 2019

hexylena commented Jan 8, 2019

AndreasSko commented Jan 10, 2019

hexylena commented Mar 14, 2019

hexylena commented Nov 22, 2018 •

edited

Loading