Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VGCN health check #16

Open
bgruening opened this issue Nov 17, 2018 · 7 comments
Open

VGCN health check #16

bgruening opened this issue Nov 17, 2018 · 7 comments

Comments

@bgruening
Copy link
Member

It would be nice if we could check the image before Condor starts up and adds themselves to the cluster.

A few ideas:

  • check filesystems
    • NFS
    • CVMFS
    • tmp
    • check size of filesystems
  • check important ENVs
  • run tiny job in Singulairty, Condor, Docker
  • check network
@AndreasSko
Copy link
Contributor

I already found a solution on Github for a similar problem (https://github.com/HEP-Puppet/htcondor/blob/master/templates/20_workernode.config.erb & https://github.com/HEP-Puppet/htcondor/blob/master/files/healthcheck_wn_condor) and tested this approach (calling a healthcheck-script on startup and as a cron) on my own - and it worked pretty good. The next step would be to write the actual healthchecks, which I will discuss with @bgruening on thursday.

@bgruening
Copy link
Member Author

Status: Andreas has a python script that is started with condor and blocks condor as long as there is an error. Next steps are building VGCN and getting this script as PR in. Playing around with a local influxdb and sending events from this script to influxdb to indicate a new node is available.

We also discussed potential FaaS for creating training materials with planemo.

@hexylena
Copy link
Member

hexylena commented Nov 22, 2018

We generally don't need to send events to influxdb, our VGCN includes an influxdb stat of health = 1 which is sent every few seconds that gives us notification of this.

Edit: it is how we built this graph: https://stats.galaxyproject.eu/d/000000021/galaxy-condor-cluster?refresh=15m&panelId=1&fullscreen&orgId=1

@AndreasSko
Copy link
Contributor

A little status update:

  • I improved my healthcheck-script (it now also detects unresponsive mounts, which took a longer time than expected) and fixed some bugs.

  • I also played with the idea to run the script outside/independet of HTCondor: On boot it starts right after HTCondor, checks the health and decides to start/stop it. You can also check periodically via Cronjobs if everything is still working and stop HTCondor if there is an error. The nice thing is, that you could extent it easily for other services. If you like the idea, I could extend the script to use something like a config file so its easier to work with.

  • I tried multiple times to build the vgcn, but without success (I already changed the iso_url to Version 1810, as the old one isn’t working anymore). I’m using Ubuntu 18.04 on the BWCloud, with the newest versions of Packer, Ansible and qemu. While building, I get the following error:

==> qemu: Connected to SSH!
==> qemu: Provisioning with Ansible...
==> qemu: Executing Ansible: ansible-playbook --extra-vars packer_build_name=qemu packer_builder_type=qemu -i /tmp/packer-provisioner-ansible619958900 /home/ubuntu/vgcn/ansible-roles/setup-vgcn-bwcloud.yml -e ansible_ssh_private_key_file=/tmp/ansible-key248446249
qemu: ERROR! 'umask' is not a valid attribute for a Task
qemu:
qemu: The error appears to have been in '/home/ubuntu/vgcn/ansible-roles/CyVerse-Ansible.singularity/tasks/main.yml': line 43, column 3, but may
qemu: be elsewhere in the file depending on the exact syntax problem.
qemu:
qemu: The offending line appears to be:
qemu:
qemu:
qemu: - name: execute make install
qemu: ^ here
qemu:
qemu: This error can be suppressed as a warning using the "invalid_task_attribute_failed" configuration
==> qemu: Deleting output directory...
Build 'qemu' errored: Error executing Ansible: Non-zero exit status: exit status 4

Do you have an idea what my problem could be?
If I can get the build to work, I will include my script, so you can have a look.

@hexylena
Copy link
Member

hexylena commented Jan 8, 2019

I improved my healthcheck-script (it now also detects unresponsive mounts, which took a longer time than expected) and fixed some bugs.

cool, could you share it somehow? As a pull request to this repo maybe? :)

If you like the idea, I could extend the script to use something like a config file so its easier to work with.

This sounds nice in theory, but I am not sure what other services we might target. I had not discussed this extensively with bjoern, in my mind I had imagined that this was one script that we run and based on exit code, the deployer of the script would write some wrapper which decides which services to start/stop. I had come with the assumption bjoern had only asked you to write the detection routine, and then I would supply something like

check_stuff
ec=$?
if (( ec != 0 )); then
  systemctl stop htcondor ....
fi

and as deployer I'd chose to run that just on boot, or on cron, or something else. But if it is in the scope of your project that you do these things additionally, then a config file sounds nice! :)

I’m using Ubuntu 18.04 on the BWCloud, with the newest versions of Packer, Ansible and qemu. While building, I get the following error:

It is generally not possible to build images within VMs the bwcloud. I'm amazed it got as far as the playbook, it should have crashed much earlier. But yes, umask is not a valid task attribute, I have now removed it.

I'm guessing it failed for you and not us, probably, because you are using the newest version of ansible? We use 2.7.1

@AndreasSko
Copy link
Contributor

So I made a pull request with my first version. What do you think? Is there maybe something I should add or do differently?

I had not discussed this extensively with bjoern, in my mind I had imagined that this was one script that we run and based on exit code, the deployer of the script would write some wrapper which decides which services to start/stop. I had come with the assumption bjoern had only asked you to write the detection routine, and then I would supply something like

The idea of my additional script was basically that, just implemented in Python: It checks if everything is healthy and stops the service if a problem occurs. As I understood it, the task of my project also included this idea. Depending on your needs (is a simple healthcheck-script enough or do you need more?) I can further work on this :)

It is generally not possible to build images within VMs the bwcloud. I'm amazed it got as far as the playbook, it should have crashed much earlier. But yes, umask is not a valid task attribute, I have now removed it.
I'm guessing it failed for you and not us, probably, because you are using the newest version of ansible? We use 2.7.1

Ok, good to know :D After your commit (and installing ansible 2.7.1) I still tried it one more time one the BWCloud and it got further to this error:

qemu: TASK [galaxy : galaxy account group] *******************************************
qemu: fatal: [default]: FAILED! => {"changed": false, "msg": "groupadd: GID '999' already exists\n", "name": "galaxy"}
qemu: to retry, use: --limit @/home/ubuntu/vgcn/ansible-roles/setup-vgcn-bwcloud.retry
qemu:
qemu: PLAY RECAP *********************************************************************
qemu: default : ok=1 changed=0 unreachable=0 failed=1

I also tried to build it on my mac (with kvm-accelerator disabled), which just didn't do anything and the google-cloud (as it supports nested VMs), this time with this error:

==> qemu: Executing Ansible: ansible-playbook --extra-vars packer_build_name=qemu packer_builder_type=qemu -i /tmp/packer-provisioner-ansible592665748 /home/askorczyk/vgcn/ansible-roles/setup-vgcn-bwcloud.yml -e ansible_ssh_private_key_file=/tmp/ansible-key358398409
qemu: ERROR! no action detected in task
qemu:
qemu: The error appears to have been in '/home/askorczyk/vgcn/ansible-roles/basic/tasks/main.yml': line 47, column 3, but may
qemu: be elsewhere in the file depending on the exact syntax problem.
qemu:
qemu: The offending line appears to be:
qemu:
qemu:
qemu: - name: Ensure services are enabled + started
qemu: ^ here

Unfortunately I don't have a local linux machine available.. I will still try a few things, but I think the best idea for me is first to concentrate on the script - at least until my next meeting with Björn.

@hexylena
Copy link
Member

The groupadd error was now fixed in #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants