Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Testing Plan #17

Open
hexylena opened this issue Dec 13, 2018 · 3 comments
Open

Cluster Testing Plan #17

hexylena opened this issue Dec 13, 2018 · 3 comments

Comments

@hexylena
Copy link
Member

hexylena commented Dec 13, 2018

I'd like to stress test condor and figure out where we can rely on it to restart jobs and where we need Galaxy to be smarter

Condor

Setup a condor cluster with VGCN, inkl Central Manager + dedicated submit node + maybe 4 m1.small executors.

  • Launch 1M tiny jobs (just exit 0 or so); do all 1M complete? At what throughput? Dump output of condor_history to file, process, make some nice graphs maybe.
  • Do the same. Repeatedly reboot the central manager at maybe 5 minute intervals throughout test. Is everything coming back successfully?
  • Do the same, but kill the central manager during the middle of the test. Just delete it in openstack + replace it. What is lost?
  • Do the same, but repeatedly reboot compute nodes randomly
  • Same but repeatedly kill + replace compute nodes (e.g. with terraform.)

(If 1M is too high and takes multiple hours then decrease until the tests run in ~20 minutes.)

Galaxy

Setup same, but add a galaxy server + NFS server. (We can help here.)

  • Launch thousands of jobs that take some time to complete (e.g. sleep 60; echo "hi" in a tool), and repeatedly kill compute nodes. Do the jobs complete successfully with their expected output?
@bgruening
Copy link
Member

ping @AndreasSko

@AndreasSko
Copy link
Contributor

I tested it with 10k small jobs, which took around 6,5 minutes to submit (50k already took more than 30m). The jobs were a simple sleep 1; echo "$(hostname)". I tried rebooting and replacing the exec-nodes and the central manager, and at least in my tests all jobs completed successfully with the correct outputs. The Galaxy-Part I still need to do.

@erasche I'm in the office tomorrow. If you are there and have the time, I could show you the rest of my results.

@hexylena
Copy link
Member Author

Sounds great! Let's talk then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants