Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic rate limiting of job submissions? #442

Closed
BenWibking opened this issue May 3, 2024 · 5 comments
Closed

dynamic rate limiting of job submissions? #442

BenWibking opened this issue May 3, 2024 · 5 comments

Comments

@BenWibking
Copy link

On the cluster I'm using, there is a hard limit of 36 jobs per user that are running or pending in the SLURM queue.

However, I need to run a 200 parameter study. Is there any workaround for this other than splitting this large study up into studies of <= 36 parameters?

It would be ideal if it were possible for the conductor process to wait until jobs complete and then submit new jobs.

@BenWibking
Copy link
Author

BenWibking commented May 4, 2024

Following the route described in the docs here (https://maestrowf.readthedocs.io/en/latest/Maestro/how_to_guides/running_with_flux.html#launch-maestro-external-to-the-batch-jobflux-broker) seems like the best option for my use-case.

I've managed to install Flux via Spack on this cluser. The one remaining issue is that I have to wait until the SLURM job starts before I can do maestro run on the login node.

If I wanted to modify the Maestro conductor code so it polls SLURM to see whether the Flux broker job has started, where should I start to do that? Is this feasible?

@BenWibking BenWibking changed the title automatic rate limiting of job submissions? dynamic rate limiting of job submissions? May 4, 2024
@FrankD412
Copy link
Member

Hi @BenWibking -- one thing to note is that maestro run also has a throttle option, you could limit the jobs to 36 there. Do keep in mind that is a universal limit between local and scheduled steps. So if you have a lot of local steps ahead of submitted steps, you will artificially limit yourself there.

@BenWibking
Copy link
Author

Hi @BenWibking -- one thing to note is that maestro run also has a throttle option, you could limit the jobs to 36 there. Do keep in mind that is a universal limit between local and scheduled steps. So if you have a lot of local steps ahead of submitted steps, you will artificially limit yourself there.

Adding --throttle 36 solves the problem and works perfectly.

I was a bit thrown off by the wording in the documentation for the --throttle option. It might help to clarify that it refers to the total number of jobs in the (external, non-Maestro) scheduler queue (both running and pending), rather than only those that are actually executing.

@BenWibking
Copy link
Author

BenWibking commented May 5, 2024

I checked the status of this study today and it seems to have stopped submitting new jobs to SLURM.

maestro status reports that several dozen steps are PENDING and dozens more are INITIALIZED, but nothing is in the SLURM queue. Maybe this is related to #441?

The last log entry is:

2024-05-05 11:40:01,492 - maestrowf.conductor:monitor_study:349 - INFO - Checking DAG status at 2024-05-05 11:40:01.492025
2024-05-05 11:40:01,597 - maestrowf.datastructures.core.executiongraph:check_study_status:963 - INFO - Jobs found for user 'bwibking'.
2024-05-05 11:40:01,598 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:916 - INFO - Found 0 available slots...

This full log for this study is here:
medres_compressive.log.zip

The conductor process for this is still running:

login4.stampede3(1011)$ ps aux | grep $USER
bwibking 3092832  0.0  0.0  20660 11896 ?        Ss   May04   0:01 /usr/lib/systemd/systemd --user
bwibking 3092835  0.0  0.0 202568  6948 ?        S    May04   0:00 (sd-pam)
bwibking 3093950  0.0  0.0   7264  3472 ?        S    May04   0:00 /bin/sh -c nohup conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240504-193253 > /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240504-193253/medres_compressive.txt 2>&1
bwibking 3093951  0.2  0.0 328808 72948 ?        S    May04   2:17 /scratch/projects/compilers/intel24.0/oneapi/intelpython/python3.9/bin/python3.9 /home1/02661/bwibking/.local/bin/conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240504-193253
root     3993349  0.0  0.0  39960 12012 ?        Ss   11:32   0:00 sshd: bwibking [priv]
bwibking 3993762  0.0  0.0  40144  7516 ?        S    11:33   0:00 sshd: bwibking@pts/73
bwibking 3993765  0.0  0.0  18048  6128 pts/73   Ss   11:33   0:00 -bash
bwibking 3998925  0.0  0.0  19236  3652 pts/73   R+   11:39   0:00 ps aux
bwibking 3998926  0.0  0.0   6432  2336 pts/73   S+   11:39   0:00 grep --color=auto bwibking

This seems to reliably happen for studies that I run on this machine.

@BenWibking BenWibking reopened this May 5, 2024
@BenWibking
Copy link
Author

This issue seems to be the same as #441, and that has more informative logs, so I'll close this.

@BenWibking BenWibking closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants