-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all pending jobs killed after Flux update #406
Comments
Without trying to construct a reproducer (yet), I believe this type of exception message is raised when a job's user/bank information is being updated in flux-accounting/src/plugins/mf_priority.cpp Lines 676 to 679 in 1a84215
Looking at the timestamps of the eventlog, it looks like the job was:
I'll see if I can reproduce this behavior. |
OK, I actually think I was able to reproduce this without having to restart Flux and instead just unloading/reloading the plugin. If I have a number of jobs in SCHED state (i.e they've received a priority and are waiting to run) and I reload the plugin without updating it with flux-accounting information and call ( they will have an exception raised on them saying that the plugin cannot find a valid user/bank entry for those previously held jobs. Is there a process in restarting Flux where jobs could be reprioritized? My thinking is that might have been what caused this. In any case, the plugin should probably handle this case more gracefully so a bunch of users' jobs don't get canceled if Flux gets restarted. I'll need to test this, but off the top of my head, I think I can include a check in the callback for If the plugin's internal map is empty (i.e it is waiting for flux-accounting information), it can continue to hold the job in PRIORITY until it loads some information. This would be similar to the behavior in the callback for |
Nice job debugging @cmoussa1! All jobs are prioritized any time a jobtap plugin is loaded, so I had assumed this would happen after |
Problem: the priority plugin will raise an exception on a job if it is held in SCHED state while the plugin is reloaded (or Flux is restarted) and jobs are reprioritized without first loading flux-accounting data to this plugin. This behavior is not graceful and we should instead continue to hold a job in PRIORITY while the plugin waits to receive flux-accounting data. Add a check of the plugin's internal map to see if we are still waiting on flux-accounting data to be loaded in; if so, continue to hold the job while we wait for data. Add a sharness test that reproduces the issue raised in flux-framework#406 and ensure that jobs continue to be held after a reprioritization without loading flux-accounting data to the priority plugin.
Problem: the priority plugin will raise an exception on a job if it is held in SCHED state while the plugin is reloaded (or Flux is restarted) and jobs are reprioritized without first loading flux-accounting data to this plugin. This behavior is not graceful and we should instead continue to hold a job in PRIORITY while the plugin waits to receive flux-accounting data. Add a check of the plugin's internal map to see if we are still waiting on flux-accounting data to be loaded in; if so, continue to hold the job while we wait for data. Add a sharness test that reproduces the issue raised in flux-framework#406 and ensure that jobs continue to be held after a reprioritization without loading flux-accounting data to the priority plugin.
Closed by #407? |
Ah, yes, should be closed by #407 - sorry that I didn't close this yesterday. Closing now |
After a flux-core update to v0.58.0, all the pending jobs on fluke were killed with an exception from the
mf_priority
plugin:e.g.
The user,
testqe
, is in theguests
bank though:The text was updated successfully, but these errors were encountered: