Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all pending jobs killed after Flux update #406

Closed
grondo opened this issue Jan 8, 2024 · 5 comments
Closed

all pending jobs killed after Flux update #406

grondo opened this issue Jan 8, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@grondo
Copy link
Contributor

grondo commented Jan 8, 2024

After a flux-core update to v0.58.0, all the pending jobs on fluke were killed with an exception from the mf_priority plugin:
e.g.

[root@fluke108:~]# flux job eventlog -H f2m1t75LN7vo
[Jan05 18:00] submit userid=61494 urgency=16 flags=0 version=1
[  +0.018063] jobspec-update attributes.system.bank="guests"
[  +0.018097] jobspec-update attributes.system.bank="guests"
[  +0.018144] validate
[  +0.034171] depend
[  +0.034206] priority priority=625
[Jan08 08:20] flux-restart
[  +0.000034] exception type="mf_priority" severity=0 note="not a member of guests"
[  +0.000062] priority priority=16
[  +0.000083] clean

The user, testqe, is in the guests bank though:

$ flux account view-bank --users guests | grep testqe
testqe            61494             guests            1                 175899119.293700070.00625           100                                 
@cmoussa1
Copy link
Member

cmoussa1 commented Jan 8, 2024

Without trying to construct a reproducer (yet), I believe this type of exception message is raised when a job's user/bank information is being updated in job.state.priority and the plugin cannot find a valid user/bank entry for the user that this job is submitted under:

if (bank_it == it->second.end ()) {
flux_jobtap_raise_exception (p, FLUX_JOBTAP_CURRENT_JOB,
"mf_priority", 0,
"not a member of %s", bank);

Looking at the timestamps of the eventlog, it looks like the job was:

  1. submitted successfully (and received a priority)
  2. Flux was restarted
  3. the priority plugin was loaded
  4. jobs were reprioritized before the plugin received any flux-accounting data, so it rejected this job and presumably all pending jobs.

I'll see if I can reproduce this behavior.

@cmoussa1
Copy link
Member

cmoussa1 commented Jan 8, 2024

OK, I actually think I was able to reproduce this without having to restart Flux and instead just unloading/reloading the plugin. If I have a number of jobs in SCHED state (i.e they've received a priority and are waiting to run) and I reload the plugin without updating it with flux-accounting information and call reprioritize on all jobs:

(flux.Flux().rpc("job-manager.mf_priority.reprioritize"))

they will have an exception raised on them saying that the plugin cannot find a valid user/bank entry for those previously held jobs.

Is there a process in restarting Flux where jobs could be reprioritized? My thinking is that might have been what caused this.

In any case, the plugin should probably handle this case more gracefully so a bunch of users' jobs don't get canceled if Flux gets restarted.

I'll need to test this, but off the top of my head, I think I can include a check in the callback for job.state.priority that checks the plugin's internal map for data before deciding what to do with the job going through reprioritization.

If the plugin's internal map is empty (i.e it is waiting for flux-accounting information), it can continue to hold the job in PRIORITY until it loads some information. This would be similar to the behavior in the callback for job.validate.

@grondo
Copy link
Contributor Author

grondo commented Jan 8, 2024

Nice job debugging @cmoussa1!

All jobs are prioritized any time a jobtap plugin is loaded, so I had assumed this would happen after mf_priority.so is loaded.

cmoussa1 added a commit to cmoussa1/flux-accounting that referenced this issue Jan 8, 2024
Problem: the priority plugin will raise an exception on a job if it is
held in SCHED state while the plugin is reloaded (or Flux is restarted)
and jobs are reprioritized without first loading flux-accounting data to
this plugin. This behavior is not graceful and we should instead
continue to hold a job in PRIORITY while the plugin waits to receive
flux-accounting data.

Add a check of the plugin's internal map to see if we are still waiting
on flux-accounting data to be loaded in; if so, continue to hold the job
while we wait for data.

Add a sharness test that reproduces the issue raised in flux-framework#406 and ensure
that jobs continue to be held after a reprioritization without loading
flux-accounting data to the priority plugin.
@cmoussa1 cmoussa1 added the bug Something isn't working label Jan 8, 2024
@cmoussa1 cmoussa1 self-assigned this Jan 8, 2024
cmoussa1 added a commit to cmoussa1/flux-accounting that referenced this issue Jan 8, 2024
Problem: the priority plugin will raise an exception on a job if it is
held in SCHED state while the plugin is reloaded (or Flux is restarted)
and jobs are reprioritized without first loading flux-accounting data to
this plugin. This behavior is not graceful and we should instead
continue to hold a job in PRIORITY while the plugin waits to receive
flux-accounting data.

Add a check of the plugin's internal map to see if we are still waiting
on flux-accounting data to be loaded in; if so, continue to hold the job
while we wait for data.

Add a sharness test that reproduces the issue raised in flux-framework#406 and ensure
that jobs continue to be held after a reprioritization without loading
flux-accounting data to the priority plugin.
@grondo
Copy link
Contributor Author

grondo commented Jan 9, 2024

Closed by #407?

@cmoussa1
Copy link
Member

cmoussa1 commented Jan 9, 2024

Ah, yes, should be closed by #407 - sorry that I didn't close this yesterday. Closing now

@cmoussa1 cmoussa1 closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants