Skip to content

Commit

Permalink
Bugfix/squeue last job workaround (#74)
Browse files Browse the repository at this point in the history
* When squeue -j is given only one job id, and that job id is invalid, the squeue exit status is 1. That causes a WorkflowMgr::SchedulerDown exception, preventing sacct from being run. Rocotorun will refuse to submit any more jobs at that point until the user manually intervenes.

* When squeue -j is given only one job id, and that job id is invalid, the squeue exit status is 1. That causes a WorkflowMgr::SchedulerDown exception, preventing sacct from being run. Rocotorun will refuse to submit any more jobs at that point until the user manually intervenes.

* Do not catch an exception that is never thrown. Remove some commented-out code.

Co-authored-by: Christopher Harrop <[email protected]>
Co-authored-by: samuel.trahan <[email protected]>
  • Loading branch information
3 people authored Jul 30, 2020
1 parent 6ffbf6f commit 0906e5f
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions lib/workflowmgr/slurmbatchsystem.rb
Original file line number Diff line number Diff line change
Expand Up @@ -300,13 +300,13 @@ def submit(task)

queued_jobs,errors,exit_status=WorkflowMgr.run4("squeue -u #{username} -M all -t all -O jobid:40,comment:32", 45)

# Raise SchedulerDown if the command failed
raise WorkflowMgr::SchedulerDown,errors unless exit_status==0
# Don't raise SchedulerDown if the command failed, otherwise
# jobs that have moved to sacct will be missed.

# Return if the output is empty
return nil,output if queued_jobs.empty?

rescue Timeout::Error,WorkflowMgr::SchedulerDown
rescue Timeout::Error
WorkflowMgr.log("#{$!}")
WorkflowMgr.stderr("#{$!}",3)
raise WorkflowMgr::SchedulerDown
Expand Down Expand Up @@ -385,13 +385,13 @@ def refresh_jobqueue(jobids)
queued_jobs,errors,exit_status=WorkflowMgr.run4("squeue --jobs=#{joblist} -M all -t all -O jobid:40,username:40,numcpus:10,partition:20,submittime:30,starttime:30,endtime:30,priority:30,exit_code:10,state:30,name:200",45)
end

# Raise SchedulerDown if the command failed
raise WorkflowMgr::SchedulerDown,errors unless exit_status==0
# Don't raise SchedulerDown if the command failed, otherwise
# jobs that have moved to sacct will be missed

# Return if the output is empty
return if queued_jobs.empty?

rescue Timeout::Error,WorkflowMgr::SchedulerDown
rescue Timeout::Error
WorkflowMgr.log("#{$!}")
WorkflowMgr.stderr("#{$!}",3)
raise WorkflowMgr::SchedulerDown
Expand Down

0 comments on commit 0906e5f

Please sign in to comment.