-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] plugin: add max nodes limit support #444
Conversation
Problem: The Association class has no member to track an association's current node count across all of their running jobs. Add a new member called "cur_nodes" which represents an association's current node count across all of their running jobs. Add a default value in the case where we are creating a special temporary association while the plugin waits for flux-accounting data to be loaded in. Add cur_node values to the accounting unit tests as well as to the expected output of the plugin.query tests.
Problem: The priority plugin needs a way to count the number of nodes an association is using for their job. Add two new helper functions to the priority plugin. extract_nodelist () will be responsible for extracting the "nodelist" key-value pair out of a job's R. process_nodelist () is responsible for counting the number of nodes from this nodelist by using flux-core's hostlist library. Include flux-core's hostlist library in the plugin and in the plugin's Makefile.
Problem: The priority plugin does not increment the cur_nodes count when an association's job is about to run. Add an increment of an association's cur_nodes count by the number of nodes their job is going to use in the callback for job.state.run.
Problem: The priority plugin does not decrement the cur_nodes count when an association's job is finished running. Add a decrement of an association's cur_nodes count by the number of nodes their job used in the callback for job.state.inactive.
Problem: flux-accounting has no tests for keeping track of the number of nodes an association is using across all of their running jobs. Add some tests that track the number of nodes an association is using while they submit jobs.
Problem: The priority plugin has no way of estimating the number of nodes a job is looking to use from its jobspec. Add a helper function to count the number of nodes a job is requesting. If a job is not requesting nodes explicitly, just return 1.
Problem: The plugin does not enforce a max-nodes limit on an association's set of running jobs. Add a check in job.state.depend to look at the current and max nodes count for the association's set of running jobs. If the job currently in job.state.depend would put the association over their max nodes limit, add a dependency to hold the job until a currently running one finishes running. Since the callback for job.state.depend currently checks and adds job dependencies for an association's max running jobs limit, combine the named dependency added for either limit (max running jobs or max nodes) to use the same dependency name. In the callback for job.state.inactive, rework the check for releasing a held job to also check for the association's node usage before releasing a job. The logic for removing an accounting dependency for a job now looks like this: - If an association has at least one held job due to a flux-accounting dependency, grab the first held job. - If the association is under their max running jobs limit, continue on to grab the jobspec of the first held job using flux_jobtap_job_lookup (). Extract the node count from the held job. - If releasing this job would keep the association under their max nodes limit, remove the dependency from the job. If either limit cannot be satisfied, the dependency on any held job is not removed and those jobs will continue to be held until another one of the association's set of running jobs finishes running.
Problem: Some of the test payloads in the testsuite don't have a "max_nodes" key-value pair, which is needed by the plugin in order to enforce limits. Add "max_nodes" key-value pairs to the test payloads sent to the priority plugin.
Problem: t1034-mf-priority-max-nodes.t does not have any tests simulating holding and releasing an association's set of jobs due to a max nodes limit. Add some tests.
68749c1
to
5dddb24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a first pass here.
I think the main thing we have to discuss and consider is expanding the "max nodes" concept to "max size" including cores+nodes+gpus.
Another thing that came to mind while I was looking through this:
- I wonder if we have to add more helpers to the jobtap interface so that 3rd party plugins don't have to parse jobspec and R (e.g. something like
flux_jobtap_get_jobspec_size()
andflux_jobtap_resource_size()
) so that every plugin doesn't have to calculate these values individually (and potentially get it wrong)
// assume nodelist contains a single string entry | ||
json_t *nodelist_str = json_array_get (nodelist, 0); | ||
if (!nodelist_str || !json_is_string (nodelist_str)) return ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will always be the case. Instead the code should loop through all entries in the nodelist
array and append them to a hostlist
object with hostlist_append()
. It may be easier, therefore, to just have a function that returns the nodelist count for an R object, e.g.
int R_nodelist_count (json_t *R)
{
// struct hostlist *hl = hostlist_create ();
// for each entry in R.execution.nodelist:
// hostlist_append (hl, entry);
// count = hostlist_count (hl);
// hostlist_destroy (hl);
// return count;
}
@@ -1051,12 +1051,14 @@ static int inactive_cb (flux_plugin_t *p, | |||
Association *b; | |||
std::map<int, std::map<std::string, Association>>::iterator it; | |||
std::map<std::string, Association>::iterator bank_it; | |||
json_t *R = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of strange for me to have the increment and decrement of cur_nodes
in separate commits. They should probably be combined so that the commit makes one useful change, maybe titled "plugin: support accounting of association cur_nodes" or similar.
int nnodes = process_nodelist (nodelist); | ||
if (nnodes < 0) | ||
flux_jobtap_raise_exception (p, FLUX_JOBTAP_CURRENT_JOB, | ||
"mf_priority", 0, | ||
"job.state.inactive: error " | ||
"counting nnodes for job"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the job is inactive, raising an exception probably won't do anything here.
(Also, the decrement of b->cur_nodes
should be skipped if nnodes < 0
to avoid instead incrementing it. The same goes for where b->cur_nodes
is incremented in the previous commit)
Actually I wonder if there's a good way for the plugin to save the nnodes
for a job in the job aux item list so it doesn't even have to be recalculated here. Let me think on that.
test_under_flux 5 job -o,--config-path=$(pwd)/conf.d | ||
|
||
flux setattr log-stderr-level 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Due to changes in flux-core you'll now have to enable guest access to the testexec implentation with
flux config load <<-EOF
[exec.testexec]
allow-guests = true
EOF
int extract_resources (json_t *root) { | ||
if (!root || !json_is_array (root)) return -1; | ||
|
||
json_t *resources = json_array_get (root, 0); | ||
if (!resources || !json_is_object (resources)) return -1; | ||
|
||
json_t *type = json_object_get (resources, "type"); | ||
json_t *count = json_object_get (resources, "count"); | ||
if (!type || !json_is_string (type)) return -1; | ||
|
||
// if the association did not explicitly request nodes for their job, | ||
// just return 1 | ||
if (strcmp (json_string_value (type), "node") != 0) return 1; | ||
|
||
if (!count || !json_is_integer (count)) return -1; | ||
int nnodes = json_integer_value (count); | ||
return nnodes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider how to do this more reliably. Probably returning 1 node if a user doesn't explicitly request nodes is not going to work in production because users could bypass all node limits by always requesting only cores: e.g. flux alloc -n64000
Instead we may have to do something more like the flux-core limit-job-size plugin, where there is a limit on the total number of cores+nodes+gpus and if a job exceeds any one of those it is deferred.
Background
The priority plugin does not implement a max nodes limit to be enforced on an association's set of currently running jobs.
This PR is built on top of #442 and looks to add basic support for enforcing a max nodes limit on an association's set of running jobs. The method for implementing this limit follows the same template as the max running jobs limit; when a job is in
job.state.depend
, the nodes the submitted job is looking to use is extracted from jobspec and added to the association's current node count. If a job does not specify any nodes, the plugin just assumes a node count of 1 for the job. If the job would put the association over their limit, a dependency is added to the job and it is held until a currently running job finishes and cleans up.To do this, I've proposed combining the two limits (max running jobs & max nodes) into one named dependency. So, if an association hits either their max running jobs or max node limit(s), the same dependency is added.
In the callback for
job.state.inactive
, the logic for releasing a held job due to a flux-accounting limit is slightly reworked. If an association is under their max running jobs limit, the first held job is grabbed and its node count is inspected, similar to how it is checked injob.state.depend
. If releasing this held job would keep the association under or equal to their max nodes limit, the dependency is removed and the job can move on to being run. If the limit cannot be satisfied, the dependency is not removed and no held jobs are released until another one of the association's currently running jobs finishes and cleans up.A couple of basic tests are added to
1034-mf-priority-max-nodes.t
to simulate submitting jobs that take up all of the association's node limit and having a job held due to their max nodes limit. Once the currently running job finishes, a test checks that the held job transitions to run.TODO