Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] plugin: add max nodes limit support #444

Closed
wants to merge 9 commits into from

Conversation

cmoussa1
Copy link
Member

Background

The priority plugin does not implement a max nodes limit to be enforced on an association's set of currently running jobs.


This PR is built on top of #442 and looks to add basic support for enforcing a max nodes limit on an association's set of running jobs. The method for implementing this limit follows the same template as the max running jobs limit; when a job is in job.state.depend, the nodes the submitted job is looking to use is extracted from jobspec and added to the association's current node count. If a job does not specify any nodes, the plugin just assumes a node count of 1 for the job. If the job would put the association over their limit, a dependency is added to the job and it is held until a currently running job finishes and cleans up.

To do this, I've proposed combining the two limits (max running jobs & max nodes) into one named dependency. So, if an association hits either their max running jobs or max node limit(s), the same dependency is added.

In the callback for job.state.inactive, the logic for releasing a held job due to a flux-accounting limit is slightly reworked. If an association is under their max running jobs limit, the first held job is grabbed and its node count is inspected, similar to how it is checked in job.state.depend. If releasing this held job would keep the association under or equal to their max nodes limit, the dependency is removed and the job can move on to being run. If the limit cannot be satisfied, the dependency is not removed and no held jobs are released until another one of the association's currently running jobs finishes and cleans up.

A couple of basic tests are added to 1034-mf-priority-max-nodes.t to simulate submitting jobs that take up all of the association's node limit and having a job held due to their max nodes limit. Once the currently running job finishes, a test checks that the held job transitions to run.

TODO
  • I should probably add some tests that simulate submitting some jobs that don't specify any nodes to make sure it works as expected

cmoussa1 added 8 commits April 9, 2024 08:52
Problem: The Association class has no member to track an association's
current node count across all of their running jobs.

Add a new member called "cur_nodes" which represents an association's
current node count across all of their running jobs.

Add a default value in the case where we are creating a special
temporary association while the plugin waits for flux-accounting data
to be loaded in.

Add cur_node values to the accounting unit tests as well as to the
expected output of the plugin.query tests.
Problem: The priority plugin needs a way to count the number of nodes an
association is using for their job.

Add two new helper functions to the priority plugin.

extract_nodelist () will be responsible for extracting the "nodelist"
key-value pair out of a job's R.

process_nodelist () is responsible for counting the number of nodes from
this nodelist by using flux-core's hostlist library.

Include flux-core's hostlist library in the plugin and in the plugin's
Makefile.
Problem: The priority plugin does not increment the cur_nodes count when
an association's job is about to run.

Add an increment of an association's cur_nodes count by the number of
nodes their job is going to use in the callback for job.state.run.
Problem: The priority plugin does not decrement the cur_nodes count when
an association's job is finished running.

Add a decrement of an association's cur_nodes count by the number of
nodes their job used in the callback for job.state.inactive.
Problem: flux-accounting has no tests for keeping track of the number of
nodes an association is using across all of their running jobs.

Add some tests that track the number of nodes an association is using
while they submit jobs.
Problem: The priority plugin has no way of estimating the number of
nodes a job is looking to use from its jobspec.

Add a helper function to count the number of nodes a job is requesting.
If a job is not requesting nodes explicitly, just return 1.
Problem: The plugin does not enforce a max-nodes limit on an
association's set of running jobs.

Add a check in job.state.depend to look at the current and max nodes
count for the association's set of running jobs. If the job currently in
job.state.depend would put the association over their max nodes limit,
add a dependency to hold the job until a currently running one finishes
running.

Since the callback for job.state.depend currently checks and adds job
dependencies for an association's max running jobs limit, combine the
named dependency added for either limit (max running jobs or max nodes)
to use the same dependency name.

In the callback for job.state.inactive, rework the check for releasing a
held job to also check for the association's node usage before releasing
a job. The logic for removing an accounting dependency for a job now
looks like this:

- If an association has at least one held job due to a flux-accounting
dependency, grab the first held job.
- If the association is under their max running jobs limit, continue on
to grab the jobspec of the first held job using
flux_jobtap_job_lookup (). Extract the node count from the held job.
- If releasing this job would keep the association under their max
nodes limit, remove the dependency from the job.

If either limit cannot be satisfied, the dependency on any held job is
not removed and those jobs will continue to be held until another one of
the association's set of running jobs finishes running.
Problem: Some of the test payloads in the testsuite don't have a
"max_nodes" key-value pair, which is needed by the plugin in order to
enforce limits.

Add "max_nodes" key-value pairs to the test payloads sent to the
priority plugin.
@cmoussa1 cmoussa1 added new feature new feature plugin related to the multi-factor priority plugin labels Apr 18, 2024
Problem: t1034-mf-priority-max-nodes.t does not have any tests
simulating holding and releasing an association's set of jobs due to a
max nodes limit.

Add some tests.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a first pass here.

I think the main thing we have to discuss and consider is expanding the "max nodes" concept to "max size" including cores+nodes+gpus.

Another thing that came to mind while I was looking through this:

  • I wonder if we have to add more helpers to the jobtap interface so that 3rd party plugins don't have to parse jobspec and R (e.g. something like flux_jobtap_get_jobspec_size() and flux_jobtap_resource_size()) so that every plugin doesn't have to calculate these values individually (and potentially get it wrong)

Comment on lines +182 to +184
// assume nodelist contains a single string entry
json_t *nodelist_str = json_array_get (nodelist, 0);
if (!nodelist_str || !json_is_string (nodelist_str)) return "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will always be the case. Instead the code should loop through all entries in the nodelist array and append them to a hostlist object with hostlist_append(). It may be easier, therefore, to just have a function that returns the nodelist count for an R object, e.g.

int R_nodelist_count (json_t *R)
{
    // struct hostlist *hl = hostlist_create ();
   // for each entry in R.execution.nodelist:
   //     hostlist_append (hl, entry);
   //  count = hostlist_count (hl);
   // hostlist_destroy (hl);
   // return count;
}

@@ -1051,12 +1051,14 @@ static int inactive_cb (flux_plugin_t *p,
Association *b;
std::map<int, std::map<std::string, Association>>::iterator it;
std::map<std::string, Association>::iterator bank_it;
json_t *R = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of strange for me to have the increment and decrement of cur_nodes in separate commits. They should probably be combined so that the commit makes one useful change, maybe titled "plugin: support accounting of association cur_nodes" or similar.

Comment on lines +1095 to +1100
int nnodes = process_nodelist (nodelist);
if (nnodes < 0)
flux_jobtap_raise_exception (p, FLUX_JOBTAP_CURRENT_JOB,
"mf_priority", 0,
"job.state.inactive: error "
"counting nnodes for job");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the job is inactive, raising an exception probably won't do anything here.
(Also, the decrement of b->cur_nodes should be skipped if nnodes < 0 to avoid instead incrementing it. The same goes for where b->cur_nodes is incremented in the previous commit)

Actually I wonder if there's a good way for the plugin to save the nnodes for a job in the job aux item list so it doesn't even have to be recalculated here. Let me think on that.

test_under_flux 5 job -o,--config-path=$(pwd)/conf.d

flux setattr log-stderr-level 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Due to changes in flux-core you'll now have to enable guest access to the testexec implentation with

flux config load <<-EOF
[exec.testexec]
allow-guests = true
EOF

Comment on lines +209 to +226
int extract_resources (json_t *root) {
if (!root || !json_is_array (root)) return -1;

json_t *resources = json_array_get (root, 0);
if (!resources || !json_is_object (resources)) return -1;

json_t *type = json_object_get (resources, "type");
json_t *count = json_object_get (resources, "count");
if (!type || !json_is_string (type)) return -1;

// if the association did not explicitly request nodes for their job,
// just return 1
if (strcmp (json_string_value (type), "node") != 0) return 1;

if (!count || !json_is_integer (count)) return -1;
int nnodes = json_integer_value (count);
return nnodes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider how to do this more reliably. Probably returning 1 node if a user doesn't explicitly request nodes is not going to work in production because users could bypass all node limits by always requesting only cores: e.g. flux alloc -n64000

Instead we may have to do something more like the flux-core limit-job-size plugin, where there is a limit on the total number of cores+nodes+gpus and if a job exceeds any one of those it is deferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature new feature plugin related to the multi-factor priority plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants