[WIP] plugin: add max nodes limit support #444

cmoussa1 · 2024-04-18T16:08:33Z

Background

The priority plugin does not implement a max nodes limit to be enforced on an association's set of currently running jobs.

This PR is built on top of #442 and looks to add basic support for enforcing a max nodes limit on an association's set of running jobs. The method for implementing this limit follows the same template as the max running jobs limit; when a job is in job.state.depend, the nodes the submitted job is looking to use is extracted from jobspec and added to the association's current node count. If a job does not specify any nodes, the plugin just assumes a node count of 1 for the job. If the job would put the association over their limit, a dependency is added to the job and it is held until a currently running job finishes and cleans up.

To do this, I've proposed combining the two limits (max running jobs & max nodes) into one named dependency. So, if an association hits either their max running jobs or max node limit(s), the same dependency is added.

In the callback for job.state.inactive, the logic for releasing a held job due to a flux-accounting limit is slightly reworked. If an association is under their max running jobs limit, the first held job is grabbed and its node count is inspected, similar to how it is checked in job.state.depend. If releasing this held job would keep the association under or equal to their max nodes limit, the dependency is removed and the job can move on to being run. If the limit cannot be satisfied, the dependency is not removed and no held jobs are released until another one of the association's currently running jobs finishes and cleans up.

A couple of basic tests are added to 1034-mf-priority-max-nodes.t to simulate submitting jobs that take up all of the association's node limit and having a job held due to their max nodes limit. Once the currently running job finishes, a test checks that the held job transitions to run.

TODO

I should probably add some tests that simulate submitting some jobs that don't specify any nodes to make sure it works as expected

Problem: The Association class has no member to track an association's current node count across all of their running jobs. Add a new member called "cur_nodes" which represents an association's current node count across all of their running jobs. Add a default value in the case where we are creating a special temporary association while the plugin waits for flux-accounting data to be loaded in. Add cur_node values to the accounting unit tests as well as to the expected output of the plugin.query tests.

Problem: The priority plugin needs a way to count the number of nodes an association is using for their job. Add two new helper functions to the priority plugin. extract_nodelist () will be responsible for extracting the "nodelist" key-value pair out of a job's R. process_nodelist () is responsible for counting the number of nodes from this nodelist by using flux-core's hostlist library. Include flux-core's hostlist library in the plugin and in the plugin's Makefile.

Problem: The priority plugin does not increment the cur_nodes count when an association's job is about to run. Add an increment of an association's cur_nodes count by the number of nodes their job is going to use in the callback for job.state.run.

Problem: The priority plugin does not decrement the cur_nodes count when an association's job is finished running. Add a decrement of an association's cur_nodes count by the number of nodes their job used in the callback for job.state.inactive.

Problem: flux-accounting has no tests for keeping track of the number of nodes an association is using across all of their running jobs. Add some tests that track the number of nodes an association is using while they submit jobs.

Problem: The priority plugin has no way of estimating the number of nodes a job is looking to use from its jobspec. Add a helper function to count the number of nodes a job is requesting. If a job is not requesting nodes explicitly, just return 1.

Problem: The plugin does not enforce a max-nodes limit on an association's set of running jobs. Add a check in job.state.depend to look at the current and max nodes count for the association's set of running jobs. If the job currently in job.state.depend would put the association over their max nodes limit, add a dependency to hold the job until a currently running one finishes running. Since the callback for job.state.depend currently checks and adds job dependencies for an association's max running jobs limit, combine the named dependency added for either limit (max running jobs or max nodes) to use the same dependency name. In the callback for job.state.inactive, rework the check for releasing a held job to also check for the association's node usage before releasing a job. The logic for removing an accounting dependency for a job now looks like this: - If an association has at least one held job due to a flux-accounting dependency, grab the first held job. - If the association is under their max running jobs limit, continue on to grab the jobspec of the first held job using flux_jobtap_job_lookup (). Extract the node count from the held job. - If releasing this job would keep the association under their max nodes limit, remove the dependency from the job. If either limit cannot be satisfied, the dependency on any held job is not removed and those jobs will continue to be held until another one of the association's set of running jobs finishes running.

Problem: Some of the test payloads in the testsuite don't have a "max_nodes" key-value pair, which is needed by the plugin in order to enforce limits. Add "max_nodes" key-value pairs to the test payloads sent to the priority plugin.

Problem: t1034-mf-priority-max-nodes.t does not have any tests simulating holding and releasing an association's set of jobs due to a max nodes limit. Add some tests.

grondo

I've made a first pass here.

I think the main thing we have to discuss and consider is expanding the "max nodes" concept to "max size" including cores+nodes+gpus.

Another thing that came to mind while I was looking through this:

I wonder if we have to add more helpers to the jobtap interface so that 3rd party plugins don't have to parse jobspec and R (e.g. something like flux_jobtap_get_jobspec_size() and flux_jobtap_resource_size()) so that every plugin doesn't have to calculate these values individually (and potentially get it wrong)

grondo · 2024-05-15T19:44:54Z

src/plugins/mf_priority.cpp

+    // assume nodelist contains a single string entry
+    json_t *nodelist_str = json_array_get (nodelist, 0);
+    if (!nodelist_str || !json_is_string (nodelist_str)) return "";


I'm not sure this will always be the case. Instead the code should loop through all entries in the nodelist array and append them to a hostlist object with hostlist_append(). It may be easier, therefore, to just have a function that returns the nodelist count for an R object, e.g.

int R_nodelist_count (json_t *R) { // struct hostlist *hl = hostlist_create (); // for each entry in R.execution.nodelist: // hostlist_append (hl, entry); // count = hostlist_count (hl); // hostlist_destroy (hl); // return count; }

grondo · 2024-05-15T19:48:07Z

src/plugins/mf_priority.cpp

@@ -1051,12 +1051,14 @@ static int inactive_cb (flux_plugin_t *p,
    Association *b;
    std::map<int, std::map<std::string, Association>>::iterator it;
    std::map<std::string, Association>::iterator bank_it;
+    json_t *R = NULL;


It's kind of strange for me to have the increment and decrement of cur_nodes in separate commits. They should probably be combined so that the commit makes one useful change, maybe titled "plugin: support accounting of association cur_nodes" or similar.

grondo · 2024-05-15T20:12:35Z

src/plugins/mf_priority.cpp

+            int nnodes = process_nodelist (nodelist);
+            if (nnodes < 0)
+                flux_jobtap_raise_exception (p, FLUX_JOBTAP_CURRENT_JOB,
+                                             "mf_priority", 0,
+                                             "job.state.inactive: error "
+                                             "counting nnodes for job");


Since the job is inactive, raising an exception probably won't do anything here.
(Also, the decrement of b->cur_nodes should be skipped if nnodes < 0 to avoid instead incrementing it. The same goes for where b->cur_nodes is incremented in the previous commit)

Actually I wonder if there's a good way for the plugin to save the nnodes for a job in the job aux item list so it doesn't even have to be recalculated here. Let me think on that.

grondo · 2024-05-15T20:13:59Z

t/t1034-mf-priority-max-nodes.t

+test_under_flux 5 job -o,--config-path=$(pwd)/conf.d
+
+flux setattr log-stderr-level 1
+


Note: Due to changes in flux-core you'll now have to enable guest access to the testexec implentation with

flux config load <<-EOF [exec.testexec] allow-guests = true EOF

grondo · 2024-05-15T20:24:12Z

src/plugins/mf_priority.cpp

+int extract_resources (json_t *root) {
+    if (!root || !json_is_array (root)) return -1;
+
+    json_t *resources = json_array_get (root, 0);
+    if (!resources || !json_is_object (resources)) return -1;
+
+    json_t *type = json_object_get (resources, "type");
+    json_t *count = json_object_get (resources, "count");
+    if (!type || !json_is_string (type)) return -1;
+
+    // if the association did not explicitly request nodes for their job,
+    // just return 1
+    if (strcmp (json_string_value (type), "node") != 0) return 1;
+
+    if (!count || !json_is_integer (count)) return -1;
+    int nnodes = json_integer_value (count);
+    return nnodes;
+}


Let's consider how to do this more reliably. Probably returning 1 node if a user doesn't explicitly request nodes is not going to work in production because users could bypass all node limits by always requesting only cores: e.g. flux alloc -n64000

Instead we may have to do something more like the flux-core limit-job-size plugin, where there is a limit on the total number of cores+nodes+gpus and if a job exceeds any one of those it is deferred.

cmoussa1 added 8 commits April 9, 2024 08:52

job.state.run: add increment of nnodes

4bc2165

Problem: The priority plugin does not increment the cur_nodes count when an association's job is about to run. Add an increment of an association's cur_nodes count by the number of nodes their job is going to use in the callback for job.state.run.

t: add tests for counting nnodes per association

667da56

Problem: flux-accounting has no tests for keeping track of the number of nodes an association is using across all of their running jobs. Add some tests that track the number of nodes an association is using while they submit jobs.

t: add "max_nodes" key-value pair to test payloads

aebcb70

Problem: Some of the test payloads in the testsuite don't have a "max_nodes" key-value pair, which is needed by the plugin in order to enforce limits. Add "max_nodes" key-value pairs to the test payloads sent to the priority plugin.

cmoussa1 added new feature new feature plugin related to the multi-factor priority plugin labels Apr 18, 2024

t: add tests for holding/releasing jobs

5dddb24

Problem: t1034-mf-priority-max-nodes.t does not have any tests simulating holding and releasing an association's set of jobs due to a max nodes limit. Add some tests.

cmoussa1 force-pushed the max.nodes.limits branch from 68749c1 to 5dddb24 Compare April 18, 2024 16:37

grondo reviewed May 15, 2024

View reviewed changes

cmoussa1 mentioned this pull request Jul 9, 2024

limits: add max-nodes limit support #349

Open

cmoussa1 closed this Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] plugin: add max nodes limit support #444

[WIP] plugin: add max nodes limit support #444

cmoussa1 commented Apr 18, 2024

grondo left a comment

grondo May 15, 2024

grondo May 15, 2024

grondo May 15, 2024

grondo May 15, 2024

grondo May 15, 2024

		test_under_flux 5 job -o,--config-path=$(pwd)/conf.d

		flux setattr log-stderr-level 1

[WIP] plugin: add max nodes limit support #444

[WIP] plugin: add max nodes limit support #444

Conversation

cmoussa1 commented Apr 18, 2024

Background

TODO

grondo left a comment

Choose a reason for hiding this comment

grondo May 15, 2024

Choose a reason for hiding this comment

grondo May 15, 2024

Choose a reason for hiding this comment

grondo May 15, 2024

Choose a reason for hiding this comment

grondo May 15, 2024

Choose a reason for hiding this comment

grondo May 15, 2024

Choose a reason for hiding this comment