Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: add node count tracking for an association's running jobs #442

Closed
wants to merge 5 commits into from

Conversation

cmoussa1
Copy link
Member

@cmoussa1 cmoussa1 commented Apr 9, 2024

Background

The priority plugin has no way to track the number of nodes an association is using across all of their running jobs.


This PR looks to add basic node count tracking to the priority plugin, specifically in the callbacks for job.state.run and job.state.inactive.

Two new helper functions are added to the plugin to help do this - extract_nodelist () and process_nodelist (). extract_nodelist () will look for the "nodelist" key-value pair and process_nodelist () will take that string and use flux-core's hostlist library to count the number of nodes. Then, the cur_nodes count for the association is incremented by the number of nodes returning by process_nodelist (). A similar process is followed in the job.state.inactive callback where the number of nodes returned by process_nodelist () is decremented from the association's total current nodes count.

A basic set of tests are also added to simulate submitting a couple of different-sized jobs and ensuring that the priority plugin can accurately keep track of them as they are submitted and finish running.

Note that this PR does not look to add actual enforcement of an association's max_nodes limit, but rather just looks to track the cur_nodes number for an association. Once this looks good, I can create a follow-up PR that tries to add the actual enforcement of a node count limit.

cmoussa1 added 5 commits April 9, 2024 08:52
Problem: The Association class has no member to track an association's
current node count across all of their running jobs.

Add a new member called "cur_nodes" which represents an association's
current node count across all of their running jobs.

Add a default value in the case where we are creating a special
temporary association while the plugin waits for flux-accounting data
to be loaded in.

Add cur_node values to the accounting unit tests as well as to the
expected output of the plugin.query tests.
Problem: The priority plugin needs a way to count the number of nodes an
association is using for their job.

Add two new helper functions to the priority plugin.

extract_nodelist () will be responsible for extracting the "nodelist"
key-value pair out of a job's R.

process_nodelist () is responsible for counting the number of nodes from
this nodelist by using flux-core's hostlist library.

Include flux-core's hostlist library in the plugin and in the plugin's
Makefile.
Problem: The priority plugin does not increment the cur_nodes count when
an association's job is about to run.

Add an increment of an association's cur_nodes count by the number of
nodes their job is going to use in the callback for job.state.run.
Problem: The priority plugin does not decrement the cur_nodes count when
an association's job is finished running.

Add a decrement of an association's cur_nodes count by the number of
nodes their job used in the callback for job.state.inactive.
Problem: flux-accounting has no tests for keeping track of the number of
nodes an association is using across all of their running jobs.

Add some tests that track the number of nodes an association is using
while they submit jobs.
@cmoussa1 cmoussa1 added new feature new feature plugin related to the multi-factor priority plugin labels Apr 9, 2024
@cmoussa1 cmoussa1 requested a review from grondo April 9, 2024 19:55
@cmoussa1 cmoussa1 closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature new feature plugin related to the multi-factor priority plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant