-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin: add "age" as a factor to priority calculation for jobs #455
base: master
Are you sure you want to change the base?
Conversation
I haven't looked yet, so I apologize if this is obvious, but what happens if a user submits a job, it goes into SCHED state and gets a priority adjustment due to age, and then a user holds the job by setting an urgency of 0? When the job is released again with a future urgency update, is the job's age calculated from the timestamp it originally entered the SCHED state, or is it the most recent urgency event? |
Ah, I believe the job's age is calculated from the timestamp it originally entered the SCHED state, not the most recent urgency event, so the scenario you proposed could result in someone boosting their own job's priority. Perhaps a watcher for an urgency event needs to be added to check for this? Do you happen to know if this watcher would follow a similar format to the other |
Actually, any I wonder if this means there's nothing more to do here and your current approach would work as is? That is, the age is updated with the most recent transition to SCHED which is what we want. That works out nicely if so. |
If this is the case, I would add a comment to the age calculation with the details above because it probably isn't clear at first look. |
Ah, I believe you are right. The age is in fact updated with the most recent transition to SCHED. If I simulate the scenario you proposed above, the eventlog for the job looks like this:
And the various priority calculations (in order) in the eventlog are as follows:
I can add a comment with your description above. Thanks! |
1945b12
to
9b6e4d7
Compare
9b6e4d7
to
9f50970
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a quick pass I had some questions, so thought I'd stop here for now.
src/plugins/mf_priority.cpp
Outdated
* age: a factor that considers the time that a job was released by the | ||
* scheduler to be run but could not due to another constraint (e.g a | ||
* resource constraint). Any urgency, priority, or jobspec-update event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is unclear to me, but I may just be reading it wrong. What do you mean by "released by the scheduler to run"?
Would it be clearer to just say: "The age factor considers the most recent timestamp that a job entered the SCHED state and therefore represents the amount of time the job has been eligible for scheduling." Or is that statement incorrect (I haven't fully reviewed this PR yet, so apologies if so.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe you just mean "released to the scheduler"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes I mean to the scheduler here, not by. I can fix this - thanks!
src/plugins/mf_priority.cpp
Outdated
t_released = std::chrono::duration_cast <std::chrono::duration<double>> ( | ||
std::chrono::system_clock::now ().time_since_epoch ()).count (); | ||
// store job id, t_released in map | ||
released_jobs[FLUX_JOBTAP_CURRENT_JOB] = t_released; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FLUX_JOBTAP_CURRENT_JOB
is a constant, is that really what you want to use here? I could be misunderstanding use of the map, but I think with multiple jobs you'd end up overwriting each job's t_released
?
Seems like you could keep a map of released jobs by jobid, then you wouldn't need to set the value in the aux item list of the job, or you can allocate space for t_released
and just store it in the job's aux item list without the need for the released_jobs
global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, it's my own misunderstanding of FLUX_JOBTAP_CURRENT_JOB
- I thought that I could use that in place of unpacking the jobid.
And you are absolutely right about not setting anything in the aux item list of the job if I use a map. Originally, I had tried allocating space for t_released
instead of using a map but wasn't able to figure it out - probably because I was doing something wrong and don't understand memory allocation fully.
Can I allocate space for a double*
in one callback and clean it up in another? In this case, allocate in job.state.sched
and clean it up in job.state.inactive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be an easier way using a C++ smart pointer, but in C I'd do something like:
// Allocate space for a double, e.g. in `job.state.sched`
double *t_released = malloc (sizeof (double));
if (!t_released)
// handle enomem error
*t_released = value;
// Place value in job_aux hash and schedule destruction with job
// (be sure to check return value in actual code)
flux_jobtap_job_aux_set (p,
FLUX_JOBTAP_CURRENT_JOB,
"mf_priority::t_released",
t_released,
free);
I think the above will work, but untested so there may be typos, etc.
Then you'd access the value the way you're currently doing it via flux_jobtap_job_aux_get(3)
. The value will automatically be freed when the job is destroyed, so you don't have to manually do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried playing around with this implementation for a bit and was just never able to get it working. For now, I've pushed up a fix to just store/retrieve job IDs and their associated timestamps with a map.
I think the nice thing with this method is the ability to return the information stored in this map with the plugin.query
callback. I'm definitely open to more discussion, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: wait, I'm sorry, I think I might have just figured it out. 😆 I don't think I was correctly managing memory with my C++ pointers, but if I declare a C pointer (like you suggested), I think this actually becomes pretty straightforward. I can push up an additional commit that uses this approach of aux_set
for comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if @trws would have a suggestion for using smart pointers stashed in a the job aux item hash, but o/w using a C pointer works fine too.
However, I liked your idea of exposing the age (and maybe other factors too) in the plugin stats returned from flux jobtap query
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! In that case, I'll leave my proposal as is. 👍 But it is good to know that we have the aux_set
/aux_get
implementation for the future!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable. FWIW, using a smart pointer in the aux array would kinda defeat the purpose, I could make something for us that makes it easy to capture smart pointers into something managed by the aux array, so things like lambdas or classes with complex state would be easier to manage, but doing it directly wouldn't really buy much unless you had a smart pointer already for some reason.
src/plugins/mf_priority.cpp
Outdated
"mf_priority:t_released")); | ||
if (timestamp) { | ||
// remove the entry from the map | ||
auto it = released_jobs.find (FLUX_JOBTAP_CURRENT_JOB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, it may not have been clear but FLUX_JOBTAP_CURRENT_JOB
is a constant. To get the actual jobid you'll have to unpack it from flux_plugin_arg_t *args
9f50970
to
147085a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've made another pass through this PR with some comments and questions.
While I was looking at this, another question occurred to me.
It seems like once age is considered as a factor, all jobs have to be reprioritized periodically so that the priority assigned due to age is fair. I don't think this occurs by default anywhere now?
For an example, one job is submitted and immediately proceeds to SCHED state and gets a small priority bump due to its age. Later another job is submitted and the same occurs. Its age is less than the first job, but they will have similar priorities because the priority of the first job has not been recomputed in the intervening time.
This means any time one job has a priority adjustement, all other jobs priorities should be recalculated? This may create quite large eventlogs for jobs on systems where jobs stay pending for awhile...
src/plugins/mf_priority.cpp
Outdated
age_weight = priority_weights["age"]; | ||
|
||
// check values of priority factor weights; if not configured, | ||
// these will be set to -1, so just use default weights | ||
if (fshare_weight == -1) fshare_weight = 100000; | ||
if (queue_weight == -1) queue_weight = 10000; | ||
if (age_weight == -1) age_weight = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup suggestion: place defaults as defines at top of the file and initialize each `priority_weights[factor] = DEFAULT`` so that you don't have to repeat these checks for every job.
src/plugins/mf_priority.cpp
Outdated
auto it = released_jobs.find (jobid); | ||
if (it != released_jobs.end ()) { | ||
// job has an "age" associated with it; fetch the timestamp | ||
t_released = it->second; | ||
// get current time | ||
t_current = std::chrono::duration_cast <std::chrono::duration<double>> | ||
(std::chrono::system_clock::now ().time_since_epoch ()).count (); | ||
// calculate age of job | ||
age_factor = t_current - t_released; | ||
} else { | ||
// job does not have an age; don't factor in the age in | ||
// the priority calculation | ||
age_factor = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest refactoring this find into a function, something like double released_job_age (jobid)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally chrono::duration<double>
as the return there rather than a regular double unless you have a reason you need it as a double in all cases. I'll comment a couple other places, but try to avoid using .count()
unless you're interfacing with C code or really, really need a bare number. Chrono prevents a lot of mistakes when you use its types, and it's a bit of a shame to immediately convert that away.
src/plugins/mf_priority.cpp
Outdated
* When a job gets to the job.state.sched state, pack a timestamp to be | ||
* associated with the job in the case that it is eligible to be run but | ||
* cannot (e.g due to resource constraints). The timestamp will be used | ||
* to calculate the "age" of a job and increase its priority according to | ||
* the time it has been waiting to run. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: "pack a timestamp" -> "generate a timestamp"
src/plugins/mf_priority.cpp
Outdated
auto it = released_jobs.find (jobid); | ||
if (it != released_jobs.end ()) | ||
// job had an "age" associated with it; remove it from the map that | ||
// stores its ID | ||
released_jobs.erase (it); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like std::map erarse() accepts a key, so could this be replaced with released_jobs.erase (jobid)
?
src/plugins/mf_priority.cpp
Outdated
if (urgency > 0) { | ||
// the job is not being held, and therefore we need to get the current | ||
// time to keep track of how long the job has been released and waiting | ||
t_released = std::chrono::duration_cast <std::chrono::duration<double>> | ||
(std::chrono::system_clock::now ().time_since_epoch ()).count (); | ||
// store job id, t_released in map | ||
released_jobs[jobid] = t_released; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: If urgency is 0, should released_jobs.erase (jobid)
be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there may be times when the t_released
for a job should not be updated in SCHED state. E.g. if a priority adjustment is made and the job was not previously held, shouldn't the age stay the same?
test_expect_success 'disable age factor in multi-factor priority plugin' ' | ||
cat >config/test.toml <<-EOT && | ||
[accounting.factor-weights] | ||
age = 0 | ||
EOT | ||
flux config reload | ||
' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an FYI for next time, I think this could be a one liner:
echo accounting.factor-weights.age=0 | flux config load
Thanks @grondo for your review and help so far. I could definitely be misunderstanding the intention behind implementing it, but my thoughts on how "age" should be factored into the calculation were that it was only considered a factor if the job was still in SCHED state by the time all jobs were called to reprioritized. If it is in SCHED state and waiting to be run before a call was made to reprioritize all jobs, its age is not considered in the calculation of its priority if it later proceeds to RUN (even just before a reprioritization of all jobs). In the example you laid out and with my proposal, I think the first job submitted that immediately proceeds to SCHED state would not get a priority bump due to its age unless it never proceeded to RUN before a reprioritization was called. If a second job is submitted and the same occurs, wouldn't the first job still have a significantly higher priority after all jobs were reprioritized because it was waiting for a longer period of time? Sorry if I am misunderstanding your example! In any case, you bring up a good point about if one job needs a priority adjustment, all other jobs need their priority to be recalculated. Maybe the plugin needs to implement a |
src/plugins/mf_priority.cpp
Outdated
flux_plugin_arg_t *args, | ||
void *data) | ||
{ | ||
double t_released = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example here, this should be a chrono::time_point
of whatever type you're using.
src/plugins/mf_priority.cpp
Outdated
@@ -43,6 +44,7 @@ std::map<std::string, Queue> queues; | |||
std::map<int, std::string> users_def_bank; | |||
std::vector<std::string> projects; | |||
std::map<std::string, int> priority_weights; | |||
std::map<flux_jobid_t, double> released_jobs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a map of jobid-> time_point also I believe.
src/plugins/mf_priority.cpp
Outdated
// job has an "age" associated with it; fetch the timestamp | ||
t_released = it->second; | ||
// get current time | ||
t_current = std::chrono::duration_cast <std::chrono::duration<double>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then if those are time_point, you can directly subtract the system_clock::now result instead of doing this duration_cast at all. If you need the double in age_factor then use the cast on the result of the subtraction.
Yes, that sounds correct. I wasn't really considering when all jobs are reprioritized because I didn't really see where that is happening in the context of this PR. Does it occur periodically as information is pulled from the accounting DB and pushed into the priority plugin?
Yeah, I think this was the main point I was trying to make. Implementing this could be tricky because you don't want to reprioritize jobs recursively (i.e. you have to tell the difference between a job being reprioritized for other reasons, vs one being reprioritized because all jobs are being reprioritized due to the first job's prioritization) |
Yes,
I'm trying to think of how reprioritizing a job due to its age might make sense in the case where it should be the only job that should have its priority recalculated - my thought was that this wouldn't be necessary. I guess I figured that something like an age factor would/should only be considered when every pending job's priority was being reprioritized, not just a single one. Does that make sense? Sorry if I am misunderstanding you. |
I think we're saying the same thing. My question is about the implementation in the current PR. I may be missing a subtlety, but with the following assumptions:
Therefore, any job with an urgency update will have its priority recalculated using |
I think I see what you're saying - sorry for it taking me until now to understand. 🤦 Any urgency/jobspec update will send a job in SCHED state back to PRIORITY, and if they already have an age, their priority will get bumped up. It sounds to me like, then, the plugin needs to handle additional cases where a job's:
Does that sound right? |
Yeah, I guess if a job's age is reset any time the job is modified then that is fair and wouldn't require recalculating the priority for all other jobs at the time of that job's priority update, since it would by definition have the smallest age. Should this rule only apply if the user themselves requested the job update? E.g. an urgency >16 set by the instance owner should perhaps not reset the age? Another question for which I'm not sure I have the answer. |
Problem: The priority plugin does not unpack the priority weight associated with an "age" factor for a job's priority calculation. Add the "age" factor's associated integer weight to the plugin's internal map that stores the weights of all of the priority factors by unpacking it in the callback for conf.update.
Hopefully I'm actually understanding the questions(s) here. I think that the ideal age factor would be based on the total time that the job is eligible to be run (i.e. urgency!=0, not in any sort of depend state). That said, considering the age to be the time since the urgency was set to >0 seems like a perfectly reasonable approximation of this. We shouldn't reset the age factor when a job goes from a 'default' urgency (16) to an 'expedite' urgency though. |
147085a
to
811dc2e
Compare
Problem: The priority plugin does not factor in the age of a job in its priority calculation. Add a new callback to the plugin for job.state.sched and store the timestamp of a job in SCHED state in a map where the key is the job ID and the value is the timestamp. When the job enters job.priority.get, the timestamp will be used with the time at that moment to calculate the "age" of the job, where the older the age, the more of a priority bump the job gets. If a job currently accumulating age has its urgency changed to 0, the age of the job will be saved and no longer grow until the urgency of the job is changed back to a value > 0. If a job currently accumulating age has one of its attributes updated, such as the queue or the bank it is running under, any previously-accumulated age will be cleared and reset for the job. When the job enters job.state.inactive, the key-value pair for the job ID and the timestamp of it entering state SCHED is removed.
Problem: A number of tests in the testsuite look for a specific priority value that becomes very difficult to test with the addition of the "age" factor. In the tests that look for a specific priority value, disable the "age" factor in the priority calculation of a job.
Problem: flux-accounting has no tests for the implementation of the "age" factor in the priority plugin. Add some tests.
Problem: The flux-config-accounting manpage does not describe the "age" factor in the priority plugin even though it is a factor in the calculation of a job's priority. Add "age" to the description of the priority factors in the flux-config-accounting manpage.
Problem: The accounting guide does not list "queue" or "age" as factors considered by the plugin when calculating a job's priority. Add both of these factors to the listed description of factors used when calculating a job's priority in the priority plugin.
811dc2e
to
37cad48
Compare
OK, I've force-pushed up some changes (notably to the More specifically, I've added logic to pause the accumulation of age for a job when it is held with an urgency update of If a job's attributes are updated (for example, if they update the bank or queue they want their job to run under), the age for the job is now cleared and reset. I've added a short set of tests to One thing to note with this most recent iteration I've pushed up: updating a job from one urgency value > 0 to another value <= 16 will not reset the age it accumulated before the urgency update since we are considering the age to be based on the total time that the job is eligible to be run (i.e. urgency != 0, not in any sort of DEPEND state). I'm wondering here if the plugin needs to completely reset the age of the job in this case, or if it is okay to leave as is. @ryanday36 do you have any thoughts or opinions on this? |
Problem
The priority plugin does not factor in the age of a job in its priority calculation. For the sake of flux-accounting's priority plugin, age will be considered the time that a job was released by the scheduler to be run but could not due to another constraint (e.g a resource constraint).
This PR looks to add age as a factor when calculating a job's priority. It adds a new callback to the plugin for
job.state.sched
where it will store the current timestamp in a map along with the job ID if the job does not have anurgency
of 0. If the job is still active when jobs are reprioritized andjob.priority.get
is called, this timestamp will be subtracted from the current time to calculate an age, boosting the job's integer priority. When the job finishes running and entersjob.state.inactive
, the job ID/timestamp key-value pair is removed from the map.A set of tests are added to the testsuite that look to simulate the following scenarios:
--urgency=0
, preventing their job from ever running. The urgency of the job is updated, but the priority of the job should not increase since the job was purposely held from running.--urgency=0
. The urgency of the job is updated while both jobs are active, so it cannot run. Jobs are reprioritized and the job that could not run is expected to have its priority increase as a result.Leaving this as [WIP] to get feedback on the implementation to see if this is a viable approach.
In terms of testing, you'll also notice that I've inserted various
sleep 1
orsleep 1.5
throughout the test file in order to simulate a job having to wait at least a second so that it's age factor will actually contribute to the calculation of a job's priority. I know we like to try and stray away from puttingsleep
commands in our test file, so if there is a better approach here, I'm all ears. :-)