-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use task_struct ptr instead of pid in rusty #672
Conversation
27c717b
to
f4d3ef3
Compare
have rusty use task_id instead of pid. task_id is task_ptr in a u64. this should fix the issue with missing pids because task_ptr does not change.
f4d3ef3
to
d8a1a89
Compare
split per scheduler ci job: https://github.com/likewhatevs/scx/actions/runs/11001710685/job/30547024955 (lavd failing in the non-split one atm). |
I don't think |
You are right wrt/ tgid and pid and the multi-thread scenario. I was looking for a non-changing ID I could use to track a task (which is the concept behind this PR, track a non-changing ID to resolve "task not found" errors).
The stress tests for rusty pass with this change, and the stress tests for rusty are configured to reproduce this error (the stress tests are configured to fork which triggers this issue). I kinda didn't want to use a pointer to task struct because that didn't sound like as "nice" a solution as using some kind of ID (i.e., ps can print TGID, AFAIK it can't print the pointer to a task struct), but per the guidance on #610, I tried using task struct ptr, and it seems to work. |
@@ -177,11 +178,18 @@ static struct dom_ctx *lookup_dom_ctx(u32 dom_id) | |||
return domc; | |||
} | |||
|
|||
static int get_task_from_id(u64 task_id, struct task_struct *p_in){ | |||
struct task_struct *p; | |||
p_in = (struct task_struct*) task_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work? I'm not sure BPF would allow casting from u64 to a pointer like this. One way to obtain the trusted task_struct
from the ID would be doing probe_read
on the &((struct task_struct *)task_id)->pid
and then do bpf_task_from_pid()
on the read PID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, wouldn't a more conventional name for the out parameter be p_out
instead of p_in
?
|
||
/* | ||
* XXX - We want BPF_NOEXIST but bpf_map_delete_elem() in .disable() may | ||
* fail spuriously due to BPF recursion protection triggering | ||
* unnecessarily. | ||
*/ | ||
ret = bpf_map_update_elem(&task_data, &pid, &taskc, 0 /*BPF_NOEXIST*/); | ||
ret = bpf_map_update_elem(&task_data, &task_id, &taskc, 0 /*BPF_NOEXIST*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was BPF okay with this? It sometimes complains about converting pointers to scalar.
|
||
const u64 ravg_1 = 1 << RAVG_FRAC_BITS; | ||
|
||
/* Map pid -> task_ctx */ | ||
/* Map task_id -> task_ctx */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether a shorter name would make things a bit easier - e.g. something like taddr
or tptr
.
I think the reason why this works is because we never ask BPF to deref the converted pointer so the value is just being treated as a scalar value all along. If that's the case, it probably makes more sense to drop the conversion function and just treat it as u64 value everywhere. |
opening a new PR with the changes discussed here, thanks! |
I think this works wrt/ fixing rusty.
I think the issue w/ rusty being unreliable has multiple cases.I think the first might be #610 , which https://stackoverflow.com/a/9306150 explains a bit also. I say might because like, I could kind of see us wanting to the kernel PID instead of TGID (but I could also kind of see us wanting to track TGID).In any event, moving to TGID alone didn't fix this issue.
I think the second issue might be that all rusty code needs to like, gracefully handle situations where PIDs (or now, TGIDs) disappear while it is working.I came to this conclusion after adding check_task_exist and seeing stress tests fail less (i.e. move from always to sometimes) and the placement of that function in quescient kind of confirms that (i.e. tasks existed, passed that check, then didn't a few lines/fn calls later).If this all sounds about right, and I'm not missing anything huge (which really wouldn't surprise me), I'll run longer stress tests (up to like, idk, overnight) to see if I can find any other places where rusty code needs to be tweaked to better handle tasks disappearing and then maybe this issue will be gone, idk.I updated this to track task_id (task_struct * casted to u64) and it seems to work (passes stress test).