Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use task_struct ptr instead of pid in rusty #672

Closed

Conversation

likewhatevs
Copy link
Contributor

@likewhatevs likewhatevs commented Sep 23, 2024

I think this works wrt/ fixing rusty.

I think the issue w/ rusty being unreliable has multiple cases.

I think the first might be #610 , which https://stackoverflow.com/a/9306150 explains a bit also. I say might because like, I could kind of see us wanting to the kernel PID instead of TGID (but I could also kind of see us wanting to track TGID).
In any event, moving to TGID alone didn't fix this issue.

I think the second issue might be that all rusty code needs to like, gracefully handle situations where PIDs (or now, TGIDs) disappear while it is working.

I came to this conclusion after adding check_task_exist and seeing stress tests fail less (i.e. move from always to sometimes) and the placement of that function in quescient kind of confirms that (i.e. tasks existed, passed that check, then didn't a few lines/fn calls later).

If this all sounds about right, and I'm not missing anything huge (which really wouldn't surprise me), I'll run longer stress tests (up to like, idk, overnight) to see if I can find any other places where rusty code needs to be tweaked to better handle tasks disappearing and then maybe this issue will be gone, idk.

I updated this to track task_id (task_struct * casted to u64) and it seems to work (passes stress test).

@likewhatevs likewhatevs marked this pull request as draft September 23, 2024 15:52
@likewhatevs likewhatevs force-pushed the have-rusty-track-tgid branch 3 times, most recently from 27c717b to f4d3ef3 Compare September 23, 2024 19:44
@likewhatevs likewhatevs marked this pull request as ready for review September 23, 2024 19:45
@likewhatevs likewhatevs changed the title use tgid as pid and add a task existence check to rusty use task_struct ptr instead of pid in rusty Sep 23, 2024
have rusty use task_id instead of pid.

task_id is task_ptr in a u64.

this should fix the issue with missing pids because
task_ptr does not change.
@likewhatevs
Copy link
Contributor Author

split per scheduler ci job: https://github.com/likewhatevs/scx/actions/runs/11001710685/job/30547024955 (lavd failing in the non-split one atm).

@vax-r
Copy link
Contributor

vax-r commented Sep 24, 2024

I don't think tgid and pid are the same inside task_struct in case of multi-thread scenario , may I ask what's the main concept behind this PR ? also why can casting task_struct * to u64 work correctly? any reference? thanks

@likewhatevs
Copy link
Contributor Author

I don't think tgid and pid are the same inside task_struct in case of multi-thread scenario , may I ask what's the main concept behind this PR ? also why can casting task_struct * to u64 work correctly? any reference? thanks

You are right wrt/ tgid and pid and the multi-thread scenario. I was looking for a non-changing ID I could use to track a task (which is the concept behind this PR, track a non-changing ID to resolve "task not found" errors).

also why can casting task_struct * to u64 work correctly?

The stress tests for rusty pass with this change, and the stress tests for rusty are configured to reproduce this error (the stress tests are configured to fork which triggers this issue).

I kinda didn't want to use a pointer to task struct because that didn't sound like as "nice" a solution as using some kind of ID (i.e., ps can print TGID, AFAIK it can't print the pointer to a task struct), but per the guidance on #610, I tried using task struct ptr, and it seems to work.

@@ -177,11 +178,18 @@ static struct dom_ctx *lookup_dom_ctx(u32 dom_id)
return domc;
}

static int get_task_from_id(u64 task_id, struct task_struct *p_in){
struct task_struct *p;
p_in = (struct task_struct*) task_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? I'm not sure BPF would allow casting from u64 to a pointer like this. One way to obtain the trusted task_struct from the ID would be doing probe_read on the &((struct task_struct *)task_id)->pid and then do bpf_task_from_pid() on the read PID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wouldn't a more conventional name for the out parameter be p_out instead of p_in?


/*
* XXX - We want BPF_NOEXIST but bpf_map_delete_elem() in .disable() may
* fail spuriously due to BPF recursion protection triggering
* unnecessarily.
*/
ret = bpf_map_update_elem(&task_data, &pid, &taskc, 0 /*BPF_NOEXIST*/);
ret = bpf_map_update_elem(&task_data, &task_id, &taskc, 0 /*BPF_NOEXIST*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was BPF okay with this? It sometimes complains about converting pointers to scalar.


const u64 ravg_1 = 1 << RAVG_FRAC_BITS;

/* Map pid -> task_ctx */
/* Map task_id -> task_ctx */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether a shorter name would make things a bit easier - e.g. something like taddr or tptr.

@htejun
Copy link
Contributor

htejun commented Sep 24, 2024

I think the reason why this works is because we never ask BPF to deref the converted pointer so the value is just being treated as a scalar value all along. If that's the case, it probably makes more sense to drop the conversion function and just treat it as u64 value everywhere.

@likewhatevs likewhatevs marked this pull request as draft September 24, 2024 20:40
@likewhatevs
Copy link
Contributor Author

opening a new PR with the changes discussed here, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants