-
Notifications
You must be signed in to change notification settings - Fork 783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use os.sched_getaffinity
instead of os.cpu_count
where possible
#2160
base: master
Are you sure you want to change the base?
Conversation
@bryant1410 thanks for the PR! Is there any before/after analysis for this change? |
What do you mean? |
AWESOME!! We should probably also change S3_WORKER_COUNT to something like this instead of using 64 all the time (that may be more debatable but we have had issues when it brings down the machine). @savingoyal -- this change is very nice because basically, cpu_count returns the number of CPUs on the entire box and not the ones for just your container. I just tested this and the affinity one returns the correct value which is much more likely what you want. I remember this had come up in the past but I never stopped to make the (clearly simple) change. |
Thanks for fixing this! Looks like in 3.13+ we can use |
I am curious to see what is observed change in behavior after this patch |
Oh. I didn't test it myself in this repo, but in other cases that I made a similar change what I observed is that it started considering cgroup or container assigned CPUs, as opposed to the system total, which is the desired behavior IMHO. |
Yeah, at the moment, exclusive CPU ownership doesn't happen by default on Kubernetes (I am not sure if it's even an option with AWS Batch—maybe @npow knows), so os.sched_getaffinity and os.cpu_count will return the same value and should be safe to roll out for now. |
When automatically choosing the number of parallel workers,
os.sched_getaffinity
is a better choice than the currently usedos.cpu_count
. The former uses a process' assigned CPU count. See this Stack Overflow answer for an explanation.I changed this codebase to first check
os.sched_getaffinity
and otherwise default toos.cpu_count
(and then default to 1; as the latter could potentially beNone
). As some form of validation, this is something PyTorch uses as well.In the (rare) case that
os.sched_getaffinity
isn't defined, I make it default toos.cpu_count
. PyTorch's code behaves differently by using the value 0. I think using 0 doesn't make sense. Still, this shouldn't happen as I was reading that in Linux you have to assign at least one.