Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Reduce number of processes spawned by ProcessPoolExecutor in blocked_io.py #426

Open
amitpendharkar opened this issue Aug 21, 2024 · 3 comments

Comments

@amitpendharkar
Copy link

Issue:
ProcessPoolExecutor spawns the processes based on max_workers argument. Total number of logical cpus are currently used as default value for max_workers

executor = ProcessPoolExecutor(max_workers=num_workers, mp_context=context)

This spawns too many processes. Based of the size of the segy file, resource contention and context switching bogs the machine down and eventually the run get's terminated.

Suggested solution:
Make default value of max_workers as either number of cores or limit it to 80% of available logical cpus.(discuss)

@tasansal
Copy link
Collaborator

Hi, thanks for looking into it. The max cpus is a good approximate for normal sized machines. If we are using a large machine with 48+ cores I can see where this could start trashing the machine.

MDIO takes the following environment variable to control number of CPUs used in ingestion: MDIO__IMPORT__CPU_COUNT.
I propose, instead of changing the default, you should set this value based on your machine.

@amitpendharkar
Copy link
Author

We noticed this issue only after we saw some random ingestion failures. Out of two similar files, one fails and other other succeeds. The failed one would sometimes succeed if re-submitted multiple times. This causes loss of time in production.

I believe setting the value to 80% of available logical cpus would take care of the issue. It would also leave some room to other processes (like opentelemetry) to run.

@amitpendharkar
Copy link
Author

I believe, setting MDIO_IMPORT_CPUT_COUNT when pod get's launched makes things really unclear. One would not know why it's only using few resources. Coming up with right value for MDIO_IMPORT_CPUT_COUNT is also difficult.

Also, setting MDIO_IMPORT_CPUT_COUNT should not be the norm. It should be a one off thing.

We can make
default_cpus = cpu_count(logical=False) below and other such places.

default_cpus = cpu_count(logical=True)

FYI: We routinely use 48+ core machines in production

I will update here with some benchmarking results to validate the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants