DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #165

sethiay · 2024-03-11T19:21:26Z

Hey Team,

We are using DLIO to simulate Unet3d e.g.

mpirun -np 8 python3 dlio_benchmark/main.py workload=unet3d ++workload.workflow.generate_data=False ++workload.workflow.train=True ++workload.dataset.num_files_train=500000 ++workload.workflow.checkpoint=False ++workload.dataset.data_folder=/mnt/disks/100KB-50GB ++workload.dataset.record_length=100000 ++workload.reader.batch_size=1500 ++workload.reader.read_threads=12 ++workload.reader.file_shuffle=seed ++workload.reader.sample_shuffle=seed ++workload.train.epochs=5

When the above DLIO command is running, we monitored the memory profile of our machine and found that:

Total Memory utilization of user space processes is around ~40GiB.
Total Memory reported by OS is up to 370-380GiB, out of which upto 330-340GiB is taken by shared memory (/dev/shm). We confirmed the numbers using cat /proc/meminfo and df -h commands.
Dropping page cache by running echo 3 > sudo /proc/sys/vm/drop_caches also doesn't clear this shared memory.

Given above, it seems like the issue is pytorch/pytorch#13246 (comment) i.e. it may be required to use multiprocessing.Arrays to avoid duplication of data in shared memory by differerent processes.

Request you to look into this and do the needful.

Thanks,
Ayush

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #165

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #165

sethiay commented Mar 11, 2024 •

edited

Loading

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #165

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #165

Comments

sethiay commented Mar 11, 2024 • edited Loading

sethiay commented Mar 11, 2024 •

edited

Loading