IO does not appear to be asynchronous with the compute phase #178

crtierney · 2024-04-03T00:48:31Z

Around line 261 in dlio_benchmark/dlio_benchmark/main.py, there is the main loop to read and simulate the computation time.

    loader = self.framework.get_loader(dataset_type=DatasetType.TRAIN)
    t0 = time()
    for batch in dlp.iter(loader.next()):
        self.stats.batch_loaded(epoch, overall_step, block, t0)

When I run my code with native_dali (I understand this isn't fully supported yet). The first step reports a reasonable process, but all subsequent response are much larger.

I have been adjusting my .yaml file for the resnet50_h100 case. My current computation_time is 0.1 seconds. If I set the computation_time to zero, the actual time reported is ~0.055 sections, which is the fastest my storage can do for this configuration.

The time reported for loaded, which is the from the batch_loaded() function for all steps > 1 is the actual IO time. So the processed time is the IO time plus the computational time. It's as if IO is not being done asynchronously.

Until the dali reader is fixed, I can't test that. I want to get this written down in case it also affects the dali reader.

[INFO] 2024-04-02T17:40:01.359388 Rank 0 step 1: loaded 400 samples in 9.298324584960938e-05 s
[INFO] 2024-04-02T17:40:01.479351 Rank 0 step 1 processed 400 samples in 0.12005329132080078 s

[INFO] 2024-04-02T17:40:01.534976 Rank 0 step 2: loaded 400 samples in 0.05459904670715332 s
[INFO] 2024-04-02T17:40:01.636629 Rank 0 step 2 processed 400 samples in 0.15625238418579102 s

The text was updated successfully, but these errors were encountered:

zhenghh04 · 2024-04-04T14:16:08Z

Yes, we are aware of this issue. There is some CPU function blocking the I/O call in native_dali. We are working on a PR to address that.

crtierney changed the title ~~Is dlp.~~ IO does not appear to be asynchronous with the compute phase Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO does not appear to be asynchronous with the compute phase #178

IO does not appear to be asynchronous with the compute phase #178

crtierney commented Apr 3, 2024

zhenghh04 commented Apr 4, 2024

IO does not appear to be asynchronous with the compute phase #178

IO does not appear to be asynchronous with the compute phase #178

Comments

crtierney commented Apr 3, 2024

zhenghh04 commented Apr 4, 2024