[dataloader] dataloading improvement tracking issue #37

d4l3k · 2024-12-12T17:51:30Z

This is a tracking issue for dataloader improvements. The current support is very basic and we likely need to make some bigger changes to make this more efficient

track dataloader step counts on a per replica_id basis
add mechanism for reinstantiating dataloader from checkpoint and fast forwarding to the correct step count
throw this all out and use a deterministic index managed by Lighthouse?

d4l3k · 2024-12-12T20:17:36Z

This relates to pytorch/data#1337

d4l3k · 2024-12-12T21:14:17Z

Notes from Andrew:

we do have a flag “snapshot_every_n_steps” that will only update the checkpoints every say 10 steps, and then there’s a counter in there so if you request checkpoint at step 15, it will load the snapshot from step 10 and then throw away 5 batches to recover the state

This is very similar to what we want

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dataloader] dataloading improvement tracking issue #37

[dataloader] dataloading improvement tracking issue #37

d4l3k commented Dec 12, 2024

d4l3k commented Dec 12, 2024

d4l3k commented Dec 12, 2024

[dataloader] dataloading improvement tracking issue #37

[dataloader] dataloading improvement tracking issue #37

Comments

d4l3k commented Dec 12, 2024

d4l3k commented Dec 12, 2024

d4l3k commented Dec 12, 2024