Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataloader] dataloading improvement tracking issue #37

Open
3 tasks
d4l3k opened this issue Dec 12, 2024 · 2 comments
Open
3 tasks

[dataloader] dataloading improvement tracking issue #37

d4l3k opened this issue Dec 12, 2024 · 2 comments

Comments

@d4l3k
Copy link
Member

d4l3k commented Dec 12, 2024

This is a tracking issue for dataloader improvements. The current support is very basic and we likely need to make some bigger changes to make this more efficient

  • track dataloader step counts on a per replica_id basis
  • add mechanism for reinstantiating dataloader from checkpoint and fast forwarding to the correct step count
  • throw this all out and use a deterministic index managed by Lighthouse?
@d4l3k
Copy link
Member Author

d4l3k commented Dec 12, 2024

This relates to pytorch/data#1337

@d4l3k
Copy link
Member Author

d4l3k commented Dec 12, 2024

Notes from Andrew:

we do have a flag “snapshot_every_n_steps” that will only update the checkpoints every say 10 steps, and then there’s a counter in there so if you request checkpoint at step 15, it will load the snapshot from step 10 and then throw away 5 batches to recover the state

This is very similar to what we want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant