You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Processing the data up front is slow and doesn't make good use of the hardware. Better to tokenize/group on cpus while the gpus are busy (see e.g. DataLoader elsewhere)
Stream data rather than downloading it
Sharded streaming for multi-node training.
Related: caching the fully preprocessed data to disk is very inefficient. A 100GB corpus blows up to 400GB.
One thing to note is we have to consider how to do this in the presence of multi-dataset training.
The text was updated successfully, but these errors were encountered:
Processing the data up front is slow and doesn't make good use of the hardware. Better to tokenize/group on cpus while the gpus are busy (see e.g. DataLoader elsewhere)
Related: caching the fully preprocessed data to disk is very inefficient. A 100GB corpus blows up to 400GB.
One thing to note is we have to consider how to do this in the presence of multi-dataset training.
The text was updated successfully, but these errors were encountered: