Streaming and Sharded Data Loading #116

dlwh · 2022-03-10T19:34:03Z

Processing the data up front is slow and doesn't make good use of the hardware. Better to tokenize/group on cpus while the gpus are busy (see e.g. DataLoader elsewhere)

Stream data rather than downloading it
Sharded streaming for multi-node training.

Related: caching the fully preprocessed data to disk is very inefficient. A 100GB corpus blows up to 400GB.

One thing to note is we have to consider how to do this in the presence of multi-dataset training.

dlwh added the enhancement New feature or request label Mar 10, 2022

dlwh added this to the Mistral V2 milestone Mar 10, 2022

This was referenced Mar 10, 2022

Tokenization crashes when using deepspeed #117

Closed

Streaming data for larger datasets #126

Closed

dlwh self-assigned this Mar 14, 2022

dlwh removed this from the Mistral V2 milestone Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming and Sharded Data Loading #116

Streaming and Sharded Data Loading #116

dlwh commented Mar 10, 2022

Streaming and Sharded Data Loading #116

Streaming and Sharded Data Loading #116

Comments

dlwh commented Mar 10, 2022