-
Anyways training with webdatasets so thought it'd be best to use Any tips regarding how to load and write jsons quickly would be appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 12 replies
-
I recommend JSONL, every other format is problematic in some way. We may remove JSON support at some point or make it actually use JSONL. |
Beta Was this translation helpful? Give feedback.
-
There was a direction that I was hoping we could go in order to easily support very large datasets, which is to start using stateless samplers (that always select elements totally i.i.d. with no attempt to avoid duplicates), and the sampler would be initialized with a list of manifests; the sampling procedure would be to randomly select a manifest then randomly select an item. I believe Piotr implemented this. To make it efficient you need to create a file that tells you the start-char of each line in the json file; I believe this was done. Then, there isn't a strong need to process the entire dataset into a single jsonl file; you can just have it be a list of jsonl files and they don't even have to be homogeneous in any sense. I suspect all the tools to do this already exist although I forget the class names / PR numbers. The only thing is, if you restart training you have to reset the seed to something that depends on the (restarted) batch number so that it doesn't use the exact same items as you used on batch 0. The idea would be that generally speaking we wouldn't use the concept of epoch for training, but just the batch number; see the Eden2 learning rate schedule, which supports this, if you were previously using the (epoch-dependent) Eden. |
Beta Was this translation helpful? Give feedback.
I recommend JSONL, every other format is problematic in some way. We may remove JSON support at some point or make it actually use JSONL.