-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Litdata optimize is very slow #417
Comments
Hi! thanks for your contribution!, great first issue! |
Yes, it is expected. LitData orders the files based on their size. To process 10B tokens, i strongly recommend to use a multi machine job on the Lightning-AI platform. Processing 1B took us 4 hours on 8 machines. |
Thanks for the response, that makes sense! How many cpus would you recommend for processing 10B, 100B, 500B and 1T tokens in a reasonable timespan? (say <=3 days) |
Adding this studio here just in case it could serve as a helpful reference: Prepare the TinyLlama 1T Token Dataset. 😊 |
Thanks for the response! I'll definitely scale up the number of machines I'm using in that case! |
Update: this may be a niche issue but I suddenly realized that since I'm in an HPC environment, >>> os.environ["SLURM_CPUS_ON_NODE"]
'2'
>>> os.cpu_count()
256 |
🐛 Bug
I'm processing 10B tokens of text with the
optimize
function in litdata. However, it seems like progress is very slow and often gets stuck altogether. The time estimate keeps increasing as more samples are processed (for instance, with 8 workers, the last time estimate was 7 hours, and now appears to be stuck). I also tried using 1 worker, which produced a less varied time estimate, but will take 50+ hours.To Reproduce
I used the following script (the temp file creation was just to troubleshoot a previous issue). I have the latest version of litdata installed (0.2.29)
Expected behavior
I expected processing to take a consistent time and not get stuck when using multiple workers. The number of CPUs available is much larger than 8.
Additional context
Environment detail
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: