The usage of data.cache() causes the run out of memory. #258

HamLaertes · 2020-11-06T02:56:42Z

Hi guys, when I run the command lqz TrainR2BStrongBaseline, I found the memory had been ran out of. My server has 64GB memories.
I check the code and find it will cache both the train_data and the validation_data, and I think this is the reason why my memory is exhausted.
I didn't see any issue related to this problem and I am just curious about do you guys have such a large memory that enough to cache the whole ImageNet dataset(up to 150GB)?
Currently, I delete the code that will cache the dataset and the training process run smoothly. Is there any way to cache part of the training_data

The text was updated successfully, but these errors were encountered:

jneeven · 2020-11-06T08:36:51Z

Hi! ImageNet is indeed quite large, which is why we use multi-GPU training. Since each of the GPUs comes with several CPUs and 64GB of RAM on the compute platform we use, ImageNet does fit into memory if you use four GPUs in parallel. Unfortunately I don't think there is currently a clean way to cache only part of a dataset; you could try to split the dataset up and get creative with tf.data.Dataset.concatenate and then do the caching before concatenating both parts of the dataset, but I doubt this would be very efficient either way. Not caching the dataset is probably your best solution (although it is unfortunately a bit slow). Good luck!

AdamHillier · 2020-11-06T10:11:57Z

Just to add, if you're on a single-GPU machine, disabling caching shouldn't have too much of an impact, especially if your dataset is stored on fast memory e.g. an SSD.

sfalkena · 2021-11-12T09:24:26Z

Hi, I have an additional question about the caching of Imagenet. I have the possibility to configure my training setup with enough RAM for caching Imagenet. However, when I run a multistage experiment, I am experiencing an increase in RAM for the second stage. I think the problem here lays in the fact that each TrainLarqZooModel caches the dataset again. Are you aware of any way to reuse the dataset across stages or on how to release the dataset from RAM? Perhaps it would make sense to move dataloading one level up so that it gets called once per experiment, regardless if that is a single or multistage experiment.

jneeven · 2021-11-12T09:30:38Z

Hi @sfalkena, I have indeed run into this problem before and didn't realise it would apply here as well. Unfortunately I have not found any robust way to remove the dataset from RAM, so your suggestion of moving the dataset one level up sounds like the best approach. In my case I wasn't using ImageNet and I was only running slightly over memory limits, so it was feasible to just slightly increase RAM, but that is of course not a maintainable solution especially at ImageNet scale...

sfalkena · 2021-11-12T09:36:45Z

Thanks for your fast answer. I am currently trying if maintaining the cache file on disk would work without slowing down too much. Otherwise I'll try a workaround for now by training the stages separately. If I have a bit more time in the future, I am happy to contribute to moving dataloading a level up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The usage of data.cache() causes the run out of memory. #258

The usage of data.cache() causes the run out of memory. #258

HamLaertes commented Nov 6, 2020

jneeven commented Nov 6, 2020 •

edited

Loading

AdamHillier commented Nov 6, 2020

sfalkena commented Nov 12, 2021

jneeven commented Nov 12, 2021

sfalkena commented Nov 12, 2021

The usage of data.cache() causes the run out of memory. #258

The usage of data.cache() causes the run out of memory. #258

Comments

HamLaertes commented Nov 6, 2020

jneeven commented Nov 6, 2020 • edited Loading

AdamHillier commented Nov 6, 2020

sfalkena commented Nov 12, 2021

jneeven commented Nov 12, 2021

sfalkena commented Nov 12, 2021

jneeven commented Nov 6, 2020 •

edited

Loading