PersistentDataset and CacheDataset hybrid #6753

ibro45 · 2023-07-21T16:37:32Z

Is your feature request related to a problem? Please describe.
CacheDataset preprocesses the non-random transforms and loads the data into RAM.
PersistentDataset preprocesses the non-random transforms into pickled files on its first run, and any subsequent run reads them on the fly and applies the random transforms only.

However, when prototyping, you often rerun a setup with different hyper-parameters, and you end up waiting each time for the CacheDataset to preprocess the non-random transforms all over again. Using PersistentDataset on the other hand, won't require preprocessing them again at each run, but could still be slower than CacheDataset as it reads objects from the drive instead of RAM.

Describe the solution you'd like
I propose a combination of the two, that could also be framed as an extension to PersistentDataset that will allow loading of the pickled files into RAM. This way, the non-random transforms are only ever done once instead of always redoing them when loading the data into RAM.

The text was updated successfully, but these errors were encountered:

wyli · 2023-07-21T17:30:54Z

thanks for the feature request, there's recently a runtime_cache flag to CacheDataset I think it does what you described..

MONAI/monai/data/dataset.py

Lines 787 to 801 in 4addc5d

    
                       runtime_cache: mode of cache at the runtime. Default to `False` to prepare 
        
                           the cache content for the entire ``data`` during initialization, this potentially largely increase the 
        
                           time required between the constructor called and first mini-batch generated. 
        
                           Three options are provided to compute the cache on the fly after the dataset initialization: 
        
                           1. ``"threads"`` or ``True``: use a regular ``list`` to store the cache items. 
        
                           2. ``"processes"``: use a ListProxy to store the cache items, it can be shared among processes. 
        
                           3. A list-like object: a users-provided container to be used to store the cache items. 
        
                           For `thread-based` caching (typically for caching cuda tensors), option 1 is recommended. 
        
                           For single process workflows with multiprocessing data loading, option 2 is recommended. 
        
                           For multiprocessing workflows (typically for distributed training), 
        
                           where this class is initialized in subprocesses, option 3 is recommended, 
        
                           and the list-like object should be prepared in the main process and passed to all subprocesses. 
        
                           Not following these recommendations may lead to runtime errors or duplicated cache across processes.

ibro45 · 2023-07-21T18:38:21Z

This is a nice addition, thanks for pointing it out!

I was suggesting something a bit different. Maybe I should frame it this way - a CacheDataset that on the first run pickles the non-random transformed objects, and in subsequent runs just loads them directly into the RAM, without requiring to re-do the non-random transformation.

Or, a PersistentDataset whose pickled files are loaded into the RAM, instead of being read from the drive throughout the training. The current PersistentDataset behavior, if it had cache_rate, would be as if it were cache_rate=0. I'm basically suggesting a PersistentData with cache_rate like in CacheDataset specifying how much data should be kept in the RAM constantly.

wyli · 2023-07-21T21:07:57Z

thanks @ibro45 that's a good idea, adding the feature request label here..

wyli added enhancement New feature or request Feature request labels Jul 21, 2023

wyli added the Contribution wanted label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PersistentDataset and CacheDataset hybrid #6753

PersistentDataset and CacheDataset hybrid #6753

ibro45 commented Jul 21, 2023

wyli commented Jul 21, 2023

ibro45 commented Jul 21, 2023

wyli commented Jul 21, 2023

PersistentDataset and CacheDataset hybrid #6753

PersistentDataset and CacheDataset hybrid #6753

Comments

ibro45 commented Jul 21, 2023

wyli commented Jul 21, 2023

ibro45 commented Jul 21, 2023

wyli commented Jul 21, 2023