Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PersistentDataset and CacheDataset hybrid #6753

Open
ibro45 opened this issue Jul 21, 2023 · 3 comments
Open

PersistentDataset and CacheDataset hybrid #6753

ibro45 opened this issue Jul 21, 2023 · 3 comments

Comments

@ibro45
Copy link
Contributor

ibro45 commented Jul 21, 2023

Is your feature request related to a problem? Please describe.
CacheDataset preprocesses the non-random transforms and loads the data into RAM.
PersistentDataset preprocesses the non-random transforms into pickled files on its first run, and any subsequent run reads them on the fly and applies the random transforms only.

However, when prototyping, you often rerun a setup with different hyper-parameters, and you end up waiting each time for the CacheDataset to preprocess the non-random transforms all over again. Using PersistentDataset on the other hand, won't require preprocessing them again at each run, but could still be slower than CacheDataset as it reads objects from the drive instead of RAM.

Describe the solution you'd like
I propose a combination of the two, that could also be framed as an extension to PersistentDataset that will allow loading of the pickled files into RAM. This way, the non-random transforms are only ever done once instead of always redoing them when loading the data into RAM.

@wyli
Copy link
Contributor

wyli commented Jul 21, 2023

thanks for the feature request, there's recently a runtime_cache flag to CacheDataset I think it does what you described..

MONAI/monai/data/dataset.py

Lines 787 to 801 in 4addc5d

runtime_cache: mode of cache at the runtime. Default to `False` to prepare
the cache content for the entire ``data`` during initialization, this potentially largely increase the
time required between the constructor called and first mini-batch generated.
Three options are provided to compute the cache on the fly after the dataset initialization:
1. ``"threads"`` or ``True``: use a regular ``list`` to store the cache items.
2. ``"processes"``: use a ListProxy to store the cache items, it can be shared among processes.
3. A list-like object: a users-provided container to be used to store the cache items.
For `thread-based` caching (typically for caching cuda tensors), option 1 is recommended.
For single process workflows with multiprocessing data loading, option 2 is recommended.
For multiprocessing workflows (typically for distributed training),
where this class is initialized in subprocesses, option 3 is recommended,
and the list-like object should be prepared in the main process and passed to all subprocesses.
Not following these recommendations may lead to runtime errors or duplicated cache across processes.

@ibro45
Copy link
Contributor Author

ibro45 commented Jul 21, 2023

This is a nice addition, thanks for pointing it out!

I was suggesting something a bit different. Maybe I should frame it this way - a CacheDataset that on the first run pickles the non-random transformed objects, and in subsequent runs just loads them directly into the RAM, without requiring to re-do the non-random transformation.

Or, a PersistentDataset whose pickled files are loaded into the RAM, instead of being read from the drive throughout the training. The current PersistentDataset behavior, if it had cache_rate, would be as if it were cache_rate=0. I'm basically suggesting a PersistentData with cache_rate like in CacheDataset specifying how much data should be kept in the RAM constantly.

@wyli wyli added enhancement New feature or request Feature request labels Jul 21, 2023
@wyli
Copy link
Contributor

wyli commented Jul 21, 2023

thanks @ibro45 that's a good idea, adding the feature request label here..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants