-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional file-based listings caching #895
base: master
Are you sure you want to change the base?
Add optional file-based listings caching #895
Conversation
f990b3e
to
2eb120d
Compare
Dear Benjamin, I want to thank you very much for tackling this improvement to fsspec. In order to shed more light on the background of this, I would like to reference earthobservations/wetterdienst#243, where we are coming from. This improvement solves a longstanding issue for us, as the dbmfile-based backend of dogpile.cache proved to be too brittle when operating the Wetterdienst library in concurrent workload situations. I used to work with both Beaker and dogpile.cache in the past and they really shine in environments where you can easily afford to run a proper database-based cache backend, like Memcached, Redis, or even MongoDB. However, this is not the situation with Wetterdienst: It is mostly library- and commandline-driven, so we didn't want to impose the overhead of running a caching database on its users. As discussed at earthobservations/wetterdienst#243 (comment), I was very happy to learn about DiskCache by @grantjenks (thank you!) from you, as its features promised it to be the absolute right thing to use in a library context. On my ramblings
@martindurant answered:
In this manner, thank you again for making a start with this patch. I hope the integration will go along well. With kind regards, Further referencesI discovered those resources in the issue tracker, which also indicate that the current fsspec/caching.py might have shortcomings in certain situations. Maybe those routines can also be improved by bringing DiskCache to the table?
|
Should/is this feature disabled by default? I couldn't tell from the code what the default behavior is. I could see it being pretty confusing if not. E.g. an application that monitors for changes in a bucket using
|
The dircache is used for more operations than simply caching the output of ls, so I don't think the simple approach above does everything you want. |
I'm mainly just cautioning against caching by default...is it a default? |
Yes, we cache file listings for remote filesystems, where getting such listings can be pretty expensive. |
So this is the expected default behavior:
I could imagine losing hours debugging something like that. |
|
I'm not sure it is a bug. I didn't run the code above...I just meant it as an illustrative example. Suppose instead of |
Agreed, the cache can get out of sync when processes are happening elsewhere, so we have the ability to force refresh, set an expiry time or turn the listings cache off. However, the default is to cache for the duration of the python session, which seems (to me) to align with the most common use cases. |
a151320
to
17e926e
Compare
Dear all, finally updated the PR. Could you please give me a review? |
Hi Benjamin and Martin, there is another discussion over at #1127, which I wanted to make you aware about. It may happen that this patch will get rejected, or not merged in its current form.
If @martindurant agrees with my evaluation that it is indeed unlikely that it will get merged, I am thinking about shipping it on behalf of a separate utility package, in order to overcome the roadblock in making the recipe reusable beyond the Wetterdienst package, where it is currently used. What do you think about this proposal? With kind regards, |
docs/source/features.rst
Outdated
skip the caching altogether, use ``use_listings_cache=False``. That would be appropriate | ||
when the target location is known to be volatile because it is being written to from other | ||
sources. If you want to use the file-based caching, you can also provide the argument | ||
``listings_cache_location`` to determine where the cache file is stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there's anything that can be done at this stage to simplify all these parameters? Should we have a caching_options dict, or config options (or both) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @martindurant ,
I updated the configurations and put them all into one dict named listings_cache_options
to make clear what it refers to.
Probably options for [data] caching and listings caching should not be put together to keep it simple and because keywords e.g. "expiriy_time" are similarly named.
fsspec/dircache.py
Outdated
listings_expiry_time = listings_expiry_time and float(listings_expiry_time) | ||
|
||
if listings_cache_location: | ||
listings_cache_location = Path(listings_cache_location) / str( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we use Path within a filesystem library :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can diskcache handle fsspec paths? E.g., memory://cache would be a very similar solution to MemDirCache. Such an option might even be useful for semi-speedy remotes like smb:
fsspec/implementations/http.py
Outdated
@@ -103,6 +103,8 @@ def __init__( | |||
request_options.pop("listings_expiry_time", None) | |||
request_options.pop("max_paths", None) | |||
request_options.pop("skip_instance_cache", None) | |||
request_options.pop("listings_cache_type", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic - where else might these kwargs pollute calls? Do we need to change the signatures of all backends?
Given progress here, I think we can fix this up, and see whether it garners any interest. |
4dad114
to
58165e4
Compare
fsspec/asyn.py
Outdated
@@ -297,15 +297,15 @@ class AsyncFileSystem(AbstractFileSystem): | |||
mirror_sync_methods = True | |||
disable_throttling = False | |||
|
|||
def __init__(self, *args, asynchronous=False, loop=None, batch_size=None, **kwargs): | |||
def __init__(self, *args, asynchronous=False, loop=None, batch_size=None, listings_cache_options=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name it dir_cache_options
instead to adhere with MemoryDirCache/FileDirCache
?
fsspec/spec.py
Outdated
if not listings_cache_options: | ||
listings_cache_options = {} | ||
elif listings_cache_options is True: | ||
listings_cache_options = {"cache_type": CacheType.MEMORY, "expiry_time": None} | ||
else: | ||
listings_cache_options = {"expiry_time": None} | listings_cache_options | ||
cache_type = listings_cache_options.get("cache_type") | ||
if cache_type: | ||
if isinstance(cache_type, str): | ||
try: | ||
cache_type = CacheType[cache_type.upper()] | ||
except KeyError as e: | ||
raise ValueError( | ||
f"Cache type must be one of {', '.join(ct.name.lower() for ct in CacheType)}") from e | ||
listings_cache_options["cache_type"] = cache_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's too much going on here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine! It could be a norm_cache_options
util functions to pull it out of here.
It seems like the "no cache" case should still have a no-op dict to cope with this kind of thing. So, there would be a third cache type, one that corresponds to the |
Dear @martindurant , thanks for the reminder. I think I need just a bit of time. I will work on this in about 2 weeks again and push it actively. |
029d267
to
eb42007
Compare
eb42007
to
6133acb
Compare
Dear @martindurant , still on it... Currently some tests are failing which I believe I'm not responsible for. Maybe some reallife backend has changed or so. |
This seems like it might be real:
Could it be that the file cache is causing a significant slow-down? Perhaps we just need to alter the test a bit to allow for a smaller speedup factor like 1.1 . |
Dear @martindurant and fsspec team,
this PR adds an additional file-based listings cache with the help of the beautiful diskcache[1]. This is relevant as the listings cache docs mention that the creation of certain large listings can be really expensive.
Following changes are done:
Please check it out and let me know for required changes!
Cheers
Benjamin
[1] http://www.grantjenks.com/docs/diskcache/
[2] http://www.grantjenks.com/docs/diskcache/api.html#diskcache.Cache
#787
CC @amotl