Make NetCDF file cache handling compatible with dask distributed #2822

gerritholl · 2024-06-14T08:05:47Z

This PR makes file cache handling in the NetCDF4FileHandler compatible with dask distributed. It adds a utility function in satpy.readers.utils called get_distributed_friendly_dask_array, which can be used to produce a dask.array from a netCDF4 variable that can be used in an xarray, but dask graphs remain picklable and thus computable when including this one. This utility function is now used in NetCDF4FileHandler, which replaces homegrown file handle caching by caching using xarray.backends.CachingFileManager, which is needed to implement the aforementioned utility function.

Closes LI and FCI readers do not work with dask distributed scheduler #2815
Tests added
Fully documented

Start work on a utility function to get a dask array from a dataset variable in a way that is friendly to dask.distributed.

For the distributed-friendly dask array helper, parameterise the test to cover more cases. Simplify the implementation.

We need to force the shape and the dtype when getting the dask-distributed-friendly xarray-dataarray. Seems to have a first working prototype now.

Add group support for getting a dask distributed friendly dask array. Speed up the related tests by sharing the dask distributed client setup and breakdown.

Add partial backward compatibility for accessing the file handle attribute when using caching with a NetCDF4FileHandler base class. Backward incompatibility is not 100%. Deleting the FileHandler closes the manager and therefore the ``file_handle`` property, however, when accessing the ``file_handle`` property after deleting the ``FileHandler``, it is reopened. Therefore, calling `__del__()`` manually and then accessing ``fh.file_handle`` will now return an open file (was a closed file). This should not happen in any sane use scenario.

With the new dask-distributed-friendly caching, make sure we are respecting auto_maskandscale and are not applying scale factors twice.

codecov · 2024-06-20T09:36:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.06%. Comparing base (5e27be4) to head (7c173e7).
Report is 67 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2822   +/-   ##
=======================================
  Coverage   96.05%   96.06%           
=======================================
  Files         370      370           
  Lines       54320    54382   +62     
=======================================
+ Hits        52177    52240   +63     
+ Misses       2143     2142    -1

Flag	Coverage Δ
behaviourtests	`3.99% <0.00%> (-0.01%)`	⬇️
unittests	`96.15% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Remove a dead code except block that should never be reached.

Migrate TestNetCDF4FileHandler from unittest.TestCase to a regular class. Use a pytest fixture for the temporary NetCDF file.

Broaden the string that is matched against in TestNetCDF4FileHandler.test_filenotfound. On Linux and MacOS the expected failure gives "No such file or directory". On Windows it gives "Invalid file format".

coveralls · 2024-06-20T13:20:09Z

Pull Request Test Coverage Report for Build 9597790771

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

92 of 92 (100.0%) changed or added relevant lines in 4 files are covered.
21 unchanged lines in 5 files lost coverage.
Overall coverage increased (+0.006%) to 96.046%

Files with Coverage Reduction	New Missed Lines	%
satpy/readers/utils.py	1	93.19%
satpy/readers/generic_image.py	1	97.7%
satpy/readers/init.py	2	98.68%
satpy/readers/hdf5_utils.py	5	92.77%
satpy/readers/sar_c_safe.py	12	97.28%

Totals
Change from base Build 9497248893:	0.006%
Covered Lines:	51643
Relevant Lines:	53769

💛 - Coveralls

djhoese

Awesome job deciphering how to use the CachingFileManager and wrapping things in a map_blocks task. I think this is really close to being done, but I had some concerns about the helper function.

satpy/readers/netcdf_utils.py

satpy/readers/utils.py

djhoese · 2024-06-26T17:24:34Z

satpy/readers/utils.py

+        manager (xarray.backends.CachingFileManager):
+            Instance of xarray.backends.CachingFileManager encapsulating the
+            dataset to be read.


We should check how the docs render this. If the argument type isn't "clickable" to go directly to the xarray docs for the CFM then we could wrap the mention of it in the description with:

:class:`xarray.backends.CachingFileManager`

The argument type was already clickable, but in the description it was not. I have now made it clickable in both cases (screenshot from local doc production):

djhoese · 2024-06-26T17:32:05Z

satpy/readers/utils.py

+def get_distributed_friendly_dask_array(manager, varname, chunks, dtype,
+                                        group="/", auto_maskandscale=None):


I'm not sure how I feel about this function name. Obviously it makes sense in this PR because it solves this specific problem, but it feels like there is a (shorter) more generic name that gets the point across. Another thing is that distributed_friendly is mentioned here, but that friendliness is a side effect of the "serializable" nature of the way you're accessing the data here, right? get_serializable_dask_array?

I don't feel super strongly about this, but the name was distracting to me so I thought I'd say something.

Renamed get_serializable_dask_array.

satpy/readers/utils.py

djhoese · 2024-06-26T17:47:18Z

satpy/readers/utils.py

+            method set_auto_maskandscale, such as is the case for
+            NetCDF4.Dataset.
+    """
+    def get_chunk():


The chunks is never used here. The current calling from the file handler is accessing the full shape of the variable so this is fine, but only for now. I mean that map_blocks will only ever call this function once. However, if you added a block_info kwarg to the function signature or whatever the map_blocks special keyword argument is, then you could change [:] to access a specific sub-set of the NetCDF file variable and only do a partial load. This should improve performance a lot (I think 🤞) if it was actually used in the file handler.

The chunks is never used here.

Hm? I'm passing chunks=chunks when I call da.map_blocks. What do you mean, it is never used? Do you mean I could be using chunk-location and num-chunks from a block_info dictionary passed to get_chunk?

The current calling from the file handler is accessing the full shape of the variable so this is fine, but only for now. I mean that map_blocks will only ever call this function once. However, if you added a block_info kwarg to the function signature or whatever the map_blocks special keyword argument is, then you could change [:] to access a specific sub-set of the NetCDF file variable and only do a partial load. This should improve performance a lot (I think 🤞) if it was actually used in the file handler.

I will try to wrap may head around this ☺

Yes I think that's what I'm saying. I think the result of get_chunk() right now is broken for any chunk size other than the full shape of the array because you never do any slicing of the NetCDF variable inside get_chunk(). So, if you had a full array of 100x100 and a chunk size of 50x50, then map_blocks would call this function 4 times ((0-50, 0-50), (0-50, 50-100), (50-100, 0-50), (50-100, 50-100)). BUT each call would return the full variable 100x100. So I think this would be a case where the dask array would say "yeah, I have shape 100x100", but then once you computed it you'd get a 200x200 array back.

Fixed it now, I think.

satpy/readers/utils.py

satpy/readers/netcdf_utils.py

Fix the spelling in the docstring example using netCDF4. Co-authored-by: David Hoese <[email protected]>

Add a workaround to prevent an unexpected type promotion in the unit test for dask distributed friendly dask arrays.

When getting a dask-distributed friendly dask array from a NetCDF file using the CachingFileManager, use the information provided in bloc_info on the array location in case we are reading not the entire variable.

Rename get_distributed_friendly_dask_array to get_serialisable_dask_array and remove the group argument, moving the responsibility for handlings groups to the caller.

Pytroll uses US spelling. Rename serializable to serialisable. Remove removed keyword argument from call.

Ensure that the meta we pass to map_blocks also has the right dtype. Not sure if this is necessary when map_blocks already has the right dtype, but it can't hurt.

Fixing three merge conflicts.

coveralls · 2024-07-25T11:39:32Z

Pull Request Test Coverage Report for Build 10528447135

Details

105 of 105 (100.0%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.004%) to 96.155%

Totals
Change from base Build 10528275069:	0.004%
Covered Lines:	52470
Relevant Lines:	54568

💛 - Coveralls

mraspaud

LGTM, just one comment inline. How is this affecting the performance of the reader? Do you consider this ready to be merged?

satpy/readers/utils.py

gerritholl · 2024-07-26T09:12:23Z

How is this affecting the performance of the reader?

I had not tested that yet. I have now. Sadly, it gets worse :(

A simple test script that reads FCI, loads some channels, resamples, and writes them again, not specifying any dask scheduler.

With Satpy main:

time 0:45.93, RAM 9.27 GB

With this PR:

time 0:55.0, RAM 10.0 GB

Additionally, upon exiting, there is the repeated error message:

 Original exception was:
Error in sys.excepthook:

Original exception was:
Error in sys.excepthook:

Original exception was:
Error in sys.excepthook:

Do you consider this ready to be merged?

Sadly no, considering the problems described above. I will dig into this.

gerritholl · 2024-07-26T09:27:32Z

With Satpy main, three runs, times in seconds:

Scene creation: 13.9, 11.9, 11.5

Loading: 1.5, 1.3, 1.0

Computing: 6.5, 6.9, 6.5

With this PR:

Scene creation: 13.2, 12.0, 12.0

Loading: 3.5, 3.5, 4.4

Computing: 5.5, 5.8, 5.6

So it's in particular the loading that gets slower.

gerritholl · 2024-07-26T09:54:44Z

Profiling reveals that there are 160 calls to acquire_context, but only 40 to _NCDatasetWrapper.__init__, so the caching appears to be doing its job. But the load call is not reusing the opening that happened upon Scene creation, which is why performance is impacted. I will see what I can improve.

When caching, make sure we use the CachingFileManager already upon scene creation and not only by the time we are loading.

gerritholl · 2024-07-26T10:49:07Z

With c2b1533 loading is much faster, although with more variability than for satpy main. Scene creation is a little slower.

Scene creation: 13.4, 14.5, 12.8, 12.6, 12.8, 13.0

Loading: 1.9, 1.0, 1.6, 1.4, 1.3, 1.0

gerritholl · 2024-07-26T15:33:30Z

I can't reliably reproduce the performance differences. Running it again with the main branch gives:

Scene creation, main branch: 14.5, 14.3, 13.2

Scene creation, this PR: 13.6, 12.7, 13.8

And with cProfile, it's always faster with my PR.

Considering those uncertainties, I will declare performance is the same within the measurement uncertainty.

Don't subclass netCDF4.Dataset, rather just return an instance from a helper function. Seems good enough and gets rid of the weird error messages upon exit.

gerritholl · 2024-07-26T16:05:54Z

Fixed the problem with the strange exception/error messages upon exit in 9fce5a7.

Some readers read entire groups; this needs xarray kwargs to be set even if caching is used.

mraspaud · 2024-07-29T08:09:27Z

I'm happy with this. @djhoese can you just confirm that you are good with this being merged? (and feel free to merge it if that's the case)

djhoese · 2024-07-29T15:05:29Z

satpy/readers/netcdf_utils.py

+        if self.manager is None:
+            return None
+        return self.manager.acquire()
+
    @staticmethod
    def _set_file_handle_auto_maskandscale(file_handle, auto_maskandscale):
        if hasattr(file_handle, "set_auto_maskandscale"):


Not that this has to be handled in your PR, but if I remember correctly this Dataset-level set_auto_maskandscale was added to netcdf4-python quite a while ago. It seems error prone and confusing to silently call the method only if it exists and to not log/inform the user that it wasn't used when it was expected. Maybe we should remove this method on the file handler class and always call file_handle.set_auto_maskandscale no matter what. Your wrapper does it already.

djhoese · 2024-07-29T15:13:31Z

satpy/tests/reader_tests/test_utils.py

+    @pytest.fixture(scope="class")
+    def dask_dist_client(self):
+        """Set up and close a dask distributed client."""
+        from dask.distributed import Client
+        cl = Client()
+        yield cl
+        cl.close()


It looks like dask developer's recommend using their test utilities for writing distributed-based tests:

https://distributed.dask.org/en/latest/develop.html#writing-tests

Would it be possible to use their tools instead of this?

Looks like you might be able to import this fixture and use it:

https://github.com/dask/distributed/blob/386e5fea1cde4aefaf821e319405188266b41832/distributed/utils_test.py#L521-L525

gerritholl added 5 commits June 14, 2024 10:02

Add test to reproduce GH 2815

7f6a8d4

make sure distributed client is local

6d31c20

Start utility function for distributed friendly

1e26d1a

Start work on a utility function to get a dask array from a dataset variable in a way that is friendly to dask.distributed.

Parameterise test and simplify implementation

be40c5b

For the distributed-friendly dask array helper, parameterise the test to cover more cases. Simplify the implementation.

Force shape and dtype. First working prototype.

cbd00f0

We need to force the shape and the dtype when getting the dask-distributed-friendly xarray-dataarray. Seems to have a first working prototype now.

gerritholl marked this pull request as ready for review June 14, 2024 12:46

gerritholl requested review from djhoese and mraspaud as code owners June 14, 2024 12:46

gerritholl marked this pull request as draft June 14, 2024 12:50

gerritholl added 3 commits June 20, 2024 09:24

Add group support and speed up tests

af4ee66

Add group support for getting a dask distributed friendly dask array. Speed up the related tests by sharing the dask distributed client setup and breakdown.

Respect auto_maskandscale with new caching

fc58ca4

With the new dask-distributed-friendly caching, make sure we are respecting auto_maskandscale and are not applying scale factors twice.

gerritholl mentioned this pull request Jun 20, 2024

Preload (FCI) filehandlers for eager processing #2686

Open

7 tasks

Remove needless except block

09c821a

Remove a dead code except block that should never be reached.

gerritholl marked this pull request as ready for review June 20, 2024 10:25

gerritholl added 2 commits June 20, 2024 14:19

Test refactoring

4f9c5ed

Migrate TestNetCDF4FileHandler from unittest.TestCase to a regular class. Use a pytest fixture for the temporary NetCDF file.

Broaden test match string for test_filenotfound

ec76fa6

Broaden the string that is matched against in TestNetCDF4FileHandler.test_filenotfound. On Linux and MacOS the expected failure gives "No such file or directory". On Windows it gives "Invalid file format".

djhoese added enhancement code enhancements, features, improvements component:readers labels Jun 26, 2024

djhoese requested changes Jun 26, 2024

View reviewed changes

gerritholl and others added 8 commits July 24, 2024 11:37

fix docstring example spelling

06d8811

Fix the spelling in the docstring example using netCDF4. Co-authored-by: David Hoese <[email protected]>

Prevent unexpected type promotion in unit test

aaf91b9

Add a workaround to prevent an unexpected type promotion in the unit test for dask distributed friendly dask arrays.

Use block info getting a dd-friendly da

a2ad42f

When getting a dask-distributed friendly dask array from a NetCDF file using the CachingFileManager, use the information provided in bloc_info on the array location in case we are reading not the entire variable.

Rename to serialisable and remove group argument

9126bbe

Rename get_distributed_friendly_dask_array to get_serialisable_dask_array and remove the group argument, moving the responsibility for handlings groups to the caller.

Use wrapper class for auto_maskandscale

5e576f9

GB -> US spelling

63e7507

Pytroll uses US spelling. Rename serializable to serialisable. Remove removed keyword argument from call.

Ensure meta dtype

ea04595

Ensure that the meta we pass to map_blocks also has the right dtype. Not sure if this is necessary when map_blocks already has the right dtype, but it can't hurt.

Merge branch 'main' into bugfix-2815

523671a

Fixing three merge conflicts.

Fix spelling in test

fde3896

mraspaud approved these changes Jul 26, 2024

View reviewed changes

satpy/readers/utils.py Outdated Show resolved Hide resolved

Clarify docstring

5b137e8

gerritholl marked this pull request as draft July 26, 2024 09:55

Use cache already in scene creation

c2b1533

When caching, make sure we use the CachingFileManager already upon scene creation and not only by the time we are loading.

gerritholl marked this pull request as ready for review July 26, 2024 15:34

Use helper function rather than subclass

9fce5a7

Don't subclass netCDF4.Dataset, rather just return an instance from a helper function. Seems good enough and gets rid of the weird error messages upon exit.

restore non-cached group retrieval

4993b65

Some readers read entire groups; this needs xarray kwargs to be set even if caching is used.

djhoese reviewed Jul 29, 2024

View reviewed changes

djhoese requested changes Jul 29, 2024

View reviewed changes

Merge branch 'main' into bugfix-2815

7c173e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make NetCDF file cache handling compatible with dask distributed #2822

Make NetCDF file cache handling compatible with dask distributed #2822

gerritholl commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 20, 2024 •

edited

Loading

coveralls commented Jun 20, 2024 •

edited

Loading

djhoese left a comment

djhoese Jun 26, 2024

gerritholl Jul 25, 2024

djhoese Jun 26, 2024

gerritholl Jul 25, 2024

djhoese Jun 26, 2024

gerritholl Jul 24, 2024

djhoese Jul 24, 2024

gerritholl Jul 25, 2024

coveralls commented Jul 25, 2024 •

edited

Loading

mraspaud left a comment

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

mraspaud commented Jul 29, 2024

djhoese Jul 29, 2024

djhoese Jul 29, 2024

djhoese Jul 29, 2024

		def get_distributed_friendly_dask_array(manager, varname, chunks, dtype,
		group="/", auto_maskandscale=None):

Make NetCDF file cache handling compatible with dask distributed #2822

Are you sure you want to change the base?

Make NetCDF file cache handling compatible with dask distributed #2822

Conversation

gerritholl commented Jun 14, 2024 • edited Loading

codecov bot commented Jun 20, 2024 • edited Loading

Codecov Report

coveralls commented Jun 20, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9597790771

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

djhoese left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 25, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10528447135

Details

💛 - Coveralls

mraspaud left a comment

Choose a reason for hiding this comment

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

gerritholl commented Jul 26, 2024

mraspaud commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerritholl commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 20, 2024 •

edited

Loading

coveralls commented Jun 20, 2024 •

edited

Loading

coveralls commented Jul 25, 2024 •

edited

Loading