Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fsspec functionality to viirs_sdr reader #2534

Merged
merged 9 commits into from
Jan 10, 2024

Conversation

martin-rdz
Copy link
Contributor

Add the option to read AWS S3 files in the viirs_sdr reader, similarly to the abi_l1b version, as described in the documentation.
h5py also is fine with the return of satpy.readers.open_file_or_filename.
(Thanks to @ajelenak for the inspiration in his jupyter notebook).
Though I am not sure where to properly document the added functionality.

Cheers,
martin

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I haven't been a big part of these types of changes in Satpy. I'm hoping @pnuu or @mraspaud can take a look at this before it gets merged. We may want to have some tests added for this, but I'm not sure if we required it for the changes to the NetCDF helper file handler.

Thanks so much for starting the work on this though. This will be great to have.

@djhoese djhoese added enhancement code enhancements, features, improvements component:readers labels Jul 24, 2023
@djhoese
Copy link
Member

djhoese commented Jul 24, 2023

Hm looks like these changes are breaking the test_epic_l1b_h5.py tests:

FAILED satpy/tests/reader_tests/test_epic_l1b_h5.py::TestEPICL1bReader::test_times - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
FAILED satpy/tests/reader_tests/test_epic_l1b_h5.py::TestEPICL1bReader::test_counts_calibration - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
FAILED satpy/tests/reader_tests/test_epic_l1b_h5.py::TestEPICL1bReader::test_refl_calibration - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
FAILED satpy/tests/reader_tests/test_epic_l1b_h5.py::TestEPICL1bReader::test_bad_calibration - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
FAILED satpy/tests/reader_tests/test_epic_l1b_h5.py::TestEPICL1bReader::test_load_ancillary - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

I don't see any mocking in the tests so this looks like it is purely these changes that are breaking things. I could be wrong too...

@ajelenak
Copy link

ajelenak commented Sep 4, 2023

Hello!

Since I was mentioned here related to access to HDF5 files in the cloud, here's the link to my recent presentation about cloud-optimized HDF5 files (applies to netCDF-4 files, too). I hope you find it useful.

@mraspaud
Copy link
Member

mraspaud commented Sep 7, 2023

@martin-rdz did you check why your changes broke the epic tests?

@martin-rdz
Copy link
Contributor Author

Not yet. Sorry was occupied with other projects.
But for sure I can have a look.

@mraspaud
Copy link
Member

mraspaud commented Sep 8, 2023

That would be great!

@martin-rdz
Copy link
Contributor Author

I tried to dig into the issue. From my understanding, satpy/tests/reader_tests/test_epic_l1b_h5.py passed the filename (....pytest-10/test_times0/epic_1b_20150613120251_03.h5) as a pathlib.PosixPath.
pathlib posses a .open method, hence open_file_or_filename succeeded in opening the file as a io.TextIOWrapper, which caused trouble downstream.

I reckon, this issue would also occur if a pathlib.Path is provided by the user and decided to fix it within open_file_or_filename with an explicit if statement.

In my test-environment test_epic_l1b_h5.py and test_abi_l1b.py succeeded, but I could not check for all the reader tests as I am lacking some libraries.

In case that did not fix the issue, I can have an even deeper look.

@codecov
Copy link

codecov bot commented Sep 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (442325d) 95.39% compared to head (529de8f) 95.40%.
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2534      +/-   ##
==========================================
+ Coverage   95.39%   95.40%   +0.01%     
==========================================
  Files         371      371              
  Lines       52690    52796     +106     
==========================================
+ Hits        50263    50370     +107     
+ Misses       2427     2426       -1     
Flag Coverage Δ
behaviourtests 4.15% <0.71%> (-0.01%) ⬇️
unittests 96.01% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djhoese
Copy link
Member

djhoese commented Sep 12, 2023

pre-commit.ci autofix

satpy/readers/__init__.py Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Sep 12, 2023

Pull Request Test Coverage Report for Build 7469856199

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 95.965%

Totals Coverage Status
Change from base Build 7379871277: 0.01%
Covered Lines: 50496
Relevant Lines: 52619

💛 - Coveralls

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. I'm a little concerned that the new if statement will break something, but as far as I can tell it should only fix things that we didn't know were wrong. I guess the last question for me is how hard would it be to add a test for this change? I'm not sure we need to test data actually coming from S3, but something local wrapped in an fsspec or FSFile object would be good. We already have helpers used by the Scene to do this conversion (s3:// URL to FSFile):

https://github.com/martin-rdz/satpy/blob/e9e5fcbf1fcaede5e97d4407086a12376fd64944/satpy/utils.py#L660-L675

But this could be done with a local file too.

A test to make sure that a Path object isn't opened when passed to the HDF5 file handler would be good too.

@mraspaud @pnuu any other requests?

@martin-rdz
Copy link
Contributor Author

For sure I can have a look at the tests as well. Tough I am not very experienced on how to inject the dummy files into the test environment.
Might take a while until I get those running.

@djhoese
Copy link
Member

djhoese commented Oct 4, 2023

It will be difficult to test with VIIRS SDR files because they are so complex. It may be better (and I'm OK with this) to create a fake file handler based on the HDF5 handler and in your test setup (you could create a pytest fixture as well) create a very very basic HDF5 file to give to it. I don't remember but you might be able to get away with creating the HDF5 handler itself and not a subclass, but there may be some abstract base class stuff that will be mad that some methods aren't implemented. You should be able to then make Mock versions of some filename inputs (a Path-like object, a FSFile, etc) and in your test assert that things are called in a certain order. For example, make sure .open() is never called on the Path-like object or at least not until it is done by the HDF5/h5py library. Passing an FSFile for a local file (your test file) should also work...I think.

Let me know if I can provide more guidance. I won't have much time to help actually write the tests, but I'll try to help where I can.

@djhoese
Copy link
Member

djhoese commented Oct 29, 2023

I'm running into the main issue you ran into here with passing a Path object and having Satpy try to give xarray.open_dataset a TextIOWrapper rather than the original filename. We should see what we can do to get this PR updated and/or merged.

@martin-rdz
Copy link
Contributor Author

Apologies, for the long delay, but I was occupied with other issues.
I am still trying to figure out what we should actually test for and would appreciate some guidiance.
So the open_file_or_filename should never return something that cannot be handled by xarray.open_dataset, especially no io.TextIOWrapper? Is this generally true for all readers?
That test would then ideally be located in satpy/tests/test_readers.py?

@mraspaud
Copy link
Member

mraspaud commented Dec 4, 2023

@martin-rdz Thanks for coming back to this.
In my opinion, the test to add would be to just pass an FSFile instance with eg a local fsspec file in it and make sure that file is actually opened. As @djhoese mentioned earlier, it's probably easier with just a dummy file. The rest of the filehandler is already tested, so that would be enough I think.

As for the location, I would put is in the test_viirs_sdr.py file.

@djhoese
Copy link
Member

djhoese commented Dec 4, 2023

So the open_file_or_filename should never return something that cannot be handled by xarray.open_dataset, especially no io.TextIOWrapper? Is this generally true for all readers?

@mraspaud Is this true?

@djhoese
Copy link
Member

djhoese commented Jan 4, 2024

@martin-rdz we may have to check this specifically, but it may be that the h5netcdf engine of xarray (xr.open_dataset(..., engine="h5netcdf")) can read TextIOWrapper objects or file-like objects. I believe it reads it as a series of bytes. The other engines (ex. netcdf4) can't handle open file-like objects like this last time I checked.

@djhoese
Copy link
Member

djhoese commented Jan 4, 2024

Ok what are our cases, Readers are given:

  1. fsspec OpenFile objects. These are not pathlib.Path subclasses and they have an .open() method which returns a LocalFileOpener (for local files of course). These Opener objects can be passed to xr.open_dataset as long as h5netcdf if available. Xarray will switch the default engine to h5netcdf if it detects a file-like object (see below).
  2. Path objects. These are obviously a Path subclass and have a .open() method which returns a TextIOWrapper. The Path itself is passable to h5netcdf and netcdf4 engines for xr.open_dataset. The open path gives the expected error when passed to the netcdf4 engine. It also gives an error when passed to h5netcdf for some reason. I'm not sure what's going on here. ...oh I need to .open(mode="rb") but that syntax isn't supported by fsspec.
  3. String local filenames. Not Path subclasses and no .open() methods. They work.

I tested the same with h5py.File and all was about as expected. Path works, fsspec Opener works, Path.open(mode="rb") also worked.

Testing

In [1]: import fsspec

In [2]: import xarray as xr

In [3]: fn = "/data/satellite/abi/20181127/OR_ABI-L1b-RadC-M3C01_G16_s20183291257184_e20183291259557_c20183291300002.nc"

In [4]: fs_file = fsspec.open(fn)

In [5]: fs_file
Out[5]: <OpenFile '/data/satellite/abi/20181127/OR_ABI-L1b-RadC-M3C01_G16_s20183291257184_e20183291259557_c20183291300002.nc'>

In [6]: import pathlib

In [7]: isinstance(fs_file, pathlib.Path)
Out[7]: False

In [9]: open_fs_file = fs_file.open()

In [10]: open_fs_file
Out[10]: <fsspec.implementations.local.LocalFileOpener at 0x7bcf34137910>

In [11]: open_ds = xr.open_dataset(open_fs_file)

In [12]: open_ds
Out[12]:
<xarray.Dataset>
Dimensions:                                 (y: 3000, x: 5000,
...

In [13]: open_ds = xr.open_dataset(open_fs_file, engine="netcdf4")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...

File ~/miniconda3/envs/satpy_py312/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:373, in NetCDF4DataStore.open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    370     filename = os.fspath(filename)
    372 if not isinstance(filename, str):
--> 373     raise ValueError(
    374         "can only read bytes or file-like objects "
    375         "with engine='scipy' or 'h5netcdf'"
    376     )
    378 if format is None:
    379     format = "NETCDF4"

ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf'


In [14]: open_ds = xr.open_dataset(open_fs_file, engine="h5netcdf")

In [15]: p = pathlib.Path(fn)

In [16]: path_ds = xr.open_dataset(p, engine="netcdf4")

In [17]: path_ds.dims
Out[17]: FrozenMappingWarningOnValuesAccess({'y': 3000, 'x': 5000, 'number_of_time_bounds': 2, 'band': 1, 'number_of_image_bounds': 2, 'num_star_looks': 24})

In [27]: open_path = p.open()

In [28]: path_ds = xr.open_dataset(open_path, engine="h5netcdf")
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
...

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

In [29]: path_ds = xr.open_dataset(open_path, engine="h5netcdf")

Other comments

So the open_file_or_filename should never return something that cannot be handled by xarray.open_dataset, especially no io.TextIOWrapper? Is this generally true for all readers?

@martin-rdz I think this is generally true. Most readers are rather naive and don't assume or implement any "stream" functionality like reading in byte arrays/strings.

@djhoese
Copy link
Member

djhoese commented Jan 4, 2024

Sorry for all the comments. I see @mraspaud suggested adding tests to the reader for this logic, but I disagree with this. If the reader(s) were ever extracted out of Satpy (or just this VIIRS SDR reader) then we'd lose out on all of this important testing. I'll see if I can add some testing for this.

@djhoese
Copy link
Member

djhoese commented Jan 4, 2024

Ok I added some tests and refactored others. Turns out I didn't need to refactor the FSFile tests to write mine, but...well...that's done now.

All of the Path cases of my tests failed without the changes from this PR. Otherwise the only thing that surprised me was that I originally wrote the FSFile based opening tests with FSFile(local_filename) and they fail for the same reason I described in my big comment above: the .open call opens them in text mode instead of as bytes. So it fails do decode the binary data as UTF-8. An expected failure, but also something that should probably work. @mraspaud thoughts?

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix and tests @martin-rdz @djhoese !
I would just skip the file deletions when pytest's tmp_path or tmp_path_factory are used.

satpy/tests/test_readers.py Outdated Show resolved Hide resolved
satpy/tests/test_readers.py Outdated Show resolved Hide resolved
satpy/tests/test_readers.py Outdated Show resolved Hide resolved
@djhoese
Copy link
Member

djhoese commented Jan 10, 2024

Assuming tests pass, does anyone else have more comments? Otherwise I can merge this tomorrow (or someone else can).

@mraspaud mraspaud merged commit bca22b4 into pytroll:main Jan 10, 2024
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants