Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening virtual datasets (dmr-adapter) #606

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ayushnag
Copy link

@ayushnag ayushnag commented Jun 18, 2024

  • Closes Opening virtual datasets with NASA dmrpp files #605
  • Fix scale_factor, add_offset bug
  • Find precise permissions URL (not DAAC general)
  • Add docs
  • Unit tests. Update current test to test specific portions of the virtual dataset and also test the numpy/dask loaded dataset
  • Check indirect access support
  • Update earthaccess documentation
  • Use updated virtualizarr version (with numpy 2.0 manifest)

📚 Documentation preview 📚: https://earthaccess--606.org.readthedocs.build/en/606/

Exception
If the DMR++ file is not found or if there is an error parsing the DMR++
"""
from virtualizarr.readers.dmrpp import DMRParser

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that you're not actually using the filetype='dmr++' option to virtualizarr.open_virtual_dataset here. It seems to me that one alternative option would be for everything in zarr-developers/VirtualiZarr#113 to also live in this library, as it already pretty much entirely uses public virtualizarr API... But I guess that depends whether you think the dmr++ option to virtualizarr.open_virtual_dataset is likely to be useful outside of the context of the earthaccess library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is because the parser has an additional kwarg data_filepath which is required in cases where the dmr path cannot be simply derived by just adding .dmrpp. If there is a way for engine specific args in virtualizarr.open_dataset then I can switch to that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is because the parser has an additional kwarg data_filepath which is required in cases where the dmr path cannot be simply derived by just adding .dmrpp.

I'm not sure I understand - why is the main filepath you pass to virtualizarr.open_virtual_dataset not sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case: virtualizarrr.open_dataset(filepath=“s3://air.dmrpp”, data_filepath=“s3://datafiles/air.nc”, engine="dmr++") when the dmr path is independent from data path. The chunk manifest needs to store the data_filepath instead of the dmr filepath

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh - does the dmr++ data not contain the path to the original data??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the dmr file only contains the file name and not the full path. This is an example of the name that a dmr file contains name="20210715090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"

Copy link
Author

@ayushnag ayushnag Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that virtualizarr renaming paths was added which can solve this issue. I will just call vz.open_dataset and then rename the data paths using earthaccess results. Then I can switch to the public virtualizarr API now and remove the _parse_dmr function

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what does that imply for my original question:

It seems to me that one alternative option would be for everything in zarr-developers/VirtualiZarr#113 to also live in this library, as it already pretty much entirely uses public virtualizarr API...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betolink How about if I move the parser code into earthaccess? Now that I'm thinking about it makes more sense to be in the NASA related repository. It would also make unit tests easier since earthaccess can easily access NASA dmrpp files. Then this PR will add dmrpp.py and virtualizarr.py

@betolink
Copy link
Member

This PR looks good to me (we need to fix some minor formatting issues with Ruff). Maybe the only missing thing would be a notebook demonstrating how to use this feature? @ayushnag

@ayushnag
Copy link
Author

ayushnag commented Jun 20, 2024

https://gist.github.com/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

Better viewing experience: https://nbviewer.org/gist/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

If there's more you want me to add or if any step is unclear I can update the notebook

@betolink betolink self-assigned this Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening virtual datasets with NASA dmrpp files
3 participants