-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow GeoDataset to list files in VSI path(s) #1399
base: main
Are you sure you want to change the base?
Allow GeoDataset to list files in VSI path(s) #1399
Conversation
@microsoft-github-policy-service agree company="vake" |
Hey, sorry it took us so long to review this! We're normally much more responsive to issues/PRs but we just finished a paper deadline. This looks great! We've been wanting better support for cloud (AWS/GCP/Azure) for a long time. The biggest thing holding that up is my own perfectionism (what about VectorDataset or NonGeoDataset?). There isn't an easy way to add support for cloud for every dataset without refactoring the entire library. I was considering doing this at the same time as porting to TorchData, but we're still on the fence about when or if that will actually happen. So in my opinion, starting with Raster/VectorDataset for now and worrying about the others should be fine. In terms of the current implementation, here are my preliminary thoughts:
@calebrob6 will likely want to check this to make sure this works for him on Azure/Planetary Computer. P.S. Thanks for the PR! This would be a fantastic first contribution. |
As Sean mentioned in my Fiona feature request maybe libcloud is a good option. But we could start with the simple
Looks like the behaviour will be the same. Everything that can be opened using fiona/rasterio/gdal will support vsi.
Looks like rasterio does not have this method. Do you think the above mentioned
This (should) work with the cloud providers listen in the gdal docs. fiona.listdir use method VSIReadDir from gdal/ogr internally, which I assume is thoroughly tested. I have tested my implementation on Azure and GS.
Looks like moto can mock virtual file systems. |
We're trying to limit additional dependencies but I'll keep libcloud and moto in mind. I agree that we should start with a simple implementation and worry about making it "perfect" later. Especially since it wouldn't be an API change if we modify the implementation later.
|
We had previously discussed possibly making a This also has the benefit of making the dependencies optional and only imported inside the method. |
We should definitely have a This is somewhat orthogonal to this PR. We should add @adriantre would you be interested in implementing |
I will give it a shot. I'm initially thinking that, analogously to |
The vsi can be passed in two ways. E.g. for azure:
Looks like 2 is preferred, at least somewhat documented, although internally 2. is converted to 1. before it is passed to the reader. For either we would need to use I realised that we cannot use Lastly, another vsi complicating the matter is zip-archives, which I'm starting to feel like Thoughts? |
It's certainly possible, but as you mention, the scope makes it difficult. I think |
I think |
3237d5f
to
6624dee
Compare
I have now refactored this and rebased on #1442, so all changes from that PR is reflected here until that is merged. |
6624dee
to
7ca3cb7
Compare
Is now outdated wrt. to #1442 but still works as proof of concept. Any feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
- Code looks kinda hacky, wondering if there's a simpler way
- Tests look extremely complex, wish we could simplify them greatly
torchgeo/datasets/utils.py
Outdated
# fiona.listdir can throw FionaValueError for only two reasons | ||
# 1. 'is not a directory' | ||
# 2. 'does not exist' | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be more specific about what we raise here. Is this try-except really the only way to do this? Seems dirty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would checking the error message and raising our own error be better?
Here is the error that might occur:
https://github.com/Toblerity/Fiona/blob/7490643d145ce0be506a210abd2802fd0fff63f4/fiona/ogrext.pyx#L2171
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Fiona accepts this, would this better? (Would still need some workaround until then).
if not fiona.dir_exists(dir):
continue
elif fiona.isdir(dir):
subdirs = fiona.listdir(dir)
dirs.extend([f'{dir}/{subdir}' for subdir in subdirs])
else:
files.append(dir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the feedback from fiona is that "torchgeo use a dedicated package like fsspec or the Azure Python module." But then we don't get the silver bullet for all of them, and it seems like additional libraries for each file system will have to be installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're definitely trying to minimize our dependencies when possible (0.6 actually removes more dependencies than it adds) so I would prefer not to do this. Worst case, users can always use these libraries themselves to get a list of files, then pass these to a GeoDataset (like in 0.5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what do you think, merge this, or go with my first proposal; make it easier to handle "non-local-files" by adding a method that the users can override?
if os.path.isdir(path):
pathname = os.path.join(path, '**', self.filename_glob)
files |= set(glob.iglob(pathname, recursive=True))
elif (os.path.isfile(path) or path_is_vsi(path)) and fnmatch.fnmatch(
str(path), f'*{self.filename_glob}'
):
files.add(path)
elif not hasattr(self, 'download'):
self.handle_non_local_file()
def handle_non_local_file(self):
warnings.warn(
f"Could not find any relevant files for provided path '{path}'. "
f'Path was ignored.',
UserWarning,
)
torchgeo/datasets/utils.py
Outdated
return files | ||
|
||
|
||
def list_directory_recursive(root: Path, filename_glob: str) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return a set instead of a list? The files
method is the only place they are used and uses a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats ok by me. The only reason I did not is because similar methods returns a list (iglob, fnmatch etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change this method to remove duplicates (set) and sort it, then return a list as well. Or is it better to be explicit about that in the files-property?
Co-authored-by: Adam J. Stewart <[email protected]>
Co-authored-by: Adam J. Stewart <[email protected]>
Co-authored-by: Adam J. Stewart <[email protected]>
Co-authored-by: Adam J. Stewart <[email protected]>
Let's work our way through this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to devote some time to peeking into fiona.listdir
and seeing if there's a way to recursively do this (or add this option to fiona if not) instead of writing our own while-statement, but I'm starting to understand and come to terms with your implementation. Let's focus on simplifying the tests before finalizing the code. If needed, we can also simplify the tests in a separate PR so that this PR only adds a couple new lines to testing (zip files in parametrize).
tests/datasets/test_geo.py
Outdated
@@ -84,6 +84,19 @@ def __len__(self) -> int: | |||
return 2 | |||
|
|||
|
|||
@pytest.fixture(scope='module') | |||
def temp_archive(request: SubRequest) -> Generator[tuple[str, str], None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be using the tmp_path
fixture here instead of creating the archive in a git-tracked location. This will require the default function scope though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a module-scoped fixture instead, but I can opt for class-scoped if it will only be used for TestGeoDataset.
tests/datasets/test_geo.py
Outdated
], | ||
indirect=True, | ||
) | ||
def test_zipped_specific_file_dir(self, temp_archive: tuple[str, str]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the testing drastically, I think I would rather:
- Create all uncompressed (.tif) and compressed (.zip) files in
tests/data/<whatever>
and store them in git so we don't have to generate and delete them on the fly - Have only 2 test functions (expected to pass, expected to fail)
- Parametrize each of these test functions with all supported inputs (.tif, zipped file, zipped dir, etc.)
This will remove a lot of the code duplication and make things easier to add new tests for (just a single line in parametrize). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look! I like the idea.
Might be annoying to keep the archives up-to-date with the source if we ever modify them though. The tests does not seem to be much slower when archiving on the fly, but it would reduce code-complexity for the test
torchgeo/datasets/utils.py
Outdated
@@ -607,18 +611,22 @@ def percentile_normalization( | |||
return img_normalized | |||
|
|||
|
|||
def path_is_vsi(path: Path) -> bool: | |||
"""Checks if the given path is pointing to a Virtual File System. | |||
def path_is_gdal_vsi(path: Path) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still considering deleting this function. It's only used in a single place, and it's only 1 line long, so we could just replace the if-statement on line 693 with this line. I would like to keep the documentation somehow though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a lot of documentation to have in-line in the other method :P
FYI: Toblerity/Fiona#1275 |
Fix #1398
Fix #1165
This PR remove blockers for reading raster files directly from Cloud Storage (buckets) and other GDAL Virtual File Systems.