Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GeoDataset to list files in VSI path(s) #1399

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

adriantre
Copy link
Contributor

@adriantre adriantre commented Jun 5, 2023

Fix #1398
Fix #1165

This PR remove blockers for reading raster files directly from Cloud Storage (buckets) and other GDAL Virtual File Systems.

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jun 5, 2023
@adriantre
Copy link
Contributor Author

@microsoft-github-policy-service agree company="vake"

@adamjstewart
Copy link
Collaborator

Hey, sorry it took us so long to review this! We're normally much more responsive to issues/PRs but we just finished a paper deadline.

This looks great! We've been wanting better support for cloud (AWS/GCP/Azure) for a long time. The biggest thing holding that up is my own perfectionism (what about VectorDataset or NonGeoDataset?). There isn't an easy way to add support for cloud for every dataset without refactoring the entire library. I was considering doing this at the same time as porting to TorchData, but we're still on the fence about when or if that will actually happen. So in my opinion, starting with Raster/VectorDataset for now and worrying about the others should be fine.

In terms of the current implementation, here are my preliminary thoughts:

  1. Is it possible to use the same logic for VectorDataset?
  2. Can VSI vs. non-VSI be detected based on whether or not root starts with /vsi? Or would it start with https://? We can check for /vsizip/, /vsitar/, etc. if we need to be more explicit. I can't imagine too many users will happen to have a /vsi* directory on their root filesystem, that's the only issue I can think of.
  3. The listdir stuff seems fragile, especially the error handling. What happens if the URL/path is not actually a real VSI filesystem? Does it return an invalid root directory? Is this only a fiona feature or does rasterio have the same feature?
  4. Does this work for all cloud providers? Some of us may have access to others if you need help checking this.
  5. Testing this will be fun. For the normal tests that run on all PRs, we try to avoid any tests that require internet access. So if we can create a local VSI file to test this that would be ideal. We could potentially add tests that only get run during the release process if we want to check remote cloud providers, but idk how stable those would be.

@calebrob6 will likely want to check this to make sure this works for him on Azure/Planetary Computer.

P.S. Thanks for the PR! This would be a fantastic first contribution.

@adamjstewart adamjstewart added this to the 0.5.0 milestone Jun 9, 2023
@adriantre
Copy link
Contributor Author

adriantre commented Jun 19, 2023

There isn't an easy way to add support for cloud for every dataset without refactoring the entire library.

As Sean mentioned in my Fiona feature request maybe libcloud is a good option. But we could start with the simple fiona.listdir and if users report incompatibilities we can put more efforts into the "perfect" solution?

  1. Is it possible to use the same logic for VectorDataset?

Looks like the behaviour will be the same. Everything that can be opened using fiona/rasterio/gdal will support vsi.

  1. Can VSI vs. non-VSI be detected based on whether or not root starts with /vsi? Or would it start with https://? We can check for /vsizip/, /vsitar/, etc. if we need to be more explicit. I can't imagine too many users will happen to have a /vsi* directory on their root filesystem, that's the only issue I can think of.

fiona.listdir will work for local filesystems as well. Root will start with /vsi. If it is a problem that users for some reason create a local folder /vsi then TorchGeo datasets already have this problem as they read files using fiona and rasterio. We can use is_remote or valid_vsi from fiona instead of the vsi kwarg in my suggestion.

  1. The listdir stuff seems fragile, especially the error handling. What happens if the URL/path is not actually a real VSI filesystem? Does it return an invalid root directory? Is this only a fiona feature or does rasterio have the same feature?

Looks like rasterio does not have this method. Do you think the above mentioned valid_vsi will be sufficient regarding teh fragility?

  1. Does this work for all cloud providers? Some of us may have access to others if you need help checking this.

This (should) work with the cloud providers listen in the gdal docs. fiona.listdir use method VSIReadDir from gdal/ogr internally, which I assume is thoroughly tested. I have tested my implementation on Azure and GS.

  1. Testing this will be fun. For the normal tests that run on all PRs, we try to avoid any tests that require internet access.

Looks like moto can mock virtual file systems.

@adamjstewart
Copy link
Collaborator

We're trying to limit additional dependencies but I'll keep libcloud and moto in mind.

I agree that we should start with a simple implementation and worry about making it "perfect" later. Especially since it wouldn't be an API change if we modify the implementation later.

valid_vsi looks good to me, let's use that and remove the user-defined parameter.

@isaaccorley
Copy link
Collaborator

isaaccorley commented Jun 19, 2023

We had previously discussed possibly making a from_s3 or from_files method instead so a user could instantiate a dataset like dataset = RasterDataset.from_files(...). This would also reduce code in the constructor. Do we want to take this approach instead?

This also has the benefit of making the dependencies optional and only imported inside the method.

@adamjstewart
Copy link
Collaborator

We should definitely have a RasterDataset.from_files classmethod constructor, @calebrob6 has been asking for that for years.

This is somewhat orthogonal to this PR. We should add from_files in a separate PR. Then we can decide whether we want root to support both local and remote file systems or just tell users to use from_files.

@adriantre would you be interested in implementing from_files? It would share most of the code with __init__ (we can move it to a subfunction) but accept a list of files instead of a single directory. I'll open an issue so we can track this, it should be relatively easy. Also let me know what you think about requiring from_files for remote file systems or not.

@adriantre
Copy link
Contributor Author

I will give it a shot. I'm initially thinking that, analogously to rasterio.open, we shouldn't need to have different constructors for remote or local files. We may look to e.g. pandas.read_file to see how they handle imports and drivers for different file types.

@adriantre
Copy link
Contributor Author

The vsi can be passed in two ways. E.g. for azure:

  1. /vsiaz/my/root is OGR VSI
  2. az://my/root is Apache` Common VFS

Looks like 2 is preferred, at least somewhat documented, although internally 2. is converted to 1. before it is passed to the reader.

For either we would need to use fiona.listdir (or alternatives) to list the contents.

I realised that we cannot use valid_vsi from fiona. It only supports 2. and is not up-to-date with all supported vsi.

Lastly, another vsi complicating the matter is zip-archives, which fiona.listdir supports to a lesser degree. It works only if the root is the actual zip-archive, so my implementation listdir_vsi_recursive won't be sufficient if one wants all tif's in all zip-archives within the root.

I'm starting to feel like RasterDataset.from_files is the most flexible way of solving this. But at the same time it feels like it should be possible to make it work for root-directory discovery as well, but we would have to narrow the scope somehow.

Thoughts?

@adamjstewart
Copy link
Collaborator

It's certainly possible, but as you mention, the scope makes it difficult. I think from_files should be added regardless. If someone asks for native support in __init__, we can consider adding it, but let's see if from_files is sufficient for most people for now. We'll also want to document it so people know it actually exists. Maybe add a new tutorial. Not required for the initial PR unless you're really gung ho about it.

@calebrob6
Copy link
Member

I think .from_uris is more what I was thinking about, then the user could pass vsi strings, https links to COGs, local paths, etc.

@adriantre adriantre force-pushed the feature/support_gdal_virtual_file_systems branch from 3237d5f to 6624dee Compare June 22, 2023 14:15
@adriantre
Copy link
Contributor Author

adriantre commented Jun 22, 2023

I have now refactored this and rebased on #1442, so all changes from that PR is reflected here until that is merged.

@adriantre adriantre force-pushed the feature/support_gdal_virtual_file_systems branch from 6624dee to 7ca3cb7 Compare June 23, 2023 12:21
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
@adriantre
Copy link
Contributor Author

Is now outdated wrt. to #1442 but still works as proof of concept. Any feedback?

.gitignore Outdated Show resolved Hide resolved
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  1. Code looks kinda hacky, wondering if there's a simpler way
  2. Tests look extremely complex, wish we could simplify them greatly

torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
# fiona.listdir can throw FionaValueError for only two reasons
# 1. 'is not a directory'
# 2. 'does not exist'
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more specific about what we raise here. Is this try-except really the only way to do this? Seems dirty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would checking the error message and raising our own error be better?

Here is the error that might occur:
https://github.com/Toblerity/Fiona/blob/7490643d145ce0be506a210abd2802fd0fff63f4/fiona/ogrext.pyx#L2171

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Fiona accepts this, would this better? (Would still need some workaround until then).

if not fiona.dir_exists(dir):
    continue
elif fiona.isdir(dir):
    subdirs = fiona.listdir(dir)
    dirs.extend([f'{dir}/{subdir}' for subdir in subdirs])
else:
    files.append(dir)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the feedback from fiona is that "torchgeo use a dedicated package like fsspec or the Azure Python module." But then we don't get the silver bullet for all of them, and it seems like additional libraries for each file system will have to be installed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're definitely trying to minimize our dependencies when possible (0.6 actually removes more dependencies than it adds) so I would prefer not to do this. Worst case, users can always use these libraries themselves to get a list of files, then pass these to a GeoDataset (like in 0.5).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do you think, merge this, or go with my first proposal; make it easier to handle "non-local-files" by adding a method that the users can override?

if os.path.isdir(path):
    pathname = os.path.join(path, '**', self.filename_glob)
    files |= set(glob.iglob(pathname, recursive=True))
elif (os.path.isfile(path) or path_is_vsi(path)) and fnmatch.fnmatch(
    str(path), f'*{self.filename_glob}'
):
    files.add(path)
elif not hasattr(self, 'download'):
    self.handle_non_local_file()
    
def handle_non_local_file(self):
    warnings.warn(
        f"Could not find any relevant files for provided path '{path}'. "
        f'Path was ignored.',
        UserWarning,
    )

torchgeo/datasets/utils.py Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
return files


def list_directory_recursive(root: Path, filename_glob: str) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return a set instead of a list? The files method is the only place they are used and uses a set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats ok by me. The only reason I did not is because similar methods returns a list (iglob, fnmatch etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change this method to remove duplicates (set) and sort it, then return a list as well. Or is it better to be explicit about that in the files-property?

torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
@adriantre
Copy link
Contributor Author

General comments:

  1. Code looks kinda hacky, wondering if there's a simpler way
  2. Tests look extremely complex, wish we could simplify them greatly

Let's work our way through this.

@adamjstewart adamjstewart modified the milestones: 0.6.0, 0.7.0 Aug 29, 2024
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to devote some time to peeking into fiona.listdir and seeing if there's a way to recursively do this (or add this option to fiona if not) instead of writing our own while-statement, but I'm starting to understand and come to terms with your implementation. Let's focus on simplifying the tests before finalizing the code. If needed, we can also simplify the tests in a separate PR so that this PR only adds a couple new lines to testing (zip files in parametrize).

pyproject.toml Show resolved Hide resolved
@@ -84,6 +84,19 @@ def __len__(self) -> int:
return 2


@pytest.fixture(scope='module')
def temp_archive(request: SubRequest) -> Generator[tuple[str, str], None, None]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be using the tmp_path fixture here instead of creating the archive in a git-tracked location. This will require the default function scope though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a module-scoped fixture instead, but I can opt for class-scoped if it will only be used for TestGeoDataset.

torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
tests/datasets/test_geo.py Outdated Show resolved Hide resolved
],
indirect=True,
)
def test_zipped_specific_file_dir(self, temp_archive: tuple[str, str]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the testing drastically, I think I would rather:

  1. Create all uncompressed (.tif) and compressed (.zip) files in tests/data/<whatever> and store them in git so we don't have to generate and delete them on the fly
  2. Have only 2 test functions (expected to pass, expected to fail)
  3. Parametrize each of these test functions with all supported inputs (.tif, zipped file, zipped dir, etc.)

This will remove a lot of the code duplication and make things easier to add new tests for (just a single line in parametrize). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look! I like the idea.

Might be annoying to keep the archives up-to-date with the source if we ever modify them though. The tests does not seem to be much slower when archiving on the fly, but it would reduce code-complexity for the test

@@ -607,18 +611,22 @@ def percentile_normalization(
return img_normalized


def path_is_vsi(path: Path) -> bool:
"""Checks if the given path is pointing to a Virtual File System.
def path_is_gdal_vsi(path: Path) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still considering deleting this function. It's only used in a single place, and it's only 1 line long, so we could just replace the if-statement on line 693 with this line. I would like to keep the documentation somehow though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a lot of documentation to have in-line in the other method :P

torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
@adriantre
Copy link
Contributor Author

Still need to devote some time to peeking into fiona.listdir and seeing if there's a way to recursively do this (or add this option to fiona if not) instead of writing our own while-statement

FYI: Toblerity/Fiona#1275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading from GDAL virtual file systems (e.g. cloud storage) Reading S3File using RasterDataset
4 participants