Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GeoDataset to list files in VSI path(s) #1399

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
72702ef
Make RasterDataset accept list of files
Jun 22, 2023
b68dd5a
Fix check if str
adriantre Jun 22, 2023
e90d01c
Use isdir and isfile
adriantre Jun 23, 2023
dfad079
Add kwarg vsi to RasterDataset to support GDAL VSI
adriantre Jun 5, 2023
7291f3e
Fix formatting
adriantre Jun 5, 2023
6b41f18
Add type hints and docstring to method in utils
adriantre Jun 5, 2023
2b2be02
Fix missing import List
adriantre Jun 5, 2023
3f91e97
Fix type hints
adriantre Jun 5, 2023
f0d9475
Refactor with respect to other branch
adriantre Jun 22, 2023
7ca3cb7
Make try-catch more targeted
adriantre Jun 23, 2023
d519061
Remove redundant iglob usage
adriantre Jun 23, 2023
a841fb7
Merge main
adriantre Aug 6, 2024
2ee3c85
Remove unused imports
adriantre Aug 6, 2024
5c6e444
Add wildcard for directories
adriantre Aug 8, 2024
2da9c4a
Allow vsi files to not exist
adriantre Aug 8, 2024
e79061f
Make protected method public
adriantre Aug 8, 2024
00594de
Add zipped dataset to test vsi listdir
adriantre Aug 8, 2024
9ef7669
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 8, 2024
8a61108
Update fiona version in min-reqs.old
adriantre Aug 8, 2024
e8367ca
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 12, 2024
014e5f5
Update docstring of list_directory_recursive
adriantre Aug 12, 2024
0bf5d68
Set fiona version in min-reqs.old
adriantre Aug 12, 2024
41e3dfb
Remove redundant path exists
adriantre Aug 12, 2024
56b1c76
Remove duplicated import
adriantre Aug 12, 2024
e96cfa9
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 12, 2024
d03120d
Create temp archive for test
adriantre Aug 12, 2024
dfc571e
Add docstring to listdir_vsi_recursive
adriantre Aug 12, 2024
aaea729
Change list_directory_recursive return type
adriantre Aug 12, 2024
740dcec
Bump fiona version in pyproject.toml
adriantre Aug 12, 2024
ba014af
Introduce fixture temp_archive for reuse in tests
adriantre Aug 12, 2024
e00b356
Make GeoDataset.files warn if VSI does not exist
adriantre Aug 12, 2024
65ef978
Replace failing test due to new behaviour of vsi
adriantre Aug 12, 2024
5192acd
Fix breaking test due to os.path.join not working on zip parent dir (…
adriantre Aug 12, 2024
9f800f1
Collect test on zip-archive in test class
adriantre Aug 12, 2024
93de375
Update versionadded in dataset/utils.py
adriantre Aug 12, 2024
37c4980
Remove patch from fiona version
adriantre Aug 12, 2024
395b4bf
Add .DS_Store to gitignore
adriantre Aug 13, 2024
6f041ee
Add test for https/curl files
adriantre Aug 13, 2024
dc334d5
Check filname_glob before adding to files property
adriantre Aug 13, 2024
0b80c12
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 13, 2024
dbbcec7
Define should_warn outside if
adriantre Aug 13, 2024
0cc7e15
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 14, 2024
c990591
Merge branch 'refs/heads/main' into feature/support_gdal_virtual_file…
adriantre Aug 26, 2024
ccdbe9b
Try to support windows for vsi tests
adriantre Aug 26, 2024
a2dd0d6
Revert windows-specific path format
adriantre Aug 26, 2024
ecce011
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 26, 2024
08fde98
Skip TestVirtualFilesystems if platform is windows
adriantre Aug 27, 2024
26920ae
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 27, 2024
3baa587
Remove user-specific ignore from gitignore
adriantre Aug 27, 2024
be142aa
Properly remove changes from .gitignore
adriantre Aug 27, 2024
e7659a9
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 27, 2024
c88111e
Rename VSI to VFS where appropriate
adriantre Aug 28, 2024
cffd707
Format docstring of listdir_vfs_recursive
adriantre Aug 28, 2024
04740a2
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
51a3d64
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
7b33550
Don't use os.path.join within VFS
adriantre Aug 28, 2024
b81591d
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
c786a57
String format wildcard instead of os.path.join
adriantre Aug 28, 2024
bb3ad78
Document raises
adriantre Aug 28, 2024
0824026
Dont use os.path.join for zip test
adriantre Aug 28, 2024
6a3d128
Update type of error in try except
adriantre Aug 28, 2024
8ea368f
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 28, 2024
0ad5db2
Simplify tests
adriantre Aug 28, 2024
87ea802
Simplify files property
adriantre Aug 28, 2024
94c3835
Remove unnecessary check in if
adriantre Aug 28, 2024
3ddee3a
Merge branch 'refs/heads/main' into feature/support_gdal_virtual_file…
adriantre Sep 6, 2024
46e2375
Temp archive into tmp_path
adriantre Sep 6, 2024
2d54680
Make utility funcitons non-public
adriantre Sep 6, 2024
d1d8a4a
Fix typo in comment
adriantre Sep 6, 2024
d22a594
Move VFS tests to TestGeoDataset
adriantre Sep 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ dependencies = [
# einops 0.3+ required for einops.repeat
"einops>=0.3",
# fiona 1.8.21+ required for Python 3.10 wheels
"fiona>=1.8.21",
# fiona 1.9+ required for fiona.listdir
"fiona>=1.9",
adriantre marked this conversation as resolved.
Show resolved Hide resolved
# kornia 0.7.3+ required for instance segmentation support in AugmentationSequential
"kornia>=0.7.3",
# lightly 1.4.5+ required for LARS optimizer
Expand Down
2 changes: 1 addition & 1 deletion requirements/min-reqs.old
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ setuptools==61.0.0

# install
einops==0.3.0
fiona==1.8.21
fiona==1.9.0
kornia==0.7.3
lightly==1.4.5
lightning[pytorch-extra]==2.0.0
Expand Down
128 changes: 109 additions & 19 deletions tests/datasets/test_geo.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import math
import os
import pickle
import shutil
import sys
from collections.abc import Iterable
from collections.abc import Generator, Iterable
from pathlib import Path
from typing import Any

Expand Down Expand Up @@ -84,6 +84,19 @@ def __len__(self) -> int:
return 2


@pytest.fixture(scope='module')
def temp_archive(request: SubRequest) -> Generator[tuple[str, str], None, None]:
adriantre marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be using the tmp_path fixture here instead of creating the archive in a git-tracked location. This will require the default function scope though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a module-scoped fixture instead, but I can opt for class-scoped if it will only be used for TestGeoDataset.

# Runs before tests
dir_not_zipped = request.param
dir_zipped = f'{dir_not_zipped}.zip'
_ = shutil.make_archive(dir_not_zipped, 'zip', dir_not_zipped)
# make_archive returns absolute path, while input may be relative path.
# we opt to return the (relative) path as provided.
yield dir_not_zipped, dir_zipped
# Runs after tests
os.remove(dir_zipped)


class TestGeoDataset:
@pytest.fixture(scope='class')
def dataset(self) -> GeoDataset:
Expand Down Expand Up @@ -178,29 +191,24 @@ def test_files_property_for_non_existing_file_or_dir(self, tmp_path: Path) -> No
with pytest.warns(UserWarning, match='Path was ignored.'):
assert len(CustomGeoDataset(paths=paths).files) == 0

def test_files_property_for_virtual_files(self) -> None:
# Tests only a subset of schemes and combinations.
paths = [
'file://directory/file.tif',
'zip://archive.zip!folder/file.tif',
'az://azure_bucket/prefix/file.tif',
'/vsiaz/azure_bucket/prefix/file.tif',
'zip+az://azure_bucket/prefix/archive.zip!folder_in_archive/file.tif',
'/vsizip//vsiaz/azure_bucket/prefix/archive.zip/folder_in_archive/file.tif',
]
assert len(CustomGeoDataset(paths=paths).files) == len(paths)

def test_files_property_ordered(self) -> None:
def test_files_property_ordered(self, tmp_path: Path) -> None:
"""Ensure that the list of files is ordered."""
paths = ['file://file3.tif', 'file://file1.tif', 'file://file2.tif']

files = ['file3.tif', 'file1.tif', 'file2.tif']
paths = [tmp_path / fake_file for fake_file in files]
for fake_file in paths:
fake_file.touch()
assert CustomGeoDataset(paths=paths).files == sorted(paths)

def test_files_property_deterministic(self) -> None:
def test_files_property_deterministic(self, tmp_path: Path) -> None:
"""Ensure that the list of files is consistent regardless of their original
order.
"""
paths1 = ['file://file3.tif', 'file://file1.tif', 'file://file2.tif']
paths2 = ['file://file2.tif', 'file://file3.tif', 'file://file1.tif']
files = ['file3.tif', 'file1.tif', 'file2.tif']
paths1 = [tmp_path / fake_file for fake_file in files]
paths2 = paths1[::-1] # reverse order
for fake_file in paths1:
fake_file.touch()
assert (
CustomGeoDataset(paths=paths1).files == CustomGeoDataset(paths=paths2).files
)
Expand Down Expand Up @@ -368,6 +376,88 @@ def test_no_all_bands(self) -> None:
CustomSentinelDataset(root, bands=bands, transforms=transforms, cache=cache)


class TestVirtualFilesystems:
adriantre marked this conversation as resolved.
Show resolved Hide resolved
@pytest.mark.parametrize(
'temp_archive', [os.path.join('tests', 'data', 'vector')], indirect=True
)
def test_zipped_file(self, temp_archive: tuple[str, str]) -> None:
_, dir_zipped = temp_archive
filename = 'vector_2024.geojson'

specific_file_zipped = f'{dir_zipped}!{filename}'

files_found = CustomGeoDataset(paths=f'zip://{specific_file_zipped}').files
assert len(files_found) == 1
assert str(files_found[0]).endswith(filename)

@pytest.mark.parametrize(
'temp_archive', [os.path.join('tests', 'data', 'vector')], indirect=True
)
def test_zipped_file_non_existing(self, temp_archive: tuple[str, str]) -> None:
_, dir_zipped = temp_archive
with pytest.warns(UserWarning, match='Path was ignored.'):
files = CustomGeoDataset(
paths=f'zip://{dir_zipped}!/non_existing_file.tif'
).files
assert len(files) == 0

@pytest.mark.parametrize(
'temp_archive',
[
os.path.join(
'tests',
'data',
'sentinel2',
'S2A_MSIL2A_20220414T110751_N0400_R108_T26EMU_20220414T165533.SAFE',
)
],
indirect=True,
)
def test_zipped_specific_file_dir(self, temp_archive: tuple[str, str]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the testing drastically, I think I would rather:

  1. Create all uncompressed (.tif) and compressed (.zip) files in tests/data/<whatever> and store them in git so we don't have to generate and delete them on the fly
  2. Have only 2 test functions (expected to pass, expected to fail)
  3. Parametrize each of these test functions with all supported inputs (.tif, zipped file, zipped dir, etc.)

This will remove a lot of the code duplication and make things easier to add new tests for (just a single line in parametrize). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look! I like the idea.

Might be annoying to keep the archives up-to-date with the source if we ever modify them though. The tests does not seem to be much slower when archiving on the fly, but it would reduce code-complexity for the test

dir_not_zipped, dir_zipped = temp_archive

filepath_within_dir = (
'GRANULE/L2A_T26EMU_A035569_20220414T110747'
'/IMG_DATA/R60m/T26EMU_20220414T110751_B02_60m.jp2'
)

files_found = CustomGeoDataset(
paths=f'zip://{dir_zipped}!/{filepath_within_dir}'
).files
assert len(files_found) == 1
assert str(files_found[0]).endswith(filepath_within_dir)

@pytest.mark.parametrize(
'temp_archive',
[
os.path.join(
'tests',
'data',
'sentinel2',
'S2A_MSIL2A_20220414T110751_N0400_R108_T26EMU_20220414T165533.SAFE',
)
],
indirect=True,
)
def test_zipped_directory(self, temp_archive: tuple[str, str]) -> None:
dir_not_zipped, dir_zipped = temp_archive
bands = Sentinel2.rgb_bands
transforms = nn.Identity()
cache = False

files_not_zipped = Sentinel2(
paths=dir_not_zipped, bands=bands, transforms=transforms, cache=cache
).files

files_zipped = Sentinel2(
paths=f'zip://{dir_zipped}', bands=bands, transforms=transforms, cache=cache
).files

basenames_not_zipped = [Path(path).stem for path in files_not_zipped]
basenames_zipped = [Path(path).stem for path in files_zipped]
assert basenames_zipped == basenames_not_zipped


class TestVectorDataset:
@pytest.fixture(scope='class')
def dataset(self) -> CustomVectorDataset:
Expand Down
12 changes: 5 additions & 7 deletions torchgeo/datasets/geo.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import abc
import fnmatch
import functools
import glob
import os
import pathlib
import re
Expand Down Expand Up @@ -40,8 +39,8 @@
array_to_tensor,
concat_samples,
disambiguate_timestamp,
list_directory_recursive,
merge_samples,
path_is_vsi,
)


Expand Down Expand Up @@ -308,13 +307,12 @@ def files(self) -> list[Path]:
# Using set to remove any duplicates if directories are overlapping
files: set[Path] = set()
for path in paths:
if os.path.isdir(path):
pathname = os.path.join(path, '**', self.filename_glob)
files |= set(glob.iglob(pathname, recursive=True))
elif (os.path.isfile(path) or path_is_vsi(path)) and fnmatch.fnmatch(
str(path), f'*{self.filename_glob}'
if os.path.isfile(path) and fnmatch.fnmatch(
str(path), os.path.join('*', self.filename_glob)
):
files.add(path)
elif files_found := set(list_directory_recursive(path, self.filename_glob)):
files |= files_found
elif not hasattr(self, 'download'):
warnings.warn(
f"Could not find any relevant files for provided path '{path}'. "
Expand Down
82 changes: 79 additions & 3 deletions torchgeo/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

import collections
import contextlib
import fnmatch
import glob
import importlib
import os
import pathlib
Expand All @@ -19,9 +21,11 @@
from datetime import datetime, timedelta
from typing import Any, TypeAlias, cast, overload

import fiona
import numpy as np
import rasterio
import torch
from fiona.errors import FionaValueError
from torch import Tensor
from torchvision.datasets.utils import (
check_integrity,
Expand Down Expand Up @@ -607,18 +611,22 @@ def percentile_normalization(
return img_normalized


def path_is_vsi(path: Path) -> bool:
"""Checks if the given path is pointing to a Virtual File System.
def path_is_gdal_vsi(path: Path) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still considering deleting this function. It's only used in a single place, and it's only 1 line long, so we could just replace the if-statement on line 693 with this line. I would like to keep the documentation somehow though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a lot of documentation to have in-line in the other method :P

"""Checks if the given path has a GDAL Virtual File System Interface (VSI) prefix.

This is a path within an Apache Virtual File System (VFS) supported by GDAL and
related libraries (rasterio and fiona).

.. note::
Does not check if the path exists, or if it is a dir or file.

VSI can for instance be Cloud Storage Blobs or zip-archives.
VFS can for instance be Cloud Storage Blobs or zip-archives.
They will start with a prefix indicating this.
For examples of these, see references for the two accepted syntaxes.

* https://gdal.org/user/virtual_file_systems.html
* https://rasterio.readthedocs.io/en/latest/topics/datasets.html
* https://commons.apache.org/proper/commons-vfs/filesystems.html

Args:
path: a directory or file
Expand All @@ -631,6 +639,74 @@ def path_is_vsi(path: Path) -> bool:
return '://' in str(path) or str(path).startswith('/vsi')


def listdir_vfs_recursive(root: Path) -> list[str]:
adriantre marked this conversation as resolved.
Show resolved Hide resolved
"""Lists all files in Virtual File Systems (VFS) recursively.

Args:
root: directory to list. These must contain the prefix for the VFS
(e.g., '/vsiaz/' or 'az://' for azure blob storage, or
'/vsizip/' or 'zip://' for zipped archives).

Returns:
A list of all file paths matching filename_glob in the root VFS directory or its
subdirectories.

adriantre marked this conversation as resolved.
Show resolved Hide resolved
Raises:
FileNotFoundError: If root does not exist.

.. versionadded:: 0.6
adriantre marked this conversation as resolved.
Show resolved Hide resolved
"""
dirs = [str(root)]
files = []
while dirs:
dir = dirs.pop()
try:
subdirs = fiona.listdir(dir)
adriantre marked this conversation as resolved.
Show resolved Hide resolved
# Don't use os.path.join here because vsi uri's require forward-slash,
# even on windows.
dirs.extend([f'{dir}/{subdir}' for subdir in subdirs])
except FionaValueError as e:
if 'is not a directory' in str(e):
files.append(dir)
else:
raise FileNotFoundError(f'No such file or directory: {dir}')
return files


def list_directory_recursive(root: Path, filename_glob: str) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return a set instead of a list? The files method is the only place they are used and uses a set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats ok by me. The only reason I did not is because similar methods returns a list (iglob, fnmatch etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change this method to remove duplicates (set) and sort it, then return a list as well. Or is it better to be explicit about that in the files-property?

"""Lists files in directory recursively matching the given glob expression.

Also supports GDAL Virtual File Systems (VFS).

Args:
root: directory to list. For VFS these will have prefix
e.g. /vsiaz/ or az:// for azure blob storage
filename_glob: filename pattern to filter filenames
adriantre marked this conversation as resolved.
Show resolved Hide resolved

Returns:
A list of all file paths matching filename_glob in the root directory or its
subdirectories.

.. versionadded:: 0.6
"""
files: list[str]
if path_is_gdal_vsi(root):
# Change type to match expected input to filter
all_files: list[str] = []
try:
all_files = listdir_vfs_recursive(root)
except FileNotFoundError:
# To match the behaviour of glob.iglob we silently return empty list
# for non-existing root.
pass
# Prefix glob with wildcard to ignore directories
files = fnmatch.filter(all_files, f'*{filename_glob}')
else:
pathname = os.path.join(root, '**', filename_glob)
files = glob.glob(pathname, recursive=True)
return files


def array_to_tensor(array: np.typing.NDArray[Any]) -> Tensor:
"""Converts a :class:`numpy.ndarray` to :class:`torch.Tensor`.

Expand Down