Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GeoDataset to list files in VSI path(s) #1399

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
72702ef
Make RasterDataset accept list of files
Jun 22, 2023
b68dd5a
Fix check if str
adriantre Jun 22, 2023
e90d01c
Use isdir and isfile
adriantre Jun 23, 2023
dfad079
Add kwarg vsi to RasterDataset to support GDAL VSI
adriantre Jun 5, 2023
7291f3e
Fix formatting
adriantre Jun 5, 2023
6b41f18
Add type hints and docstring to method in utils
adriantre Jun 5, 2023
2b2be02
Fix missing import List
adriantre Jun 5, 2023
3f91e97
Fix type hints
adriantre Jun 5, 2023
f0d9475
Refactor with respect to other branch
adriantre Jun 22, 2023
7ca3cb7
Make try-catch more targeted
adriantre Jun 23, 2023
d519061
Remove redundant iglob usage
adriantre Jun 23, 2023
a841fb7
Merge main
adriantre Aug 6, 2024
2ee3c85
Remove unused imports
adriantre Aug 6, 2024
5c6e444
Add wildcard for directories
adriantre Aug 8, 2024
2da9c4a
Allow vsi files to not exist
adriantre Aug 8, 2024
e79061f
Make protected method public
adriantre Aug 8, 2024
00594de
Add zipped dataset to test vsi listdir
adriantre Aug 8, 2024
9ef7669
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 8, 2024
8a61108
Update fiona version in min-reqs.old
adriantre Aug 8, 2024
e8367ca
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 12, 2024
014e5f5
Update docstring of list_directory_recursive
adriantre Aug 12, 2024
0bf5d68
Set fiona version in min-reqs.old
adriantre Aug 12, 2024
41e3dfb
Remove redundant path exists
adriantre Aug 12, 2024
56b1c76
Remove duplicated import
adriantre Aug 12, 2024
e96cfa9
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 12, 2024
d03120d
Create temp archive for test
adriantre Aug 12, 2024
dfc571e
Add docstring to listdir_vsi_recursive
adriantre Aug 12, 2024
aaea729
Change list_directory_recursive return type
adriantre Aug 12, 2024
740dcec
Bump fiona version in pyproject.toml
adriantre Aug 12, 2024
ba014af
Introduce fixture temp_archive for reuse in tests
adriantre Aug 12, 2024
e00b356
Make GeoDataset.files warn if VSI does not exist
adriantre Aug 12, 2024
65ef978
Replace failing test due to new behaviour of vsi
adriantre Aug 12, 2024
5192acd
Fix breaking test due to os.path.join not working on zip parent dir (…
adriantre Aug 12, 2024
9f800f1
Collect test on zip-archive in test class
adriantre Aug 12, 2024
93de375
Update versionadded in dataset/utils.py
adriantre Aug 12, 2024
37c4980
Remove patch from fiona version
adriantre Aug 12, 2024
395b4bf
Add .DS_Store to gitignore
adriantre Aug 13, 2024
6f041ee
Add test for https/curl files
adriantre Aug 13, 2024
dc334d5
Check filname_glob before adding to files property
adriantre Aug 13, 2024
0b80c12
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 13, 2024
dbbcec7
Define should_warn outside if
adriantre Aug 13, 2024
0cc7e15
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 14, 2024
c990591
Merge branch 'refs/heads/main' into feature/support_gdal_virtual_file…
adriantre Aug 26, 2024
ccdbe9b
Try to support windows for vsi tests
adriantre Aug 26, 2024
a2dd0d6
Revert windows-specific path format
adriantre Aug 26, 2024
ecce011
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 26, 2024
08fde98
Skip TestVirtualFilesystems if platform is windows
adriantre Aug 27, 2024
26920ae
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 27, 2024
3baa587
Remove user-specific ignore from gitignore
adriantre Aug 27, 2024
be142aa
Properly remove changes from .gitignore
adriantre Aug 27, 2024
e7659a9
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 27, 2024
c88111e
Rename VSI to VFS where appropriate
adriantre Aug 28, 2024
cffd707
Format docstring of listdir_vfs_recursive
adriantre Aug 28, 2024
04740a2
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
51a3d64
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
7b33550
Don't use os.path.join within VFS
adriantre Aug 28, 2024
b81591d
Update torchgeo/datasets/utils.py
adriantre Aug 28, 2024
c786a57
String format wildcard instead of os.path.join
adriantre Aug 28, 2024
bb3ad78
Document raises
adriantre Aug 28, 2024
0824026
Dont use os.path.join for zip test
adriantre Aug 28, 2024
6a3d128
Update type of error in try except
adriantre Aug 28, 2024
8ea368f
Merge branch 'main' into feature/support_gdal_virtual_file_systems
adriantre Aug 28, 2024
0ad5db2
Simplify tests
adriantre Aug 28, 2024
87ea802
Simplify files property
adriantre Aug 28, 2024
94c3835
Remove unnecessary check in if
adriantre Aug 28, 2024
3ddee3a
Merge branch 'refs/heads/main' into feature/support_gdal_virtual_file…
adriantre Sep 6, 2024
46e2375
Temp archive into tmp_path
adriantre Sep 6, 2024
2d54680
Make utility funcitons non-public
adriantre Sep 6, 2024
d1d8a4a
Fix typo in comment
adriantre Sep 6, 2024
d22a594
Move VFS tests to TestGeoDataset
adriantre Sep 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion requirements/min-reqs.old
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ setuptools==61.0.0

# install
einops==0.3.0
fiona==1.8.21
fiona==1.9.6
adriantre marked this conversation as resolved.
Show resolved Hide resolved
kornia==0.7.3
lightly==1.4.5
lightning[pytorch-extra]==2.0.0
Expand Down
adriantre marked this conversation as resolved.
Show resolved Hide resolved
Binary file not shown.
21 changes: 20 additions & 1 deletion tests/datasets/test_geo.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import math
import os
import pathlib
adriantre marked this conversation as resolved.
Show resolved Hide resolved
import pickle
import sys
from collections.abc import Iterable
Expand Down Expand Up @@ -205,6 +205,25 @@ def test_files_property_deterministic(self) -> None:
CustomGeoDataset(paths=paths1).files == CustomGeoDataset(paths=paths2).files
)

def test_zipped_sentinel2_dataset(self) -> None:
bands = ['B04', 'B03', 'B02']
transforms = nn.Identity()
cache = False

dir_not_zipped = 'tests/data/sentinel2/S2A_MSIL2A_20220414T110751_N0400_R108_T26EMU_20220414T165533.SAFE'
dir_zipped = f'zip://{dir_not_zipped}.zip'

files_not_zipped = Sentinel2(
paths=dir_not_zipped, bands=bands, transforms=transforms, cache=cache
).files
files_zipped = Sentinel2(
paths=[dir_zipped], bands=bands, transforms=transforms, cache=cache
).files

basenames_not_zipped = [pathlib.Path(path).stem for path in files_not_zipped]
basenames_zipped = [pathlib.Path(path).stem for path in files_zipped]
assert basenames_zipped == basenames_not_zipped


class TestRasterDataset:
@pytest.fixture(params=zip([['R', 'G', 'B'], None], [True, False]))
Expand Down
12 changes: 5 additions & 7 deletions torchgeo/datasets/geo.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

import abc
import functools
import glob
import os
import pathlib
import re
Expand Down Expand Up @@ -39,8 +38,8 @@
array_to_tensor,
concat_samples,
disambiguate_timestamp,
list_directory_recursive,
merge_samples,
path_is_vsi,
)


Expand Down Expand Up @@ -307,12 +306,11 @@ def files(self) -> list[Path]:
# Using set to remove any duplicates if directories are overlapping
files: set[Path] = set()
for path in paths:
if os.path.isdir(path):
pathname = os.path.join(path, '**', self.filename_glob)
files |= set(glob.iglob(pathname, recursive=True))
elif os.path.isfile(path) or path_is_vsi(path):
if os.path.exists(path) and os.path.isfile(path):
adriantre marked this conversation as resolved.
Show resolved Hide resolved
files.add(path)
elif not hasattr(self, 'download'):
else:
files |= set(list_directory_recursive(path, self.filename_glob))
if not files and not hasattr(self, 'download'):
adamjstewart marked this conversation as resolved.
Show resolved Hide resolved
warnings.warn(
f"Could not find any relevant files for provided path '{path}'. "
f'Path was ignored.',
Expand Down
45 changes: 45 additions & 0 deletions torchgeo/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
import bz2
import collections
import contextlib
import fnmatch
import glob
import gzip
import importlib
import lzma
Expand All @@ -23,9 +25,11 @@
from datetime import datetime, timedelta
from typing import Any, TypeAlias, cast, overload

import fiona
import numpy as np
import rasterio
import torch
from fiona.errors import FionaValueError
from torch import Tensor
from torchvision.datasets.utils import check_integrity, download_url
from torchvision.utils import draw_segmentation_masks
Expand Down Expand Up @@ -772,6 +776,47 @@ def path_is_vsi(path: Path) -> bool:
return '://' in str(path) or str(path).startswith('/vsi')


def listdir_vsi_recursive(root: Path) -> list[Path]:
adriantre marked this conversation as resolved.
Show resolved Hide resolved
dirs = [root]
files = []
while dirs:
dir = dirs.pop()
try:
subdirs = fiona.listdir(dir)
adriantre marked this conversation as resolved.
Show resolved Hide resolved
dirs.extend([os.path.join(dir, subdir) for subdir in subdirs])
except FionaValueError:
adriantre marked this conversation as resolved.
Show resolved Hide resolved
# Assuming dir is a file as it is not a directory
# fiona.listdir can throw FionaValueError for only two reasons
# 1. 'is not a directory'
# 2. 'does not exist'
# We currently don't have tests for existing vsi, and will thus
# allow vsi to not exist.
files.append(dir)
return files


def list_directory_recursive(root: Path, filename_glob: str) -> list[Path]:
"""Lists files in directory recursively matching the given glob expression.

Also supports gdal virtual file systems (vsi).
adriantre marked this conversation as resolved.
Show resolved Hide resolved

Args:
root: directory to list. For vsi these will have prefix
e.g. /vsiaz or az:// for azure blob storage
filename_glob: filename pattern to filter filenames
adriantre marked this conversation as resolved.
Show resolved Hide resolved
"""
files: list[Path]
if path_is_vsi(root):
# Change type to match expected input to filter
files_as_str: list[str] = [str(file) for file in listdir_vsi_recursive(root)]
# Prefix glob with wildcard to ignore directories
files = cast(list[Path], fnmatch.filter(files_as_str, '*' + filename_glob))
adriantre marked this conversation as resolved.
Show resolved Hide resolved
else:
pathname = os.path.join(root, '**', filename_glob)
files = list(glob.iglob(pathname, recursive=True))
adriantre marked this conversation as resolved.
Show resolved Hide resolved
return files


def array_to_tensor(array: np.typing.NDArray[Any]) -> Tensor:
"""Converts a :class:`numpy.ndarray` to :class:`torch.Tensor`.

Expand Down
Loading