TileDatasets #1353

calebrob6 · 2023-05-20T21:50:18Z

Introduces a new class of datasets called TileDatasets that are indexed by filename, xoffset, yoffset, and patch size.
Implements samplers for these
Implements a L7IrishDataModule using this scheme

adamjstewart · 2023-05-21T03:04:48Z

Thanks for opening this proof of concept!

There's a few questions here:

Do we want to add this base class to TorchGeo?
Do we want all of our curated benchmark GeoDatasets to subclass this?
Do we want to search for files or pass in a list of files?

My current opinions:

I think we'll likely want something like this in TorchGeo
This one is more nuanced. Using this for all benchmark GeoDatasets lets us avoid reprojection, but prevents us from combining those datasets with other GeoDatasets. We'll have to decide how important this functionality is. Depending on how much time I have this summer, I may try to tackle GeoDataset: avoid unnecessary reprojection #409, which would make the gains less.
I know you've been looking for something like this for a long time. I would like to be consistent here. Either Raster/Vector/Tile search for files, or get passed a list of files, or support both options.

calebrob6 · 2023-05-21T04:19:09Z

Do we want to add this base class to TorchGeo?

Maybe -- although it feels like we should be able to get there by making RasterDataset and the Samplers more complex. I think I sketched out a method for allowing RasterDataset __getitem__ to take in two different types of bounding boxes, the current one, and the one used in this implementation (call these BoundingBox and Patch or something -- note you can convert from Patch to BoundingBox). If we're in a situation where there are overlapping datasets or something you can just stick with the current implementation.

Do we want all of our curated benchmark GeoDatasets to subclass this?

Depends on the first question

Do we want to search for files or pass in a list of files?

We've talked about this a bunch of times before -- I almost never want to search a file system and keep things that match a regex. The nice thing about this is that you can rename fns to uris and pass it lists of COG files on remote servers and it will work, while you definitely can't do that with glob.

nilsleh · 2023-05-22T06:48:15Z

experiments/ssl4eo/run_l7irish.py

+        else:
+            experiment_name = f"{model}_{lr}_{loss}_{wd}_{weights}_{seed}"
+
+        config_file = os.path.join("conf", "l7irishtile.yaml")


Should we make one run_{downstream_task}.py script that has the config file name as an additional variable at the beginning of the file? Then we can use this one script for all the different downstream tasks, or are there some differences beyond the config files that we need to account for?

isaaccorley · 2023-12-17T03:52:26Z

I'd like to revive this one but maybe we can also make the samplers work with non georeferenced images. Some of the datasets we have like GID-15, LEVIR-CD, etc. are large images that I may want to sample smaller patches from for training but I don't want to manually preprocess them to a specific patch size beforehand in case I want to run an ablation by varying the patch size.

calebrob6 · 2024-08-29T16:07:22Z

@adamjstewart -- do you have any more recent thoughts on this? I use these types of datasets somewhat frequently in my day-to-day work and would love to have them in TorchGeo, but also don't have a burning passion to merge it

adamjstewart · 2024-08-29T18:06:46Z

Overall, I think I would be okay with this approach. It's kind of like what other libraries did before TorchGeo existed. I'm very okay with this for datasets like GID-15 and LEVIR-CD where the images are large but there are no geocoordinates. I'm slightly okay with this for datasets like L7 Irish where the images are large and there are geocoordinates.

Fancy is cool, but fast is better. Especially if you're already regularly using datasets like this. I want to make sure TorchGeo is actually useful to its intended audience, including its maintainers. If the maintainers aren't using the builtin datasets, we're doing something wrong.

Depending on how much time I have this summer, I may try to tackle GeoDataset: avoid unnecessary reprojection #409, which would make the gains less.

Well this did not age well...

it feels like we should be able to get there by making RasterDataset and the Samplers more complex.

I don't think this is possible. If we subclass from GeoDataset, we are declaring that the dataset can interoperate with all other GeoDatasets. TileDataset really does deserve its own base class and samplers. I'm not even sure if samplers are the right approach here. This requires a lot of rethinking. How do people normally use these kind of datasets for ML? Are there other ideas from high-res imagery we can borrow from? Presumably we are not the only ones using large images.

Do we want to search for files or pass in a list of files?

We now support this, yay!

Implements a L7IrishDataModule using this scheme

Note that I converted L7 Irish to an IntersectionDataset in 0.6.0. This significantly cuts down on the code (no more getitem). Just an aside, not really relevant to the discussion.

calebrob6 · 2024-09-04T17:03:21Z

Note that this would fix all issues with ChesapeakeCVPR as ChesapeakeCVPR should be a tile dataset!
The unchipped version of LandCoverAI and OSCD (more generally, any dataset where we currently use NCrops) could also be tile datasets. Any situation where you have image tiles of different shapes, that have no CRS information (or many different CRSs) you can treat as a tile dataset.

calebrob6 added 2 commits May 20, 2023 20:45

this is how I normally write code

62e3e9c

getting some experiments running

b30a295

github-actions bot added datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets scripts Training and evaluation scripts labels May 20, 2023

Grid time

e043a0e

github-actions bot added the trainers PyTorch Lightning trainers label May 21, 2023

experiment_name

0ae66f9

nilsleh reviewed May 22, 2023

View reviewed changes

adamjstewart added this to the 0.5.0 milestone May 25, 2023

Add L8BiomeTile

6934dc5

adamjstewart removed this from the 0.5.0 milestone Sep 28, 2023

calebrob6 mentioned this pull request Sep 4, 2024

Incompatible image size with RandomGeoSampler #1778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TileDatasets #1353

TileDatasets #1353

calebrob6 commented May 20, 2023

adamjstewart commented May 21, 2023

calebrob6 commented May 21, 2023

nilsleh May 22, 2023 •

edited

Loading

isaaccorley commented Dec 17, 2023

calebrob6 commented Aug 29, 2024

adamjstewart commented Aug 29, 2024

calebrob6 commented Sep 4, 2024 •

edited

Loading

TileDatasets #1353

Are you sure you want to change the base?

TileDatasets #1353

Conversation

calebrob6 commented May 20, 2023

adamjstewart commented May 21, 2023

calebrob6 commented May 21, 2023

nilsleh May 22, 2023 • edited Loading

Choose a reason for hiding this comment

isaaccorley commented Dec 17, 2023

calebrob6 commented Aug 29, 2024

adamjstewart commented Aug 29, 2024

calebrob6 commented Sep 4, 2024 • edited Loading

nilsleh May 22, 2023 •

edited

Loading

calebrob6 commented Sep 4, 2024 •

edited

Loading