I/O Bench: add new dataset #1972

adamjstewart · 2024-04-01T14:02:17Z

Our paper includes an I/O performance benchmark consisting of 114 Landsat 8 Level-2 scenes and 1 CDL 2019 mask. Since its publication, we have not been benchmarking I/O during PRs or between releases. This has led to difficulty in properly evaluating proposed changes (#1881) and massive performance bugs that have gone undetected for years (#1968).

This PR adds a dataset to properly test this. It consists of a single Landsat 9 scene and CDL mask with the following splits:

original: the original files as downloaded from USGS Earth Explorer and USDA CropScape
raw: the same files with compression and with CDL clipped to the bounds of the Landsat scene
preprocessed: the same files with compression, reprojected to the same CRS, as COGs, with TAP

These roughly correspond to the same categories as used in our paper, with a few changes:

Only a single Landsat scene
CDL is clipped to the bounds of Landsat

These changes are made to make the dataset as small as possible so it can be quickly downloaded and fit on any system. I believe that the dataset is still useful even at such a small size. This should be considered to be version 1 of the dataset, with many future changes to evaluate a broader set of conditions.

Sample image over Champaign County, IL, USA:

This PR also represents an interesting and novel dataset design. Note that instead of writing a custom RasterDataset and overriding __getitem__ to handle both Landsat and CDL files, I'm actually subclassing from IntersectionDataset, instantiating Landsat and CDL classes, and computing the intersection. This is significantly easier and could be used to simplify our implementations of other raster datasets with both images and masks: AgriFieldNet, Chesapeake, EnviroAtlas, GlobBiomass, L7 Irish, L8 Biome, LandCover.ai, etc.

FYI @yichiac

Closes #190

adamjstewart · 2024-04-04T19:34:41Z

Trying to resist the urge to write a custom profiler that automatically formats the output we care about as a Markdown table: https://lightning.ai/docs/pytorch/stable/tuning/profiler_expert.html

github-actions bot added the datasets Geospatial or benchmark datasets label Apr 1, 2024

adamjstewart added this to the 0.6.0 milestone Apr 1, 2024

github-actions bot added testing Continuous integration testing datamodules PyTorch Lightning datamodules documentation Improvements or additions to documentation trainers PyTorch Lightning trainers labels Apr 1, 2024

adamjstewart added 14 commits April 4, 2024 20:40

I/O Bench: add new dataset

0442068

Add tests

afb7434

IOBenchDataModule: add new data module

ecda517

Style fixes

2e501d8

Style fixes

f4c787f

RandomGeoSampler == RandomBatchGeoSampler when 1 scene

fb027b9

IOBenchDataModule: add tests

548004d

Add API docs

4ed2a5e

Smaller class size

28e4bb3

Fix typo

04531c3

Add config file

378b728

Add IOBenchTask

fbe7f87

Add tests

2642f44

pyupgrade

aa60ec9

adamjstewart force-pushed the datasets/iobench branch from a095943 to aa60ec9 Compare April 4, 2024 18:43

Fix support for older PyTorch

d10d7be

Add another config file

f90c490

adamjstewart mentioned this pull request Apr 4, 2024

RandomGeoSampler: fix performance regression #1968

Merged

Add usage documentation

ecb5619

adamjstewart marked this pull request as ready for review April 4, 2024 20:01

adamjstewart added 2 commits April 15, 2024 23:57

Merge branch 'main' into datasets/iobench

6ec8985

Merge branch 'main' into datasets/iobench

4dec6a8

adamjstewart merged commit 04a85a5 into microsoft:main Apr 19, 2024
15 checks passed

adamjstewart deleted the datasets/iobench branch April 19, 2024 16:22

This was referenced Apr 20, 2024

RasterDataset: add control over resampling algorithm #2015

Merged

Add South Africa Crop Type DataModule #1970

Merged

adamjstewart mentioned this pull request Apr 27, 2024

L7 Irish: convert to IntersectionDataset #2034

Merged

adamjstewart mentioned this pull request May 13, 2024

L8 Biome: convert to IntersectionDataset #2058

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O Bench: add new dataset #1972

I/O Bench: add new dataset #1972

adamjstewart commented Apr 1, 2024 •

edited

Loading

adamjstewart commented Apr 4, 2024

I/O Bench: add new dataset #1972

I/O Bench: add new dataset #1972

Conversation

adamjstewart commented Apr 1, 2024 • edited Loading

adamjstewart commented Apr 4, 2024

adamjstewart commented Apr 1, 2024 •

edited

Loading