Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O Bench: add new dataset #1972

Merged
merged 19 commits into from
Apr 19, 2024
Merged

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Apr 1, 2024

Our paper includes an I/O performance benchmark consisting of 114 Landsat 8 Level-2 scenes and 1 CDL 2019 mask. Since its publication, we have not been benchmarking I/O during PRs or between releases. This has led to difficulty in properly evaluating proposed changes (#1881) and massive performance bugs that have gone undetected for years (#1968).

This PR adds a dataset to properly test this. It consists of a single Landsat 9 scene and CDL mask with the following splits:

  • original: the original files as downloaded from USGS Earth Explorer and USDA CropScape
  • raw: the same files with compression and with CDL clipped to the bounds of the Landsat scene
  • preprocessed: the same files with compression, reprojected to the same CRS, as COGs, with TAP

These roughly correspond to the same categories as used in our paper, with a few changes:

  • Only a single Landsat scene
  • CDL is clipped to the bounds of Landsat

These changes are made to make the dataset as small as possible so it can be quickly downloaded and fit on any system. I believe that the dataset is still useful even at such a small size. This should be considered to be version 1 of the dataset, with many future changes to evaluate a broader set of conditions.

  • Add dataset
  • Add data module
  • Add trainer
  • Add tests
  • Add documentation

Sample image over Champaign County, IL, USA:

iobench


This PR also represents an interesting and novel dataset design. Note that instead of writing a custom RasterDataset and overriding __getitem__ to handle both Landsat and CDL files, I'm actually subclassing from IntersectionDataset, instantiating Landsat and CDL classes, and computing the intersection. This is significantly easier and could be used to simplify our implementations of other raster datasets with both images and masks: AgriFieldNet, Chesapeake, EnviroAtlas, GlobBiomass, L7 Irish, L8 Biome, LandCover.ai, etc.

FYI @yichiac

Closes #190

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Apr 1, 2024
@adamjstewart adamjstewart added this to the 0.6.0 milestone Apr 1, 2024
@github-actions github-actions bot added testing Continuous integration testing datamodules PyTorch Lightning datamodules documentation Improvements or additions to documentation trainers PyTorch Lightning trainers labels Apr 1, 2024
@adamjstewart
Copy link
Collaborator Author

Trying to resist the urge to write a custom profiler that automatically formats the output we care about as a Markdown table: https://lightning.ai/docs/pytorch/stable/tuning/profiler_expert.html

@adamjstewart adamjstewart marked this pull request as ready for review April 4, 2024 20:01
@adamjstewart adamjstewart merged commit 04a85a5 into microsoft:main Apr 19, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing trainers PyTorch Lightning trainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GeoDataset performance
1 participant