Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add STACAPI dataset #412

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Add STACAPI dataset #412

wants to merge 3 commits into from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Feb 17, 2022

This PR closes #403 by adding a STACAPI dataset. I will try to explain my thinking for this proposal:

Here are some explanations for choices I made:

  • constructor takes the usual root, crs, res, and transforms argument like other raster datasets
  • additionally, there is a bands, api_endpoint and **query_parameters argument. I thought the bands argument should be separate because other raster datasets also specifically have this argument

When the getitem method receives a bbox query:

  • a single item that is part of the hit objects will be merged together to a single raster and computed resulting in a 4D xarray of (time x channel x height x width)
  • since it can happen that multiple hits are returned for a particular bbox query, these have to be merged. Usually, this happens with rasterio because they are datareader objects or filepaths and not xarray resulting from stackstac. I only found the rioxarray library, which does merging for xarrays so this library is currently also required
  • however, since rioxarray.merge only supports 2D or 3D tensors and potentially stac returns many samples over a time period, I loop over the time dimension, merge with rioxarray per time step and afterwords stack back together into one final tensor which will be returned as an image in the sample
  • squeeze out the time dimension if it is only a single time instance

Questions / Left to do:

  • several new libraries needed, maybe some can be avoided like rioxarray?
  • computing the aoi can potentially take fairly long albeit it is much faster than I expected
  • handling the resolution self.res parameter with each separate band having a different resolution
  • deciding whether an image will be under image or mask key because StacAPI has both images and labels, so maybe another argument for the constructor?
  • given the last point, need to implement a plot method that can handle both scenarios of the dataset containing images or labels
    testing with sampler and dataloader

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Feb 17, 2022
@nilsleh nilsleh marked this pull request as draft February 17, 2022 21:00
@adamjstewart adamjstewart added this to the 0.3.0 milestone Feb 18, 2022
@calebrob6
Copy link
Member

calebrob6 commented Feb 20, 2022

instead of what is currently in __getitem__ we can use something like:

items = TODO # list of the STAC Items that the current query intersects with (based on the results from the rtree)

stack = stackstac.stack(
  items,
  assets=self.bands,
  resolution=self.res,
  epsg=TODO # convert self._crs to an epsg code
)

aoi = stack.loc[..., maxy: miny, minx: maxx]
# can do some checking here to see what's up with the time dimension
data = aoi.compute()
out_image = data.data

@calebrob6
Copy link
Member

several new libraries needed, maybe some can be avoided like rioxarray?

Generally, we want to minimize direct dependences

computing the aoi can potentially take fairly long albeit it is much faster than I expected

I have previously benchmarked different ways of transforming coordinates (https://gist.github.com/calebrob6/334af85deb4c862689020ae4b3344bfc) and pyproj was by far the fastest. The constructor may be sped up if you replace the fiona.transform.transform(...) call with a pyproj version.

handling the resolution self.res parameter with each separate band having a different resolution

I think we have to reproject to a single resolution. I think it makes sense to choose the highest resolution if the user doesn't specify (what stackstac will do by default).

deciding whether an image will be under image or mask key because StacAPI has both images and labels, so maybe another argument for the constructor?

I think this makes sense -- what do you think @adamjstewart?

given the last point, need to implement a plot method that can handle both scenarios of the dataset containing images or labels

I think we can't have a plot for this class as it is too generic (e.g. this could be used to grab anything on the Planetary Computer).

@calebrob6
Copy link
Member

Another thought -- for the planetary computer specifically, we will need to sign STAC Items before we can load data from them. Perhaps there should be a Callable "item_transformer" argument that takes an Item as input and returns an Item, and is an identity transform by default? For the planetary computer case we can make a transform that signs the item.

@TomAugspurger, any thoughts in general here?

@adamjstewart
Copy link
Collaborator

For me, the biggest remaining question is how to integrate this with RasterDataset/VectorDataset. Is the intention that we will have subclasses of STACAPIDataset for each dataset on the Planetary Computer? For things like Landsat and Sentinel that are available in both STAC and non-STAC forms, will we have a single class or will we have Landsat and LandsatSTAC classes?

@calebrob6
Copy link
Member

how to integrate this with RasterDataset/VectorDataset

What do you mean here? (e.g. I would consider this integrated as it extends RasterDataset)

Is the intention that we will have subclasses of STACAPIDataset for each dataset on the Planetary Computer?

No, I don't think so. Users could easily use it with the Landsat8 collection on the Planetary Computer, but would need to do more work if they wanted the features that come with the Landsat dataset.

# from torchgeo.samplers import RandomGeoSampler


class STACAPIDataset(RasterDataset, abc.ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class STACAPIDataset(RasterDataset, abc.ABC):
class STACAPIDataset(GeoDataset):

This is neither an ABC (there are no abstract methods), nor does it need to be a subclass of RasterDataset (it doesn't re-use any of the logic from RasterDataset). I would consider this to be just another possible base class that subclasses GeoDataset.

@adamjstewart
Copy link
Collaborator

@calebrob6 I guess we're not on the same page then, I've always thought of STAC integration as no different than finding files on your filesystem. My hope was that we could have a single dataset Landsat (or rather, Landsat8, etc.) that would work for both locally downloaded files without a STAC index and remote files with a STAC index. Just like we have subclasses of RasterDataset for various datasets, I think we'll want subclasses of STACAPIDataset for each dataset. Is there any reason that we don't need this? Or would you suggest that Landsat8 isn't needed and users should just directly use RasterDataset and pass in a file glob or list of files or something?

@calebrob6
Copy link
Member

I see what you mean now -- I think we are only not on the same page simply because you are 1 chapter ahead of me ;)

I've always thought of STAC integration as no different than finding files on your filesystem

I like this! Some thoughts:

  • If we implement this for e.g. Landsat8 then we would probably be coupled to a particular STAC API (e.g. "the planetary computer API").
  • I think STACAPIDataset should still be able to function independently, similar to how you can pass some TIFFs to RasterDataset, so that we don't have to implement a specific dataset for every possible collection that a user wants to use that is hosted behind a STAC API.
  • Maybe a "mixin" type pattern is appropriate for implementing this? Landsat8 is "STACAPI'able", Sentinel2 is "STACAPI'able", etc.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Feb 27, 2022

instead of what is currently in __getitem__ we can use something like:

items = TODO # list of the STAC Items that the current query intersects with (based on the results from the rtree)

stack = stackstac.stack(
  items,
  assets=self.bands,
  resolution=self.res,
  epsg=TODO # convert self._crs to an epsg code
)

aoi = stack.loc[..., maxy: miny, minx: maxx]
# can do some checking here to see what's up with the time dimension
data = aoi.compute()
out_image = data.data

@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the .loc[] indexing always returns an array of shape [num_items, channels, 0, 0] so no height or width, and I can't seem to figure out what is causing that.

Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nilsleh, thanks for opening up this PR, can't wait to see torchgeo be able to use STAC datasets directly! I've got a few suggested changes, including one that should hopefully fix the empty array problem you found.

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.


if not items:
raise RuntimeError(
f"Your search criteria off {query_parameters} did not return any items"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Your search criteria off {query_parameters} did not return any items"
f"No items returned from search criteria: {query_parameters}"

Comment on lines 105 to 108
crs_dict = {"init": "epsg:{}".format(epsg)}
src_crs = CRS.from_dict(crs_dict)
if crs is None:
crs = CRS.from_dict(crs_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok to just get construct the CRS from the EPSG code directly.

Suggested change
crs_dict = {"init": "epsg:{}".format(epsg)}
src_crs = CRS.from_dict(crs_dict)
if crs is None:
crs = CRS.from_dict(crs_dict)
src_crs = CRS.from_epsg(epsg)
if crs is None:
crs = CRS.from_epsg(epsg)

self.index.insert(i, coords, item)

self._crs = crs
self.res = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably shouldn't hardcode the resolution to 10 here, not all STAC datasets have a 10m spatial resolution.

Suggested change
self.res = 10
self.res = res

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry this was a mypy workaround. I think because in GeoDataset, res is expecting a float, but STACAPIDataset takes res: Optional[float] as argument mypy is failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it helps, but I see other parts of torchgeo using self.res = typing.cast(float, res), E.g. at

self.res = cast(float, res)

Comment on lines +169 to +171
aoi = stack.loc[
..., query.maxy : query.miny, query.minx : query.maxx # type: ignore[misc]
]
Copy link
Contributor

@weiji14 weiji14 Apr 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the .loc[] indexing always returns an array of shape [num_items, channels, 0, 0] so no height or width, and I can't seem to figure out what is causing that.

This doesn't work because the stack DataArray has coordinates in a UTM projection, but the query was using longitude/latitude coordinates. Need to use the same coordinate reference system in both for this to work. See my suggestion at L113 (#412 (comment)) that should fix this.

Comment on lines 111 to 113
minx, miny, maxx, maxy = item.bbox

transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output from item.bbox appears to be longitude/latitude coordinates, i.e. EPSG:4326. I don't know if this is the case for all STAC datasets, but assuming that item.bbox is always EPSG:4326, then the pyproj.Transformer should be doing the coordinate transform like so. Oh, and usually a good idea to use always_xy=True.

Suggested change
minx, miny, maxx, maxy = item.bbox
transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg())
minx, miny, maxx, maxy = item.bbox
transformer = Transformer.from_crs(4326, crs.to_epsg(), always_xy=True)

Comment on lines 299 to 302
minx = -148.46876
maxx = -148.31072
miny = 61.0491
maxy = 61.12567489536982
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torchgeo's GeoSampler does the bounding box query in the Rtree spatial index's CRS, which would be a UTM projection in this case.

Suggested change
minx = -148.46876
maxx = -148.31072
miny = 61.0491
maxy = 61.12567489536982
minx = 420688.14962388354
maxx = 429392.15007465985
miny = 6769145.954634559
maxy = 6777492.989499866

aoi = stackstac.stack(
signed_items,
assets=self.bands,
bounds_latlon=bounds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bounds are not in EPSG:4326 now but the STAC dataset's native CRS (i.e. some UTM projection).

Suggested change
bounds_latlon=bounds,
bounds=bounds,

@nilsleh
Copy link
Collaborator Author

nilsleh commented Apr 19, 2022

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.

Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment.

Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.

Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment.

No worries, we all need to start somewhere and Sentinel-2 is a pretty common dataset so it makes sense to use this as a test case 😄

Comment on lines +154 to +155
key = "image" if self.is_image else "mask"
sample = {key: image, "crs": self.crs, "bbox": query}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think key could be turned into a parameter set by the user, so people can name the dataset directly? I'm thinking of cases e.g. where there's a Sentinel-2 input, a Landsat input, or a DEM input, and people might want to keep them separate when merging (using IntersectionDataset).

@jamesvrt
Copy link

jamesvrt commented Aug 31, 2022

Does this need any support with development? I'd like to contribute my experience with STAC if that would help. In particular to generalize the approach for a range of different STACs. Let me know the best way to do that if so (PR to your fork @nilsleh ?).

I'm aware potential changes to torchgeos approach to DataLoaders are incoming (#576 (comment)). Will that require a complete rewrite of this?

@weiji14
Copy link
Contributor

weiji14 commented Aug 31, 2022

I'm aware potential changes to torchgeos approach to DataLoaders are incoming (#576 (comment)). Will that require a complete rewrite of this?

If we're going for torchdata, then yes, this will need to be rewritten completely. I've been thinking of implementing a torchdata reader that does just the STAC reading part (i.e. no transforms or anything). The idea is to then feed the STAC urls to another DataPipe that uses stackstac or odc-stac (see also opendatacube/odc-stac#54 (comment)), and then feed that into other 'geo'-DataPipes. Would be great to hear your opinion on this!

@adamjstewart
Copy link
Collaborator

Yes, in the long run, the plan is to move towards TorchData, although I don't know when that will happen. I'm fine with having a STACAPIDataset base class while we slowly transition towards TorchData, or I'm fine with experimenting with this as our first TorchData use case. I'm not super familiar with TorchData myself, need to block off a week where I can experiment with it to better understand what it entails.

@KennSmithDS
Copy link
Contributor

KennSmithDS commented Aug 31, 2022

@weiji14 @jamesvrt having looked through the code for this current STACAPIDataset it appears there is only support of catalogs served over an API via catalog querying with pystac_client. Is the intention for this custom Dataset to support any STAC catalog, or only those served over an API via pystac_client? How would Torchdata change that integration (I'm not familiar with it either)?

Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our radiant-mlhub Python client v0.5.1+, which will include functionality to build image and mask tensors by traversing a STAC catalog locally. Seems there's potential for overlap here, but perhaps best to keep them as separate. Thoughts?

@adamjstewart
Copy link
Collaborator

Seems there's potential for overlap here, but perhaps best to keep them as separate.

We can keep them separate at the start, but I imagine a future in which RadiantDataset is a subclass of STACDataset (names TBD). We definitely want to avoid code duplication if we can.

@weiji14
Copy link
Contributor

weiji14 commented Aug 31, 2022

having looked through the code for this current STACAPIDataset it appears there is only support of catalogs served over an API via catalog querying with pystac_client.

Currently yes, this PR does use pystac_client, and is somewhat hardcoded to the Planetary Computer STAC API.

Is the intention for this custom Dataset to support any STAC catalog, or only those served over an API via pystac_client?

At the very least, we should generalize this PR's implementation to work for any Planetary Computer dataset on https://planetarycomputer.microsoft.com/catalog. Ideally though, it should work for any API endpoint supported by pystac_client.

How would Torchdata change that integration (I'm not familiar with it either)?

Torchdata (see https://pytorch.org/blog/pytorch-1.11-released/#introducing-torchdata) would decompose the current implementation into multiple parts as it follows a composition over inheritance philosophy.

Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our radiant-mlhub Python client v0.5.1+, which will include functionality to build image and mask tensors by traversing a STAC catalog locally. Seems there's potential for overlap here, but perhaps best to keep them as separate. Thoughts?

Now, I'm not familiar with radiant-mlhub either, but one idea might be to have a PySTAC torch Dataset/DataPipe that has an engine parameter with options of radiant-mlhub or pystac-client? Or we could settle on the common ground and build a torch DataPipe on just pystac, with separate radiant-mlhub or pystac-client interfaces on top.

@KennSmithDS
Copy link
Contributor

KennSmithDS commented Aug 31, 2022

@adamjstewart @weiji14 thanks for the quick responses!

It seems there is possibly a need for an even more generic base class STACDataset based on STAC specifications. Yet the way the catalog would reference STAC objects might differ depending on if it's a local directory of Catalog, Collection and Item objects in JSON files, vs. an API; the key distinction I can think of off the top of my head being, the former has access to traversing the entire catalog on disk vs Items that are returned by the API with pagination.

I was not intending to use the PySTAC library for the MLHubDataset (or RadiantDataset, whichever) because of performance issues we've ran into when loading and validating STAC objects and item relationships within catalogs. When a catalog grows to hundreds of thousands or millions of STAC objects (as is the case in a few our our recent datasets) the computation can become $O(n^2)$ (if I recall, I'll have to double-check the code base). We have found using multiprocessing and direct JSON object manipulation to be much more performant in our workflows. However, PySTAC does a great job of enforcing STAC object compliance to the specification and extensions.

@weiji14 the radiant-mlhub Python client is a library that half a dozen Torchgeo datasets use to fetch datasets from our platform. When we released v0.5.1, this gave us reason to consider 1) updating the existing Torchgeo datasets to be compliant with the new client functionality, and/or 2) create a new base class that interacts with our Python client directly, upon which to build children classes that will extend the MLHubDataset base class. Our STAC catalogs contain some metadata that is not standard but is still STAC compliant, and we also use a slightly unconventional architecture for serving our catalogs over API using stac-fastapi which would make it less than straightforward for STACAPIDataset to work with our datasets directly.

@adamjstewart
Copy link
Collaborator

@KennSmithDS didn't realize GitHub supports LaTeX now, you've just made my day!

@jamesvrt
Copy link

jamesvrt commented Sep 1, 2022

If I understand all this correctly, there are two main aspects to consider:

  • Providing access to local STAC Catalogs as well as STAC APIs
    • For local Catalogs, how much flexibility for the reading engine?
    • For APIs, how hardcoded to the Planetary Computer should this be?
    • For APIs, how much flexibility for unconventional architectures (e.g. Radiant MLHub)?
  • Using torchdata or the existing approach
    • If torchdata, focus now on only the STAC reading part (no transforms etc.)?

Given these, I think it makes sense to use torchdata already. It allows more granular control on reading STAC, since having separate STACAPI and STACLocal DataPipes would disconnect two quite different approaches for accessing a STAC. These DataPipes should cover most use cases without modification, but could act as a base for more niche requirements (reading engine, API configuration, auth etc.). These DataPipes could be connected by a user to whichever raster reading DataPipe makes sense for their target datastore.

To get a workable first pass at this we might focus first on a STACAPI DataPipe and a stackstac DataPipe, with the intention of serving users of Planetary Computer and open STAC APIs (e.g. element-84's earth search).

What do you think?

@weiji14
Copy link
Contributor

weiji14 commented Sep 1, 2022

Given these, I think it makes sense to use torchdata already. It allows more granular control on reading STAC, since having separate STACAPI and STACLocal DataPipes would disconnect two quite different approaches for accessing a STAC. These DataPipes should cover most use cases without modification, but could act as a base for more niche requirements (reading engine, API configuration, auth etc.). These DataPipes could be connected by a user to whichever raster reading DataPipe makes sense for their target datastore.

To get a workable first pass at this we might focus first on a STACAPI DataPipe and a stackstac DataPipe, with the intention of serving users of Planetary Computer and open STAC APIs (e.g. element-84's earth search).

What do you think?

Yep, totally agree on going straight for torchdata! Let's start from a pure and simple STAC-something DataPipe using the basic tools in https://github.com/stac-utils. I've coded up a draft pystac.Item.from_file() DataPipe over at weiji14/zen3geo#46, and we can slowly add in the pystac-client and stackstac DataPipe pieces next 😄

@KennSmithDS
Copy link
Contributor

KennSmithDS commented Sep 1, 2022

@weiji14 thanks for sharing the development on weiji14/zen3geo#46, this is great!. This raises a few questions in my mind looking at zen3geo readthedocs:

  1. I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?
  2. Is there functionality for reading in the catalog and collections as well or just items? (looking at the source code just Items) :)
  3. Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

@weiji14
Copy link
Contributor

weiji14 commented Sep 1, 2022

  1. I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?

Yes for STACLocal, no for STACAPI? My impression of a STACAPI DataPipe was that it would be a wrapper around https://github.com/stac-utils/pystac-client, i.e. for STAC Catalogs/APIs. But maybe I'm getting confused by the terminology...

  1. Is there functionality for reading in the catalog and collections as well or just items? (looking at the source code just Items) :)

We could have a PySTACCatalogReader wrapping pystac.Catalog.from_file and PySTACCollectionReader wrapping pystac.ItemCollection.from_file perhaps?

  1. Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

Yep, and there's probably a few ways to go from many STAC items to an xarray.DataArray. The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best. E.g. STAC Items -> ItemCollection -> stackstac xarray.DataArray; STAC items -> xarray.DataArrays -> xarray.Dataset; etc.

@adamjstewart
Copy link
Collaborator

The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best.

@weiji14 just to make sure I understand, is this something that TorchGeo users will need to do, or something that TorchGeo developers will need to do? I'm still hoping to keep a single Dataset instance for each dataset that optionally allows users to choose where to download data from. Is that possible, or is that counter to the idea of DataPipes? Still catching up on this TorchData stuff...

@jamesvrt
Copy link

jamesvrt commented Sep 1, 2022

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?

@KennSmithDS @weiji14 Maybe I confused the issue by calling the proposed class STACLocal. It should be STACStatic. A static STAC Catalog can be hosted locally or remotely, which is why PySTACItemReader can take either a local file path or URL. The difference versus accessing a STAC API using pystac_client is you can search Items by attribute when using the API, something you cannot readily do with a static catalog.

@rbavery
Copy link

rbavery commented Sep 1, 2022

The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best. E.g. STAC Items -> ItemCollection -> stackstac xarray.DataArray; STAC items -> xarray.DataArrays -> xarray.Dataset; etc.

Been following this discussion and very interested in the solutions being discussed! I think there might be some solutions for iterating over xarray DataArrays that TorchGeo could standardize on, rather than leaving it to a data engineer to decide whether to represent an imagery dataset as either a DataArray or Dataset.

Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher

I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher. Performance of data loading for ML is an open issue xarray-contrib/xbatcher#37

@adamjstewart
Copy link
Collaborator

Before we get too far down the line with xarray discussions, note that TorchGeo does not currently use xarray for anything. We currently rely on rasterio/fiona (aka GDAL) for all file I/O since it can handle reprojection/resampling. Ultimately, all arrays will end up as PyTorch Tensors, so it doesn't really matter what the intermediate step is. But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed.

@weiji14
Copy link
Contributor

weiji14 commented Sep 1, 2022

After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher

I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher.

Done at weiji14/zen3geo#22, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/chipping.html. (Disclaimer: I'm actually an xbatcher dev).

But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed.

Done at weiji14/zen3geo#6, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/walkthrough.html#read-using-rioxarrayreader

Anyways, this DataPipe discussion really should be in #576, let's focus on the STAC discussion 🙂

@jamesvrt
Copy link

jamesvrt commented Sep 1, 2022

@adamjstewart Xarray comes as a requirement of stackstac and odc-data, if we were to make PipeLines using either of those packages (similar to what @nilsleh has done in this PR). They're not necessary by any means, but may simplify the STAC Item --> Tensor piece. Agreed rioxarray is an extra requirement for reprojection and general spatial awareness.

More STAC related... shall I make a start on the STACAPI PipeLine? Maybe we can use a draft PR to drive further discussion.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Sep 2, 2022

Apologies, I have been busy finishing my master thesis until a couple of days ago. I was planning to pick this up again but it seems like there is a lot of interest and expertise around this already. When I started this it was merely an exploration of trying an approach. I don't have a preference to how you proceed so feel free to take this over @jamesvrt @weiji14 @KennSmithDS

@jamesvrt
Copy link

jamesvrt commented Sep 2, 2022

Sorry @nilsleh, I didn't mean to sidetrack from your work here. Congratulations on finishing your masters. Happy to move this discussion elsewhere if that makes more sense to the organisers of this project.

I made a start at writing a DataPipe for a generic STAC API, but having done a little digging into torchdata I'm not sure that DataPipes are intended for API calls. This is because they operate on iterables or mappable objects. Given pystac_client already exists for the purpose of retrieving Items from a STAC API as an iterable, writing a DataPipe around it that is called only once might be adding redundancy and introduce an unnecessary requirement to torchgeo. I don't have a good grasp on torchdata yet, so please let me know if these findings are inaccurate.

If I'm correct about the above, maybe it makes more sense for the user to retrieve all of their desired STAC Items up front using their own method (probably pystac related) and feed these as an iterable into a class built on torchdata.datapipes.iter.IterableWrapper, like the one @weiji14 created in weiji14/zen3geo#46. Something like this:

from pystac_client import Client
from torchgeo.datapipes import STACItemReader

catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = catalog.search(
    collections=["landsat-c2-l2"],
    bbox=bbox_of_interest,
    datetime=time_of_interest,
    query={"eo:cloud_cover": {"lt": 10}},
)

items = list(search.get_items())
dp = STACItemReader(iterable=items)

This approach has the advantage of allowing users to use their own Item discovery methods for static or API-based STACs.

The difficult part as I can see it is selecting the required Item Assets and providing an appropriate method for IO. For those who aren't familiar with STAC, a single Item can contain many Assets (e.g. individually stored bands, masks, thumbnails), with no requirement for these assets to be stored in the same location (e.g. S3) or format (e.g. Cloud Optimized GeoTIFF). Thankfully, torchdata already contains many DataPipes for different locations and formats (see pytorch/data#672 and TorchData IO DataPipes)

In my mind, an ideal STACItemReader would:

  • Take an iterable of STAC Items (Iterable[pystac.Item] or Iterable[Dict])
  • Allow the user to specify which Assets to retrieve from each Item (e.g. list of bands)
  • Provide the correct IO for each Asset given its (location, format)
  • Capture the appropriate metadata from the STAC Item. For example, imagery stored in a non-georeferenced format will require the geometry attribute of the Item.

stackstac and odc-data do these already in their own way. I don't know how well they will work in a DataPipe, given they use xarray and its lazy loading.

@weiji14
Copy link
Contributor

weiji14 commented Sep 2, 2022

In my mind, an ideal STACItemReader would:

  • Take an iterable of STAC Items (Iterable[pystac.Item] or Iterable[Dict])
  • Allow the user to specify which Assets to retrieve from each Item (e.g. list of bands)
  • Provide the correct IO for each Asset given its (location, format)
  • Capture the appropriate metadata from the STAC Item. For example, imagery stored in a non-georeferenced format will require the geometry attribute of the Item.

You're on the right track 😄 This entire thing sounds more like a DataPipeLine rather than a DataPipe though, but I think a curated DataPipeLine like this (which chains-together multiple DataPipes) is what torchgeo will end up doing (once the matter of optional dependencies at #662 is resolved).

While it's possible to create a DataPipe with several DataPipes inside, you only have to look at torchdata's code (e.g. at https://github.com/pytorch/data/tree/v0.4.0/torchdata/datapipes/iter) to see that each official torch DataPipe focuses on one task only. E.g. one function with one dependency. So I recommend having STACItemReader do only STAC Item reading, and have a separate StackSTACReader read these STAC Item objects into an xarray.DataArray (while handling the Asset loading/filtering). This separation of concerns makes it less user friendly perhaps, but it's easier to write and maintain for developers and more suited for power users wanting full customization of their data pipeline.

P.S. I've started a page at weiji14/zen3geo#48 summarizing some of the ideas in this thread, feel free anyone to continue the discussion there!

P.P.S. Sorry @nilsleh for going off track from your PR, I'll keep quiet now 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

STAC API dataset
7 participants