Add STACAPI dataset #412

nilsleh · 2022-02-17T21:00:27Z

This PR closes #403 by adding a STACAPI dataset. I will try to explain my thinking for this proposal:

Here are some explanations for choices I made:

constructor takes the usual root, crs, res, and transforms argument like other raster datasets
additionally, there is a bands, api_endpoint and **query_parameters argument. I thought the bands argument should be separate because other raster datasets also specifically have this argument

When the getitem method receives a bbox query:

a single item that is part of the hit objects will be merged together to a single raster and computed resulting in a 4D xarray of (time x channel x height x width)
since it can happen that multiple hits are returned for a particular bbox query, these have to be merged. Usually, this happens with rasterio because they are datareader objects or filepaths and not xarray resulting from stackstac. I only found the rioxarray library, which does merging for xarrays so this library is currently also required
however, since rioxarray.merge only supports 2D or 3D tensors and potentially stac returns many samples over a time period, I loop over the time dimension, merge with rioxarray per time step and afterwords stack back together into one final tensor which will be returned as an image in the sample
squeeze out the time dimension if it is only a single time instance

Questions / Left to do:

several new libraries needed, maybe some can be avoided like rioxarray?
computing the aoi can potentially take fairly long albeit it is much faster than I expected
handling the resolution self.res parameter with each separate band having a different resolution
deciding whether an image will be under image or mask key because StacAPI has both images and labels, so maybe another argument for the constructor?
given the last point, need to implement a plot method that can handle both scenarios of the dataset containing images or labels
testing with sampler and dataloader

calebrob6 · 2022-02-20T19:02:17Z

instead of what is currently in __getitem__ we can use something like:

items = TODO # list of the STAC Items that the current query intersects with (based on the results from the rtree)

stack = stackstac.stack(
  items,
  assets=self.bands,
  resolution=self.res,
  epsg=TODO # convert self._crs to an epsg code
)

aoi = stack.loc[..., maxy: miny, minx: maxx]
# can do some checking here to see what's up with the time dimension
data = aoi.compute()
out_image = data.data

calebrob6 · 2022-02-20T19:08:27Z

several new libraries needed, maybe some can be avoided like rioxarray?

Generally, we want to minimize direct dependences

computing the aoi can potentially take fairly long albeit it is much faster than I expected

I have previously benchmarked different ways of transforming coordinates (https://gist.github.com/calebrob6/334af85deb4c862689020ae4b3344bfc) and pyproj was by far the fastest. The constructor may be sped up if you replace the fiona.transform.transform(...) call with a pyproj version.

handling the resolution self.res parameter with each separate band having a different resolution

I think we have to reproject to a single resolution. I think it makes sense to choose the highest resolution if the user doesn't specify (what stackstac will do by default).

deciding whether an image will be under image or mask key because StacAPI has both images and labels, so maybe another argument for the constructor?

I think this makes sense -- what do you think @adamjstewart?

given the last point, need to implement a plot method that can handle both scenarios of the dataset containing images or labels

I think we can't have a plot for this class as it is too generic (e.g. this could be used to grab anything on the Planetary Computer).

calebrob6 · 2022-02-20T19:12:19Z

Another thought -- for the planetary computer specifically, we will need to sign STAC Items before we can load data from them. Perhaps there should be a Callable "item_transformer" argument that takes an Item as input and returns an Item, and is an identity transform by default? For the planetary computer case we can make a transform that signs the item.

@TomAugspurger, any thoughts in general here?

adamjstewart · 2022-02-20T19:16:15Z

For me, the biggest remaining question is how to integrate this with RasterDataset/VectorDataset. Is the intention that we will have subclasses of STACAPIDataset for each dataset on the Planetary Computer? For things like Landsat and Sentinel that are available in both STAC and non-STAC forms, will we have a single class or will we have Landsat and LandsatSTAC classes?

calebrob6 · 2022-02-20T20:50:25Z

how to integrate this with RasterDataset/VectorDataset

What do you mean here? (e.g. I would consider this integrated as it extends RasterDataset)

Is the intention that we will have subclasses of STACAPIDataset for each dataset on the Planetary Computer?

No, I don't think so. Users could easily use it with the Landsat8 collection on the Planetary Computer, but would need to do more work if they wanted the features that come with the Landsat dataset.

adamjstewart · 2022-02-20T21:51:58Z

torchgeo/datasets/stacapidataset.py

+# from torchgeo.samplers import RandomGeoSampler
+
+
+class STACAPIDataset(RasterDataset, abc.ABC):


Suggested change

class STACAPIDataset(RasterDataset, abc.ABC):

class STACAPIDataset(GeoDataset):

This is neither an ABC (there are no abstract methods), nor does it need to be a subclass of RasterDataset (it doesn't re-use any of the logic from RasterDataset). I would consider this to be just another possible base class that subclasses GeoDataset.

adamjstewart · 2022-02-20T21:56:01Z

@calebrob6 I guess we're not on the same page then, I've always thought of STAC integration as no different than finding files on your filesystem. My hope was that we could have a single dataset Landsat (or rather, Landsat8, etc.) that would work for both locally downloaded files without a STAC index and remote files with a STAC index. Just like we have subclasses of RasterDataset for various datasets, I think we'll want subclasses of STACAPIDataset for each dataset. Is there any reason that we don't need this? Or would you suggest that Landsat8 isn't needed and users should just directly use RasterDataset and pass in a file glob or list of files or something?

calebrob6 · 2022-02-23T17:32:53Z

I see what you mean now -- I think we are only not on the same page simply because you are 1 chapter ahead of me ;)

I've always thought of STAC integration as no different than finding files on your filesystem

I like this! Some thoughts:

If we implement this for e.g. Landsat8 then we would probably be coupled to a particular STAC API (e.g. "the planetary computer API").
I think STACAPIDataset should still be able to function independently, similar to how you can pass some TIFFs to RasterDataset, so that we don't have to implement a specific dataset for every possible collection that a user wants to use that is hosted behind a STAC API.
Maybe a "mixin" type pattern is appropriate for implementing this? Landsat8 is "STACAPI'able", Sentinel2 is "STACAPI'able", etc.

nilsleh · 2022-02-27T11:45:24Z

instead of what is currently in __getitem__ we can use something like:

items = TODO # list of the STAC Items that the current query intersects with (based on the results from the rtree)

stack = stackstac.stack(
  items,
  assets=self.bands,
  resolution=self.res,
  epsg=TODO # convert self._crs to an epsg code
)

aoi = stack.loc[..., maxy: miny, minx: maxx]
# can do some checking here to see what's up with the time dimension
data = aoi.compute()
out_image = data.data

@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the .loc[] indexing always returns an array of shape [num_items, channels, 0, 0] so no height or width, and I can't seem to figure out what is causing that.

weiji14

Hi @nilsleh, thanks for opening up this PR, can't wait to see torchgeo be able to use STAC datasets directly! I've got a few suggested changes, including one that should hopefully fix the empty array problem you found.

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.

weiji14 · 2022-04-17T01:39:09Z

torchgeo/datasets/stacapidataset.py

+
+        if not items:
+            raise RuntimeError(
+                f"Your search criteria off {query_parameters} did not return any items"


Suggested change

f"Your search criteria off {query_parameters} did not return any items"

f"No items returned from search criteria: {query_parameters}"

weiji14 · 2022-04-17T01:43:47Z

torchgeo/datasets/stacapidataset.py

+        crs_dict = {"init": "epsg:{}".format(epsg)}
+        src_crs = CRS.from_dict(crs_dict)
+        if crs is None:
+            crs = CRS.from_dict(crs_dict)


Should be ok to just get construct the CRS from the EPSG code directly.

Suggested change

crs_dict = {"init": "epsg:{}".format(epsg)}

src_crs = CRS.from_dict(crs_dict)

if crs is None:

crs = CRS.from_dict(crs_dict)

src_crs = CRS.from_epsg(epsg)

if crs is None:

crs = CRS.from_epsg(epsg)

weiji14 · 2022-04-17T01:48:01Z

torchgeo/datasets/stacapidataset.py

+            self.index.insert(i, coords, item)
+
+        self._crs = crs
+        self.res = 10


Probably shouldn't hardcode the resolution to 10 here, not all STAC datasets have a 10m spatial resolution.

Suggested change

self.res = 10

self.res = res

Yes, sorry this was a mypy workaround. I think because in GeoDataset, res is expecting a float, but STACAPIDataset takes res: Optional[float] as argument mypy is failing.

Not sure if it helps, but I see other parts of torchgeo using self.res = typing.cast(float, res), E.g. at

torchgeo/torchgeo/datasets/geo.py

Line 379 in 9f96cdd

self.res = cast(float, res)

weiji14 · 2022-04-17T02:23:36Z

torchgeo/datasets/stacapidataset.py

+        aoi = stack.loc[
+            ..., query.maxy : query.miny, query.minx : query.maxx  # type: ignore[misc]
+        ]


@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the .loc[] indexing always returns an array of shape [num_items, channels, 0, 0] so no height or width, and I can't seem to figure out what is causing that.

This doesn't work because the stack DataArray has coordinates in a UTM projection, but the query was using longitude/latitude coordinates. Need to use the same coordinate reference system in both for this to work. See my suggestion at L113 (#412 (comment)) that should fix this.

weiji14 · 2022-04-17T02:28:30Z

torchgeo/datasets/stacapidataset.py

+            minx, miny, maxx, maxy = item.bbox
+
+            transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg())


The output from item.bbox appears to be longitude/latitude coordinates, i.e. EPSG:4326. I don't know if this is the case for all STAC datasets, but assuming that item.bbox is always EPSG:4326, then the pyproj.Transformer should be doing the coordinate transform like so. Oh, and usually a good idea to use always_xy=True.

Suggested change

minx, miny, maxx, maxy = item.bbox

transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg())

minx, miny, maxx, maxy = item.bbox

transformer = Transformer.from_crs(4326, crs.to_epsg(), always_xy=True)

weiji14 · 2022-04-17T02:37:56Z

torchgeo/datasets/stacapidataset.py

+    minx = -148.46876
+    maxx = -148.31072
+    miny = 61.0491
+    maxy = 61.12567489536982


Torchgeo's GeoSampler does the bounding box query in the Rtree spatial index's CRS, which would be a UTM projection in this case.

Suggested change

minx = -148.46876

maxx = -148.31072

miny = 61.0491

maxy = 61.12567489536982

minx = 420688.14962388354

maxx = 429392.15007465985

miny = 6769145.954634559

maxy = 6777492.989499866

weiji14 · 2022-04-17T02:43:39Z

torchgeo/datasets/stacapidataset.py

+        aoi = stackstac.stack(
+            signed_items,
+            assets=self.bands,
+            bounds_latlon=bounds,


Bounds are not in EPSG:4326 now but the STAC dataset's native CRS (i.e. some UTM projection).

Suggested change

bounds_latlon=bounds,

bounds=bounds,

nilsleh · 2022-04-19T17:52:35Z

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.

Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment.

weiji14

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.

Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment.

No worries, we all need to start somewhere and Sentinel-2 is a pretty common dataset so it makes sense to use this as a test case 😄

weiji14 · 2022-04-19T18:28:29Z

torchgeo/datasets/stacapidataset.py

+        key = "image" if self.is_image else "mask"
+        sample = {key: image, "crs": self.crs, "bbox": query}


Do you think key could be turned into a parameter set by the user, so people can name the dataset directly? I'm thinking of cases e.g. where there's a Sentinel-2 input, a Landsat input, or a DEM input, and people might want to keep them separate when merging (using IntersectionDataset).

jamesvrt · 2022-08-31T14:21:25Z

Does this need any support with development? I'd like to contribute my experience with STAC if that would help. In particular to generalize the approach for a range of different STACs. Let me know the best way to do that if so (PR to your fork @nilsleh ?).

I'm aware potential changes to torchgeos approach to DataLoaders are incoming (#576 (comment)). Will that require a complete rewrite of this?

weiji14 · 2022-08-31T16:34:21Z

I'm aware potential changes to torchgeos approach to DataLoaders are incoming (#576 (comment)). Will that require a complete rewrite of this?

If we're going for torchdata, then yes, this will need to be rewritten completely. I've been thinking of implementing a torchdata reader that does just the STAC reading part (i.e. no transforms or anything). The idea is to then feed the STAC urls to another DataPipe that uses stackstac or odc-stac (see also opendatacube/odc-stac#54 (comment)), and then feed that into other 'geo'-DataPipes. Would be great to hear your opinion on this!

adamjstewart · 2022-08-31T17:23:29Z

Yes, in the long run, the plan is to move towards TorchData, although I don't know when that will happen. I'm fine with having a STACAPIDataset base class while we slowly transition towards TorchData, or I'm fine with experimenting with this as our first TorchData use case. I'm not super familiar with TorchData myself, need to block off a week where I can experiment with it to better understand what it entails.

KennSmithDS · 2022-08-31T22:05:11Z

@weiji14 @jamesvrt having looked through the code for this current STACAPIDataset it appears there is only support of catalogs served over an API via catalog querying with pystac_client. Is the intention for this custom Dataset to support any STAC catalog, or only those served over an API via pystac_client? How would Torchdata change that integration (I'm not familiar with it either)?

Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our radiant-mlhub Python client v0.5.1+, which will include functionality to build image and mask tensors by traversing a STAC catalog locally. Seems there's potential for overlap here, but perhaps best to keep them as separate. Thoughts?

adamjstewart · 2022-08-31T22:10:29Z

Seems there's potential for overlap here, but perhaps best to keep them as separate.

We can keep them separate at the start, but I imagine a future in which RadiantDataset is a subclass of STACDataset (names TBD). We definitely want to avoid code duplication if we can.

weiji14 · 2022-08-31T22:43:38Z

having looked through the code for this current STACAPIDataset it appears there is only support of catalogs served over an API via catalog querying with pystac_client.

Currently yes, this PR does use pystac_client, and is somewhat hardcoded to the Planetary Computer STAC API.

Is the intention for this custom Dataset to support any STAC catalog, or only those served over an API via pystac_client?

At the very least, we should generalize this PR's implementation to work for any Planetary Computer dataset on https://planetarycomputer.microsoft.com/catalog. Ideally though, it should work for any API endpoint supported by pystac_client.

How would Torchdata change that integration (I'm not familiar with it either)?

Torchdata (see https://pytorch.org/blog/pytorch-1.11-released/#introducing-torchdata) would decompose the current implementation into multiple parts as it follows a composition over inheritance philosophy.

Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our radiant-mlhub Python client v0.5.1+, which will include functionality to build image and mask tensors by traversing a STAC catalog locally. Seems there's potential for overlap here, but perhaps best to keep them as separate. Thoughts?

Now, I'm not familiar with radiant-mlhub either, but one idea might be to have a PySTAC torch Dataset/DataPipe that has an engine parameter with options of radiant-mlhub or pystac-client? Or we could settle on the common ground and build a torch DataPipe on just pystac, with separate radiant-mlhub or pystac-client interfaces on top.

KennSmithDS · 2022-08-31T23:45:07Z

@adamjstewart @weiji14 thanks for the quick responses!

It seems there is possibly a need for an even more generic base class STACDataset based on STAC specifications. Yet the way the catalog would reference STAC objects might differ depending on if it's a local directory of Catalog, Collection and Item objects in JSON files, vs. an API; the key distinction I can think of off the top of my head being, the former has access to traversing the entire catalog on disk vs Items that are returned by the API with pagination.

I was not intending to use the PySTAC library for the MLHubDataset (or RadiantDataset, whichever) because of performance issues we've ran into when loading and validating STAC objects and item relationships within catalogs. When a catalog grows to hundreds of thousands or millions of STAC objects (as is the case in a few our our recent datasets) the computation can become $O(n^2)$ (if I recall, I'll have to double-check the code base). We have found using multiprocessing and direct JSON object manipulation to be much more performant in our workflows. However, PySTAC does a great job of enforcing STAC object compliance to the specification and extensions.

@weiji14 the radiant-mlhub Python client is a library that half a dozen Torchgeo datasets use to fetch datasets from our platform. When we released v0.5.1, this gave us reason to consider 1) updating the existing Torchgeo datasets to be compliant with the new client functionality, and/or 2) create a new base class that interacts with our Python client directly, upon which to build children classes that will extend the MLHubDataset base class. Our STAC catalogs contain some metadata that is not standard but is still STAC compliant, and we also use a slightly unconventional architecture for serving our catalogs over API using stac-fastapi which would make it less than straightforward for STACAPIDataset to work with our datasets directly.

adamjstewart · 2022-09-01T00:00:22Z

@KennSmithDS didn't realize GitHub supports LaTeX now, you've just made my day!

jamesvrt · 2022-09-01T13:39:55Z

If I understand all this correctly, there are two main aspects to consider:

Providing access to local STAC Catalogs as well as STAC APIs
- For local Catalogs, how much flexibility for the reading engine?
- For APIs, how hardcoded to the Planetary Computer should this be?
- For APIs, how much flexibility for unconventional architectures (e.g. Radiant MLHub)?
Using torchdata or the existing approach
- If torchdata, focus now on only the STAC reading part (no transforms etc.)?

Given these, I think it makes sense to use torchdata already. It allows more granular control on reading STAC, since having separate STACAPI and STACLocal DataPipes would disconnect two quite different approaches for accessing a STAC. These DataPipes should cover most use cases without modification, but could act as a base for more niche requirements (reading engine, API configuration, auth etc.). These DataPipes could be connected by a user to whichever raster reading DataPipe makes sense for their target datastore.

To get a workable first pass at this we might focus first on a STACAPI DataPipe and a stackstac DataPipe, with the intention of serving users of Planetary Computer and open STAC APIs (e.g. element-84's earth search).

What do you think?

weiji14 · 2022-09-01T15:17:35Z

Given these, I think it makes sense to use torchdata already. It allows more granular control on reading STAC, since having separate STACAPI and STACLocal DataPipes would disconnect two quite different approaches for accessing a STAC. These DataPipes should cover most use cases without modification, but could act as a base for more niche requirements (reading engine, API configuration, auth etc.). These DataPipes could be connected by a user to whichever raster reading DataPipe makes sense for their target datastore.

To get a workable first pass at this we might focus first on a STACAPI DataPipe and a stackstac DataPipe, with the intention of serving users of Planetary Computer and open STAC APIs (e.g. element-84's earth search).

What do you think?

Yep, totally agree on going straight for torchdata! Let's start from a pure and simple STAC-something DataPipe using the basic tools in https://github.com/stac-utils. I've coded up a draft pystac.Item.from_file() DataPipe over at weiji14/zen3geo#46, and we can slowly add in the pystac-client and stackstac DataPipe pieces next 😄

KennSmithDS · 2022-09-01T16:24:57Z

@weiji14 thanks for sharing the development on weiji14/zen3geo#46, this is great!. This raises a few questions in my mind looking at zen3geo readthedocs:

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?
Is there functionality for reading in the catalog and collections as well or just items? (looking at the source code just Items) :)
Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

weiji14 · 2022-09-01T17:37:48Z

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?

Yes for STACLocal, no for STACAPI? My impression of a STACAPI DataPipe was that it would be a wrapper around https://github.com/stac-utils/pystac-client, i.e. for STAC Catalogs/APIs. But maybe I'm getting confused by the terminology...

Is there functionality for reading in the catalog and collections as well or just items? (looking at the source code just Items) :)

We could have a PySTACCatalogReader wrapping pystac.Catalog.from_file and PySTACCollectionReader wrapping pystac.ItemCollection.from_file perhaps?

Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

Yep, and there's probably a few ways to go from many STAC items to an xarray.DataArray. The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best. E.g. STAC Items -> ItemCollection -> stackstac xarray.DataArray; STAC items -> xarray.DataArrays -> xarray.Dataset; etc.

adamjstewart · 2022-09-01T17:43:09Z

The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best.

@weiji14 just to make sure I understand, is this something that TorchGeo users will need to do, or something that TorchGeo developers will need to do? I'm still hoping to keep a single Dataset instance for each dataset that optionally allows users to choose where to download data from. Is that possible, or is that counter to the idea of DataPipes? Still catching up on this TorchData stuff...

jamesvrt · 2022-09-01T17:46:31Z

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?

@KennSmithDS @weiji14 Maybe I confused the issue by calling the proposed class STACLocal. It should be STACStatic. A static STAC Catalog can be hosted locally or remotely, which is why PySTACItemReader can take either a local file path or URL. The difference versus accessing a STAC API using pystac_client is you can search Items by attribute when using the API, something you cannot readily do with a static catalog.

rbavery · 2022-09-01T18:12:00Z

The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best. E.g. STAC Items -> ItemCollection -> stackstac xarray.DataArray; STAC items -> xarray.DataArrays -> xarray.Dataset; etc.

Been following this discussion and very interested in the solutions being discussed! I think there might be some solutions for iterating over xarray DataArrays that TorchGeo could standardize on, rather than leaving it to a data engineer to decide whether to represent an imagery dataset as either a DataArray or Dataset.

Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher

I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher. Performance of data loading for ML is an open issue xarray-contrib/xbatcher#37

adamjstewart · 2022-09-01T18:18:00Z

Before we get too far down the line with xarray discussions, note that TorchGeo does not currently use xarray for anything. We currently rely on rasterio/fiona (aka GDAL) for all file I/O since it can handle reprojection/resampling. Ultimately, all arrays will end up as PyTorch Tensors, so it doesn't really matter what the intermediate step is. But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed.

weiji14 · 2022-09-01T18:25:34Z

After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher

I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher.

Done at weiji14/zen3geo#22, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/chipping.html. (Disclaimer: I'm actually an xbatcher dev).

But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed.

Done at weiji14/zen3geo#6, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/walkthrough.html#read-using-rioxarrayreader

Anyways, this DataPipe discussion really should be in #576, let's focus on the STAC discussion 🙂

jamesvrt · 2022-09-01T18:57:51Z

@adamjstewart Xarray comes as a requirement of stackstac and odc-data, if we were to make PipeLines using either of those packages (similar to what @nilsleh has done in this PR). They're not necessary by any means, but may simplify the STAC Item --> Tensor piece. Agreed rioxarray is an extra requirement for reprojection and general spatial awareness.

More STAC related... shall I make a start on the STACAPI PipeLine? Maybe we can use a draft PR to drive further discussion.

nilsleh · 2022-09-02T08:27:27Z

Apologies, I have been busy finishing my master thesis until a couple of days ago. I was planning to pick this up again but it seems like there is a lot of interest and expertise around this already. When I started this it was merely an exploration of trying an approach. I don't have a preference to how you proceed so feel free to take this over @jamesvrt @weiji14 @KennSmithDS

jamesvrt · 2022-09-02T17:08:58Z

Sorry @nilsleh, I didn't mean to sidetrack from your work here. Congratulations on finishing your masters. Happy to move this discussion elsewhere if that makes more sense to the organisers of this project.

I made a start at writing a DataPipe for a generic STAC API, but having done a little digging into torchdata I'm not sure that DataPipes are intended for API calls. This is because they operate on iterables or mappable objects. Given pystac_client already exists for the purpose of retrieving Items from a STAC API as an iterable, writing a DataPipe around it that is called only once might be adding redundancy and introduce an unnecessary requirement to torchgeo. I don't have a good grasp on torchdata yet, so please let me know if these findings are inaccurate.

If I'm correct about the above, maybe it makes more sense for the user to retrieve all of their desired STAC Items up front using their own method (probably pystac related) and feed these as an iterable into a class built on torchdata.datapipes.iter.IterableWrapper, like the one @weiji14 created in weiji14/zen3geo#46. Something like this:

from pystac_client import Client
from torchgeo.datapipes import STACItemReader

catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = catalog.search(
    collections=["landsat-c2-l2"],
    bbox=bbox_of_interest,
    datetime=time_of_interest,
    query={"eo:cloud_cover": {"lt": 10}},
)

items = list(search.get_items())
dp = STACItemReader(iterable=items)

This approach has the advantage of allowing users to use their own Item discovery methods for static or API-based STACs.

The difficult part as I can see it is selecting the required Item Assets and providing an appropriate method for IO. For those who aren't familiar with STAC, a single Item can contain many Assets (e.g. individually stored bands, masks, thumbnails), with no requirement for these assets to be stored in the same location (e.g. S3) or format (e.g. Cloud Optimized GeoTIFF). Thankfully, torchdata already contains many DataPipes for different locations and formats (see pytorch/data#672 and TorchData IO DataPipes)

In my mind, an ideal STACItemReader would:

Take an iterable of STAC Items (Iterable[pystac.Item] or Iterable[Dict])
Allow the user to specify which Assets to retrieve from each Item (e.g. list of bands)
Provide the correct IO for each Asset given its (location, format)
Capture the appropriate metadata from the STAC Item. For example, imagery stored in a non-georeferenced format will require the geometry attribute of the Item.

stackstac and odc-data do these already in their own way. I don't know how well they will work in a DataPipe, given they use xarray and its lazy loading.

weiji14 · 2022-09-02T19:44:42Z

In my mind, an ideal STACItemReader would:

Take an iterable of STAC Items (Iterable[pystac.Item] or Iterable[Dict])

Allow the user to specify which Assets to retrieve from each Item (e.g. list of bands)

Provide the correct IO for each Asset given its (location, format)

Capture the appropriate metadata from the STAC Item. For example, imagery stored in a non-georeferenced format will require the geometry attribute of the Item.

You're on the right track 😄 This entire thing sounds more like a DataPipeLine rather than a DataPipe though, but I think a curated DataPipeLine like this (which chains-together multiple DataPipes) is what torchgeo will end up doing (once the matter of optional dependencies at #662 is resolved).

While it's possible to create a DataPipe with several DataPipes inside, you only have to look at torchdata's code (e.g. at https://github.com/pytorch/data/tree/v0.4.0/torchdata/datapipes/iter) to see that each official torch DataPipe focuses on one task only. E.g. one function with one dependency. So I recommend having STACItemReader do only STAC Item reading, and have a separate StackSTACReader read these STAC Item objects into an xarray.DataArray (while handling the Asset loading/filtering). This separation of concerns makes it less user friendly perhaps, but it's easier to write and maintain for developers and more suited for power users wanting full customization of their data pipeline.

P.S. I've started a page at weiji14/zen3geo#48 summarizing some of the ideas in this thread, feel free anyone to continue the discussion there!

P.P.S. Sorry @nilsleh for going off track from your PR, I'll keep quiet now 🙂

draft for dataset

d674b4a

github-actions bot added the datasets Geospatial or benchmark datasets label Feb 17, 2022

nilsleh marked this pull request as draft February 17, 2022 21:00

adamjstewart added this to the 0.3.0 milestone Feb 18, 2022

adamjstewart reviewed Feb 20, 2022

View reviewed changes

no data

3cacc1f

weiji14 reviewed Apr 17, 2022

View reviewed changes

suggested changes

b0c78a0

weiji14 reviewed Apr 19, 2022

View reviewed changes

weiji14 mentioned this pull request Jun 6, 2022

RioXarrayDataset for in-memory geographical xarray.DataArray objects #509

Closed

adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022

weiji14 mentioned this pull request Sep 1, 2022

✨ PySTACItemReaderIterDataPipe for reading STAC Items weiji14/zen3geo#46

Merged

3 tasks

This was referenced Sep 16, 2022

✨ PySTACAPISearchIterDataPipe to query dynamic STAC Catalogs weiji14/zen3geo#59

Merged

✨ StackSTACStackerIterDataPipe for stacking STAC items weiji14/zen3geo#61

Merged

🚸 Walkthrough on stacking time-series earth observation data weiji14/zen3geo#62

Merged

adamjstewart removed this from the 0.4.0 milestone Jan 24, 2023

nilsleh mentioned this pull request Jul 19, 2023

Xarray Dataset #1486

Open

		# from torchgeo.samplers import RandomGeoSampler


		class STACAPIDataset(RasterDataset, abc.ABC):

	class STACAPIDataset(RasterDataset, abc.ABC):
	class STACAPIDataset(GeoDataset):

	f"Your search criteria off {query_parameters} did not return any items"
	f"No items returned from search criteria: {query_parameters}"

		minx, miny, maxx, maxy = item.bbox

		transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg())

		key = "image" if self.is_image else "mask"
		sample = {key: image, "crs": self.crs, "bbox": query}

Add STACAPI dataset #412

Are you sure you want to change the base?

Add STACAPI dataset #412

Conversation

nilsleh commented Feb 17, 2022 • edited by adamjstewart Loading

calebrob6 commented Feb 20, 2022 • edited Loading

calebrob6 commented Feb 20, 2022

calebrob6 commented Feb 20, 2022

adamjstewart commented Feb 20, 2022

calebrob6 commented Feb 20, 2022

Choose a reason for hiding this comment

adamjstewart commented Feb 20, 2022

calebrob6 commented Feb 23, 2022

nilsleh commented Feb 27, 2022

weiji14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Apr 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilsleh commented Apr 19, 2022

weiji14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesvrt commented Aug 31, 2022 • edited Loading

weiji14 commented Aug 31, 2022

adamjstewart commented Aug 31, 2022

KennSmithDS commented Aug 31, 2022 • edited Loading

adamjstewart commented Aug 31, 2022

weiji14 commented Aug 31, 2022 • edited Loading

KennSmithDS commented Aug 31, 2022 • edited by adamjstewart Loading

adamjstewart commented Sep 1, 2022

jamesvrt commented Sep 1, 2022 • edited Loading

weiji14 commented Sep 1, 2022 • edited Loading

KennSmithDS commented Sep 1, 2022 • edited Loading

weiji14 commented Sep 1, 2022 • edited Loading

adamjstewart commented Sep 1, 2022

jamesvrt commented Sep 1, 2022 • edited Loading

rbavery commented Sep 1, 2022

adamjstewart commented Sep 1, 2022

weiji14 commented Sep 1, 2022 • edited Loading

jamesvrt commented Sep 1, 2022 • edited Loading

nilsleh commented Sep 2, 2022

jamesvrt commented Sep 2, 2022 • edited Loading

weiji14 commented Sep 2, 2022 • edited Loading

nilsleh commented Feb 17, 2022 •

edited by adamjstewart

Loading

calebrob6 commented Feb 20, 2022 •

edited

Loading

weiji14 Apr 17, 2022 •

edited

Loading

jamesvrt commented Aug 31, 2022 •

edited

Loading

KennSmithDS commented Aug 31, 2022 •

edited

Loading

weiji14 commented Aug 31, 2022 •

edited

Loading

KennSmithDS commented Aug 31, 2022 •

edited by adamjstewart

Loading

jamesvrt commented Sep 1, 2022 •

edited

Loading

weiji14 commented Sep 1, 2022 •

edited

Loading

KennSmithDS commented Sep 1, 2022 •

edited

Loading

weiji14 commented Sep 1, 2022 •

edited

Loading

jamesvrt commented Sep 1, 2022 •

edited

Loading

weiji14 commented Sep 1, 2022 •

edited

Loading

jamesvrt commented Sep 1, 2022 •

edited

Loading

jamesvrt commented Sep 2, 2022 •

edited

Loading

weiji14 commented Sep 2, 2022 •

edited

Loading