Large Scale Geospatial Benchmarks #1545
Replies: 15 comments 36 replies
-
I am not sure what would be considered in scope for geospatial workloads but I find that large-scale spatial joins are some of the most unruly computations in my experience. I see all sorts of blogs/benchmarks/demos etc. using the NYC taxi dataset and the NYC neighborhoods as an example of scaling to large geospatial datasets. I find this example interesting but not helpful in many situations. This particular problem gets much harder when both sides of your join are large because you can no longer broadcast your smaller dataset to all of the workers to run several smaller joins on each worker. I would propose a spatial join with relatively large datasets on both sides of the join. Bonus points for doing polygons (and not just points) on each side of the join. I would certainly suggest one of the open source building footprint datasets as these tend to be pretty large datasets of polygons. I am sure there are tons of point datasets you could join with this to make an interesting benchmark but I would try to find another polygon-based dataset to join with. Possible polygon based datasets to join to the a building footprint dataset If you want to make it even more challenging, modifying this to be the closely related but much harder nearest-neigbor join on two large datasets would be a true stress test. I am not sure how common these problems are in the "wild" but I have come across them multiple times in my career. I know this probably doesn't get to your multi-terabyte threshold but I have found that the scalability of these types of joins gets painful even before you get to terabytes. Also, if I am missing some trick to just makes these computations "just work" I'd love to hear about it. Feel free to reach out if you want to chat about the benchmark or about my experiences with these types of computations. |
Beta Was this translation helpful? Give feedback.
-
Based on my experience, every dataset has its own specific patterns, and attempting to use real-time data for benchmarking is often limited to the moment it's collected. Data evolves constantly, and there is no one-size-fits-all solution for benchmarking across different real-time datasets. Customers tend to have unique data characteristics that vary over time, making it challenging to rely on a single dataset for accurate benchmarking in the long term. This dynamic nature of data requires customized, adaptive approaches for each situation rather than static benchmarks based on some real-time data |
Beta Was this translation helpful? Give feedback.
-
Thanks @bahaugen ! Large scale spatial joins are a fun problem (I actually got this working in an early version of dask-geopandas. It was neat!) I think I should ask the following two questions before we invest much time here:
I would be excited if the answer to both questions is "yes!" because I think this would be a fun thing to work on. |
Beta Was this translation helpful? Give feedback.
-
Thanks @upbram ! I agree 100% with you. The devil is always in the details. That being said, lots of problems do look pretty similar, and assuming that we continue to build general purpose software that isn't tailor made to very specific problems, I have moderately high confidence that work done to optimize a broad set of benchmarks will result in software that has improved performance on novel problems. This has certainly been our experience with other similar efforts we've done, like the TPC-H effort with dask dataframes |
Beta Was this translation helpful? Give feedback.
-
A while back I did an investigation of the low-level libraries often used in geospatial python workflows. We were just starting to move to S3 and there wasn't a lot of understanding of load-time performance in the cloud at the time. While I have not tried to run it for a long while now, it might still be useful, if just for historic context: https://github.com/opendatacube/benchmark-rio-s3/blob/master/report.md @mrocklin if there is interest from Coiled I would love to chat on the topic of geospatial Dask, and our experience of using Dask in the cloud for large raster-based geospatial workloads. |
Beta Was this translation helpful? Give feedback.
-
Thanks all who have engaged here so far. I'm going to convert this issue to a GitHub discussion so we can use threads to better handle multiple conversations going on. |
Beta Was this translation helpful? Give feedback.
-
A test could be Upscaling or downscaling a datase at scale. |
Beta Was this translation helpful? Give feedback.
-
I'd like to propose some benchmark based on global and high frequencies remote sensing observations, typically Landsat 8 or Sentinel 2 datasets. The big advantage is that you kind find them in most cloud platforms and several private datacenters. I think some typical large scale benchmark using those would be to compute and reduce some indices (NDVI, NDWI, NDSI, etc.) on either or both a large temporal or spatial scale. This can typically be done to plot some statistics (take the mean, or just count) of the evolution of water/vegetation/snow over a given area on a given period. You can start with a simple case, like ploting the evolution of water on a single Sentinel 2 tile by year/season over the 8 years of observations from the mission. This means about 400 observations, using two bands per index values, about 100GB of data. |
Beta Was this translation helpful? Give feedback.
-
ClimatologyA canonical workflow for weather/climate data is calculating climatology, i.e., average weather for a particular time of year/day, independently at each location. We have code for calculating climatology using Apache Beam in WeatherBench2, which we have run on datasets up to at least ~100 TB scale using Google Cloud DataFlow: https://github.com/google-research/weatherbench2/blob/47d72575cf5e99383a09bed19ba989b718d5fe30/scripts/compute_climatology.py The source dataset is described here. There are also larger variants up ~6 PB size in ARCO-ERA5 if you're looking for more of a challenge :). This dataset has dozens of variables with dimensions At a high-level, the climatology calculation in WeatherBench2 looks like this:
|
Beta Was this translation helpful? Give feedback.
-
RegriddingAnother canonical workflow for geospatial data is regridding/re-projection. We also have code for this in WeatherBench2, which we've run up to PB scale: https://github.com/google-research/weatherbench2/blob/47d72575cf5e99383a09bed19ba989b718d5fe30/scripts/regrid.py#L140 The source data here is similar to the data use for the climatology calculation, but the calculation itself is a little simpler:
In a more advanced version, we simultaneously repeat steps 2-4 for a variety of desired output resolutions. This allows us to only load data once, which results in significant cost savings. |
Beta Was this translation helpful? Give feedback.
-
Forecast evaluationYet another canonical workflow for weather/climate data is forecast evaluation, e.g., this code in WeatherBench2. Weather forecasts and ground-truth datasets typically have three "time" dimensions:
Forecast datasets have dimensions The workflow then looks like:
Alternatively, omit averaging over either spatial or temporal dimensions in step (4), and write the result to Zarr instead. |
Beta Was this translation helpful? Give feedback.
-
Transformed Eulerian MeanA standard atmospheric circulation diagnostic: import xarray as xr
ds = xr.open_zarr(
"gs://weatherbench2/datasets/era5/1959-2023_01_10-full_37-1h-0p25deg-chunk-1.zarr",
chunks={},
)
ds = ds[
["u_component_of_wind", "v_component_of_wind", "temperature", "vertical_velocity"]
].rename(
{
"u_component_of_wind": "U",
"v_component_of_wind": "V",
"temperature": "T",
"vertical_velocity": "W",
}
)
zonal_means = ds.mean("longitude")
anomaly = ds - zonal_means
anomaly["uv"] = anomaly.U * anomaly.V
anomaly["vt"] = anomaly.V * anomaly.T
anomaly["uw"] = anomaly.U * anomaly.W
temdiags = zonal_means.merge(anomaly[["uv", "vt", "uw"]].mean("longitude"))
# This is incredibly slow, takes a while for flox to construct the graph
# daily = temdiags.resample(time="D").mean()
# Option 2: rechunk to make it a blockwise problem
# we should do this automatically
from xarray.groupers import TimeResampler
daily = temdiags.chunk(time=TimeResampler("D")).resample(time="D").mean()
daily.to_zarr(SOMEWHERE) There is a regridding step in the middle that I skipped, will update when I am able to confirm some details of that step. |
Beta Was this translation helpful? Give feedback.
-
Vectorized functions (e.g.,
|
Beta Was this translation helpful? Give feedback.
-
I would love to see more discussion of memmap-based solutions (including approaches built on top of these) - any geospatial dataset that can be written in a single memmap on SSD will get an order of magnitude speedup for slicing etc access as memmaps avoid various file-related kernel calls. See also our AAAI paper on global weather forecasting https://ojs.aaai.org/index.php/AAAI/article/view/17749/17556 |
Beta Was this translation helpful? Give feedback.
-
Greetings from particle physics! We face related problems to this workflow in our datasets (which are similarly large except we don't do joins yet, but rather have complex task graphs atop large amounts of data). I would love to be able to collaborate with y'all towards scaling to ~billions of tasks in complex task graphs well! |
Beta Was this translation helpful? Give feedback.
-
People love the Xarray/Dask/... software stack for geospatial workloads, but only up to about the terabyte scale. At the terabyte scale this stack can struggle, requiring expertise to work well and frustrating users and developers alike.
To address this, we want to build a large-scale geospatial benchmark suite of end-to-end examples to ensure that these tools operate smoothly up to the 100-TB scale.
We want your help to build a catalog of large scale, end-to-end, representative benchmarks. What does this help look like? We can use:
"People often need to take netCDF files that are generated hourly, rechunk them to be arranged spatially, and then update that dataset every day".
This is a big ask, we know, but we hope that if a few people can contribute something meaningfully then we’ll be able to contribute code changes that accelerate those workflows (and others) considerably.
We’d welcome contributions as comments on this issue.
Beta Was this translation helpful? Give feedback.
All reactions