Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do a global run of embeddings #277

Open
brunosan opened this issue Jun 11, 2024 · 19 comments
Open

Do a global run of embeddings #277

brunosan opened this issue Jun 11, 2024 · 19 comments
Assignees

Comments

@brunosan
Copy link
Member

We've been using Clay v1 embeddings directly, and via the Build/Explore apps. We've also done several types of partial benchmarking, so we are starting to feel comfortable with the quality of the model. We therefore should think about making large runs of existing open data and create embeddings, for our benefit to continue learning about Clay, but also to enable the community to leverage these open embeddings.

We still need to make decisions once we decide to make large runs:

  • Instrument? Sentinel?, NAIP? One Instrument, couple?
  • Unit of schema? Should do them at the file-level? Or spatial reference?
  • Spatial resolution? We've seen that many applications need the highest possible spatial resolution. Hence if Sentinel, a small tile size (but not too small to make the embeddings of lower quality). 128x128?
  • locations, time? Large coverage seems most important, but many users also request temporal changes. So I suggest either only wide spatial coverage, or 80% a large coverage run, and the 20% remaining many snapshots over time.
  • What format? I propose we wait and follow guidance from @cholmes on https://github.com/cloudnativegeo/geo-embeddings-survey
  • Hosted? source.coop
  • License? Open. Is CC-by best? OpenRail-M?
  • What is the cost of creation? It would be great to come up with a number.

Ideally, we can wrap this code to execute easily down the line, e.g. taking a STAC list and a spec file for chip_size, ...
Note: Do not over-scope here, since we have the build app.

Probably out of scope, but the end-state at some point this year could be:

  • Sentinel-2 annual composites for EU
  • Sentinel-2 Level-2 files for a deforestation basin in Amazon with as many dates as possibe.
  • Same as above but Sentinel-1 files, or Landsat composites.
  • NAIP for whole states once.
  • NAIP for one state as many years as available.

Filing this also early to allow community requests, but we should aim to set a date for such run, e.g. end of June.

@brunosan
Copy link
Member Author

brunosan commented Jul 2, 2024

I was talking with @konstantinklemmer and asking his help to help us make decisions here. Also pinging @cholmes and @bengmstrong @BradNeuberg @Clay-foundation/all and please ping others.

We will need to make decision within a month for the "Big Embeddings run". This will imply lots of decisions that are free to do now and VERY EXPENSIVE to correct later.

My questions, and my suggestions, but none of my suggestions are strongly held.

  1. What instruments?
    I suggest Sentinel-2 annual composite.
  2. What locations/time?
    I suggest to start somewhere in South America latest composite, and Amazon basin all available years. Then increase as budget allows.
  3. What chip size?
    128x128, with small sections at 64x64, and 256x256.
  4. What output?
    Start with average of patch embedding, and maybe a separate file with all patch embeddings, and feature maps Saving Raw embeddings + feature maps. #291
  5. What format?
    Geoparquet. I'd follow Earth Index Columns and Metadata here https://github.com/cloudnativegeo/geo-embeddings-survey/blob/main/data/earth_index/readme.md
    I'll also add Lossess. Either training loss, or a simple loss.
  6. Host/License
    Hosted on source, with CC-By license.
  7. How much budget to put for this?
    Let's start churning and see costs. If it deviates a lot from estimates, we rethink. Assuming a $1/hour g5.xlarge instance with an NVIDIA A10, processing batches of 10 Sentinel-2 inputs takes 10 seconds. Each 128x128 chip covers 2.5 square kilometers. This means we can process roughly 360 inputs per dollar. With a $10,000 budget to start with, that translates to a coverage of 9,000,000 square kilometers. Let's put a 50% penalty just bcuase, and it should give us enough for South America??

What are your thought @yellowcap @srmsoumya ? How much effort to pull this on your side?
Should we continue trainning v1 first (#283 ) ?

Let's aim to kick this compute off July 15th?

@yellowcap
Copy link
Member

If we use worldcover I would suggest a chip size of 100x100 or 200x200, then the chips fit nicely into their 10k x 10k source files. Maybe for Sentinel-2 we would use 100x100 to have a more fine grained resolution. Not sure what kind of features we hope to find based on the embeddings.

Regarding feature maps output, there are 4 feature maps of 32x32 pixels for 768 embeddings, stored as float32. If we assume the input is 4 bands of Sentinel-2 imagery at uint16, then the feature maps are much heavier than the original data. So I would not advise to store the feature maps and rely on running the model at inference time when doing segmentation tasks (did I get this correctly @srmsoumya ?)

Regarding cost we would have to do more test runs to understand it better. We were able to do US level runs already with a reasonable budget, so I think doing some continental scale processing or even global processing should be doable.

Note that the Sentinel-2 composites have limited quality in tropical areas, they are mostly cloud free, but not without haze, and there are small nodata gaps here and there. At least for the Worldcover composites. Happy to look at other sources for composite imagery if people have suggestions.

Finally, I would add at least one NAIP run for all of the US to the wish-list as well.

@konstantinklemmer
Copy link

konstantinklemmer commented Jul 4, 2024

After discussing with @brunosan and thinking a bit more about it, here is my rough "wishlist":

  • Global coverage embedding map. Chip size is less relevant as long as the whole, continuous planet is covered.
  • Sentinel-2 would be the preferred sensor; should of course be cloud free.
  • Ideally two time steps for each location; e.g. January and July (to roughly cover seasonality), but that's secondary.
  • Major TOM Sentinal 2A might be an option: https://github.com/ESA-PhiLab/Major-TOM

For each observation, ideally we'd have the following data (roughly sketched out):
[chip_centroid_lon, chip_centroid_lat, timestamp, chip_thumbnail, clay_embedding, clay_loss]

This "wishlist" is motivated mostly by me wanting to dissect Clay embeddings and see what it learns. Guiding questions are e.g. How does the complexity of embeddings change over space? How representative are embeddings of environmental and human-activity measures? Can Clay embeddings be used as geographic priors?

This would also create a dense embedding database to be used in arbitrary downstream tasks. This allows direct comparison to competitors like MOSAIKS or SatCLIP. The approach would be as follows: Download Clay Embedding with lon/lat closest to downstream location -> Train model y_lonlat = f(ClayEmbedding_lonlat) -> Evaluate.

@bengmstrong
Copy link

Very cool that you're gearing up for a global run! Would love to pull/play with your embeddings. I agree that Sentinel-2 annual composites are the right starting point for global embeddings. To enable comparisons with other models it would be nice to use the same public free imagery. We've created/shared global sentinel-2 L2A composites for 2023 here which you are welcome to use (https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/) but they're a work in progress and do have some quality issues.

One other note @brunosan I think you dropped a factor of 10 in your back of the envelop math. Looks like you should be able to get through 3600 inputs per dollar right? (batch of 10 inputs / 10 seconds * 3600 sec/hour * 1hr/$) So it might be more affordable than you think!!

@brunosan brunosan changed the title Investigate global runs of embeddings Do a global run of embeddings Jul 5, 2024
@brunosan
Copy link
Member Author

brunosan commented Jul 5, 2024

Thanks everyone. I love that we are getting momentum here.

TLDR; So far
I'm leaning for a

  1. global run of the Sentinel-2 year composite, at 100px most recent year available with EG Sentinel-2 all-bands composite.
  2. NAIP for CONUS. Latest, with 100px chip size too.
  3. maybe? Selected locations (the training set?) to enable temporal and cross instrument studies.
  4. maybe? Satellogic set

Released as CC-By (inheriting EG CC-By)

Still TBD format and adding what losses.

Source imagery

Thanks @bengmstrong and EG team for the data release. It seems to fit perfectly. Besides the files and the STAC endpoint, this blog post explains the method.

It meets the criteria of:

  • Fully open license. (CC-By)
  • Global. (there are mentions of "errors", should be get a black-list of these and run them when fixed? I've spotted checked and I only see the usual hard places like permanent clouds locations)
  • Recent (2023) (only global open composite this recent).

Notes:

  • This seems to be a median reduction of the "best" 16 scenes per location. What "best" mean? (least cloudy of the ~35/year?)

Chip Size

Boils down to 50px of 100px in my opinion.
Costs grows quadratic since its an area. Also very small areas approach the patch size of 8px which means less chances for the self attention to learn about the surroundings.

Since we use the average, we can recreate embeddings at bigger chip sizes just averaging the smaller ones. It won't be the same sine the patches done with smaller chip sizes will not have paid "self-attention" to the patches outsize of that small chip.

It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places.

Cost estimates

From our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ($2.5$/h). Most of the time is spent on downloading, so chip size doesn't seem to be a strong factor.

This would mean 4k chips/$/h/worker.

Uncanny we get pretty much the same result that the napkin exercise (we should become consultants).

chip size (x10m/px) cost unit (km^2/$/h/worker) cost to run the world
50x50 px 1000 $510K
100x100 px 4000 $127K

50px is too expensive, 100px is doable. I'm hopefully @yellowcap optimism is true for this run.
Let's just start and assess what coverage/$ we get.

@BradNeuberg
Copy link

BradNeuberg commented Jul 5, 2024 via email

@brunosan
Copy link
Member Author

Update here.
We are going to do another v1 training run before the global embeddings run. Follow #283 for details.

@brunosan
Copy link
Member Author

Some update with @yellowcap. We are getting ready building the pipelines and testing the Earth Genome Sentinel-2 composites data:

  • Data comes in Web mercator projection, which is certainly great for map visualization, but it comes at a cost in terms of data to process. We've trained Clay in a projection that keep GSD across the tile. In Web mercator e.g. Norway is about 77% more pixels than the same near the equator. More pixel for the same feature. Not sure how Embeddings will suffer classifying same object in high latitudes (something to check).
  • The projection change also implies that there is a nodata boundary around each scene, and that the scene edge are not exacltly horizontal / vertical.

image

  • The nominal resolution of the pixels is 9.8 and 19.1 for the 10m and 20m bands (i.e. the resolution of web mercator zoom levels 14 and 13). But this is not the real resolution if you go away from the equator, hence the changes in nr of pixels. So when using a 256x256 pixel image for ML application, one is looking at different sized areas in reality.
  • In the STAC items some property of the proj extension are missing, for example the proj:shape property, which is required for stacchip. We can work around this. (CC @bengmstrong)

@noahgolmant
Copy link

Hi all, has there been validation of any version or checkpoint of this model on an existing benchmark suite such as GEO-Bench? If not what are the major blockers? It seems valuable to do this prior to any global embeddings run because the embeddings cannot be used to run those benchmarks post hoc. And if the benchmark metrics are poor then the embeddings would likely not be very good.

@brunosan
Copy link
Member Author

brunosan commented Oct 1, 2024

Related #326

@lauracchen
Copy link
Member

+1 to CC BY license, in alignment with the Digital Public Goods Alliance standards https://digitalpublicgoods.net/implement/

@lauracchen
Copy link
Member

Update:
For NAIP inference, we are planning to use 256x256 px chips.

@AdeelH
Copy link

AdeelH commented Oct 18, 2024

Hi all! For NAIP, have you considered the problem of making sure that chips fully align with their counterparts from other years? This would be a desirable property to have for applications like change detection. Unfortunately, this property does not seem to hold by default because the extents of NAIP tiles from different years don't exactly match up.

For example, consider the following two NAIP images of the same grid tile from 2018 and 2021 respectively:

s3://naip-analytic/vt/2018/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20181123.tif
s3://naip-analytic/vt/2021/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20211019.tif

Their extents and shapes don't match up:

with rio.Env():
    with rio.open('s3://naip-analytic/vt/2018/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20181123.tif') as ds:
        print(ds.bounds)
        print(ds.shape)
    with rio.open('s3://naip-analytic/vt/2021/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20211019.tif') as ds:
        print(ds.bounds)
        print(ds.shape)
# output
# BoundingBox(left=699184.8, bottom=4728693.6, right=705106.2, top=4736384.4)
# (12818, 9869)
# BoundingBox(left=699226.2, bottom=4728750.6, right=705104.3999999999, top=4736383.8)
# (12722, 9797)

This means that the i-th 256x256 px chip from one scene would not cover an identical geographical area to the i-th 256x256 px chip from the other scene.

This would suggest a need for some mosaicing and reprojection. What do you guys think? Would love to hear thoughts on this!

@yellowcap
Copy link
Member

This is a great observation @AdeelH . This is one of the limitations of our current approach, where we are simply cutting up the imagery as it is. This will indeed lead to bounding boxes that are not 100% overlapping for the different images. The reason for this is that doing the re-projection is a huge lift computationally and logistically. Aligning all this imagery is not a trivial task, and we are going for the simple approach first (dividing the image as-is into chips). In a future iteration we might be able to do the aligining and re-sampling.

@jasongilman
Copy link

@yellowcap, @AdeelH just figured out how to do this for NAIP as we were also working on generating vector embeddings at Element 84. He could share the approach and code we're using. Given the amount of inference you're going to run to generate vector embeddings for all of NAIP it would be great if you could consider doing this first. We're currently working on a process to compare vector embeddings over time for the same area in order to do change detection. We're seeing some really great initial results for a smaller area. It would be amazing to be able to do this across the entire US which could be enabled if you have consistent chips across time. If the vector embeddings are put in a table the chip identifiers can be used to do efficient spatial joins.

@AdeelH
Copy link

AdeelH commented Nov 4, 2024

@yellowcap, I hear you on the technical challenges. I think the approach we're using looks promising. The main idea is to tie the chip bounding boxes to the AOI and do the reprojection and mosaicing in chunks. Here's how it currently works:

Given an AOI,

  1. Break it into a grid of chunks (24km x 24km in the image below)
    Image
  2. Then, for each chunk,
    • Fetch the intersecting STAC items and download their image files
    • Lazily warp each image file to EPSG:5070 via VRTs (basically instantaneous)
    • Build a new VRT that mosaics the reprojection VRTs (basically instantaneous)
    • Read chips from the mosaic VRT. The image below shows ~240m x 240m (~400px x 400px) chips in a single chunk that fall within the AOI.
      Image
    • Create embeddings and save them in a GeoParquet file

The idea with chunks is to be able to process them in parallel on multiple machines. The reprojection/warping to EPSG:5070 is to ensure each chip has roughly the same amount of area.

Since NAIP imagery for different states is not collected at the same time. I don't think it makes sense to treat the whole CONUS as a single AOI. Instead, doing it state by state makes more sense.

@yellowcap
Copy link
Member

yellowcap commented Nov 5, 2024

Thanks for the input and sharing your algorithm @jasongilman and @AdeelH . At this point we will move forward with the simple approach without doing the mosaicking. We have the algorithm ready to go and need to move due to time constraints.

Once we have the NAIP and Sentinel-2 embeddings created, we can revisit the technique and see if we can do something more advanced like the approach you propose. I'd love to try this out but we can't fit that into the timeline we have right now.

The code we are using for embedding generation sits here

@brunosan
Copy link
Member Author

brunosan commented Nov 5, 2024

Thanks again @jasongilman and @AdeelH. Sorry for not taking these last improvements in this round, but as @yellowcap says, this is just the first iteration. Lots of things to improve for the next one, including your points. This issue was opened in June and we I'm already kicking myself for not having done this inference run much earlier.

There are indeed many choices we had to make that we would love to revisit with wider feedback, like size of the images, (size of the patches!), overlaps, tiling/sliding, masking clouds of not, ... even if we should also keep patch embeddings...

The best way for us to take these comments is with PRs with the actual code to run, like we have on the documentation, so we can minimize overhead on our side.

@jasongilman
Copy link

Thanks for the detailed answer. I can understand the demands on the schedule and the embeddings will still be really useful within a single NAIP release. I look forward to trying them out once it's done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants