Do a global run of embeddings #277

brunosan · 2024-06-11T21:32:50Z

We've been using Clay v1 embeddings directly, and via the Build/Explore apps. We've also done several types of partial benchmarking, so we are starting to feel comfortable with the quality of the model. We therefore should think about making large runs of existing open data and create embeddings, for our benefit to continue learning about Clay, but also to enable the community to leverage these open embeddings.

We still need to make decisions once we decide to make large runs:

Instrument? Sentinel?, NAIP? One Instrument, couple?
Unit of schema? Should do them at the file-level? Or spatial reference?
Spatial resolution? We've seen that many applications need the highest possible spatial resolution. Hence if Sentinel, a small tile size (but not too small to make the embeddings of lower quality). 128x128?
locations, time? Large coverage seems most important, but many users also request temporal changes. So I suggest either only wide spatial coverage, or 80% a large coverage run, and the 20% remaining many snapshots over time.
What format? I propose we wait and follow guidance from @cholmes on https://github.com/cloudnativegeo/geo-embeddings-survey
Hosted? source.coop
License? Open. Is CC-by best? OpenRail-M?
What is the cost of creation? It would be great to come up with a number.

Ideally, we can wrap this code to execute easily down the line, e.g. taking a STAC list and a spec file for chip_size, ...
Note: Do not over-scope here, since we have the build app.

Probably out of scope, but the end-state at some point this year could be:

Sentinel-2 annual composites for EU
Sentinel-2 Level-2 files for a deforestation basin in Amazon with as many dates as possibe.
Same as above but Sentinel-1 files, or Landsat composites.
NAIP for whole states once.
NAIP for one state as many years as available.

Filing this also early to allow community requests, but we should aim to set a date for such run, e.g. end of June.

The text was updated successfully, but these errors were encountered:

brunosan · 2024-07-02T13:02:03Z

I was talking with @konstantinklemmer and asking his help to help us make decisions here. Also pinging @cholmes and @bengmstrong @BradNeuberg @Clay-foundation/all and please ping others.

We will need to make decision within a month for the "Big Embeddings run". This will imply lots of decisions that are free to do now and VERY EXPENSIVE to correct later.

My questions, and my suggestions, but none of my suggestions are strongly held.

What instruments?
I suggest Sentinel-2 annual composite.
What locations/time?
I suggest to start somewhere in South America latest composite, and Amazon basin all available years. Then increase as budget allows.
What chip size?
128x128, with small sections at 64x64, and 256x256.
What output?
Start with average of patch embedding, and maybe a separate file with all patch embeddings, and feature maps Saving Raw embeddings + feature maps. #291
What format?
Geoparquet. I'd follow Earth Index Columns and Metadata here https://github.com/cloudnativegeo/geo-embeddings-survey/blob/main/data/earth_index/readme.md
I'll also add Lossess. Either training loss, or a simple loss.
Host/License
Hosted on source, with CC-By license.
How much budget to put for this?
Let's start churning and see costs. If it deviates a lot from estimates, we rethink. Assuming a $1/hour g5.xlarge instance with an NVIDIA A10, processing batches of 10 Sentinel-2 inputs takes 10 seconds. Each 128x128 chip covers 2.5 square kilometers. This means we can process roughly 360 inputs per dollar. With a $10,000 budget to start with, that translates to a coverage of 9,000,000 square kilometers. Let's put a 50% penalty just bcuase, and it should give us enough for South America??

What are your thought @yellowcap @srmsoumya ? How much effort to pull this on your side?
Should we continue trainning v1 first (#283 ) ?

Let's aim to kick this compute off July 15th?

yellowcap · 2024-07-03T15:23:11Z

If we use worldcover I would suggest a chip size of 100x100 or 200x200, then the chips fit nicely into their 10k x 10k source files. Maybe for Sentinel-2 we would use 100x100 to have a more fine grained resolution. Not sure what kind of features we hope to find based on the embeddings.

Regarding feature maps output, there are 4 feature maps of 32x32 pixels for 768 embeddings, stored as float32. If we assume the input is 4 bands of Sentinel-2 imagery at uint16, then the feature maps are much heavier than the original data. So I would not advise to store the feature maps and rely on running the model at inference time when doing segmentation tasks (did I get this correctly @srmsoumya ?)

Regarding cost we would have to do more test runs to understand it better. We were able to do US level runs already with a reasonable budget, so I think doing some continental scale processing or even global processing should be doable.

Note that the Sentinel-2 composites have limited quality in tropical areas, they are mostly cloud free, but not without haze, and there are small nodata gaps here and there. At least for the Worldcover composites. Happy to look at other sources for composite imagery if people have suggestions.

Finally, I would add at least one NAIP run for all of the US to the wish-list as well.

konstantinklemmer · 2024-07-04T09:10:31Z

After discussing with @brunosan and thinking a bit more about it, here is my rough "wishlist":

Global coverage embedding map. Chip size is less relevant as long as the whole, continuous planet is covered.
Sentinel-2 would be the preferred sensor; should of course be cloud free.
Ideally two time steps for each location; e.g. January and July (to roughly cover seasonality), but that's secondary.
Major TOM Sentinal 2A might be an option: https://github.com/ESA-PhiLab/Major-TOM

For each observation, ideally we'd have the following data (roughly sketched out):
[chip_centroid_lon, chip_centroid_lat, timestamp, chip_thumbnail, clay_embedding, clay_loss]

This "wishlist" is motivated mostly by me wanting to dissect Clay embeddings and see what it learns. Guiding questions are e.g. How does the complexity of embeddings change over space? How representative are embeddings of environmental and human-activity measures? Can Clay embeddings be used as geographic priors?

This would also create a dense embedding database to be used in arbitrary downstream tasks. This allows direct comparison to competitors like MOSAIKS or SatCLIP. The approach would be as follows: Download Clay Embedding with lon/lat closest to downstream location -> Train model y_lonlat = f(ClayEmbedding_lonlat) -> Evaluate.

bengmstrong · 2024-07-04T19:06:35Z

Very cool that you're gearing up for a global run! Would love to pull/play with your embeddings. I agree that Sentinel-2 annual composites are the right starting point for global embeddings. To enable comparisons with other models it would be nice to use the same public free imagery. We've created/shared global sentinel-2 L2A composites for 2023 here which you are welcome to use (https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/) but they're a work in progress and do have some quality issues.

One other note @brunosan I think you dropped a factor of 10 in your back of the envelop math. Looks like you should be able to get through 3600 inputs per dollar right? (batch of 10 inputs / 10 seconds * 3600 sec/hour * 1hr/$) So it might be more affordable than you think!!

brunosan · 2024-07-05T10:13:04Z

Thanks everyone. I love that we are getting momentum here.

TLDR; So far
I'm leaning for a

global run of the Sentinel-2 year composite, at 100px most recent year available with EG Sentinel-2 all-bands composite.
NAIP for CONUS. Latest, with 100px chip size too.
maybe? Selected locations (the training set?) to enable temporal and cross instrument studies.
maybe? Satellogic set

Released as CC-By (inheriting EG CC-By)

Still TBD format and adding what losses.

Source imagery

Thanks @bengmstrong and EG team for the data release. It seems to fit perfectly. Besides the files and the STAC endpoint, this blog post explains the method.

It meets the criteria of:

Fully open license. (CC-By)
Global. (there are mentions of "errors", should be get a black-list of these and run them when fixed? I've spotted checked and I only see the usual hard places like permanent clouds locations)
Recent (2023) (only global open composite this recent).

Notes:

This seems to be a median reduction of the "best" 16 scenes per location. What "best" mean? (least cloudy of the ~35/year?)

Chip Size

Boils down to 50px of 100px in my opinion.
Costs grows quadratic since its an area. Also very small areas approach the patch size of 8px which means less chances for the self attention to learn about the surroundings.

Since we use the average, we can recreate embeddings at bigger chip sizes just averaging the smaller ones. It won't be the same sine the patches done with smaller chip sizes will not have paid "self-attention" to the patches outsize of that small chip.

It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places.

Cost estimates

From our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ($2.5$/h). Most of the time is spent on downloading, so chip size doesn't seem to be a strong factor.

This would mean 4k chips/$/h/worker.

Uncanny we get pretty much the same result that the napkin exercise (we should become consultants).

chip size (x10m/px)	cost unit (km^2/$/h/worker)	cost to run the world
`50x50` px	1000	$510K
`100x100` px	4000	$127K

50px is too expensive, 100px is doable. I'm hopefully @yellowcap optimism is true for this run.
Let's just start and assess what coverage/$ we get.

BradNeuberg · 2024-07-05T22:14:57Z

Some small suggestions: Not sure if the scale is possible, but is there a monthly Sentinel-2 global composite available? If a Clay v2 model were trained on such a monthly basis, over several years for example, the model might learn strong seasonal and time based correlations, which would be especially helpful for change detection problems. In terms of chip size, can that be specified as a kind of metadata fed in as is already done for sensor details? I see varying the chip size even for the same sensor as providing several advantages: - At Planet we’ve noticed that small object detection for embeddings is aided by having smaller chip sizes, such as small buildings or small forest degradation areas. Being able to have a smaller chip size in urban areas, for example, would aid dealing with smaller objects over time - Varying the chip size acts as a regularizer that would force the model to not overfit to a particular chip size - Being able to use varying chip sizes in practice could be a powerful technique - use smaller chip sizes in known urban areas while larger chip sizes in relatively sparse area would trade off accuracy vs compute and storage costs. In terms of storage, I agree geoparquet is a good format, as well as storing the centroid of the latitude and longitude. At Planet we’ve also stored a geometry column that corresponds to the exact chip bounding box behind an embedding, which can be very helpful for knowing exactly where an embedding is generated from. Another useful thing to optionally store is a visual product image chip for that embedding, as a preview URL stored along with the embedding. This is a chipped visual product for the underlying analytic imagery and is stored as a PNG file in a Google bucket. This is very useful when presenting results to the user or showing things like clustering results. Not having a preview chip can make it much harder to deal with embeddings at scale. At Planet we’ve been using 224x224 chips for our embeddings, with a 3m GSD pixel size for PlanetScope. As you’ve found yourself going to smaller chip sizes can significantly increase compute and storage costs. Ultimately we’ve wanted to figure out a way to store something like a pyramid of different representations, something like Matroyshka embeddings but that remains an R&D edge we haven’t figured out yet. Something else we store in our embedding geoparquet files is quality information per chip, using a cloud and quality mask. This is very helpful to filter embeddings down based on quality, which is especially important for change detection problems over time. You might want to compute cloud and quality info using s2cloudless or fmask and store these in a consistent way. We store percentages for percent of haze, clouds, snow, null pixels, etc and use geopandas to quickly filter embeddings.

…

On Fri, Jul 5, 2024 at 3:13 AM Bruno Sánchez-Andrade Nuño < ***@***.***> wrote: Thanks everyone. I love that we are getting momentum here. TLDR; So far I'm leaning for a 1. global run of the Sentinel-2 year composite, at 100px most recent year available with EG Sentinel-2 all-bands composite. 2. NAIP for CONUS. Latest, with 100px chip size too. 3. maybe? Selected locations (the training set?) to enable temporal and cross instrument studies. 4. maybe? Satellogic set Released as CC-By (inheriting EG CC-By) Still TBD format and adding what losses. Source imagery Thanks @bengmstrong <https://github.com/bengmstrong> and EG team for the data release. It seems to fit perfectly. Besides the files <https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/> and the STAC endpoint <https://stac.earthgenome.org/>, this blog post explains the method <https://medium.com/radiant-earth-insights/announcing-public-access-to-our-global-cloud-free-imagery-archive-25b33dc675ec> . It meets the criteria of: - Fully open license. (CC-By) - Global. (there are mentions of "errors", should be get a black-list of these and run them when fixed? I've spotted checked and I only see the usual hard places like permanent clouds locations) - Recent (2023) (only global open composite this recent). Notes: - This seems to be a median reduction of the "best" 16 scenes per location. What "best" mean? (least cloudy of the ~35/year?) Chip Size Boils down to 50px of 100px in my opinion. Costs grows quadratic since its an area. Also very small areas approach the patch size of 8px which means less chances for the self attention to learn about the surroundings. Since we use the average, we can recreate embeddings at bigger chip sizes just averaging the smaller ones. It won't be the same sine the patches done with smaller chip sizes will not have paid "self-attention" to the patches outsize of that small chip. It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places. Cost estimates From our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ( $2.5$/h). Most of the time is spent on downloading, so chip size doesn't seem to be a strong factor. This would mean 4k chips/$/h/worker. Uncanny we get pretty much the same result that the napkin exercise (we should become consultants). chip size (x10m/px) cost unit (km^2/$/h/worker) cost to run the world 50x50 px 1000 $510K 100x100 px 4000 $127K 50px is too expensive, 100px is doable. I'm hopefully @yellowcap <https://github.com/yellowcap> optimism is true for this run. Let's just start and assess what coverage/$ we get. — Reply to this email directly, view it on GitHub <#277 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABEHUMVDHPFGTLYJBF7BKLZKZWUNAVCNFSM6AAAAABJFDYINOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGYYDIMJRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

brunosan · 2024-07-30T05:24:13Z

Update here.
We are going to do another v1 training run before the global embeddings run. Follow #283 for details.

brunosan · 2024-08-19T13:26:25Z

Some update with @yellowcap. We are getting ready building the pipelines and testing the Earth Genome Sentinel-2 composites data:

Data comes in Web mercator projection, which is certainly great for map visualization, but it comes at a cost in terms of data to process. We've trained Clay in a projection that keep GSD across the tile. In Web mercator e.g. Norway is about 77% more pixels than the same near the equator. More pixel for the same feature. Not sure how Embeddings will suffer classifying same object in high latitudes (something to check).
The projection change also implies that there is a nodata boundary around each scene, and that the scene edge are not exacltly horizontal / vertical.

The nominal resolution of the pixels is 9.8 and 19.1 for the 10m and 20m bands (i.e. the resolution of web mercator zoom levels 14 and 13). But this is not the real resolution if you go away from the equator, hence the changes in nr of pixels. So when using a 256x256 pixel image for ML application, one is looking at different sized areas in reality.
In the STAC items some property of the proj extension are missing, for example the proj:shape property, which is required for stacchip. We can work around this. (CC @bengmstrong)

noahgolmant · 2024-09-20T00:29:05Z

Hi all, has there been validation of any version or checkpoint of this model on an existing benchmark suite such as GEO-Bench? If not what are the major blockers? It seems valuable to do this prior to any global embeddings run because the embeddings cannot be used to run those benchmarks post hoc. And if the benchmark metrics are poor then the embeddings would likely not be very good.

brunosan · 2024-10-01T09:55:23Z

Related #326

lauracchen · 2024-10-18T16:44:51Z

+1 to CC BY license, in alignment with the Digital Public Goods Alliance standards https://digitalpublicgoods.net/implement/

lauracchen · 2024-10-18T18:09:15Z

Update:
For NAIP inference, we are planning to use 256x256 px chips.

AdeelH · 2024-10-18T19:45:01Z

Hi all! For NAIP, have you considered the problem of making sure that chips fully align with their counterparts from other years? This would be a desirable property to have for applications like change detection. Unfortunately, this property does not seem to hold by default because the extents of NAIP tiles from different years don't exactly match up.

For example, consider the following two NAIP images of the same grid tile from 2018 and 2021 respectively:

s3://naip-analytic/vt/2018/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20181123.tif
s3://naip-analytic/vt/2021/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20211019.tif

Their extents and shapes don't match up:

with rio.Env():
    with rio.open('s3://naip-analytic/vt/2018/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20181123.tif') as ds:
        print(ds.bounds)
        print(ds.shape)
    with rio.open('s3://naip-analytic/vt/2021/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20211019.tif') as ds:
        print(ds.bounds)
        print(ds.shape)
# output
# BoundingBox(left=699184.8, bottom=4728693.6, right=705106.2, top=4736384.4)
# (12818, 9869)
# BoundingBox(left=699226.2, bottom=4728750.6, right=705104.3999999999, top=4736383.8)
# (12722, 9797)

This means that the i-th 256x256 px chip from one scene would not cover an identical geographical area to the i-th 256x256 px chip from the other scene.

This would suggest a need for some mosaicing and reprojection. What do you guys think? Would love to hear thoughts on this!

yellowcap · 2024-11-04T12:52:10Z

This is a great observation @AdeelH . This is one of the limitations of our current approach, where we are simply cutting up the imagery as it is. This will indeed lead to bounding boxes that are not 100% overlapping for the different images. The reason for this is that doing the re-projection is a huge lift computationally and logistically. Aligning all this imagery is not a trivial task, and we are going for the simple approach first (dividing the image as-is into chips). In a future iteration we might be able to do the aligining and re-sampling.

jasongilman · 2024-11-04T14:01:09Z

@yellowcap, @AdeelH just figured out how to do this for NAIP as we were also working on generating vector embeddings at Element 84. He could share the approach and code we're using. Given the amount of inference you're going to run to generate vector embeddings for all of NAIP it would be great if you could consider doing this first. We're currently working on a process to compare vector embeddings over time for the same area in order to do change detection. We're seeing some really great initial results for a smaller area. It would be amazing to be able to do this across the entire US which could be enabled if you have consistent chips across time. If the vector embeddings are put in a table the chip identifiers can be used to do efficient spatial joins.

AdeelH · 2024-11-04T19:32:22Z

@yellowcap, I hear you on the technical challenges. I think the approach we're using looks promising. The main idea is to tie the chip bounding boxes to the AOI and do the reprojection and mosaicing in chunks. Here's how it currently works:

Given an AOI,

Break it into a grid of chunks (24km x 24km in the image below)
Then, for each chunk,
- Fetch the intersecting STAC items and download their image files
- Lazily warp each image file to EPSG:5070 via VRTs (basically instantaneous)
- Build a new VRT that mosaics the reprojection VRTs (basically instantaneous)
- Read chips from the mosaic VRT. The image below shows ~240m x 240m (~400px x 400px) chips in a single chunk that fall within the AOI.
- Create embeddings and save them in a GeoParquet file

The idea with chunks is to be able to process them in parallel on multiple machines. The reprojection/warping to EPSG:5070 is to ensure each chip has roughly the same amount of area.

Since NAIP imagery for different states is not collected at the same time. I don't think it makes sense to treat the whole CONUS as a single AOI. Instead, doing it state by state makes more sense.

yellowcap · 2024-11-05T09:15:15Z

Thanks for the input and sharing your algorithm @jasongilman and @AdeelH . At this point we will move forward with the simple approach without doing the mosaicking. We have the algorithm ready to go and need to move due to time constraints.

Once we have the NAIP and Sentinel-2 embeddings created, we can revisit the technique and see if we can do something more advanced like the approach you propose. I'd love to try this out but we can't fit that into the timeline we have right now.

The code we are using for embedding generation sits here

brunosan · 2024-11-05T09:25:33Z

Thanks again @jasongilman and @AdeelH. Sorry for not taking these last improvements in this round, but as @yellowcap says, this is just the first iteration. Lots of things to improve for the next one, including your points. This issue was opened in June and we I'm already kicking myself for not having done this inference run much earlier.

There are indeed many choices we had to make that we would love to revisit with wider feedback, like size of the images, (size of the patches!), overlaps, tiling/sliding, masking clouds of not, ... even if we should also keep patch embeddings...

The best way for us to take these comments is with PRs with the actual code to run, like we have on the documentation, so we can minimize overhead on our side.

jasongilman · 2024-11-05T12:34:43Z

Thanks for the detailed answer. I can understand the demands on the schedule and the embeddings will still be really useful within a single NAIP release. I look forward to trying them out once it's done!

brunosan assigned yellowcap Jun 11, 2024

lauracchen mentioned this issue Jul 3, 2024

Assess how to add Satellogic data to Clay trainign pipeline #284

Closed

brunosan mentioned this issue Jul 5, 2024

Amazon Mining Watch #279

Open

brunosan changed the title ~~Investigate global runs of embeddings~~ Do a global run of embeddings Jul 5, 2024

brunosan mentioned this issue Oct 1, 2024

Training run for Clay v1.5 #283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do a global run of embeddings #277

Do a global run of embeddings #277

brunosan commented Jun 11, 2024

brunosan commented Jul 2, 2024 •

edited

Loading

yellowcap commented Jul 3, 2024

konstantinklemmer commented Jul 4, 2024 •

edited

Loading

bengmstrong commented Jul 4, 2024

brunosan commented Jul 5, 2024

BradNeuberg commented Jul 5, 2024 via email

brunosan commented Jul 30, 2024

brunosan commented Aug 19, 2024

noahgolmant commented Sep 20, 2024

brunosan commented Oct 1, 2024

lauracchen commented Oct 18, 2024

lauracchen commented Oct 18, 2024

AdeelH commented Oct 18, 2024

yellowcap commented Nov 4, 2024

jasongilman commented Nov 4, 2024

AdeelH commented Nov 4, 2024 •

edited

Loading

yellowcap commented Nov 5, 2024 •

edited

Loading

brunosan commented Nov 5, 2024

jasongilman commented Nov 5, 2024

Do a global run of embeddings #277

Do a global run of embeddings #277

Comments

brunosan commented Jun 11, 2024

brunosan commented Jul 2, 2024 • edited Loading

yellowcap commented Jul 3, 2024

konstantinklemmer commented Jul 4, 2024 • edited Loading

bengmstrong commented Jul 4, 2024

brunosan commented Jul 5, 2024

Source imagery

Chip Size

Cost estimates

BradNeuberg commented Jul 5, 2024 via email

brunosan commented Jul 30, 2024

brunosan commented Aug 19, 2024

noahgolmant commented Sep 20, 2024

brunosan commented Oct 1, 2024

lauracchen commented Oct 18, 2024

lauracchen commented Oct 18, 2024

AdeelH commented Oct 18, 2024

yellowcap commented Nov 4, 2024

jasongilman commented Nov 4, 2024

AdeelH commented Nov 4, 2024 • edited Loading

yellowcap commented Nov 5, 2024 • edited Loading

brunosan commented Nov 5, 2024

jasongilman commented Nov 5, 2024

brunosan commented Jul 2, 2024 •

edited

Loading

konstantinklemmer commented Jul 4, 2024 •

edited

Loading

AdeelH commented Nov 4, 2024 •

edited

Loading

yellowcap commented Nov 5, 2024 •

edited

Loading