-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do a global run of embeddings #277
Comments
I was talking with @konstantinklemmer and asking his help to help us make decisions here. Also pinging @cholmes and @bengmstrong @BradNeuberg @Clay-foundation/all and please ping others. We will need to make decision within a month for the "Big Embeddings run". This will imply lots of decisions that are free to do now and VERY EXPENSIVE to correct later. My questions, and my suggestions, but none of my suggestions are strongly held.
What are your thought @yellowcap @srmsoumya ? How much effort to pull this on your side? Let's aim to kick this compute off July 15th? |
If we use worldcover I would suggest a chip size of 100x100 or 200x200, then the chips fit nicely into their 10k x 10k source files. Maybe for Sentinel-2 we would use 100x100 to have a more fine grained resolution. Not sure what kind of features we hope to find based on the embeddings. Regarding feature maps output, there are 4 feature maps of 32x32 pixels for 768 embeddings, stored as float32. If we assume the input is 4 bands of Sentinel-2 imagery at uint16, then the feature maps are much heavier than the original data. So I would not advise to store the feature maps and rely on running the model at inference time when doing segmentation tasks (did I get this correctly @srmsoumya ?) Regarding cost we would have to do more test runs to understand it better. We were able to do US level runs already with a reasonable budget, so I think doing some continental scale processing or even global processing should be doable. Note that the Sentinel-2 composites have limited quality in tropical areas, they are mostly cloud free, but not without haze, and there are small nodata gaps here and there. At least for the Worldcover composites. Happy to look at other sources for composite imagery if people have suggestions. Finally, I would add at least one NAIP run for all of the US to the wish-list as well. |
After discussing with @brunosan and thinking a bit more about it, here is my rough "wishlist":
For each observation, ideally we'd have the following data (roughly sketched out): This "wishlist" is motivated mostly by me wanting to dissect Clay embeddings and see what it learns. Guiding questions are e.g. How does the complexity of embeddings change over space? How representative are embeddings of environmental and human-activity measures? Can Clay embeddings be used as geographic priors? This would also create a dense embedding database to be used in arbitrary downstream tasks. This allows direct comparison to competitors like MOSAIKS or SatCLIP. The approach would be as follows: Download Clay Embedding with lon/lat closest to downstream location -> Train model y_lonlat = f(ClayEmbedding_lonlat) -> Evaluate. |
Very cool that you're gearing up for a global run! Would love to pull/play with your embeddings. I agree that Sentinel-2 annual composites are the right starting point for global embeddings. To enable comparisons with other models it would be nice to use the same public free imagery. We've created/shared global sentinel-2 L2A composites for 2023 here which you are welcome to use (https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/) but they're a work in progress and do have some quality issues. One other note @brunosan I think you dropped a factor of 10 in your back of the envelop math. Looks like you should be able to get through 3600 inputs per dollar right? (batch of 10 inputs / 10 seconds * 3600 sec/hour * 1hr/$) So it might be more affordable than you think!! |
Thanks everyone. I love that we are getting momentum here. TLDR; So far
Released as CC-By (inheriting EG CC-By) Still TBD format and adding what losses. Source imageryThanks @bengmstrong and EG team for the data release. It seems to fit perfectly. Besides the files and the STAC endpoint, this blog post explains the method. It meets the criteria of:
Notes:
Chip SizeBoils down to
It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places. Cost estimatesFrom our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ( This would mean 4k chips/$/h/worker.
|
Some small suggestions:
Not sure if the scale is possible, but is there a monthly Sentinel-2 global
composite available? If a Clay v2 model were trained on such a monthly
basis, over several years for example, the model might learn strong
seasonal and time based correlations, which would be especially helpful for
change detection problems.
In terms of chip size, can that be specified as a kind of metadata fed in
as is already done for sensor details? I see varying the chip size even for
the same sensor as providing several advantages:
- At Planet we’ve noticed that small object detection for embeddings is
aided by having smaller chip sizes, such as small buildings or small forest
degradation areas. Being able to have a smaller chip size in urban areas,
for example, would aid dealing with smaller objects over time
- Varying the chip size acts as a regularizer that would force the model to
not overfit to a particular chip size
- Being able to use varying chip sizes in practice could be a powerful
technique - use smaller chip sizes in known urban areas while larger chip
sizes in relatively sparse area would trade off accuracy vs compute and
storage costs.
In terms of storage, I agree geoparquet is a good format, as well as
storing the centroid of the latitude and longitude. At Planet we’ve also
stored a geometry column that corresponds to the exact chip bounding box
behind an embedding, which can be very helpful for knowing exactly where an
embedding is generated from.
Another useful thing to optionally store is a visual product image chip for
that embedding, as a preview URL stored along with the embedding. This is a
chipped visual product for the underlying analytic imagery and is stored as
a PNG file in a Google bucket. This is very useful when presenting results
to the user or showing things like clustering results. Not having a preview
chip can make it much harder to deal with embeddings at scale.
At Planet we’ve been using 224x224 chips for our embeddings, with a 3m GSD
pixel size for PlanetScope. As you’ve found yourself going to smaller chip
sizes can significantly increase compute and storage costs. Ultimately
we’ve wanted to figure out a way to store something like a pyramid of
different representations, something like Matroyshka embeddings but that
remains an R&D edge we haven’t figured out yet.
Something else we store in our embedding geoparquet files is quality
information per chip, using a cloud and quality mask. This is very helpful
to filter embeddings down based on quality, which is especially important
for change detection problems over time. You might want to compute cloud
and quality info using s2cloudless or fmask and store these in a consistent
way. We store percentages for percent of haze, clouds, snow, null pixels,
etc and use geopandas to quickly filter embeddings.
…On Fri, Jul 5, 2024 at 3:13 AM Bruno Sánchez-Andrade Nuño < ***@***.***> wrote:
Thanks everyone. I love that we are getting momentum here.
TLDR; So far
I'm leaning for a
1. global run of the Sentinel-2 year composite, at 100px most recent
year available with EG Sentinel-2 all-bands composite.
2. NAIP for CONUS. Latest, with 100px chip size too.
3. maybe? Selected locations (the training set?) to enable temporal
and cross instrument studies.
4. maybe? Satellogic set
Released as CC-By (inheriting EG CC-By)
Still TBD format and adding what losses.
Source imagery
Thanks @bengmstrong <https://github.com/bengmstrong> and EG team for the
data release. It seems to fit perfectly. Besides the files
<https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/>
and the STAC endpoint <https://stac.earthgenome.org/>, this blog post
explains the method
<https://medium.com/radiant-earth-insights/announcing-public-access-to-our-global-cloud-free-imagery-archive-25b33dc675ec>
.
It meets the criteria of:
- Fully open license. (CC-By)
- Global. (there are mentions of "errors", should be get a black-list
of these and run them when fixed? I've spotted checked and I only see the
usual hard places like permanent clouds locations)
- Recent (2023) (only global open composite this recent).
Notes:
- This seems to be a median reduction of the "best" 16 scenes per
location. What "best" mean? (least cloudy of the ~35/year?)
Chip Size
Boils down to 50px of 100px in my opinion.
Costs grows quadratic since its an area. Also very small areas approach
the patch size of 8px which means less chances for the self attention to
learn about the surroundings.
Since we use the average, we can recreate embeddings at bigger chip sizes
just averaging the smaller ones. It won't be the same sine the patches done
with smaller chip sizes will not have paid "self-attention" to the patches
outsize of that small chip.
It worries me that an area of 1km^2 is substantially big for many
potential uses, limiting the usefulness of this large and expensive run,
but doing smaller sizes is too expensive. we can do smaller chip sizes for
selected places.
Cost estimates
From our "build" workers (the ones on the co-code app, which we might or
might not use for this run), we see that in reality we are getting there
~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU
with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month (
$2.5$/h). Most of the time is spent on downloading, so chip size doesn't
seem to be a strong factor.
This would mean 4k chips/$/h/worker.
Uncanny we get pretty much the same result that the napkin exercise (we
should become consultants).
chip size (x10m/px) cost unit (km^2/$/h/worker) cost to run the world
50x50 px 1000 $510K
100x100 px 4000 $127K
50px is too expensive, 100px is doable. I'm hopefully @yellowcap
<https://github.com/yellowcap> optimism is true for this run.
Let's just start and assess what coverage/$ we get.
—
Reply to this email directly, view it on GitHub
<#277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEHUMVDHPFGTLYJBF7BKLZKZWUNAVCNFSM6AAAAABJFDYINOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGYYDIMJRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Update here. |
Some update with @yellowcap. We are getting ready building the pipelines and testing the Earth Genome Sentinel-2 composites data:
|
Hi all, has there been validation of any version or checkpoint of this model on an existing benchmark suite such as GEO-Bench? If not what are the major blockers? It seems valuable to do this prior to any global embeddings run because the embeddings cannot be used to run those benchmarks post hoc. And if the benchmark metrics are poor then the embeddings would likely not be very good. |
Related #326 |
+1 to CC BY license, in alignment with the Digital Public Goods Alliance standards https://digitalpublicgoods.net/implement/ |
Update: |
Hi all! For NAIP, have you considered the problem of making sure that chips fully align with their counterparts from other years? This would be a desirable property to have for applications like change detection. Unfortunately, this property does not seem to hold by default because the extents of NAIP tiles from different years don't exactly match up. For example, consider the following two NAIP images of the same grid tile from 2018 and 2021 respectively:
Their extents and shapes don't match up: with rio.Env():
with rio.open('s3://naip-analytic/vt/2018/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20181123.tif') as ds:
print(ds.bounds)
print(ds.shape)
with rio.open('s3://naip-analytic/vt/2021/60cm/rgbir_cog/42072/m_4207220_ne_18_060_20211019.tif') as ds:
print(ds.bounds)
print(ds.shape)
# output
# BoundingBox(left=699184.8, bottom=4728693.6, right=705106.2, top=4736384.4)
# (12818, 9869)
# BoundingBox(left=699226.2, bottom=4728750.6, right=705104.3999999999, top=4736383.8)
# (12722, 9797) This means that the i-th 256x256 px chip from one scene would not cover an identical geographical area to the i-th 256x256 px chip from the other scene. This would suggest a need for some mosaicing and reprojection. What do you guys think? Would love to hear thoughts on this! |
This is a great observation @AdeelH . This is one of the limitations of our current approach, where we are simply cutting up the imagery as it is. This will indeed lead to bounding boxes that are not 100% overlapping for the different images. The reason for this is that doing the re-projection is a huge lift computationally and logistically. Aligning all this imagery is not a trivial task, and we are going for the simple approach first (dividing the image as-is into chips). In a future iteration we might be able to do the aligining and re-sampling. |
@yellowcap, @AdeelH just figured out how to do this for NAIP as we were also working on generating vector embeddings at Element 84. He could share the approach and code we're using. Given the amount of inference you're going to run to generate vector embeddings for all of NAIP it would be great if you could consider doing this first. We're currently working on a process to compare vector embeddings over time for the same area in order to do change detection. We're seeing some really great initial results for a smaller area. It would be amazing to be able to do this across the entire US which could be enabled if you have consistent chips across time. If the vector embeddings are put in a table the chip identifiers can be used to do efficient spatial joins. |
@yellowcap, I hear you on the technical challenges. I think the approach we're using looks promising. The main idea is to tie the chip bounding boxes to the AOI and do the reprojection and mosaicing in chunks. Here's how it currently works: Given an AOI,
The idea with chunks is to be able to process them in parallel on multiple machines. The reprojection/warping to EPSG:5070 is to ensure each chip has roughly the same amount of area. Since NAIP imagery for different states is not collected at the same time. I don't think it makes sense to treat the whole CONUS as a single AOI. Instead, doing it state by state makes more sense. |
Thanks for the input and sharing your algorithm @jasongilman and @AdeelH . At this point we will move forward with the simple approach without doing the mosaicking. We have the algorithm ready to go and need to move due to time constraints. Once we have the NAIP and Sentinel-2 embeddings created, we can revisit the technique and see if we can do something more advanced like the approach you propose. I'd love to try this out but we can't fit that into the timeline we have right now. The code we are using for embedding generation sits here |
Thanks again @jasongilman and @AdeelH. Sorry for not taking these last improvements in this round, but as @yellowcap says, this is just the first iteration. Lots of things to improve for the next one, including your points. This issue was opened in June and we I'm already kicking myself for not having done this inference run much earlier. There are indeed many choices we had to make that we would love to revisit with wider feedback, like size of the images, (size of the patches!), overlaps, tiling/sliding, masking clouds of not, ... even if we should also keep patch embeddings... The best way for us to take these comments is with PRs with the actual code to run, like we have on the documentation, so we can minimize overhead on our side. |
Thanks for the detailed answer. I can understand the demands on the schedule and the embeddings will still be really useful within a single NAIP release. I look forward to trying them out once it's done! |
We've been using Clay v1 embeddings directly, and via the Build/Explore apps. We've also done several types of partial benchmarking, so we are starting to feel comfortable with the quality of the model. We therefore should think about making large runs of existing open data and create embeddings, for our benefit to continue learning about Clay, but also to enable the community to leverage these open embeddings.
We still need to make decisions once we decide to make large runs:
128x128
?Ideally, we can wrap this code to execute easily down the line, e.g. taking a STAC list and a spec file for chip_size, ...
Note: Do not over-scope here, since we have the build app.
Probably out of scope, but the end-state at some point this year could be:
Filing this also early to allow community requests, but we should aim to set a date for such run, e.g. end of June.
The text was updated successfully, but these errors were encountered: