Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More work on getting Dataflow to run #9

Merged
merged 67 commits into from
Aug 23, 2024
Merged

Conversation

jbusecke
Copy link
Contributor

@jbusecke jbusecke commented Jul 12, 2024

dataflow does not like underscores in jobnames. Our machinery here using pangeo-forge-runner uses the recipe_id to make jobnames.
Until we have more general fix upstream we will have to fix this within each feedstock.

This is somewhat frustrating (I have stumbled upon this many times), wondering if there is an easy way to check/validate the values of recipe_id automatically with the linting (cc @andersy005?).

@jbusecke
Copy link
Contributor Author

An important note! In order to properly deploy (and in particular write to the proper location), the catalog.yaml has to be edited! See 1ed8bba for details here. I think some of this should be automatically checked (see leap-stc/LEAP_template_feedstock#51 for tracking.

@jbusecke
Copy link
Contributor Author

Ughhh I think dataflow also does not like uppercase letters, this is annoying....

@jbusecke
Copy link
Contributor Author

Got past the naming issue, but there was a problem with the requirements. The xarray requirement was not formatted correctly (wondering how that did not fail in the local tests, but no big deal for now).

@jbusecke
Copy link
Contributor Author

Getting another error here that should have been caught in a local test?

ImportError: cannot import name 'ConsolidateMetadata' from 'pangeo_forge_recipes.transforms'

I think this was due to depending on an old pgf-recipe version

@jbusecke jbusecke mentioned this pull request Jul 12, 2024
@jbusecke
Copy link
Contributor Author

jbusecke commented Jul 12, 2024

The dataflow job is now running! I can see the console here but I think nobody of you can. We should change this (tracking this here)

@jbusecke
Copy link
Contributor Author

Hmmmm the dataflow job seems stalled.... Ill try to switch from Dataflow prime to specific high ram workers

@jbusecke
Copy link
Contributor Author

The run succeeded 🎉

But the Copy() stage is missing, and thus this does not work:

path = "gs://leap-scratch/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr"
import xarray as xr
xr.open_dataset(path, engine='zarr', chunks={})

Looking at the temp storage location we can take a look at the output:

path = "gs://leap-scratch/data-library/feedstocks/output/eNATL_feedstock/enatl60-blbt02-9908751732-1/eNATL60_BLBT02.zarr"
import xarray as xr
xr.open_dataset(path, engine='zarr', chunks={})
image

We should never get rid of the Copy Stage!!! I have added an ad-hoc comment to the recipe in leap-stc/LEAP_template_feedstock#53, but it would be very helpful to get feedback from @SammyAgrawal on where this concept would be best explained for new feedstock creators.

@jbusecke
Copy link
Contributor Author

Ok now everything runs as expected and I can see the final output (at the moment still pruned in time) like follows:

path = "gs://leap-persistent/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr"
import xarray as xr
ds = xr.open_dataset(path, engine='zarr', chunks={})
image

Ill remove the time pruning now and run the whole thing!

@jbusecke
Copy link
Contributor Author

Ah the full run just failed with

aiohttp.client_exceptions.ClientResponseError: 429, message='TOO MANY REQUESTS', url=URL('https://zenodo.org/records/10513552/files/eNATL60-BLBT02_y2009m07d18.1d_TSWm_60m.nc') [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|Preprocess|StoreToZarr|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenURLWithFSSpec/MapWithConcurrencyLimit/open_url-ptransform-81']
```

I think we have seen this before. Zenodo is relatively strict about many parallel requests, so I added a [concurrency limit](https://github.com/leap-stc/eNATL_feedstock/pull/9/commits/6152eb638cbcb152061d77832d6fdc11dfe373df)

@jbusecke
Copy link
Contributor Author

jbusecke commented Jul 12, 2024

WTF why is this failing all the time now... EDIT: Apparently because Zenodo does not allow parallel downloads...lame.

@jbusecke
Copy link
Contributor Author

I am suspecting that we ran out of memory in the last run
image

Lets see how this fares when we bump the worker size.

@jbusecke
Copy link
Contributor Author

jbusecke commented Jul 22, 2024

It worked, but something seems off...

ds = xr.open_dataset("gs://leap-persistent/data-library/feedstocks/eNATL_feedstock/eNATL60-BLBT02.zarr", engine='zarr', chunks={})
image

The time is just [0,1,0,1,....], wondering if the time encoding is lost here. Might be related to #5? @SammyAgrawal do you still have a 'raw' file xarray representation around so we can compare?

Also this is just 31 time steps, that seems low?

@norlandrhagen
Copy link
Contributor

@jbusecke how do you feel about merging this branch into main? Contingent on how you feel about the eNATL output.

@jbusecke
Copy link
Contributor Author

Few comments, that hopefully are not too much work?
Curious what you think about only merging once the dataset is actually fully built vs running the 'last build' from main @norlandrhagen?

@jbusecke jbusecke changed the title Fix recipe_id naming More work on getting Dataflow to run Aug 22, 2024
@@ -42,6 +42,7 @@ jobs:
# AT that point, screw it, not worth it.
run: |
jobname="${{ env.JOB_NAME }}"
echo "$JOB_NAME"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful to bring over to the template feedstock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.DataflowBakery.use_dataflow_prime = False
c.DataflowBakery.machine_type = "e2-highmem-16"
c.DataflowBakery.disk_size_gb = 400
c.DataflowBakery.use_shuffle = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? I am actually just curious. Again it might be good to document this as a 'case' in the template feedstock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should cut it, since I had to create a fork of pangeo-forge-runner to add it. It disables dataflow shuffle , which I thought had some disk space limitations, but I think I was wrong, so we can use shuffle.

name: "The even cooler large Proto Dataset" # no pyramids
url: "gs://leap-scratch/data-library/feedstocks/proto_feedstock/large.zarr"
- id: "enatl60-blbt02"
name: "Needs a name"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@auraoupa Can you help here? This name would show up in the LEAP catalog, see the marked portion here as example
image

ds = ds.set_coords(["deptht", "depthw", "nav_lon", "nav_lat", "tmask"])

ds = ds.rename({"time_counter": "time"})
ds = ds.set_coords(("nav_lat", "nav_lon"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did t_mask go? See #8 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, I'll rerun a subset to see what was up. We might have to regen.

ds = ds.set_coords(("nav_lat", "nav_lon"))
ds.attrs["deptht"] = ds.deptht.values[0]
ds = ds.drop("deptht")
ds = ds[["vosaline", "votemper", "vovecrtz"]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah probably dropped here!

ds = ds.set_coords(["deptht", "depthw", "nav_lon", "nav_lat", "tmask"])

ds = ds.rename({"time_counter": "time"})
ds = ds.set_coords(("nav_lat", "nav_lon"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ds = ds.set_coords(("nav_lat", "nav_lon"))
ds = ds.set_coords(("nav_lat", "nav_lon", "t_mask"))

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I remember. I'm pretty sure some of the input netcdf files are missing "t_mask".

@norlandrhagen
Copy link
Contributor

Few comments, that hopefully are not too much work? Curious what you think about only merging once the dataset is actually fully built vs running the 'last build' from main @norlandrhagen?

I kind of think we should do our 'prod' build from main, then iterate from that if we need updates? Also, I wonder if we should figure out how to incorporate a git commit into the dataset metadata?

@jbusecke
Copy link
Contributor Author

Also, I wonder if we should figure out how to incorporate a git commit into the dataset metadata?

Already part of the injected attrs by default 😁

https://github.com/leap-stc/leap-data-management-utils/blob/5d1e80270ec0d6373c5e9947ea4653e90c507bd0/leap_data_management_utils/data_management_transforms.py#L150

@norlandrhagen
Copy link
Contributor

Ah incredible, I forgot about this haha.

@jbusecke jbusecke merged commit a412394 into main Aug 23, 2024
3 of 4 checks passed
@jbusecke jbusecke deleted the recipe-id-wo-underscore branch August 23, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants