-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[use case demonstration] Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model #87
Comments
I like how you tagged NVIDIA hahaha. The RAPIDS folks (@jakirkham, @madsbk, @jacobtomlinson) were really interested in a blogpost about this stuff |
👍 for a blog post. I'd be happy to contribute to a draft blog post as @dcherian suggested at a recent Pangeo meeting for https://medium.com/pangeo (or https://medium.com/rapids-ai), but probably need to wait for pydata/xarray#6874 and zarr-developers/zarr-python#934 to get merged and new xarray and Zarr releases first. One issue with having this |
I love the idea of a blog post here. Perhaps we publish the post in a few places at once (xarray's blog would also work).
I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future. |
OK thanks for the prompt. I added a super brief intro blogpost here: xarray-contrib/xarray.dev#308 to get the word out. This blogpost could then just link to that for extra details. |
At https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705, there's some ideas on how to run 'expensive' (read: GPU required) notebooks via the Pangeo Binder Jupyter Hub. It'll be more work than the caching solution, but probably allows for easier reproducibility long-term for the wider community, especially if the GPU direct storage/kvikIO technology gets updated in the future and we need to re-run things for newer versions. Thoughts? |
I think the eventual goal should be to build the examples that are 'expensive' and cross-cutting in terms of software (e.g., Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model) as part of the Project Pythia cookbooks and link to those cookbooks from the individual package docs (e.g., xbatcher). But, as discussed on that thread some infrastructure developments are required before Project Pythia can support those examples. The notebook discussed here could be a great test case for the integration between JupyterHubs and JupyterBook and could be "cached" in xbatcher docs while that development happens. |
Just on the infrastructure point, I noticed that GPU-enabled GitHub Actions is on the roadmap (github/roadmap#505), but unsure if this will be limited to Teams/Enterprise plans only as with https://github.blog/changelog/2022-09-01-github-actions-larger-runners-are-now-in-public-beta. In theory, this would allow for us to store an uncached version of the notebook and run it from time to time (though it will probably cost some $$). Still, I think the Project Pythia cookbook method is worth pursuing, as the close integration with Pangeo Binder would allow users to actually run the example kvikIO notebook on the cloud. In practical terms, we could:
|
Now available in zarr-python 2.13.0a2 for testing. |
Is there a cloud provider that has the necessary GDS stuff set up? |
Tried running on Microsoft Planetary Computer (
Could try to get in a PR to install the necessary GPU direct storage and kvikIO packages perhaps, they're usually pretty responsive. Edit: opened issue at microsoft/planetary-computer-containers#51. |
Oh, and if we do get GPU direct storage setup on Microsoft Planetary Computer (on Azure West Europe), I have an idea to get a demo working with the https://github.com/carbonplan/cmip6-downscaling dataset (since it's also on Azure West Europe?). This may or may not require the multi-resolution issue at #93 to be resolved, but it looked like a good Zarr machine learning dataset to play with. As a start, I did try this quickly: xr.open_dataset(
"https://cpdataeuwest.blob.core.windows.net/cp-cmip/version1/data/DeepSD/ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.DeepSD.pr.zarr",
engine="kvikio",
consolidated=False,
) but got a strange |
Ok, looks like I've severely underestimated how long this is going to take 😅 Hoping to get some time to work on this in October 2023 🤞, but just gonna make a TODO list on things that need to happen:
Longer term, we'll also look into:
|
It may be a lot easier to experiment on NCAR systems once they can do it. @negin513 seems very interested in this kind of thing :) |
thanks for creating the to-do list @weiji14! as we discussed earlier today, I'll also have some time in October to contribute and am particularly interesting in the kerchunk connections. |
Starting with the name brand CSPs is a reasonable first step While lesser known, CoreWeave has been putting in good effort to configuring hardware optimally Though if you have your own system that you are planning to use long term, setting up there sounds good |
Cool, the idea is to enable more people to run kvikIO/NVIDIA GPUDirect Storage, either on a local GPU, or in the cloud if they don't have one. That's why I'd like to start with the documentation, and we could experiment on NCAR first to understand how involved the configuration would be. Once we've figured out the config settings, we can then expand to other HPC or commercial cloud systems. That CoreWeave offering does look nice, though I can't see on their webpage if they do support NVIDIA GDS (would like to hope that they do)! |
Have managed to run some benchmark experiments on a WeatherBench2/ERA5 subset comparing Initial results are that |
@weiji14 can you please describe where these tests were run local Machine or in Cloud environment ? |
Hi @KiranModukuri, yes, these tests were ran locally (using an NVIDIA RTX A2000 8GB GPU). I did try to set up a GCP container to run the benchmarks (WeatherBench2's ERA5 is at https://console.cloud.google.com/storage/browser/weatherbench2/datasets/era5), but was running into quota issues allocating GPUs on |
What is your issue?
Recent developments by @NVIDIA and @dcherian are opening the door for direct-to-gpu data loading in Xarray. This could mean that when combined with Xbatcher and the tensorflow or pytorch data loaders, a complete workflow from Zarr all the way to a ml model training could be accomplished without ever handling data on a CPU.
Here's a short illustration of the potential workflow:
This would be awesome to demonstrate in a single example. Perhaps as a second tutorial on Xbatcher's documentation site.
xref: xarray-contrib/cupy-xarray#10
cc @dcherian, @negin513, and @weiji14
The text was updated successfully, but these errors were encountered: