-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidelines on run Pace v0.1.0 with GPU? #371
Comments
I think the above issue is caused by openmpi is not CUDA-aware installed from apt, and potentially cupy-cuda117 runtime isn't compatible. I recompiled openmpi and cupy on p3 AWS instance and it now makes progress. However, I am not sure if How can I tell when
|
Hi @miaoneng,
That information will be in in the log. It can be the case that some ranks are still compiling code after the first timestep starts, but at the end of the first step you can be assured that all code is compiled. In a more concrete sense, all the compilation happens in the
This is how we currently run it, but I believe it is probably not required to be that way. If you have Let me know if you still run into issues getting going with it. Happy to help! |
Hi @jdahm. Thank you for your kind response. I am running the c12 test on one p3.2xlarge node as a test, I am not sure 16GB GDDR is sufficient or not. I will let it compile overnight and see what would report tomorrow. Thank you. |
I leave the machine overnight and I think it completed compile but failed at execution. I restarted it again I think it reuses cached binary so start up is much faster. However, it still failed at early execution stage.
Could you provide any suggestions? |
Apologies for the delay. Is this on a recent commit on There is a check for a "GPU" backend internally that does not catch Hope this helps! |
It looks like this scalar is initialized using a dynamic numpy module which should be cupy for gpu backends - the numpy module is selected by if (
stencil_factory.config.is_gpu_backend
and stencil_factory.config.dace_config.is_dace_orchestrated()
):
self.gfdl_cloud_microphys_init(namelist.dt_atmos, cp)
else:
self.gfdl_cloud_microphys_init(namelist.dt_atmos, np) in We're looking at updating the code so that an exception is raised if an unexpected backend is used. |
Thank you for your valuable response. I just changed the backend to It usually takes a whole night to compile it. I am currently using the v0.1 GMD tagged version. Should I switch to Best |
Yes, compilation does take quite a while. For the full model I'd expect an hour or two, but it depends on your system. It should run much faster once you have a cache of compiled code. We have systems in place to compile all configurations using only 9 ranks before we submit many-node performance runs, for this reason. Switching to main or not is up to you, not knowing anything about your use of the code I would suggest sticking to the tagged version. If you are mainly looking to reproduce the paper results and check out the model, this should work best. If you want to use this code more extensively for a project I'd be happy to meet and discuss our current development plans and how best to keep you updated on our latest changes. The APIs are all still subject to change, but some features are more stable than others. |
Looks like I am still hitting the same error. Here is the top section of the config
I deleted I put the whole log here https://pastebin.com/WS4iafRS |
That's odd, I'm starting to suspect this is a real bug and not an issue with how you're running the code. In a PR I am running into an issue with the same section of code for different reasons. I should be able to investigate this today and get back to you. |
I wasn't able to reproduce your errors because of cupy not importing inside the docker image you provided, which I think is why you moved to working baremetal. If you can get something running in Docker I can try to debug your issue directly. Otherwise, I'll let you know when this section of the code is updated and your problem will likely be solved (though I won't be able to confirm by testing it myself). |
I can definitely work towards a |
Here is my docker file
I think I am able to run the application and stops at the same (probably) location. |
I had to add an apt-get install of
If I follow what it says and
Do you have any ideas why this image is behaving differently on my machine? |
I am not 100% sure. Let me relinquish and get a whole new p3d instance and retry the docker image. I think when you install Here is the env after entering the docker image.
|
To recap, I checked out the v0.1.0 tag: (base) mcgibbon@jeremy-vm-gpu:~/python/pace$ git status
HEAD detached at v0.1.0 I modified requirements_dev.txt with: (base) mcgibbon@jeremy-vm-gpu:~/python/pace$ git diff requirements_dev.txt
diff --git a/requirements_dev.txt b/requirements_dev.txt
index 6a4386b8..63897b71 100644
--- a/requirements_dev.txt
+++ b/requirements_dev.txt
@@ -12,7 +12,7 @@ fv3config>=0.9.0
dace>=0.14
f90nml>=1.1.0
numpy>=1.15
--e external/gt4py
+-e external/gt4py[cuda117]
-e util
-e stencils
-e dsl I modified the Dockerfile by replacing it with your copy. I modified the baroclinic_c12.yaml with (base) mcgibbon@jeremy-vm-gpu:~/python/pace$ git diff driver
diff --git a/driver/examples/configs/baroclinic_c12.yaml b/driver/examples/configs/baroclinic_c12.yaml
index 6aa295b8..b1a2c6bc 100644
--- a/driver/examples/configs/baroclinic_c12.yaml
+++ b/driver/examples/configs/baroclinic_c12.yaml
@@ -1,6 +1,6 @@
stencil_config:
compilation_config:
- backend: numpy
+ backend: gt:gpu
rebuild: false
validate_args: true
format_source: false I forced a rebuild of the image and entered the interactive docker environment with (base) mcgibbon@jeremy-vm-gpu:~/python/pace$ make _force_build enter While doing this, the image failed to build on the third last step, with this error tail:
I then added the following line to the Dockerfile just before that step:
And I re-ran From there, I run root@jeremy-vm-gpu:/# mpirun -n 6 --oversubscribe python3 -m pace.driver.run /pace/driver/examples/configs/baroclinic_c12.yaml
Traceback (most recent call last):
Traceback (most recent call last):
File "/pace/util/pace/util/utils.py", line 14, in <module>
cp.cuda.runtime.deviceSynchronize()
AttributeError: module 'cupy' has no attribute 'cuda'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "/pace/driver/pace/driver/__init__.py", line 1, in <module>
from .comm import (
File "/pace/driver/pace/driver/comm.py", line 7, in <module>
import pace.dsl
File "/pace/dsl/pace/dsl/__init__.py", line 3, in <module>
from pace.util.mpi import MPI
File "/pace/util/pace/util/__init__.py", line 1, in <module>
from . import testing
File "/pace/util/pace/util/testing/__init__.py", line 3, in <module>
from .dummy_comm import ConcurrencyError, DummyComm
File "/pace/util/pace/util/testing/dummy_comm.py", line 1, in <module>
from ..local_comm import ConcurrencyError # noqa
File "/pace/util/pace/util/local_comm.py", line 6, in <module>
from .utils import ensure_contiguous, safe_assign_array
File "/pace/util/pace/util/utils.py", line 16, in <module>
except cp.cuda.runtime.CUDARuntimeError:
AttributeError: module 'cupy' has no attribute 'cuda'
Traceback (most recent call last):
File "/pace/util/pace/util/utils.py", line 14, in <module>
cp.cuda.runtime.deviceSynchronize()
AttributeError: module 'cupy' has no attribute 'cuda' With all of these differences, I don't think I am running the same docker image you're running. Are you positive what you pasted is exactly the same as what you built? If so, is there some way you can provide me with your pre-built image so I can run that instead? |
OK, I was finally able to resolve the "no attribute 'cuda'" issue. It was because I didn't include your --gpus flag in my docker run command. I edited the Makefile: (base) mcgibbon@jeremy-vm-gpu:~/python/pace$ git diff Makefile
diff --git a/Makefile b/Makefile
index e17d570f..81b8fff0 100644
--- a/Makefile
+++ b/Makefile
@@ -91,6 +91,7 @@ enter:
docker run --rm -it \
--network host \
$(VOLUMES) \
+ --gpus '"device=0"' \
$(PACE_IMAGE) bash
dev: And also removed the |
@miaoneng sorry for the lack of progress on this over the past week, I came down with covid and have not gotten any work done. I'm leaving starting Thursday for essentially the remainder of AI2's participation in the Pace project (barring paper revisions), which is being handed over to GFDL at the end of this year. @jdahm will be taking charge of helping you with this issue as much as we can with the time available to us. |
Hi @miaoneng, Using the system $ make _force_build enter # builds docker image, starts container, and attaches shell
container $ apt install -y --no-install-recommends git # this was still required for some reason
container $ mpirun -n 6 --oversubscribe python3 -m pace.driver.run examples/configs/baroclinic_c12.yaml Did that work for you? |
Hi @jdahm. Could you kindly post the |
@mcgibbon I am sorry to hear that and truly hope you have a fast recovery. Thank you for your time and I will follow your posts to recreate the environment and give it a try again. |
Closing this for now. Let us know if you still have questions! |
Thank you, John. I actually didn't have time to work on it at all in the past a couple of weeks. I will reopen in future in case I encounter any issue. |
I am trying to set up a lab to replicate the Pace v0. 1: A Python-based Performance-Portable Implementation of the FV3 Dynamical Core Due to #355 I cannot access the docker environment as provided in the docs (i.e.,
make dev
doesn't work). So I tried to start with providedDockerfile
I am using v0.1.0 release because I am assuming this is the version for the submission, and I modified
requirements_dev.txt
to install gt4py with cuda117 features, like gt4py[cuda117]For sake of simplicity, I started with Nvidia's docker images. Here is my
Dockerfile
Then I modified
driver/examples/configs/baroclinic_c12.yaml
to change backend tocuda
(which I believe cupy is invoked as the code generator). My modified top part is like:Then I run the command line as
inside the Docker image.
After kernel is compiled, the program crashed as following.
BTW, numpy backend works, but due to missing a written guideline to run with GPU backend. I am not sure if I am using a right way.
Could you help me to triage the issue or provide any additional instructions?
Thank you.
The text was updated successfully, but these errors were encountered: