Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC modifications from first run #22

Open
vanderwb opened this issue Feb 7, 2023 · 2 comments
Open

HPC modifications from first run #22

vanderwb opened this issue Feb 7, 2023 · 2 comments

Comments

@vanderwb
Copy link
Collaborator

vanderwb commented Feb 7, 2023

After the first run of this tutorial, the following modifications seem useful in the HPC section:

  1. Make sure viewers can run through the example without YAML config files!
  2. Show comparison of various spill ratios on Casper and give guidance on recommended values.

Would be good to have a new section on analyzing perf metrics in more depth (case study of a real workflow).

More to come!

@dcherian
Copy link
Contributor

dcherian commented Feb 7, 2023

That was a great tutorial. Here are some notes I made:

  • On resource allocation, one thing to think about is thread-based parallelism with numba, NumPy. It should be possible to request more cores (say 36, use 9 dask workers with 1 thread each, and then set NUMBA_NUM_THREADS=4 to enable thread parallelism with numba. I've done this with a LocalCluster on a cheyenne compute node to good effect. Perhaps this should be in an "Advanced Examples" section somewhere.

    • I see the Nanny now sets these variables to 1. So perhaps we should demo how to overwrite that when you want thread based parallelism with numba or numpy on each dask worker.
    • I think I did this initially to avoid reading too much data into memory (by limiting the number of dask workers), and then crunch through the read data quickly since I had so many cores lying idle. This may not be necessary anymore with the scheduling improvement.s
  • It may be a good idea to resurrect ncar_jobqueue and add NCARCluster.analyze to make the memory/CPU time plots you were showing; and NCARCluster.validate to check for known misconfigurations (e.g. more dask threads than requested number of cores in the resourcespec)

@vanderwb
Copy link
Collaborator Author

vanderwb commented Feb 7, 2023

Thanks for the notes Deepak - these look like excellent selections. From follow up questions we've had, it seems like we would want to spend more time discussing the following too:

  1. The distinction between cores, ncpus, processes, and workers when running a batch cluster.
  2. Chunking that spans the time dimension across multiple files with Xarray
  3. Using blocks that have ghost cells from neighboring blocks (e.g., via map_overlap)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants