Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize reconstructions with submitit #477

Merged
merged 28 commits into from
Sep 17, 2024
Merged

Conversation

talonchandler
Copy link
Collaborator

@talonchandler talonchandler commented Jul 15, 2024

This PR uses submitit to parallelize reconstructions with the following features:

  • if slurm is available, submits batched jobs to nodes (one job per ome-zarr position)
  • if slurm is not available, submits jobs to individual processes (one PID per ome-zarr position)
  • estimates resources based on the size of the inputs, with a user-controlled parameter --ram-multiplier

This PR also features several nice-to-have features:

  • the CLI can now accept a -i plate.zarr, which it will expand into a list of positions
  • a simple monitor_jobs utility that makes well-job-node mappings very clear---handy for when things fail
Screenshot 2024-07-18 at 2 15 18 PM
  • reasonable failure messaging---if at least one job fails, the monitor prints the first failed job's STDOUT and STDERR.

@talonchandler talonchandler changed the title Prototype submitit parallelization Parallelize reconstructions with submitit Jul 18, 2024
Copy link

codecov bot commented Jul 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 9.54%. Comparing base (7687259) to head (58c094f).
Report is 2 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main    #477   +/-   ##
=====================================
  Coverage   9.54%   9.54%           
=====================================
  Files         30      30           
  Lines       4591    4591           
=====================================
  Hits         438     438           
  Misses      4153    4153           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@talonchandler
Copy link
Collaborator Author

talonchandler commented Jul 18, 2024

submitit doesn't support 3.12. Dropping.

@talonchandler talonchandler marked this pull request as ready for review July 18, 2024 22:15
@edyoshikun
Copy link
Contributor

edyoshikun commented Jul 22, 2024

@talonchandler
I tried to use this but couldn't get it to work. Here is my script: /hpc/projects/tlg2/photom/20240719_3df_photom_countour_demo/1-reconstruction/reconstruct.sh

Here are some comments trying to use this:
Major:

  • In the script I set it to have any variable other than 1 it launches simultaneous processes but it ran 10 in parallel.
  • We should change the way the resources are allocated. Currently, depending on the amount of memory required, in this case, using the default -j and -rx flags, this job requests 49GB and 10CPU. This can be challenging when reserving memory for the jobs as we would have to wait for that much memory to be available, especially if we use the multiplier. If we could submit them in an array that would be great.
  • One cannot set the parameters for --rx when running recorder reconstruct

Minor

  • Can we have logs written to a subfolder based on the SLURM_JOB_ID so that it's easier to debug?
  • After I cancel a job or try to re-run the job, my terminal's font color swaps to green or red.

@talonchandler
Copy link
Collaborator Author

Thanks for testing @edyoshikun!

couldn't get it to work. Here is my script: /hpc/projects/tlg2/photom/20240719_3df_photom_countour_demo/1-reconstruction/reconstruct.sh

Hmmm...I couldn't get your script to fail after ~10 trials. We might need to pair on your failures. You got a SIGTERM...what terminal are you running from? Did you always get the same failure? (I saw one set of failures in /hpc/projects/tlg2/photom/20240719_3df_photom_countour_demo/1-reconstruction-submitit/logs/15519888_0_log.out and the other failures seem like parameter issues).

Major concerns:

You helped me find a bug that accidentally set the number of requested CPUs to the number of timepoints. This branch now requests the same number of CPUs as the -j flag. This fix addresses your first two major concerns---if you want to requests fewer resources you can make -j and -rx smaller.

One cannot set the parameters for --rx when running recorder reconstruct

Fixed! Thanks for flagging.

Can we have logs written to a subfolder based on the SLURM_JOB_ID so that it's easier to debug?

Hmmm...on a first pass I'm not seeing an easy way to do this (without a submitit PR or duplication of a large submitit class), and I'm trying to understand your debugging case where folders would help? Naively, the only difference between a folder and the current behavior is a / changed to a _?

After I cancel a job or try to re-run the job, my terminal's font color swaps to green or red.

I think I've fixed this by always printing white when the monitor exits. If this persists we might need to pair to debug this.

Thanks for your helpful review!

@edyoshikun
Copy link
Contributor

It ran now without any issues. I want to say the culprit is cpu-b-3 node because my jobs which were now allocated on cpu-b-6 completed without an error. For these reconstructions, can these be launched to GPUs? Currently, the default is cpus.

Hmmm...I couldn't get your script to fail after ~10 trials. We might need to pair on your failures. You got a SIGTERM...what terminal are you running from? Did you always get the same failure? (I saw one set of failures in

I was using the linux shell on the HPC launching the jobs from fry2 which should not be an issue.

@talonchandler
Copy link
Collaborator Author

@edyoshikun I just finished an end-to-end test, and I happy with this branch. I also committed a change that resets terminal colors when this is complete.

In my opinion this is ready to merge.

Copy link
Contributor

@edyoshikun edyoshikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@talonchandler talonchandler added this pull request to the merge queue Sep 17, 2024
Merged via the queue into main with commit 6c70732 Sep 17, 2024
9 checks passed
@edyoshikun edyoshikun deleted the prototype-submitit branch September 17, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants