Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #6914 job_agent recover existing sbatch jobs #7404

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

robnagler
Copy link
Member

@robnagler robnagler commented Dec 20, 2024

  • Fix Remove SIREPO_FEATURE_CONFIG_UI_WEBSOCKET=0 test case #7308 ui_websocket default is True and removed False case from test.sh
  • Fix Remove supervisor _run task #7385 job_supervisor run returns immediately and is not a task
  • job_supervisor run_status_op pends until run or status watcher complete
  • run_status_update is new op that is sent asynchronously from agent to supervisor
  • job_agent separate out logic for run/state; reconnects to sbatch job
  • job_cmd restructured and more error handling
  • job_cmd centralized dispatch in _process_msg
  • job_cmd._do_compute more robust and supports separate run/status
  • job documents more ops and statuses
  • Added max_procs=4 to test.sh to parallelize tests
  • Fixed global state checks (mpiexec) to allow parallel test execution
  • Increased timeouts to allow for delays during parallel test execution
  • Improve arg validation in simulation_db.json_filename
  • sbatchLoginService commented out invalid state transitions
  • SIREPO.srlog includes time

- Fix #7308 ui_websocket default is True and removed False case from test.sh
- job_supervisor run returns immediately and is not a task
- job_supervisor run_status_op pends until run or status watcher complete
- run_status_update is new op that is sent asynchronously from agent to supervisor
- job_agent separate out logic for run/state; reconnects to sbatch job
- job_cmd restructured and more error handling
- job_cmd centralized dispatch in _process_msg
- job_cmd._do_compute more robust and supports separate run/status
- job documents more ops and statuses
- Added max_procs=4 to test.sh to parallelize tests
- Fixed global state checks (mpiexec) to allow parallel test execution
- Increased timeouts to allow for delays during parallel test execution
- Improve arg validation in simulation_db.json_filename
- sbatchLoginService commented out invalid state transitions
- SIREPO.srlog includes time
@robnagler
Copy link
Member Author

GitHub Actions speed up with max_procs=4 is 3x (8 vs 25 minutes). The docker pull, pip install, fmt, etc. take 2.5 minutes so the speed up is actually linear. I'm going to add SIREPO_MPI_CORES=2, because I think this will test the code better.

@robnagler robnagler requested a review from e-carlin December 20, 2024 16:56
@robnagler
Copy link
Member Author

@e-carlin I'm still testing. Good to get started on the review now, though.

@robnagler
Copy link
Member Author

SIREPO_MPI_CORES=2 doesn't change the speed. I think this makes GH action a better teset.

@robnagler
Copy link
Member Author

@e-carlin ready for a review. Tests pass, and seems to work on NERSC. I've done a lot more testing of NERSC than local. I didn't test docker, but I don't think I modified that.

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working my way through reviewing. Probably another day. I left some initial comments.

Some quirks I noticed

  • If I'm running a sim (doesn't need to be under sbatch) and I kill -9 it from the terminal I the GUI reports it as canceled. Seems like it should be error
  • If I kill -9 an agent (again doesn't need to be under sbatch) then the gui continues to report "running: awaiting output". Even after refresh.

sirepo/job_driver/__init__.py Show resolved Hide resolved
sirepo/job_supervisor.py Outdated Show resolved Hide resolved
sirepo/job_driver/__init__.py Outdated Show resolved Hide resolved
sirepo/job_driver/__init__.py Outdated Show resolved Hide resolved
@robnagler
Copy link
Member Author

Added to #7406 (comment).

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two errors while running simulations:

  • openmc > aurora > wait for volume extraction > visualization > vagrant cluster > login > start > error: [No such file or directory]: open('/var/tmp/vagrant/sirepo/user/ZSLW4c4Y/openmc/ZSLW4c4Y-VqEsWQZE-openmcAnimation/in.json', 'r')
  • flash > blast2 > run setup and compile > visualization > vagrant cluster > login > start > error: /home/vagrant/.pyenv/versions/py3/bin/python: can't open file '/var/tmp/vagrant/sirepo/user/6NiqqZff/flash/6NiqqZff-ykM8ISjL-animation/parameters.py': [Errno 2] No such file or directory

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed everything. Just a few more comments.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

sirepo/job_supervisor.py Show resolved Hide resolved
sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved
sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved
sirepo/pkcli/job_agent.py Show resolved Hide resolved
sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved
@robnagler
Copy link
Member Author

I've reviewed everything. Just a few more comments.

Thank you. I know it was a lot and very complicated.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

I appreciate the testing.

@robnagler robnagler requested a review from e-carlin January 2, 2025 15:36
Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into the same openmc error

#7404 (review)

@robnagler
Copy link
Member Author

Fix #7404 had to write in.json. Refactored that code. openmc works now. I didn't test flash.

@robnagler robnagler requested a review from e-carlin January 4, 2025 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove supervisor _run task Remove SIREPO_FEATURE_CONFIG_UI_WEBSOCKET=0 test case
2 participants