-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #6914 job_agent recover existing sbatch jobs #7404
base: master
Are you sure you want to change the base?
Conversation
- Fix #7308 ui_websocket default is True and removed False case from test.sh - job_supervisor run returns immediately and is not a task - job_supervisor run_status_op pends until run or status watcher complete - run_status_update is new op that is sent asynchronously from agent to supervisor - job_agent separate out logic for run/state; reconnects to sbatch job - job_cmd restructured and more error handling - job_cmd centralized dispatch in _process_msg - job_cmd._do_compute more robust and supports separate run/status - job documents more ops and statuses - Added max_procs=4 to test.sh to parallelize tests - Fixed global state checks (mpiexec) to allow parallel test execution - Increased timeouts to allow for delays during parallel test execution - Improve arg validation in simulation_db.json_filename - sbatchLoginService commented out invalid state transitions - SIREPO.srlog includes time
GitHub Actions speed up with max_procs=4 is 3x (8 vs 25 minutes). The docker pull, pip install, fmt, etc. take 2.5 minutes so the speed up is actually linear. I'm going to add SIREPO_MPI_CORES=2, because I think this will test the code better. |
@e-carlin I'm still testing. Good to get started on the review now, though. |
SIREPO_MPI_CORES=2 doesn't change the speed. I think this makes GH action a better teset. |
/bin/test does not exist so just call test, which is a builtin took out printing of env for testing
missing nextRequestSeconds SlotProxy shows enter and exit from wait
fixed more status issues can kill supervisor or agent on running, queued
@e-carlin ready for a review. Tests pass, and seems to work on NERSC. I've done a lot more testing of NERSC than local. I didn't test docker, but I don't think I modified that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working my way through reviewing. Probably another day. I left some initial comments.
Some quirks I noticed
- If I'm running a sim (doesn't need to be under sbatch) and I kill -9 it from the terminal I the GUI reports it as canceled. Seems like it should be error
- If I kill -9 an agent (again doesn't need to be under sbatch) then the gui continues to report "running: awaiting output". Even after refresh.
Added to #7406 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two errors while running simulations:
- openmc > aurora > wait for volume extraction > visualization > vagrant cluster > login > start > error: [No such file or directory]: open('/var/tmp/vagrant/sirepo/user/ZSLW4c4Y/openmc/ZSLW4c4Y-VqEsWQZE-openmcAnimation/in.json', 'r')
- flash > blast2 > run setup and compile > visualization > vagrant cluster > login > start > error: /home/vagrant/.pyenv/versions/py3/bin/python: can't open file '/var/tmp/vagrant/sirepo/user/6NiqqZff/flash/6NiqqZff-ykM8ISjL-animation/parameters.py': [Errno 2] No such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed everything. Just a few more comments.
The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.
Thank you. I know it was a lot and very complicated.
I appreciate the testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into the same openmc error
…pare Modularize access to run_dir_input file
Fix #7404 had to write in.json. Refactored that code. openmc works now. I didn't test flash. |
_run
task #7385 job_supervisor run returns immediately and is not a task