Fix #6914 job_agent recover existing sbatch jobs #7404

robnagler · 2024-12-20T16:29:21Z

Fix Remove SIREPO_FEATURE_CONFIG_UI_WEBSOCKET=0 test case #7308 ui_websocket default is True and removed False case from test.sh
Fix Remove supervisor _run task #7385 job_supervisor run returns immediately and is not a task
job_supervisor run_status_op pends until run or status watcher complete
run_status_update is new op that is sent asynchronously from agent to supervisor
job_agent separate out logic for run/state; reconnects to sbatch job
job_cmd restructured and more error handling
job_cmd centralized dispatch in _process_msg
job_cmd._do_compute more robust and supports separate run/status
job documents more ops and statuses
Added max_procs=4 to test.sh to parallelize tests
Fixed global state checks (mpiexec) to allow parallel test execution
Increased timeouts to allow for delays during parallel test execution
Improve arg validation in simulation_db.json_filename
sbatchLoginService commented out invalid state transitions
SIREPO.srlog includes time

- Fix #7308 ui_websocket default is True and removed False case from test.sh - job_supervisor run returns immediately and is not a task - job_supervisor run_status_op pends until run or status watcher complete - run_status_update is new op that is sent asynchronously from agent to supervisor - job_agent separate out logic for run/state; reconnects to sbatch job - job_cmd restructured and more error handling - job_cmd centralized dispatch in _process_msg - job_cmd._do_compute more robust and supports separate run/status - job documents more ops and statuses - Added max_procs=4 to test.sh to parallelize tests - Fixed global state checks (mpiexec) to allow parallel test execution - Increased timeouts to allow for delays during parallel test execution - Improve arg validation in simulation_db.json_filename - sbatchLoginService commented out invalid state transitions - SIREPO.srlog includes time

robnagler · 2024-12-20T16:56:32Z

GitHub Actions speed up with max_procs=4 is 3x (8 vs 25 minutes). The docker pull, pip install, fmt, etc. take 2.5 minutes so the speed up is actually linear. I'm going to add SIREPO_MPI_CORES=2, because I think this will test the code better.

robnagler · 2024-12-20T16:57:33Z

@e-carlin I'm still testing. Good to get started on the review now, though.

robnagler · 2024-12-20T17:18:25Z

SIREPO_MPI_CORES=2 doesn't change the speed. I think this makes GH action a better teset.

/bin/test does not exist so just call test, which is a builtin took out printing of env for testing

missing nextRequestSeconds SlotProxy shows enter and exit from wait

fixed more status issues can kill supervisor or agent on running, queued

robnagler · 2024-12-24T23:29:05Z

@e-carlin ready for a review. Tests pass, and seems to work on NERSC. I've done a lot more testing of NERSC than local. I didn't test docker, but I don't think I modified that.

e-carlin

I'm working my way through reviewing. Probably another day. I left some initial comments.

Some quirks I noticed

If I'm running a sim (doesn't need to be under sbatch) and I kill -9 it from the terminal I the GUI reports it as canceled. Seems like it should be error
If I kill -9 an agent (again doesn't need to be under sbatch) then the gui continues to report "running: awaiting output". Even after refresh.

sirepo/job_driver/__init__.py

sirepo/job_supervisor.py

sirepo/job_driver/__init__.py

robnagler · 2024-12-28T18:52:07Z

Added to #7406 (comment).

e-carlin

Two errors while running simulations:

openmc > aurora > wait for volume extraction > visualization > vagrant cluster > login > start > error: [No such file or directory]: open('/var/tmp/vagrant/sirepo/user/ZSLW4c4Y/openmc/ZSLW4c4Y-VqEsWQZE-openmcAnimation/in.json', 'r')
flash > blast2 > run setup and compile > visualization > vagrant cluster > login > start > error: /home/vagrant/.pyenv/versions/py3/bin/python: can't open file '/var/tmp/vagrant/sirepo/user/6NiqqZff/flash/6NiqqZff-ykM8ISjL-animation/parameters.py': [Errno 2] No such file or directory

e-carlin

I've reviewed everything. Just a few more comments.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

sirepo/job_supervisor.py

sirepo/pkcli/job_agent.py

robnagler · 2025-01-02T15:36:50Z

I've reviewed everything. Just a few more comments.

Thank you. I know it was a lot and very complicated.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

I appreciate the testing.

e-carlin

I ran into the same openmc error

#7404 (review)

…pare Modularize access to run_dir_input file

robnagler · 2025-01-04T00:44:02Z

Fix #7404 had to write in.json. Refactored that code. openmc works now. I didn't test flash.

robnagler added 5 commits December 20, 2024 16:15

fix srdbg and console.log

6932459

remove comment

701e261

fmt

db0a36c

cores=2 runs mpiexec

cc6ee18

robnagler requested a review from e-carlin December 20, 2024 16:56

robnagler and others added 21 commits December 20, 2024 23:25

need compute model for sbatch login exception

58a6d45

DEV_SRC_RADIASOFT_DIR must be str so not eval'ed on server

2cedf27

/bin/test does not exist so just call test, which is a builtin took out printing of env for testing

run_dir needs to exist for run_status

5f40c63

undef variable

791deeb

jobCmd has to be set before calling _SbatchRunStatus

05a3057

need to setup _SbatchRunStatus better

555c7f2

incorrect attrs

47d70de

various attribute and exception issues

fa37016

more attr issues

2831d41

job agent runs sacct

4ab3cb5

run_status_op has to free run_dir_slot

010aba3

missing nextRequestSeconds SlotProxy shows enter and exit from wait

fix missing status

696037c

send() returns false on socket error and clears _websocket

fbe9c34

fixed more status issues can kill supervisor or agent on running, queued

fmt

668bea8

add more logging

5684755

make job_cancel_test more robust

89e6ecc

make tests more robust to time sensitivity

4992a00

remove pkdp

ec9d4e9

fmt

bd49fa7

too much asynchrony so be flexible about states

0d5cb1b

fixing state

82949a9

robnagler and others added 10 commits December 23, 2024 17:00

more error handlng

4ce6f0f

undo 5408e5a srw cancel is clearer now; more fixes and error handling

c69c033

debug

1fd112c

debug

06bf7f5

debug

5985602

fixed fastcgi_destroy maybe

60b2bb4

fix f-string

37cd3b4

fix destroy; remove all debugging

d8e1806

remove debug

a9dfc9b

missing arg for non-sbatch run

f0c2538

e-carlin requested changes Dec 27, 2024

View reviewed changes

sirepo/job_driver/__init__.py Show resolved Hide resolved

sirepo/job_supervisor.py Outdated Show resolved Hide resolved

sirepo/job_driver/__init__.py Outdated Show resolved Hide resolved

sirepo/job_driver/__init__.py Outdated Show resolved Hide resolved

robnagler mentioned this pull request Dec 28, 2024

when runSimulation encounters an error, it's not displayed #7406

Open

review

6e0f1ee

e-carlin requested changes Dec 30, 2024

View reviewed changes

sirepo/job_supervisor.py Show resolved Hide resolved

sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved

sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved

sirepo/pkcli/job_agent.py Show resolved Hide resolved

sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved

robnagler added 2 commits December 30, 2024 23:31

Fix #7414 write_message binary=True

3d0638e

review

d0649f2

robnagler requested a review from e-carlin January 2, 2025 15:36

e-carlin requested changes Jan 3, 2025

View reviewed changes

robnagler added 5 commits January 3, 2025 22:49

Refactor simulation_db.prepare_simulation as sim_data.sim_run_dir_pre…

39c607b

…pare Modularize access to run_dir_input file

pkdp

4661f2b

pkdp

994b0fa

must call sim_run_input_to_run_dir

4d3d886

deviance case is special

9beb8b2

robnagler requested a review from e-carlin January 4, 2025 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #6914 job_agent recover existing sbatch jobs #7404

Fix #6914 job_agent recover existing sbatch jobs #7404

robnagler commented Dec 20, 2024 •

edited

Loading

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 24, 2024

e-carlin left a comment

robnagler commented Dec 28, 2024

e-carlin left a comment

e-carlin left a comment

robnagler commented Jan 2, 2025

e-carlin left a comment

robnagler commented Jan 4, 2025

Fix #6914 job_agent recover existing sbatch jobs #7404

Are you sure you want to change the base?

Fix #6914 job_agent recover existing sbatch jobs #7404

Conversation

robnagler commented Dec 20, 2024 • edited Loading

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 24, 2024

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Dec 28, 2024

e-carlin left a comment

Choose a reason for hiding this comment

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Jan 2, 2025

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Jan 4, 2025

robnagler commented Dec 20, 2024 •

edited

Loading