Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balsam job rm --all fails to delete some jobs with no warning #364

Open
s-sajid-ali opened this issue Jun 13, 2023 · 9 comments
Open

balsam job rm --all fails to delete some jobs with no warning #364

s-sajid-ali opened this issue Jun 13, 2023 · 9 comments

Comments

@s-sajid-ali
Copy link

s-sajid-ali commented Jun 13, 2023

On theta, attempting to delete all jobs from a site via balsam job rm --all sometimes fails with no indication that the job was not removed:

(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
31897475   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
31897476   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
31897588   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
31897589   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
31897626   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
31897627   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job rm --all
THIS WILL DELETE ALL JOBS! CAUTION!
Really delete 6 jobs? [y/N]: y   
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
31897475   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
31897476   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
31897588   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
31897589   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
31897626   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
31897627   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> 

The development version of balsam was installed via:

pip install balsam@git+https://github.com/argonne-lcf/balsam
@cms21
Copy link
Contributor

cms21 commented Jun 15, 2023

Was the site active when you tried this?

@s-sajid-ali
Copy link
Author

It might have been. Does that typically prevent jobs from being deleted?

@cms21
Copy link
Contributor

cms21 commented Jun 21, 2023

Hm, I did some tests, and it looks like the site being active doesn't affect this after all. Are you able to delete them individually, using their job ids? Or does that fail as well?

@s-sajid-ali
Copy link
Author

Are you able to delete them individually, using their job ids? Or does that fail as well?

I ended up deleting the site as a workaround for now so I'm unable to answer that. I can post more details about what happens by doing so if I encounter this issue again.

@s-sajid-ali
Copy link
Author

s-sajid-ali commented Jun 29, 2023

Hit the same bug again today, so here are the results of the approaches mentioned above:

On an active site:

(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam site ls
ID      Name                         Path                                       Active
542     split_test_balsam            ...sajid/icarus_hepnos/split_test_balsam   Yes 
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> pwd
/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
32178869   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
32178870   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}

job rm --all fails:

(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job rm --all 
THIS WILL DELETE ALL JOBS! CAUTION!
Really delete 2 jobs? [y/N]: y
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
32178869   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
32178870   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> 

Are you able to delete them individually, using their job ids?

(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job rm --id 32178869
Really delete 1 jobs? [y/N]: y
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
32178869   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
32178870   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job rm --id 32178870
Really delete 1 jobs? [y/N]: y
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
32178869   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
32178870   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> 

No, these jobs are strangely stuck and cannot be deleted without deleting the whole site.

Let me know if helping the definition of the applications/jobs would help.

@s-sajid-ali
Copy link
Author

s-sajid-ali commented Jun 29, 2023

Another observation I have is that the failure of job deletion seems to be associated with the daemon hanging.

Seeing the above failure mode reproduced on a different login node (thetalogin4):

(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
32178945   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
32178946   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job rm --all
THIS WILL DELETE ALL JOBS! CAUTION!
Really delete 2 jobs? [y/N]: y
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir      State       Tags  
32178945   split_test_balsam   HEPnOS_server     server       STAGED_IN   {}    
32178946   split_test_balsam   HEPnOS_list_dbs   connection   STAGED_IN   {}    

where balsam thinks the site is active, but syncing it fails, indicating that the daemon is unresponsive ( hanging?):

(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam site ls
ID      Name                         Path                                       Active
544     split_test_balsam            ...sajid/icarus_hepnos/split_test_balsam   Yes 
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam site sync
Updated site.
Restarting Site /lus/theta-fs0/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam
Sent SIGTERM to Balsam site daemon [pid 6331]
Waiting for site /lus/theta-fs0/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam to shutdown...
  [#################################---]   91%  00:00:01
Usage: balsam site sync [OPTIONS]
Try 'balsam site sync --help' for help.

Error: Site daemon did not shut down gracefully on its own; please kill it manually and delete /lus/theta-fs0/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam/balsam-service.pid
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> 

I'm unsure if this is related to #340

@cms21
Copy link
Contributor

cms21 commented Jun 30, 2023

Few questions,

  • Do you have a file in your site directory called balsam-service.pid? If so, in that file, you'll see the name of the node you started the site on and the process id of the daemon process. Are you able to find the daemon process? for example, by typing ps <process id>?
  • In the site directory there is a subdirectory called log. If you go into log, there should be files with names like service_<datetime>.log. Every time you restart the site the new daemon process creates a new service log that it writes to. Are any of your service logs being written to? In the most recent service log, do you see any errors?

@cms21
Copy link
Contributor

cms21 commented Jun 30, 2023

  • Also, I see in some of your messages above that sometimes you are on thetalogin4 and sometimes you are on thetalogin6. Do you know which node your Balsam process running on? Also, in the log directory, do you have multiple service logs being updated?

@cms21
Copy link
Contributor

cms21 commented Jun 30, 2023

Hey, I realize now that your comments on issue #340 answer some of these questions, I'll shift over my discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants