Add support for using htcondor to launch jobs #96

eacharles · 2024-09-03T23:25:22Z

No description provided.

ctslater · 2024-09-05T23:43:08Z

Let me know when this is finished and tested and I'll review it then.

eacharles · 2024-09-06T00:08:28Z

The micro test just finished, so I'd say it's ready to review.

src/lsst/cmservice/common/htcondor.py

src/lsst/cmservice/handlers/script_handler.py

ctslater

Comments are mostly cleanup items, plus a few bigger questions about error handling that I think will make this easier to run.

ctslater · 2024-09-17T00:17:36Z

src/lsst/cmservice/common/htcondor.py

@@ -0,0 +1,120 @@
+"""Utility functions for working with slurm jobs"""


s/slurm/htcondor. Missing license preamble.

I don't think we have license preambles in any files. There is a LICENSE file giving copyright to Stanford in the top level. Lawyers?

src/lsst/cmservice/common/htcondor.py

ctslater · 2024-09-17T00:20:59Z

src/lsst/cmservice/common/htcondor.py

+    htcondor_log: str
+        Path for the wrapper log
+
+    script_url: str


Are these really URLs? (starts with file://?)

I think that the point is that they don't have to be path to files. They are just something that the tool uses to find/ identify the script. E.g., in panda they can just be the panda req id. I.e., it isn't a really url, but it isn't doesn't have to be a path either. Is there something you'd rather call this. If so we can thing about changing it, but it really shows up in a lot of places.

I usually call something like this a slug, but I don't know if that's widely understandable. Ok with keeping consistency though, agree that a piecewise rename wouldn't be better.

src/lsst/cmservice/common/htcondor.py

ctslater · 2024-09-17T04:33:12Z

src/lsst/cmservice/common/htcondor.py

+        Location of job log file to write
+    """
+    options = dict(
+        should_transfer_files="Yes",


Is htcondor transferring files? I would have thought the shared filesystem took care of that.

Not sure, but it seems safer to not rely on the shared file system, and this is what the example I got used. should we follow up on this.

src/lsst/cmservice/common/htcondor.py

ctslater · 2024-09-17T04:53:17Z

src/lsst/cmservice/handlers/jobs.py

+        if htcondor_status == StatusEnum.accepted:
+            await script.update_values(session, status=StatusEnum.accepted)
+            wms_job_id = os.path.join(os.path.dirname(script.log_url), "submit")
+            await parent.update_values(session, wms_job_id=wms_job_id)


If a job has multiple BPS scripts, does this overwrite the wms_job_id upon running subsequent scripts?

I'm not understanding, how would a job have multiple bps scripts? Do you mean if you retry to launch a job. Then it depends on how you are relaunching it. If you do a rescue then you will have a new job. If you reset the script then you will simply overwrite the first attempt. We could change that to copy and move away anything on disk related to the first attempt, or to include the attempt number in the submission directory, but we aren't doing that now. In any case, the only thing the DB needs is to know where the submit dir for the current job is.

I don't think this is a huge practical issue, but it does mean that some scripts have special properties w.r.t to the Job that runs them. I think this is conceptually more straightforward for the script to hold the wms_job_id and if you need to access that from the Job, you can have a function that inspects the Job's scripts.

Scripts do have special properties w.r.t. the parent node that runs them. I.e., a script run as part of a job is different than a script run as part of a campaign. This the special properties live with the parent node. The alternative would be for scripts to carry around properties that are only relevant for a small subset of them, e.g., wms_job_id is only relevent for BpsSubmit, BpsReport and PipetaskReport scripts, not for any others.

src/lsst/cmservice/handlers/jobs.py

src/lsst/cmservice/handlers/script_handler.py

…config writing

…t and check functions

…tion of same

ctslater

Thanks for the updates.

ctslater · 2024-09-23T22:32:43Z

src/lsst/cmservice/handlers/jobs.py

+        if htcondor_status == StatusEnum.accepted:
+            await script.update_values(session, status=StatusEnum.accepted)
+            wms_job_id = os.path.join(os.path.dirname(script.log_url), "submit")
+            await parent.update_values(session, wms_job_id=wms_job_id)


I don't think this is a huge practical issue, but it does mean that some scripts have special properties w.r.t to the Job that runs them. I think this is conceptually more straightforward for the script to hold the wms_job_id and if you need to access that from the Job, you can have a function that inspects the Job's scripts.

ctslater · 2024-09-23T22:43:11Z

src/lsst/cmservice/handlers/script_handler.py

@@ -384,6 +359,40 @@ async def _check_slurm_job(  # pylint: disable=unused-argument
            await script.update_values(session, status=status)
        return status

+    async def _check_htcondor_job(  # pylint: disable=unused-argument


On a small scale that's a tradeoff between future utility inside the function vs additional infrastructure required to support those parameters that aren't used. In the big picture, I think parents should be introspecting the status of their constituent Scripts/jobs/etc rather than Scripts modifying their parent.

eacharles · 2024-09-24T21:31:12Z

Yes. So, my take is that it is better to have the stuff living in the nodes that actually use it, e.g., wms_job_id living in Job and letting the scripts pass that info along to the parents, rather than having Scripts generically carrying around information that is only relevant from some scripts. Similarly, this avoids Jobs having to worry about the different types of scripts and knowing that particular things live in particular scripts.

eacharles requested a review from ctslater September 3, 2024 23:56

eacharles force-pushed the tickets/DM-46100 branch from 168e641 to 7beee52 Compare September 5, 2024 21:42

eacharles force-pushed the tickets/DM-46100 branch from 6f44a93 to 2f7fa29 Compare September 6, 2024 18:53

villarrealas reviewed Sep 6, 2024

View reviewed changes

src/lsst/cmservice/common/htcondor.py Outdated Show resolved Hide resolved

eacharles force-pushed the tickets/DM-46100 branch from 2f7fa29 to 4a2a32e Compare September 6, 2024 20:20

villarrealas reviewed Sep 6, 2024

View reviewed changes

src/lsst/cmservice/handlers/script_handler.py Outdated Show resolved Hide resolved

eacharles force-pushed the tickets/DM-46100 branch from 9354c3e to 90f0d67 Compare September 12, 2024 17:51

ctslater requested changes Sep 17, 2024

View reviewed changes

eacharles force-pushed the tickets/DM-46100 branch 2 times, most recently from d006294 to 4896781 Compare September 20, 2024 20:00

eacharles requested a review from ctslater September 20, 2024 22:40

eacharles and others added 17 commits September 23, 2024 13:57

Add support for using htcondor to launch jobs

e9e15a9

Added lsst/cmservice/common/htcondor.py

ad72955

mypy htcondor

1c19076

fix condor_q check logic

d012ff6

Fixes to htcondor submission

b74ffa8

Fix up interaction between envvars expansion and abspath in bps yaml …

b10e350

…config writing

protect against empty log files

008643c

Make BPS related job scripts work with htcondor

d50e0d8

Fix path statement

dbd9ad1

Update htcondor.py to fix error message

6febd55

Update script_handler.py

67abad1

Update htcondor.py

6e2d5c4

fix up typing

3186b31

Update htcondor.py

b03e4f2

Update jobs.py to remove unneeded call to script.update_values

cd11ec0

Update enums.py to fix typo

0b9d24d

Update errors.py to fix typo

46503ea

eacharles and others added 6 commits September 23, 2024 14:04

Update htcondor.py

4ee8d3c

Update htcondor.py to fix typo.

3b4ddf7

Added CMSlurmCheckError and CMHTCondorCheckError

3ccd662

Added Error handling and fixed up awaits for slurm and htconder submi…

12c95bd

…t and check functions

change htcondor_script to htcondor_script_path and expand doc descrip…

37ac245

…tion of same

Fix typo in htcondor.py

7dd3929

fritzm force-pushed the tickets/DM-46100 branch from fb83f23 to 7dd3929 Compare September 23, 2024 21:04

ctslater approved these changes Sep 23, 2024

View reviewed changes

eacharles merged commit 9107809 into main Sep 24, 2024
9 checks passed

eacharles deleted the tickets/DM-46100 branch September 24, 2024 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for using htcondor to launch jobs #96

Add support for using htcondor to launch jobs #96

eacharles commented Sep 3, 2024

ctslater commented Sep 5, 2024

eacharles commented Sep 6, 2024

ctslater left a comment

ctslater Sep 17, 2024

eacharles Sep 17, 2024

ctslater Sep 17, 2024

eacharles Sep 17, 2024 •

edited

Loading

ctslater Sep 23, 2024

ctslater Sep 17, 2024

eacharles Sep 17, 2024

ctslater Sep 17, 2024

eacharles Sep 17, 2024 •

edited

Loading

ctslater Sep 23, 2024

eacharles Sep 24, 2024

ctslater left a comment

ctslater Sep 23, 2024

ctslater Sep 23, 2024

eacharles commented Sep 24, 2024

		@@ -0,0 +1,120 @@
		"""Utility functions for working with slurm jobs"""

Add support for using htcondor to launch jobs #96

Add support for using htcondor to launch jobs #96

Conversation

eacharles commented Sep 3, 2024

ctslater commented Sep 5, 2024

eacharles commented Sep 6, 2024

ctslater left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eacharles Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eacharles Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctslater left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eacharles commented Sep 24, 2024

eacharles Sep 17, 2024 •

edited

Loading

eacharles Sep 17, 2024 •

edited

Loading