Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for using htcondor to launch jobs #96

Merged
merged 23 commits into from
Sep 24, 2024
Merged

Conversation

eacharles
Copy link
Collaborator

No description provided.

@ctslater
Copy link
Member

ctslater commented Sep 5, 2024

Let me know when this is finished and tested and I'll review it then.

@eacharles
Copy link
Collaborator Author

The micro test just finished, so I'd say it's ready to review.

Copy link
Member

@ctslater ctslater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are mostly cleanup items, plus a few bigger questions about error handling that I think will make this easier to run.

@@ -0,0 +1,120 @@
"""Utility functions for working with slurm jobs"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/slurm/htcondor. Missing license preamble.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have license preambles in any files. There is a LICENSE file giving copyright to Stanford in the top level. Lawyers?

src/lsst/cmservice/common/htcondor.py Outdated Show resolved Hide resolved
src/lsst/cmservice/common/htcondor.py Outdated Show resolved Hide resolved
htcondor_log: str
Path for the wrapper log

script_url: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these really URLs? (starts with file://?)

Copy link
Collaborator Author

@eacharles eacharles Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the point is that they don't have to be path to files. They are just something that the tool uses to find/ identify the script. E.g., in panda they can just be the panda req id. I.e., it isn't a really url, but it isn't doesn't have to be a path either. Is there something you'd rather call this. If so we can thing about changing it, but it really shows up in a lot of places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually call something like this a slug, but I don't know if that's widely understandable. Ok with keeping consistency though, agree that a piecewise rename wouldn't be better.

src/lsst/cmservice/common/htcondor.py Outdated Show resolved Hide resolved
Location of job log file to write
"""
options = dict(
should_transfer_files="Yes",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is htcondor transferring files? I would have thought the shared filesystem took care of that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but it seems safer to not rely on the shared file system, and this is what the example I got used. should we follow up on this.

src/lsst/cmservice/common/htcondor.py Outdated Show resolved Hide resolved
if htcondor_status == StatusEnum.accepted:
await script.update_values(session, status=StatusEnum.accepted)
wms_job_id = os.path.join(os.path.dirname(script.log_url), "submit")
await parent.update_values(session, wms_job_id=wms_job_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a job has multiple BPS scripts, does this overwrite the wms_job_id upon running subsequent scripts?

Copy link
Collaborator Author

@eacharles eacharles Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding, how would a job have multiple bps scripts? Do you mean if you retry to launch a job. Then it depends on how you are relaunching it. If you do a rescue then you will have a new job. If you reset the script then you will simply overwrite the first attempt. We could change that to copy and move away anything on disk related to the first attempt, or to include the attempt number in the submission directory, but we aren't doing that now. In any case, the only thing the DB needs is to know where the submit dir for the current job is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a huge practical issue, but it does mean that some scripts have special properties w.r.t to the Job that runs them. I think this is conceptually more straightforward for the script to hold the wms_job_id and if you need to access that from the Job, you can have a function that inspects the Job's scripts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scripts do have special properties w.r.t. the parent node that runs them. I.e., a script run as part of a job is different than a script run as part of a campaign. This the special properties live with the parent node. The alternative would be for scripts to carry around properties that are only relevant for a small subset of them, e.g., wms_job_id is only relevent for BpsSubmit, BpsReport and PipetaskReport scripts, not for any others.

src/lsst/cmservice/handlers/jobs.py Outdated Show resolved Hide resolved
src/lsst/cmservice/handlers/script_handler.py Show resolved Hide resolved
@eacharles eacharles force-pushed the tickets/DM-46100 branch 2 times, most recently from d006294 to 4896781 Compare September 20, 2024 20:00
Copy link
Member

@ctslater ctslater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

if htcondor_status == StatusEnum.accepted:
await script.update_values(session, status=StatusEnum.accepted)
wms_job_id = os.path.join(os.path.dirname(script.log_url), "submit")
await parent.update_values(session, wms_job_id=wms_job_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a huge practical issue, but it does mean that some scripts have special properties w.r.t to the Job that runs them. I think this is conceptually more straightforward for the script to hold the wms_job_id and if you need to access that from the Job, you can have a function that inspects the Job's scripts.

@@ -384,6 +359,40 @@ async def _check_slurm_job( # pylint: disable=unused-argument
await script.update_values(session, status=status)
return status

async def _check_htcondor_job( # pylint: disable=unused-argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a small scale that's a tradeoff between future utility inside the function vs additional infrastructure required to support those parameters that aren't used. In the big picture, I think parents should be introspecting the status of their constituent Scripts/jobs/etc rather than Scripts modifying their parent.

@eacharles
Copy link
Collaborator Author

Yes. So, my take is that it is better to have the stuff living in the nodes that actually use it, e.g., wms_job_id living in Job and letting the scripts pass that info along to the parents, rather than having Scripts generically carrying around information that is only relevant from some scripts. Similarly, this avoids Jobs having to worry about the different types of scripts and knowing that particular things live in particular scripts.

@eacharles eacharles merged commit 9107809 into main Sep 24, 2024
9 checks passed
@eacharles eacharles deleted the tickets/DM-46100 branch September 24, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants