-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for using htcondor to launch jobs #96
Conversation
168e641
to
7beee52
Compare
Let me know when this is finished and tested and I'll review it then. |
The micro test just finished, so I'd say it's ready to review. |
6f44a93
to
2f7fa29
Compare
2f7fa29
to
4a2a32e
Compare
9354c3e
to
90f0d67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are mostly cleanup items, plus a few bigger questions about error handling that I think will make this easier to run.
@@ -0,0 +1,120 @@ | |||
"""Utility functions for working with slurm jobs""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/slurm/htcondor. Missing license preamble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have license preambles in any files. There is a LICENSE file giving copyright to Stanford in the top level. Lawyers?
htcondor_log: str | ||
Path for the wrapper log | ||
|
||
script_url: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these really URLs? (starts with file://
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the point is that they don't have to be path to files. They are just something that the tool uses to find/ identify the script. E.g., in panda they can just be the panda req id. I.e., it isn't a really url, but it isn't doesn't have to be a path either. Is there something you'd rather call this. If so we can thing about changing it, but it really shows up in a lot of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually call something like this a slug, but I don't know if that's widely understandable. Ok with keeping consistency though, agree that a piecewise rename wouldn't be better.
Location of job log file to write | ||
""" | ||
options = dict( | ||
should_transfer_files="Yes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is htcondor transferring files? I would have thought the shared filesystem took care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but it seems safer to not rely on the shared file system, and this is what the example I got used. should we follow up on this.
if htcondor_status == StatusEnum.accepted: | ||
await script.update_values(session, status=StatusEnum.accepted) | ||
wms_job_id = os.path.join(os.path.dirname(script.log_url), "submit") | ||
await parent.update_values(session, wms_job_id=wms_job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a job has multiple BPS scripts, does this overwrite the wms_job_id
upon running subsequent scripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding, how would a job have multiple bps scripts? Do you mean if you retry to launch a job. Then it depends on how you are relaunching it. If you do a rescue then you will have a new job. If you reset the script then you will simply overwrite the first attempt. We could change that to copy and move away anything on disk related to the first attempt, or to include the attempt number in the submission directory, but we aren't doing that now. In any case, the only thing the DB needs is to know where the submit dir for the current job is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a huge practical issue, but it does mean that some scripts have special properties w.r.t to the Job that runs them. I think this is conceptually more straightforward for the script to hold the wms_job_id
and if you need to access that from the Job, you can have a function that inspects the Job's scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scripts do have special properties w.r.t. the parent node that runs them. I.e., a script run as part of a job is different than a script run as part of a campaign. This the special properties live with the parent node. The alternative would be for scripts to carry around properties that are only relevant for a small subset of them, e.g., wms_job_id is only relevent for BpsSubmit, BpsReport and PipetaskReport scripts, not for any others.
d006294
to
4896781
Compare
…t and check functions
fb83f23
to
7dd3929
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates.
if htcondor_status == StatusEnum.accepted: | ||
await script.update_values(session, status=StatusEnum.accepted) | ||
wms_job_id = os.path.join(os.path.dirname(script.log_url), "submit") | ||
await parent.update_values(session, wms_job_id=wms_job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a huge practical issue, but it does mean that some scripts have special properties w.r.t to the Job that runs them. I think this is conceptually more straightforward for the script to hold the wms_job_id
and if you need to access that from the Job, you can have a function that inspects the Job's scripts.
@@ -384,6 +359,40 @@ async def _check_slurm_job( # pylint: disable=unused-argument | |||
await script.update_values(session, status=status) | |||
return status | |||
|
|||
async def _check_htcondor_job( # pylint: disable=unused-argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a small scale that's a tradeoff between future utility inside the function vs additional infrastructure required to support those parameters that aren't used. In the big picture, I think parents should be introspecting the status of their constituent Scripts/jobs/etc rather than Scripts modifying their parent.
Yes. So, my take is that it is better to have the stuff living in the nodes that actually use it, e.g., wms_job_id living in Job and letting the scripts pass that info along to the parents, rather than having Scripts generically carrying around information that is only relevant from some scripts. Similarly, this avoids Jobs having to worry about the different types of scripts and knowing that particular things live in particular scripts. |
No description provided.