-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sasascore module #862
base: main
Are you sure you want to change the base?
sasascore module #862
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to rename this module to sasascore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code base: OK
still have to test it
Co-authored-by: Victor Reys <[email protected]>
Co-authored-by: Victor Reys <[email protected]>
Co-authored-by: Victor Reys <[email protected]>
Co-authored-by: Victor Reys <[email protected]>
Co-authored-by: Victor Reys <[email protected]>
Co-authored-by: Victor Reys <[email protected]>
Co-authored-by: Victor Reys <[email protected]>
@@ -214,10 +223,12 @@ def add_calc_accessibility_arguments(calc_accessibility_subcommand): | |||
'PNS': 78.11, | |||
} | |||
} | |||
DEFAULT_PROBE_RADIUS = 1.4 | |||
|
|||
|
|||
def get_accessibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to move this inside the sasascore
module and call it here instead of the other way around since these CLIs are marked for deletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it to the restraints library, imo a better place than the module file
# Combine files | ||
with open(output_fname, 'w') as out_file: | ||
for core in range(ncores): | ||
tmp_file = Path(path, f"{key}_" + str(core) + ".tsv") | ||
with open(tmp_file) as infile: | ||
out_file.write(infile.read()) | ||
tmp_file.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern here looks very similar to what you had before in the caprieval
module and that we changed to produce less files. See the implementation I did there: https://github.com/haddocking/haddock3/blob/main/src/haddock/modules/analysis/caprieval/capri.py#L905
You can extract the values directly from the scheduler object, no need to have these intermediate files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that holds only for the scheduler, doesn't it? and here the intermediate files are much less..OK though, will add this other call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is producing several files, and they are being merge here, right? This thing that is producing the several files is the one that should go trough the scheduler and then here instead of merging the file you just retrieve the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am checking this over and over and I think this induces code duplication everywhere. now caprieval has both the rearrange_output
function (when less_io = false) and this extract_data_from_capri_class
(when less_io = true)..both doing the same thing in a completely different manner.
Everything would be much easier if fewer output files were generated by each module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does yes, but it's already part of our technological debt because of how people decided this workflow execution would be done - unfortunately nothing we can do at this point except work around it 🤷🏽
but this pattern is in many places already, just add a comment in the code and add it to this list here: #970
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is something that can be done, which is simply having much fewer files produced by each analysis module. The code duplication this less_io
pattern introduces is not worth the advantage imo. Probably something to be discussed in a code meeting. I won't add anything to that PR for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you mean - here is the same case as #928 (comment), the execution mode is being overwritten on the fly so there's no need to even consider this less_io
parameter, the easier solution is to just set it to local
, let the scheduler split the tasks and refactor this implementation to extract directly from the class.
# initialize jobs | ||
sasascore_jobs: list[AccScoreJob] = [] | ||
|
||
for core in range(ncores): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here, why are you doing this per core and why not per model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to launch hundreds/thousands of extremely short jobs, but rather concatenate them so that each core is dealing with a decent number of jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already handled by the scheduler:
haddock3/src/haddock/libs/libparallel.py
Line 148 in 01eb9ae
job_list = split_tasks(sorted_task_list, self.num_processes) |
so it's not responsability of a scoring module to sort tasks - that's what the library is for heh so here you can just loop over the models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvhonorato I have been thinking about this. I don't think this way of handling the tasks is a good idea (related also to the caprieval code linked above). If the running mode is not local and less_io is not active (the if clause in the caprieval output), this still amounts at creating a huge amount of files (for example 10k in the higher sampling scenarios), which can be avoided by simply parallelising internally on the number of cores (much lower in general). Additionally, for super-short tasks like this one, this probably makes things faster. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found out that pattern is found in many other places in the code, see here: #970
So please add this one to the list there and I can handle them in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to think about less_io
unless it's a CNS-based module, what you are doing here is duplicating the work of the scheduler - even if you split the tasks here, when they arrive at the scheduler they are re-split, so no need to do it twice and better to let the lower function do it
assert a_viols == {"A": set(), "B": set()} | ||
|
||
|
||
def test_prettiy_df(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here?
You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and that you comply with the following criteria:
tox
tests pass. Runtox
command inside the repository folder-test.cfg
examples execute without errors. Insideexamples/
runpython run_tests.py -b
Closes #861 by creating a
sasascore
module, which allows to score PDB files against existing accessibility information.As an example, if some glycosylation sites on chain A (say residues 40 and 50) are known to be preserved upon complex formation, a penalty can be added if they are buried in the resulting model. At the same time, if some residues are known to be buried in the complex, we can impose a penalty if they are accessible.
An example application of the module:
This will create a sasascore.tsv file, analogous to the other scoring tsv files.
Here the score is the number of times the input information has not be satisfied (the lower the better). A file named violations.tsv is also produced, with a detailed picture of the violations: