sasascore module #862

mgiulini · 2024-04-16T08:52:06Z

You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and that you comply with the following criteria:

Closes #861 by creating a sasascore module, which allows to score PDB files against existing accessibility information.

As an example, if some glycosylation sites on chain A (say residues 40 and 50) are known to be preserved upon complex formation, a penalty can be added if they are buried in the resulting model. At the same time, if some residues are known to be buried in the complex, we can impose a penalty if they are accessible.

An example application of the module:

[sasascore]
resdic_accessible_A = [40,50]
residic_buried_B = [22,23,24]

This will create a sasascore.tsv file, analogous to the other scoring tsv files.

structure       original_name   md5     score
cluster_1_model_1.pdb   emref_1.pdb     None    2
cluster_1_model_2.pdb   emref_2.pdb     None    2
cluster_2_model_1.pdb   emref_4.pdb     None    5
cluster_3_model_1.pdb   emref_3.pdb     None    5
cluster_2_model_2.pdb   emref_5.pdb     None    5

Here the score is the number of times the input information has not be satisfied (the lower the better). A file named violations.tsv is also produced, with a detailed picture of the violations:

structure             bur_A acc_B
cluster_1_model_1.pdb 40 22
cluster_1_model_2.pdb 50 23 
cluster_2_model_1.pdb 40,50 22,23,24
cluster_3_model_1.pdb 40,50 22,23,24  
cluster_2_model_2.pdb 40,50 22,23,24

amjjbonvin

I would suggest to rename this module to sasascore

src/haddock/modules/scoring/sasascore/__init__.py

src/haddock/modules/scoring/sasascore/defaults.yaml

src/haddock/modules/scoring/sasascore/sasascore.py

src/haddock/modules/scoring/sasascore/defaults.yaml

src/haddock/modules/scoring/sasascore/__init__.py

src/haddock/modules/scoring/sasascore/sasascore.py

VGPReys

Code base: OK

still have to test it

Co-authored-by: Victor Reys <[email protected]>

rvhonorato · 2024-08-07T08:26:02Z

src/haddock/clis/restraints/calc_accessibility.py

@@ -214,10 +223,12 @@ def add_calc_accessibility_arguments(calc_accessibility_subcommand):
            'PNS': 78.11,
            }
    }
+DEFAULT_PROBE_RADIUS = 1.4


 def get_accessibility(


best to move this inside the sasascore module and call it here instead of the other way around since these CLIs are marked for deletion

I can move it to the restraints library, imo a better place than the module file

src/haddock/modules/scoring/sasascore/__init__.py

rvhonorato · 2024-08-07T08:33:04Z

src/haddock/modules/scoring/sasascore/__init__.py

+        # Combine files
+        with open(output_fname, 'w') as out_file:
+            for core in range(ncores):
+                tmp_file = Path(path, f"{key}_" + str(core) + ".tsv")
+                with open(tmp_file) as infile:
+                    out_file.write(infile.read())
+                tmp_file.unlink()


This pattern here looks very similar to what you had before in the caprieval module and that we changed to produce less files. See the implementation I did there: https://github.com/haddocking/haddock3/blob/main/src/haddock/modules/analysis/caprieval/capri.py#L905

You can extract the values directly from the scheduler object, no need to have these intermediate files

but that holds only for the scheduler, doesn't it? and here the intermediate files are much less..OK though, will add this other call

Something is producing several files, and they are being merge here, right? This thing that is producing the several files is the one that should go trough the scheduler and then here instead of merging the file you just retrieve the values

I am checking this over and over and I think this induces code duplication everywhere. now caprieval has both the rearrange_output function (when less_io = false) and this extract_data_from_capri_class (when less_io = true)..both doing the same thing in a completely different manner.
Everything would be much easier if fewer output files were generated by each module.

It does yes, but it's already part of our technological debt because of how people decided this workflow execution would be done - unfortunately nothing we can do at this point except work around it 🤷🏽

but this pattern is in many places already, just add a comment in the code and add it to this list here: #970

there is something that can be done, which is simply having much fewer files produced by each analysis module. The code duplication this less_io pattern introduces is not worth the advantage imo. Probably something to be discussed in a code meeting. I won't add anything to that PR for the time being.

I think I understand what you mean - here is the same case as #928 (comment), the execution mode is being overwritten on the fly so there's no need to even consider this less_io parameter, the easier solution is to just set it to local, let the scheduler split the tasks and refactor this implementation to extract directly from the class.

src/haddock/modules/scoring/sasascore/__init__.py

rvhonorato · 2024-08-07T08:37:03Z

src/haddock/modules/scoring/sasascore/__init__.py

+        # initialize jobs
+        sasascore_jobs: list[AccScoreJob] = []
+
+        for core in range(ncores):


I'm a bit confused here, why are you doing this per core and why not per model?

we don't want to launch hundreds/thousands of extremely short jobs, but rather concatenate them so that each core is dealing with a decent number of jobs

This is already handled by the scheduler:

haddock3/src/haddock/libs/libparallel.py

Line 148 in 01eb9ae

job_list = split_tasks(sorted_task_list, self.num_processes)

so it's not responsability of a scoring module to sort tasks - that's what the library is for heh so here you can just loop over the models

@rvhonorato I have been thinking about this. I don't think this way of handling the tasks is a good idea (related also to the caprieval code linked above). If the running mode is not local and less_io is not active (the if clause in the caprieval output), this still amounts at creating a huge amount of files (for example 10k in the higher sampling scenarios), which can be avoided by simply parallelising internally on the number of cores (much lower in general). Additionally, for super-short tasks like this one, this probably makes things faster. what do you think?

I just found out that pattern is found in many other places in the code, see here: #970

So please add this one to the list there and I can handle them in another PR

We don't need to think about less_io unless it's a CNS-based module, what you are doing here is duplicating the work of the scheduler - even if you split the tasks here, when they arrive at the scheduler they are re-split, so no need to do it twice and better to let the lower function do it

src/haddock/modules/scoring/sasascore/sasascore.py

rvhonorato · 2024-08-07T08:39:57Z

tests/test_module_sasascore.py

+    assert a_viols == {"A": set(), "B": set()}
+
+
+def test_prettiy_df():


mgiulini added 3 commits April 16, 2024 10:15

added accscoring resdic

990c6bc

added first draft of accscoring module

6af27b4

removed CNS dependency

7d8f0f0

mgiulini self-assigned this Apr 16, 2024

mgiulini added enhancement Enhancing an existing feature of adding a new one community contributions from people outside the haddock team m|accscoring related to accessibility (a posteriori) scoring labels Apr 16, 2024

mgiulini added 8 commits April 17, 2024 09:44

accscoring docs

09079e5

typing

c45346a

added violations

6ae7a3c

added d probe_radius argument

edcdb6e

added probe_radius in accscoring

9743331

added func docs

1f6f9f6

added tests for accscoring

09e8088

fixed linting

6e56d99

amjjbonvin reviewed Apr 25, 2024

View reviewed changes

mgiulini added 2 commits April 25, 2024 10:50

renamed to sasascore

34020ca

renamed tests

ea6fa89

mgiulini changed the title ~~Accscoring module~~ sasascore module Apr 25, 2024

mgiulini added 4 commits April 25, 2024 11:08

added example config file

518a6e0

modules using resdic

ea2a1ec

sasascore docs

f136c01

codacy spaces

449a32c

mgiulini marked this pull request as ready for review April 25, 2024 10:05

mgiulini mentioned this pull request Apr 25, 2024

Add codacy/codacy-coverage-reporter-action action to tests.yml #875

Merged

mgiulini added 5 commits August 1, 2024 14:55

rearrange output as class method

019d798

added emscoring to sasascore-test

420a634

added comment

b4aa016

improved tests

ec67b70

added integration test for sasascore

5552fbb

mgiulini requested a review from VGPReys August 6, 2024 12:13