Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selection syntax #52

Merged
merged 31 commits into from
Jun 3, 2024
Merged

selection syntax #52

merged 31 commits into from
Jun 3, 2024

Conversation

lmiq
Copy link
Contributor

@lmiq lmiq commented May 27, 2024

@jgreener64 @timholy

This PR is not necessarily to be merged (unless for the fact that it introduces something potentially useful and not breaking at all).

How this syntax works may be useful for the discussions of #48

I adapted the selection syntax as implemented in PDBTools for BioStructures.

With the PR, we can do (for example):

julia> pdb = read(BioStructures.TESTPDB, PDB)
ProteinStructure structure.pdb with 1 models, 5 chains (A,B,C,D,E), 19534 residues, 60570 atoms

julia> filter(sel"element H", pdb)
39817-element Vector{AbstractAtom}:
 Atom HA with serial 6, coordinates [-8.185, -15.947, -6.894]
 Atom HB1 with serial 8, coordinates [-10.428, -14.602, -7.6]
 
 Atom H1 with serial 62025, coordinates [13.218, -3.647, -34.453]
 Atom H2 with serial 62026, coordinates [12.618, -4.977, -34.303]

julia> filter(sel"chain A", pdb)
2541-element Vector{AbstractAtom}:
 Atom N with serial 1, coordinates [-9.229, -14.861, -5.481]
 Atom HT1 with serial 2, coordinates [-10.048, -15.427, -5.569]
 
 Atom H32 with serial 4011, coordinates [13.564, -16.517, 12.419]

julia> filter(sel"backbone and resname LYS", pdb)
8-element Vector{AbstractAtom}:
 Atom N with serial 307, coordinates [-8.851, 4.954, 6.211]
 Atom CA with serial 309, coordinates [-9.459, 5.331, 7.476]
 
 Atom C with serial 559, coordinates [2.983, 8.729, -0.894]
 Atom O with serial 560, coordinates [3.807, 8.269, -0.105]

The syntax is still limited. We do not support parenthesis, and other features that VMD syntax supports. Ideally we would be able to parse something like (resname ARG LYS) and (not backbone) and (chain A or model 2). Nevertheless, is what we have in PDBTools right now and that is useful for most of our purposes (and we switch to standard julia functions for more complicated things).

This is related to the other discussion in the sense of what we want these selections to return. Now they return vectors of atoms, that is, flat structures. We could return (that would require some work) return a proper hierarchical structure, respecting the model, chain, residue, etc, of the original tree.

Food for thought. Hope this helps.

Copy link

codecov bot commented May 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.08%. Comparing base (04f6ffa) to head (f7ff3a6).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #52      +/-   ##
==========================================
+ Coverage   94.90%   95.08%   +0.18%     
==========================================
  Files           9       10       +1     
  Lines        1786     1852      +66     
==========================================
+ Hits         1695     1761      +66     
  Misses         91       91              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lmiq
Copy link
Contributor Author

lmiq commented May 27, 2024

(ps: now I see that it breaks 1.6 compatibility, but that is easily fixable).

@jgreener64
Copy link
Member

Looks great. We do have an API for selecting atoms, collectatoms(struc, selector_function), so maybe this could be available as collectatoms(struc, sel"selection string") for consistency with that.

@lmiq
Copy link
Contributor Author

lmiq commented May 28, 2024

@jgreener64

Added the collectatoms method and simplified the code.

One side question: Are you willing to use the TestItems framework here? I've added the tests for this part as a @testitem, which allows it to be run easily independently from VSCode, and I can adapt the other tests for it. Otherwise I can remove the TestItems and move the tests of this part to the standard test file.

If you are considering really to merge this, I can add more tests and write some documentation latter this week.

@jgreener64
Copy link
Member

I think this could be merged as it provides a nice interface.

There is already some filtering code for collectatoms at https://github.com/BioJulia/BioStructures.jl/blob/master/src/model.jl#L1198. I think if you overload Base.filter!(by::Select, atoms::AbstractVector{<:AbstractAtom}), which is one of your atom_filter methods, then you shouldn't need to write any collectatoms functions or functions that loop over a ProteinStructure.

Personally I'm not a fan of TestItems as I prefer the tests in one place.

@lmiq lmiq marked this pull request as draft May 29, 2024 14:20
@lmiq lmiq marked this pull request as ready for review May 29, 2024 14:51
@lmiq
Copy link
Contributor Author

lmiq commented May 29, 2024

@jgreener64

I think this is ready for review. I won't be available for the next few days.

I removed the overload of filter and let all this be an option on how to run collectatoms.

What's missing from the selection syntax, relative to VMD:

  1. residue to mean the sequential residue index in the PDB file. This is available in VMD, but not here.
  2. segment
  3. Support for parenthesis.
  4. Support for implicit or in selections as resname ARG GLU to mean resname ARG or resname GLU.

In my opinion, if 3 and 4 are solved (probably someone that implemented a real parser of anything can do that in a morning), it would be very nice to move the parser to another package of more general utility, because the way the keywords and operators are implemented now are very modular and this could be used for other purposes.

I commented the non-standard residue names that are used in MD simulations from the residue list.

There is some redundancy relative to the selectors already available in BioStructures, but I'm not sure if sharing everything is worth the trouble or the reduced modularity.

Copy link
Member

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few comments.

There is some redundancy relative to the selectors already available in BioStructures, but I'm not sure if sharing everything is worth the trouble or the reduced modularity.

I might have a look after merging this about sharing more of the functions so that everything available here is available as a selector function and vice versa.

Test/Project.toml Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
src/select.jl Outdated Show resolved Hide resolved
src/select.jl Outdated Show resolved Hide resolved
@lmiq
Copy link
Contributor Author

lmiq commented Jun 3, 2024

There's already BioSymbols.AminoAcid, so I changed the name to AminoAcidResidue. In fact, the masses of that table are the masses of amino acid residues, not of free amino acids.

Probably such information is or should be in a more fundamental package.

@jgreener64 jgreener64 merged commit 4f80e81 into BioJulia:master Jun 3, 2024
9 checks passed
@jgreener64
Copy link
Member

Great, thanks for this.

@lmiq lmiq deleted the selection_syntax branch June 3, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants