Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selection syntax #52

Merged
merged 31 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
1a2d136
adapting select to BioStructures
lmiq May 27, 2024
3282b77
all keywords and macrokeywords working
lmiq May 27, 2024
a5f4283
add tests
lmiq May 27, 2024
5674722
fix tests environments
lmiq May 27, 2024
b7711f8
fix backbone selector
lmiq May 27, 2024
ce446c4
fix 1.6 compatibility and add collectatoms(struc, sel"string") method.
lmiq May 28, 2024
d495d90
reduce size of test file
lmiq May 28, 2024
9fd089a
adjust tests for new test file
lmiq May 28, 2024
9cdcce1
simplify code
lmiq May 28, 2024
4469735
ignore all temporary vim files
lmiq May 28, 2024
74ac54c
remove loop over deprecated functional keywords
lmiq May 28, 2024
43c69cd
merge keywords and macrokeywords in same array
lmiq May 28, 2024
f7b185c
make all non-generic and reorder code
lmiq May 28, 2024
3593c88
remove Select docstring
lmiq May 28, 2024
6d557d7
use filter! to select atoms
lmiq May 28, 2024
762d3f6
throw ArgumentError and test it
lmiq May 28, 2024
187b92b
add disorder and test for it
lmiq May 28, 2024
744536f
remove TestItems, add tests
lmiq May 28, 2024
f61171b
add doc entry
lmiq May 28, 2024
fa424f1
fix internal hyperlink
lmiq May 28, 2024
74d3d6b
remove test pdb file, use downloaded file
lmiq May 29, 2024
f4896df
comment non-standard residues of MD simulations
lmiq May 29, 2024
f43c396
small doc change
lmiq May 29, 2024
9bc6e41
remove reference to TESTPDB
lmiq May 29, 2024
56fd1f8
small format change in code
lmiq May 29, 2024
6775737
use julia-repl syntax where appropriate
lmiq May 29, 2024
10b2ad2
Delete Test/Project.toml
lmiq May 31, 2024
aa222b8
remove using from tests
lmiq Jun 3, 2024
b535e8a
uncomment alternate protonation states
lmiq Jun 3, 2024
7f0f554
rename ProteinResidue to AminoAcid
lmiq Jun 3, 2024
f7ff3a6
rename AminoAcid to AminoAcidResidue
lmiq Jun 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ docs/site/
.DS_Store
benchmark/tune.json
Manifest.toml
*.swp
1 change: 1 addition & 0 deletions Test/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[deps]
lmiq marked this conversation as resolved.
Show resolved Hide resolved
122 changes: 108 additions & 14 deletions docs/src/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ downloadpdb("1EN2")

To parse a PDB file into a Structure-Model-Chain-Residue-Atom framework:

```julia
```julia-repl
julia> struc = read("/path/to/pdb/file.pdb", PDB)
ProteinStructure 1EN2.pdb with 1 models, 1 chains (A), 85 residues, 754 atoms
```
Expand All @@ -29,7 +29,7 @@ mmCIF files can be read into the same data structure with `read("/path/to/cif/fi
The keyword argument `gzip`, default `false`, determines if the file is gzipped.
If you want to read an mmCIF file into a dictionary to query yourself (e.g. to access metadata fields), use [`MMCIFDict`](@ref):

```julia
```julia-repl
julia> mmcif_dict = MMCIFDict("/path/to/cif/file.cif")
mmCIF dictionary with 716 fields

Expand Down Expand Up @@ -144,6 +144,7 @@ This can be changed by setting `expand_disordered` to `true` in [`collectatoms`]

Selectors are functions passed as additional arguments to these functions.
Only elements that return `true` when passed to all the selector are retained.
See also the selection syntax [described below](@ref String-selection-syntax).
For example:

| Command | Action | Return type |
Expand Down Expand Up @@ -199,7 +200,7 @@ collectatoms(struc, at -> x(at) < 0)
[`countatoms`](@ref), [`countresidues`](@ref), [`countchains`](@ref) and [`countmodels`](@ref) can be used to count elements with the same selector API.
For example:

```julia
```julia-repl
julia> countatoms(struc)
754

Expand All @@ -215,7 +216,7 @@ julia> countatoms(struc, expand_disordered=true)

The amino acid sequence of a protein can be retrieved by passing an element to [`LongAA`](@ref) with optional residue selectors:

```julia
```julia-repl
julia> LongAA(struc['A'], standardselector)
85aa Amino Acid Sequence:
RCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENKCW…RCGAAVGNPPCGQDRCCSVHGWCGGGNDYCSGGNCQYRC
Expand All @@ -227,7 +228,7 @@ See [BioSequences.jl](https://github.com/BioJulia/BioSequences.jl) and [BioAlign
[`LongAA`](@ref) is an alias for `LongSequence{AminoAcidAlphabet}`.
For example, to see the alignment of CDK1 and CDK2 (this example also makes use of Julia's [broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting-1)):

```julia
```julia-repl
julia> struc1, struc2 = retrievepdb.(["4YC6", "1HCL"])
2-element Array{ProteinStructure,1}:
ProteinStructure 4YC6.pdb with 1 models, 8 chains (A,B,C,D,E,F,G,H), 1420 residues, 12271 atoms
Expand Down Expand Up @@ -270,6 +271,99 @@ PairwiseAlignmentResult{Int64, LongAA, LongAA}:
In fact, [`pairalign`](@ref) is extended to carry out the above steps and return the alignment by calling `pairalign(struc1["A"], struc2["A"], standardselector)` in this case.
`scoremodel` and `aligntype` are keyword arguments with the defaults shown above.

## String selection syntax

!!! compat
The string-selection syntax was introduced in version 3.13.

`BioStructures.jl` exports the `sel` macro that provides a practical way to collect atoms from a structure
using a natural selection syntax. It must be used as:
```julia
collectatoms(struc, sel"selection string")
```
where `struc` is the input structure and `selection string` defines the atoms to be selected, with the
operators and keyword defined below.

For example:

```julia-repl
julia> struc = retrievepdb("4YC6")
[ Info: Downloading file from PDB: 4YC6
ProteinStructure 4YC6.pdb with 1 models, 8 chains (A,B,C,D,E,F,G,H), 1420 residues, 12271 atoms

julia> collectatoms(struc, sel"name CA and resnumber <= 5")
24-element Vector{AbstractAtom}:
Atom CA with serial 2, coordinates [17.363, 31.409, -27.535]
Atom CA with serial 10, coordinates [20.769, 32.605, -28.801]
Atom CA with serial 11096, coordinates [-8.996, 6.094, -29.097]
```

There are also macro-keywords to select groups of atoms with specific properties. For example:

```julia-repl
julia> ats = collectatoms(struc, sel"acidic and name N")
188-element Vector{AbstractAtom}:
Atom N with serial 9, coordinates [19.33, 32.429, -28.593]
Atom N with serial 18, coordinates [21.056, 33.428, -26.564]
Atom N with serial 11603, coordinates [-0.069, 21.516, -32.604]

julia> resname.(ats)
188-element Vector{SubString{String}}:
"GLU"
"ASP"
"GLU"
```

The current supported operators are:

| Operators | Acts on |
| :-------------------------- | :---------------------------------- |
| `=`, `>`, `<` `>=`, `<=` | Properties with real values. |
| `and`, `or`, `not` | Selections subsets. |

The keywords supported are:


| Keyword | Input type | Selects for |
| :-------------------------- | :-----------:|:---------------------|
| `index` | `Int` | `serial` |
| `serial` | `Int` | `serial` |
| `resnumber` | `Int` | `resnumber` |
| `resnum` | `Int` | `resnumber` |
| `resid` | `Int` | `resid` |
| `occupancy` | `Real` | `occupancy` |
| `beta` | `Real` | `tempfactor` |
| `tempfactor` | `Real` | `tempfactor` |
| `model` | `Int` | `modelnumber` |
| `modelnumber` | `Int` | `modelnumber` |
| `name` | `String` | `atomname` |
| `atomname` | `String` | `atomname` |
| `segname` | `String` | `segname` |
| `resname` | `String` | `resname` |
| `chain` | `String` | `chainid` |
| `chainid` | `String` | `chainid` |
| `element` | `String` | `element` |
| `water` | | Water molecules |
| `protein` | | Protein atoms |
| `polar` | | Polar residues |
| `nonpolar` | | Non-polar residues |
| `basic` | | Basic residues |
| `acidic` | | Acidic residues |
| `charged` | | Charged residues |
| `aliphatic` | | Aliphatic residues |
| `aromatic` | | Aromatic residues |
| `hydrophobic` | | Hydrophobic residues |
| `neutral` | | Neutral residues |
| `backbone` | | Backbone atoms |
| `heavyatom` | | Heavy atoms |
| `disordered` | | Disordered atoms |
| `sidechain` | | Side-chain atoms |
| `all` | | all atoms |


## Spatial calculations

Various functions are provided to calculate spatial quantities for proteins:
Expand Down Expand Up @@ -304,7 +398,7 @@ The [`omegaangle`](@ref) and [`phiangle`](@ref) functions measure the angle betw
The [`psiangle`](@ref) function measures between the given index and the one after.
For example:

```julia
```julia-repl
julia> distance(struc['A'][10], struc['A'][20])
10.782158874733762

Expand All @@ -325,7 +419,7 @@ julia> rad2deg(psiangle(struc['A'], 50))
[`ContactMap`](@ref) can also be given two structural elements as arguments, in which case a non-symmetrical 2D array is returned showing contacts between the elements.
The underlying `BitArray{2}` for [`ContactMap`](@ref) `contacts` can be accessed with `contacts.data` if required.

```julia
```julia-repl
julia> contacts = ContactMap(collectatoms(struc['A'], cbetaselector), 8.0)
Contact map of size (85, 85)
```
Expand Down Expand Up @@ -387,7 +481,7 @@ The contacting elements in a molecular structure form a graph, and this can be r
This extends `MetaGraph` from [MetaGraphs.jl](https://github.com/JuliaGraphs/MetaGraphs.jl), allowing you to use all the graph analysis tools in [Graphs.jl](https://github.com/JuliaGraphs/Graphs.jl).
For example:

```julia
```julia-repl
julia> using Graphs, MetaGraphs

julia> mg = MetaGraph(collectatoms(struc["A"], cbetaselector), 8.0)
Expand Down Expand Up @@ -488,7 +582,7 @@ In this case download the mmCIF file or MMTF file instead.

To parse an existing PDB file into a Structure-Model-Chain-Residue-Atom framework:

```julia
```julia-repl
julia> struc = read("/path/to/pdb/file.pdb", PDB)
ProteinStructure 1EN2.pdb with 1 models, 1 chains (A), 85 residues, 754 atoms
```
Expand All @@ -506,15 +600,15 @@ Various options can be set through optional keyword arguments when parsing PDB/m

Use [`retrievepdb`](@ref) to download and parse a PDB file into a Structure-Model-Chain-Residue-Atom framework in a single line:

```julia
```julia-repl
julia> struc = retrievepdb("1ALW", dir="path/to/pdb/directory")
INFO: Downloading PDB: 1ALW
ProteinStructure 1ALW.pdb with 1 models, 2 chains (A,B), 346 residues, 2928 atoms
```

If you prefer to work with data frames rather than the data structures in BioStructures, the `DataFrame` constructor from [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl) has been extended to construct relevant data frames from lists of atoms or residues:

```julia
```julia-repl
julia> using DataFrames

julia> df = DataFrame(collectatoms(struc));
Expand Down Expand Up @@ -547,7 +641,7 @@ As with file writing disordered entities are expanded by default but this can be
You can read and write files containing multiple mmCIF data blocks (equivalent to a `MMCIFDict` in this package) with the [`readmultimmcif`](@ref) and [`writemultimmcif`](@ref) functions.
An example of such a file is the PDB's [chemical component dictionary](https://www.wwpdb.org/data/ccd).

```julia
```julia-repl
julia> ccd = readmultimmcif("components.cif.gz"; gzip=true);

julia> ccd["2W4"]
Expand Down Expand Up @@ -590,14 +684,14 @@ Multi-character chain IDs can be written to mmCIF and MMTF files but will throw
If you want the PDB record line for an [`AbstractAtom`](@ref), use [`pdbline`](@ref).
For example:

```julia
```julia-repl
julia> pdbline(at)
"HETATM 101 C A X B 20 10.500 20.123 -5.123 0.50 50.13 C1+"
```

If you want to generate a PDB record line from values directly, do so using an [`AtomRecord`](@ref):

```julia
```julia-repl
julia> pdbline(AtomRecord(false, 669, "CA", ' ', "ILE", "A", 90, ' ', [31.743, 33.11, 31.221], 1.00, 25.76, "C", ""))
"ATOM 669 CA ILE A 90 31.743 33.110 31.221 1.00 25.76 C "
```
Expand Down
1 change: 1 addition & 0 deletions src/BioStructures.jl
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ include("mmcif.jl")
include("mmtf.jl")
include("spatial.jl")
include("secondary.jl")
include("select.jl")

if !isdefined(Base, :get_extension)
include("../ext/BioStructuresDataFramesExt.jl")
Expand Down
Loading
Loading