-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failing to read PDB files generated by VMD #48
Comments
The offending PDB lines are
and
which both try to put an atom with name We treat this defensively since there are no or very few cases in the PDB that have such duplicate atoms. I reported a handful of cases to the PDB a while ago and they updated the records, so I believe it is considered a format violation. In this case you could use different chain IDs (the empty column 22) or keep incrementing the residue number. |
Since this file was written by VMD, I assume this should be reported to them? @lmiq, can you do that? |
Well, there is nothing wrong with that file for the use it was designed for, so I do not think there's anything to report there, upstream. The parsing of the residues in that file is simply done by incrementing the residue counter when the residue index changes (for more or less). Having limitations in the number of residues per chain, or number of chains, etc, is something that cannot be important in MD simulations PDB files. This is the choice to be made here: have or not the possibility of parsing non-standard files to some degree. I perfectly understand if Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation. Also, note the |
If it's true that it violates https://www.wwpdb.org/documentation/file-format (I haven't checked), I'd say that's clearly a problem. It's not OK that there's a workaround. It introduces ambiguities and issues, precisely as being discussed here. But if the format is ambiguous, then that's another issue entirely. |
It's certainly challenging to balance the considerations.
This is the key, BioStructures guarantees that every atom has "meaning" about what it is (i.e. on which residue and chain), which is useful in many contexts (such as file interchange) but in turn necessitates some complexities in representation and strictness in parsing. The only way the original case could work currently is if the existing atom is overwritten (or the new atom ignored) with a warning, which is unlikely to be desired behaviour even if it runs without error. I guess philosophically the aim of BioStructures is to not so much to read structural files, but to represent unambiguously the molecules within them. This affects other design decisions, such as why
Sadly, this seems to be the case. There is a lot of documentation on the column format, but as far as I know not so much on the row format (e.g. what duplicates mean). In general I prefer it when other tool authors use |
From the description, that file does not adhere to the standard because:
and there the chain identifier is blank. But the standard does not say anything about two residues in the same chain having the same number (which is actually the issue here). For actual bio-structures it is just assumed that that does not make sense. But when part of the system is thousands of water molecules, that is just unnecessarily limiting. |
Given that the format is ambiguous, then perhaps we do need to deal with ambiguity. I don't think this should be open-ended: it's kind of a disaster to take the strategy "can I find something that seems to make sense in column 5? No? OK, let's see if things seem more sensible if I substitute the value in column 11? No? OK, maybe I can infer a good value from column 13?" But if instead one passes |
is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead? |
As a minimal example:
VMD, Pymol, Packmol, and other common simulation packages interpret this without problems. There are three water molecules there. The issue is that this does not fit into an hierarchical structure, unless the reader creates chains arbitrarily. That would come with a bunch of problems as well.
Well, in some sense yes, the PDB format is problematic. But that ship has sailed decades ago in the MD field. Not that there aren't other formats that are more appropriate, particularly for the limitations of the coordinates fields, but the legacy PDB format is still widely used because of its readability. This is something we (MD simulation package authors) just have to live with. ps: MIToS is reading this if one adds occupancy and b-factor fields (added now), because its hierarchy is at the residue level only, and it just not cares about the "meaning" of two residues with identical numbers in the same chain. |
So maybe we need a |
Yes, but the question still remains about how to put the two atoms into the same hierarchical structure. We could assign an unused residue number or chain ID, but that feels clunky and runs into problems if the assigned value is found later during reading. We could have another data type that stores multiple coordinates for the same atom, but that raises concerns about how to write the file back out in the same order. We could assign duplicates into a new |
Maybe the flavor should be the type of hierarchy desired: |
I guess my thought is that we decide on concrete flavor-specific behavior and document it. Passing a flavor might require that the entire file be scanned first to identify "occupied" names/indices/whatever, which would be a bit of a performance hit, but perhaps the price of abusing a file format. |
I think it is a little more complicated than that. If we keep the hierarchy we are forced to attribute some of the fields arbitrarily. That will cause confusion. |
Isn't the hierarchy about how you assign bonds? Meaning, you have implicit hierarchy any time you want to link entries in rows together. |
The issue is that multiple atoms want the same "spot" in the hierarchy, because the file contains multiple objects which have the same name. So you either have to discard one atom, have multiple versions of the hierarchy, or store duplicate atoms in that spot (beyond disorder, which we already do). We could have a flag
as
but would still error on
Renumbering like this is a useful feature anyway that is available in a number of PDB packages. |
I'm not sure if I understand your comment. In the MD files, I think it is safe to assume that the "residue" is the object by excellence. But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure. Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues (as MIToS does now, and PDBTools parses at the atom level but does provide an iterator over residues). What I can´t say is that having the file parsed in a different hierarchy would require all the other functions of BioStructures to have additional methods, thus effectively creating a parallel package within it. By the way, I noticed now that:
that per-se would not fit quite well in MD files, as many of them do not have any protein whatsoever. |
That would help, but I think it would still fall short, as the hierarchy on chains is not really meaningful in this context. Concerning this:
That specifically is more complicated. VMD and Pymol recognize that only because they compute the connectivity from the distances and deduce that there are 3 residues there. That I think is beyond what is expected for a reading package. Also, |
The name |
I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right? That's a valid perspective if either (1) you don't actually care about bonds, or (2) there's an external program to assign bonds. I'd guess that few people are actually in camp 1, but camp 2 is presumably common. Is there any ambiguity, though, about assigning bonds simply by distance? If so, then the reader is still partially responsible for indicating hierarchy, like via the TER that Amber requires. So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution? |
Bonds (or more generally the topology of the molecules) are defined in different files in MD simulations. In the PDB files they can be written with the There are certainly problems in assigning bonds based on distances. Sometimes atoms from different molecules are too close and that causes errors. VMD throws erros/warnings in these cases, when it detects that some atom appears to have more bonds that allowed. But assigning bonds based on distances is useful but only a workaround when the topology files are not provided.
You could still select one individually by a residue counter (which might or not match the "residue number" written in the PDB file), exactly how it happens when such files are read by
The packages dissociate the residue and atom counts from the "residue number" and "atom index" as written in the PDB files. What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started. |
I'd favor strict-by-default and requiring that you pass |
Just to check, the case in #48 (comment) doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal? |
This is what I had in mind with Note, this would mean you can't write the original PDB file back out as we would discard the residue numbers in the file. |
Uhm... I don't think residue renumbering is the correct approach. We do not want residue numbers to be changed when reading protein residues, for example. What is necessary is to dissociate the residue counter from the residue number as written in the PDB file. (And I would say also dissociate the atom counter from the atom index as written in the PDB file). Note: VMD has different attributes for each: I see now that the residues are stored in a Dict in I wander if having the residues of each chain in a vector wouldn't be more appropriate - I'm not very comfortable with the order of the residues in the file being meaningless. For instance this is how the residues of julia> pdb[1].chains["A"].residues
Dict{String, BioStructures.AbstractResidue} with 282 entries:
"407" => Residue 407:A with name LYS, 9 atoms
"371" => Residue 371:A with name ARG, 11 atoms
"447" => Residue 447:A with name ILE, 8 atoms
"335" => Residue 335:A with name ASN, 8 atoms not having them in order is somewhat strange. |
Just changing the dicts to julia> pdb[1].chains["A"].residues
OrderedDict{String, AbstractResidue} with 282 entries:
"225" => Residue 225:A with name SER, 6 atoms
"226" => Residue 226:A with name ALA, 5 atoms
"227" => Residue 227:A with name ASN, 8 atoms
"228" => Residue 228:A with name GLU, 9 atoms which then, if read with the "new residue approach", would implicitly store both the counter and the residue numer. But I would probably prefer having a Or having to explicitly use something like
@timholy sorry, missed this comment. Yes, that case I think we can leave out. Maybe support |
Am I understanding right that the key in the I guess this could be solved with a vector type, but then what should be returned when you index into it with
The order is not stored directly in the object, but the residues are sorted when writing out. This means that different file representations of the same underlying molecule get written out to the same file, but you do lose the ordering of the input file. In the usual case of ascending residue numbers in the input file (possibly with gaps), this is preserved in the output file. |
I think @jgreener64's point is that this admirable goal makes things harder---if you only increment the counter when something is in conflict, what happens if a later structure in the file uses the number you "stole" in order to do the increment? Two possible solutions are (1) to parse the file twice to identify all conflicts in advance, or (2) transiently distinguish "unconflicted" ids and "conflicted" ids and then assign final ids after all unconflicted ones have been assigned. While I proposed (1) above, maybe (2) is the better choice.
I've learned that there's a separate field that gives the order of the keys, and you can access the dict with integer indexing as
👍 |
I think I need to give a step back and explain how we use the data here. From our perspective, all fields of the PDB are just labels. Apart from the data of the fields, in the process of reading the file, one annotates, incrementally, an independent residue counter and an independent atom index counter. At the end, we have a vector of atoms where each atom carries all that information: julia> pdb = PDBTools.readPDB("/home/leandro/Downloads/6hn6.pdb")
Array{Atoms,1} with 2090 atoms with fields:
index name resname chain resnum residue x y z occup beta model segname index_pdb
1 N SER A 225 1 43.004 80.351 76.389 1.00 118.26 1 - 1
2 CA SER A 225 1 42.216 81.220 75.509 1.00 116.77 1 - 2
3 C SER A 225 1 42.952 81.518 74.177 1.00 118.83 1 - 3
⋮
2088 O HOH A 638 306 23.920 74.193 77.532 1.00 55.18 1 - 2089
2089 O HOH A 639 307 6.395 85.305 51.528 1.00 104.62 1 - 2090
2090 O HOH A 640 308 19.024 61.159 68.618 1.00 64.07 1 - 2091 There, for instance, Then, working with such a data structure consists in applying filters, which return vectors of atoms, or indexes: julia> filter(sel"resnum < 300", pdb)
Array{Atoms,1} with 587 atoms with fields:
index name resname chain resnum residue x y z occup beta model segname index_pdb
1 N SER A 225 1 43.004 80.351 76.389 1.00 118.26 1 - 1
2 CA SER A 225 1 42.216 81.220 75.509 1.00 116.77 1 - 2
3 C SER A 225 1 42.952 81.518 74.177 1.00 118.83 1 - 3
⋮
585 CG1 ILE A 299 75 23.370 60.986 68.597 1.00 44.83 1 - 585
586 CG2 ILE A 299 75 23.842 62.784 70.389 1.00 46.63 1 - 586
587 CD1 ILE A 299 75 24.450 60.003 68.979 1.00 49.51 1 - 587
julia> findall(sel"residue = 1", pdb)
6-element Vector{Int64}:
1
2
3
4
5
6 We do not need these filtering operations to be particularly fast, or lazy. But we do need to be able to select subsets of the structure with great versatility, potentially using incremental or pdb-defined residue numbers (very common), or any other atom property. That data structure doesn't prevent us from iterating over residues, with an appropriate iterator: julia> for res in eachresidue(filter(sel"residue <= 2", pdb))
println(resname(res)," ",resnum(res))
end
SER 225
ALA 226 and we could (but didn´t yet) define similar iterators for models, chains, etc, of course. We just don't use those as often. These way dealing with the data does not have restrictions about duplicate residues, residue numbers, etc. If two residues have the same data, they will just be filtered together, as we would expect them to. Still we can differentiate them by their incremental residue counter and incremental atom indices. There are no conflicts when reading the data, and no special issues associated to repeated fields. These way of storing and using the data is convenient for us. And I think that we must understand why or how it is not convenient for other people, to justify having different data structures, and to which extent. For us, using a syntax like |
We've also gotten a lot of benefit from random-access indexing. For example we might pick out all the positively-charged residues and examine their spatial distribution. We also compute displacement vectors from the alpha carbon to the side chain center-of-mass to determine whether residues in a 7TM are "interior" or "exterior." And so on. I don't think the That said, I think BioStructures is doing something important: it focuses on structural representation rather than file format, and it can get multiple file formats into that same representation. It also seems to take the complexities of that process seriously. It's why I'm prepared to do a fair amount of work to port our own code over to BioStructures, if the technical issues (convenience, comprehensiveness, speed, heaviness of dependency, etc) can all be resolved. I for one am quite optimistic that this is the case. And if not, I think the goal is worthy even if we have to have a fresh start. Fundamentally, the current fragmentation of the representations for structures is very detrimental to the long-term health of the Julia ecosystem for questions that relate to structural biology. I have a conference deadline in a few weeks, so I am not sure how much I can do between now and then, but this is something I'm willing to put some work into to help make it happen. |
Overall, this discussion seems a bit like a
The last, of course, is something that can be enhanced any time there is an unmet need. But I don't think that all users would agree that I think this analogy is useful because it highlights that the three packages seem to pick different levels for their default representation:
This diversity highlights the fact that this choice is a bit arbitrary. To me that's a strong argument that there are many ways we can all get our work done, and that the most important thing to do is unify around a single representation, regardless of whether it's hierarchical or flat. We need a standard representation to build a large, high-quality stack in this space.
I'd propose that we distinguish the discussion about "hierarchical or flat" from the "strict vs permissive" discussion. While a Dict('a' => 1, 'a' => 2) you can always use Dict('a' => [1, 2]) as long as the rest of the code knows that the value field should be interpreted as a container of values. |
Sort of, because PDBs have also "model" and "segment" identifiers. There is a lot of arbitrariness in the way the PDB format classifies parts of the structures. "chain" is a synonym of "molecule" from a chemical point of view, but if that was taken to heart then water molecules should split on different chains. IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level. A "molecule" is effectively meaningful, but even the content of the molecule is subject to ambiguities (as multiple conformations or incomplete atomic positions). All the remaining information that a PDB (or other format) contains are just arbitrary annotations, which I think can only be represented generally enough if at the atom level with custom fields. If we had an underlying molecular structure format that represented molecules where atoms have optional fields, that could effectively be useful for interoperability of the various molecular-structure packages. Anticipating the possibility of storing the topology of the molecule in the same format is probably a must. ps: The only real constraint we have on the underlying format is concerning performance for specific operations. I think the choice of one specific format could be limited in view if we identify an important and common operation that needs to be very fast on very large data sets. But probably in those cases intermediate representations are needed anyway. |
CC @anton083. This discussion started on Slack (I invited @murrellb but a search didn't turn up a user handle for you), but most of the content is here now. The bottom line is that I'm trying to build momentum around the idea that currently we have 3 main packages for reading PDBs and that we really should settle on just one. This productive discussion is mostly about identifying any barriers that currently prevent this, and what can be done to fix them. |
Personally that seems like a good idea to me, if the one-character limit of PDB files weren't a constraint. Of course you'd want tools to "find/discard all waters" but that's basically sugar.
Conceptually there's a lot of value in that idea---you're right that 3×4 DataFrame
Row │ id symbol residue chain
│ Int64 String Missing Int64
─────┼───────────────────────────────
1 │ 1 C missing 1
2 │ 2 O missing 1
3 │ 3 O missing 1 than via struct Chain
residues::Vector{Residue}
end
struct Residue
atoms::Vector{Atom}
end For a hierarchical representation, we could fix that with struct Chain
residues::Union{Vector{Residue},Vector{Atom}}
end or struct Chain
residues::Union{Vector{Residue}, Nothing} # specify either (not both!)
atoms::Union{Vector{Atom}, Nothing}
end but presumably that would be viable.
This is one place where hierarchical representations have advantages. If you have a completely flat representation, say with 20 chains, each with 300 residues, each with ~8 atoms, then any time you select anything from the "dataframe" you need to traverse 20 * 300 * 8 = 48000 rows. Whereas with a hierarchical representation, if you only need "chain D" then you immediately reduce down to 300 residues regardless of whether any other efficiencies are possible. Of course, hybrid strategies achieve hybrid performance. As an example, one thing we do in our analyses all the time is compute the center-of-mass of the each residue side-chain. This is easy and efficient with a hierarchical representation. With a flat representation of atoms, you'd need the equivalent of DataFrames' |
The problem here is that this not up to us to decide. PDBs are already used in which multiple molecules belong to the same chain. If for generality, I would rather use struct Molecule
atoms::Vector{Atom}
bonds::Vector{Tuple{Int,Int}}
end where
Yes, those things can be easier with the hierarchical approach. But, is that performance really important? And, at the same time, with the flat approach that specifically can also be done in O(n) (just mentioning because I think it is important to identify actual applications where one or other format is effectively limiting): julia> using PDBTools, StaticArrays
julia> cm_side_chains = SVector{3,Float64}[]
for residue in eachresidue(atoms) # lazy
side_chain = select(residue, by = issidechain) # select side chain atoms of this residue
length(side_chain) > 0 && push!(cm_side_chains, center_of_mass(side_chain))
end (yes, if I needed the CM of one specific residue, I still have to traverse the array). |
How we represent structures is/can be distinct from how they are encoded in the file. Several of our posts above concern renumbering schemes that ignore one or more tags in the file. To the extent that
That's getting a bit into MolecularGraph.jl territory. It's an excellent package and we use it heavily.
Yes, that's basically what I meant by "groupby"; if your code example is meant to run in accum = [zeros(3) for _ = 1:nresidues]
count = fill(0, nresidues)
for atom in list
idx = residueindex(atom)
accum[idx] += atom.coords
count[idx] += 1
end
centerofmass = accum ./ count but of course that's a bit algorithm-specific (it works for linear measures on the residue, less obviously well for nonlinear measures). |
It is because "chain" is thought as "polymer chain", which make sense for proteins or nucleic acids, or other polymers. Calling a water molecule a "chain" would be an abuse of notation, though. Yet, a molecule could be a vector of general "ChemicalUnit"s, for which a single molecule is a one, or a polymer residue is also one.
then a water molecule would have a single chemical unit, a polymer could have many and map what we call "chain" in the specific chain of PDB files. Just throwing an idea, I'm not claiming "ChemicalUnit" not be necessarily a good name. |
I find this useful for interactive structural biology applications, but as you say any representation can probably do this by overloading
I think we are all aware of this but to be clear, BioStructures allows the 3 types of multiple conformations that appear in the PDB: alternative atom (representing ambiguous experimental determination), alternative residue (ambiguous experimental determination or mutation) and
I agree with this, I am suggesting that even if the data structure allows it we should think very carefully before being permissive in this sense.
We could think about making this change, but would have to think about how it plays with
This is exactly the aim of AtomsBase.jl, though it hasn't taken off for the biomolecular case yet. They have some readers and writers too. It would be nice to have support for that in BioStructures.
Residue-level is also natural and useful for biopolymers, for example for assigning parameters for a molecular dynamics simulation which are parameterised by residue. |
But what's the label then? Two identical atoms belonging to different conformations of the same residue do not have a clear way to be distinguished by a label in the PDB. If we need to invent a label, I actually find more useful to just stick with the index of the atom in the "file" (or more generally, "MolecularSystem"), which in fact is a useful index. |
I'm still coming to grips with this myself, but isn't this what |
It is. I meant it is not useful as a unique identifier, unless some malabarism is made. |
Why is it not useful? The alt loc ID can be As I see it small-scale conformational variability is handled okay. The case we are missing might appear when you have >9999 water molecules, you can't write Another case would be where you have different conformations of the whole system, which could represent a MD trajectory, but |
I was just editing my comment above. Now I could check what BioStructures do, which is have disordered atoms stored with a different type (that was the piece I was missing in terms of using the field for representing the multiple conformations): julia> s["A"].residues["364"].atoms[" CB "]
DisorderedAtom CB with alt loc IDs A,B
julia> s["A"].residues["364"].atoms[" CB "]['A']
Atom CB with serial 1085, coordinates [25.31, 88.547, 65.041], alt loc ID A
julia> s["A"].residues["364"].atoms[" CB "]['B']
Atom CB with serial 1086, coordinates [25.318, 88.585, 65.034], alt loc ID B I don't have any objection to that, in principle. Retrospectively, I think, I didn't use If a good solution to that is found, I might test if the typical files I have to handle are parsed and, then, I wouldn't mind moving the reading infrastructure and even the data formats of PDBTools to that. If someone smarter than me can implement a full-featured selection syntax like that of VMD (and which PDBTools implements only partially), I think PDBTools could be retired for good.
PDBTools (and VMD) switch to hexadecimal representations when that happens (also for the atom indices). ps: I'm still uneasy about not using the sequence of atoms as written in the files. While I see the appeal in the fact that that sequence is somewhat arbitrary and non-physical, the fact is that editing these molecular systems is very commonly a back-and-forth lookup on the actual files, and I would rather have a clear correlation between the representation of the molecular system and what is written. I do not see any advantage of not being able to readily access the atoms by their incremental indices in the PDB file. Maybe I could get accustomed to using |
Agree that a selection syntax would be great. MDAnalysis is another reference point there: https://userguide.mdanalysis.org/stable/selections.html.
We could probably extend to read this in.
We could store this as an extra field. In general I think it is just the atom serial though, which we do store. The order that residues are read in we don't store. |
Not necessarily, because the atom indices too frequently overflow the format. VMD just ignores them and counts the atoms sequentially. PDBTools stores the serial numbers of the PDB in the Concerning levels of classification: there is the "segment" level as well, which is actually important. For instance, we are dealing now with a virus particle. Viruses have a lot of symmetry, and the deposited PDB has only a minimal unit that has to be replicated to compose the complete virus capsid. Within that minimal unit there are of course different chains. Upon replication of the unit, we have then multiple repeated chains, and we do not want to change the name of these chains, because it is inconvenient for many sorts of analysis. The different replicas of the minimal unit are then differentiated by the their "segment names". Thus, in this case, we have, in the same MODEL, many repeated CHAIN identifiers, with repeated fields for everything else, except the "segment" name. |
Awesome, @lmiq. I'll be happy to help out once I get past my conference deadline. Regarding atom order, one thing to think about is the overhead of allocating a |
It is kind of amazing that folks (including me!) haven't adopted mmCIF, because these are the kinds of problems it was designed to solve. In my own case, the small step from "PDB" the database to "PDB" the fileformat made me assume "oh, PDB is the right format to download." I only just read enough about this in the past few days to realize that I should switch to mmCIF files, and how weird it is that we all still prefer a file format heavily tailored to the era when keyboards only had capital letters, monitors had black backgrounds with glowing green fonts, and terminals were fixed at 80-character width. It's a little like the story about how the gap between two horses pulling a chariot ended up affected the design of the space shuttle, because the horse-gap set the wheel spacing, the wheel spacing determined the ruts in the ground and thus the road width, which ended up affecting the dimensions of tunnels in the modern US interstate highway system needed to transport assembled components over long distances. I wonder if they had named it "PDB+" if it would have wider adoption? There seems to be a lot in a name. |
The alternative to store the coordinates in a matrix or I'm still, though, not convinced that not having an order for the atoms in the complete system, following the input file, is really an advantage. There's the argument that chains or molecules could be randomly written in the original file, but that is something I've never seen. (on the contrary, the order of the chains, heteroatoms, etc, is usually carefully thought in terms of their importance). Concerning the PDB format, there are objective advantages of having a format for which one can copy and paste chunks of data from one file to the other. (and it worries me a bit the fact that mmCIF can, as far as I understand, have other fields, with other levels of user-defined molecular architecture organization, which probably will appear as the systems become larger and larger - and a long-lasting reading format should probably be able to adapt to that). |
One way to store segment information would be to have a We do currently read all fields from mmCIF files into a |
Yep. There is a table format (expand the "styling plans"), one that exists now and one which seems planned (not sure what that means...). But it clearly isn't universally used. |
Found another detail about reading MD PDB files: in these files usually 4-letter residue names are accepted, using the character in position 21 as the fourth character (which does not have any function in the original PDB format). For instance, this is a very common way to represent water:
where This raises an additional complication because blank characters are currently meaningful as the residue or atom keys (IMO they should be stripped, but I don't know if this adds additional complications because of the variable length of each key). |
To me that seems like a clear format violation. I did consider whether to strip atom and residue names for storage, but decided against it as there are common cases like |
Yes, it is, but a common one. The question is should this package support that, or do we need another parser?
Ough. I didn't know about that. I've ever seen Calcium as "CAL". Also do think that is standardized? |
We could have it behind a flag I guess. Atom and residue names don't seem to be well standardised but if you store the spacing you can at least write it out the same way it was read in. |
Based on discussions here and on Slack a few changes have recently gone in:
Now is a good time to mention any other breaking changes of interest as the next version will be v4 (I will unify the selection functions a bit first). Regarding the heaviness of the package, for me
STRIDE_jll (and DSSP_jll) could be moved to extensions, I'm not sure how conventional that is though. On Julia 1.10.3 Zlib_jll and STRIDE_jll take considerably less time to load, though the overall load time is longer (0.26 s) since SparseArrays is not a weak dependency in Statistics. |
From the discussion here up to now, my impression is that we cannot escape having a new type ( |
The JLL load time is presumably something that should be fixed elsewhere. I wouldn't tie yourself in knots to work around the issue in this package. Nice progress on decreasing load time! |
One solution to the nonstandard PDB files would be to have an additional data type that serves the equivalent job of |
When trying to read this file (and other similar ones), I get:
This is one example VMD pdb file.
vmd.pdb.zip
The text was updated successfully, but these errors were encountered: