-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insufficient space group descriptions #416
Comments
Space group descriptions as merged in PR #405 suffer from a serious drawback: they do not list symmetry operators, and they do not provide for other spacegroup symbols such as H-M. Symmetry operators provide a universal means to specify symmetry. They are easier to parse that Hall symbols, and are guaranteed to be able to convey all possible symmetry operator possibilities for any space group and setting. The standard is hopelessly incomplete if it does not allow to specify symmetry operators. The H-M symbols, and the extended H-M symbols, are a human-readable description of the space group, including non-standard settings. They are easier to interpret that Hall symbols (if in doubt, read Hall1981). The Hall symbols, in contrast, depend on ad-hoc definitions to make space group operators derivable from the symbols itself (anybody knows what is 'w' without looking up in a table?), and need extensions to specify (if they can do it at all), and the latest extensions seems to be equivalent to just specifying the same symmetry operators (this needs to be checked). Parsing them is complicated (in contrast to just space group operators like '-x,-y,z') and reading it by humans is more difficult IMHO than reading H-M symbols. If OPTIMADE is to be kept to the bare minimum, a space group operators (to give operator matrices) and the ITC number (which gives a mathematical group identity up to isomorphism) should be standardised. If Hall symbols are included, then all other other symbols (H-M, extended H-M and possibly Schoenflies) should be included as well. Having just Hall symbols is the worse situation of all – neither easy for humans nor easy to parse (and ITC number does not help in this case). Since Hall symbols are already standardised, I suggest including also:
|
I am fine with including support for H-M symbols and symmetry operators, and I volunteer to prepare PR(s) whenever the details about them are sorted out. Citing coreCIF:
Do we need to establish our conventions on how to always choose the same H-M symbol for same space group type? Conventions for symmetry operators in OPTIMADE have been discussed in #35, but I do not think consensus has been reached for all the details. |
(I think this discussion belongs in #35. We may want to close this issue as a duplicate and move things there?) I'm in favor for more space group symbols, but I think we need to resolve the fundamental question of how to handle multiple "synonymous" data in some sane way so we don't end up with half of the databases providing HM-symbols but no Hall symbols and vise-versa.
My understanding from those who argued for Hall over HM is that for, e.g., spacegroup 68 you have origin choices that are often not distinguished in the HM symbol corresponding to, e.g., Hall symbols Regarding the non-human readable aspect, couldn't a client implementation just encode the following list: http://cci.lbl.gov/sginfo/itvb_2001_table_a1427_hall_symbols.html and auto-translate back-and-forth to HM symbols? |
After some offline discussion with @sauliusg and @merkys we came to the conclusion that there does not seem to be a widely accepted notation that would be both human-readable and unambiguous. Therefore, we suggest for now to remove/avoid all space group symbol fields altogether (Hall, HM, etc.) and rather express the symmetry using the space group IT number and a list of symmetry operations. A separate PR (issue?) will be filled on this topic. From what I gathered at the CECAM meeting, space group number is sufficient for most current queries and additional space group information can always be derived from the symmetry operation list if needed. However, maybe there are some actual use cases that I have missed? |
Are you opposed to IT number + an optional field That way I can more easily translate back and forth into Hall and HM symbols (by using that table) than I can via identification of the whole list of symmetry operators. |
I forgot to stress in the original issue, but I think the crucial thing is IMHO to include a full list of symmetry operators for the space group. They alone permit to encode any possible space group or setting, derive any symbol, are for sure unambiguous and are in end the information that is used for symmetry computations. (I've added that update to the original issue, sorry for confusion). |
I'm quite strongly opposed to this combination – it does not solve the symmetry operator issue, is not sufficient and invents yet another (for what I know, non-standard) way to describe space groups (at least my copy of the ITC vol. A does not mention the notation like '2:c' for the ITC spacegroup number). |
Sure, however, for clients and servers that want to deal with symmetry on their preferred form (e.g., a database only storing the Hall symbol, and a client that always want to display the HM-symbol to the end user in a UI) it is IMO preferable if OPTIMADE standardize on a format where connections between the representations to an as large degree as possible can be done cheaply on-the-fly for large sets of entries, i.e., preferably via direct lookup tables. While I really like to follow the CIF standard when we can, I'm not overly enthusiastic over the Also, we have to avoid choosing a format that down the line causes a lot of troubles in representing non-3D periodicity, i.e., slabs, writes, and molecules; what we do needs to be extensible in that direction. The same thing with magnetism, etc. |
I think they are ambiguous in the sense that multiple Hall symbols denote exactly the same symmetry operators. Also, the original proposal seems seams to be not powerful enough to capture all standard space group setting and were extended in multiple ways. I'm now digging a bit into the decoding of the Hall symbols (decode-Hall-symbol, takes time, bear with me...). What I come across, and @BobHanson seems to confirm (pleas correct me if I am wrong), is the following:
Debated it can be, but for the H-M symbol, 'P 21' means a screw twofold, and this you can not avoid learning if you study crystallography. For Hall notation, it becomes 'P 2yb', and you can only learn what My conclusion: Hall symbols are a nice try and a useful gadget to play with, with certain utility in computer applications, but they are not to replace H-M notation for humans, nor are they suitable as a single replacement for a symmetry operator list. Refs.:
|
If anything is standardised, then
Symmetry operators allow all this an more, e.g. modulated structures. The point is that symmetry operators are just a concise (and rather human-readable ;) ) encoding of symmetry operator matrices. One can decode them by using just a formal grammar (while decoding Hall symbols needs ad-hoc lookup tables). Also, when using them you can be pretty sure that anything you can process using matrix algebra you can also encode as an operator. Not so obvious for symbolic notation. |
To sum up, I propose the following two options for the upcomming PR:
No symbols (neither Hall nor H-M nor Schönflies) are standardised in variant (1).
Keep open for other notations (orbifold, fibrifold, ... see https://en.wikipedia.org/wiki/Space_group). I am neutral as to whether (1) or (2) is selected, but symops MUST be present in any case. The (2) variant seems less work to do right now, and (2) can be added afterwards. So I would go for (1). |
There is also option 3:
This way the database implementers only have to worry about dealing with a single set of space group symbols and the conversion between different notations is dedicated to the clients. Having multiple alternative representations as proposed in (2) seems to have a high risk of leading to internal contradictions. For now, I would also lean more towards option 1, since, as mentioned by @sauliusg, the specification can always be extended in the future. I fully understand the benefits of having a relatively simple space group symbol string that can be easily used in search queries, however, the existing notations do not seem to be sufficient (or sufficiently standardized). @rartino, does approach (1) seem reasonable for now? Alternatively, we could continue the discussion and try to agree on the use of a space group symbol notation that is not exhaustive, but covers most actual use cases. If I am not mistaken, one of the initial ideas that @merkys had was to provide Hall symbols only for entries with symop sets that are explicitly described in ITC tables (this would cover about 99% of the COD). However, it would be useful to first known what the current actual use cases are. |
My position:
- Hall symbols, as extended by Hall in later work (See current ITA Vol B
Appendix 1, attached) are unambiguous and complete. I believe they are
technically canonical, but not perhaps *stringwise *canonical in the sense
that the appended transformation matrix might be described slightly
differently.
- For anyone wanting to convert back from Cartesians to the original
fractional coordinates (alas!), truly the only way to do this easily is
with the Johns-Faithful (x-1/2, y, -z) notation. No, it is not canonical,
but no one would search for this anyway.
- Personally, I would prefer ITA number + H-M + operations. The ITA number
and H-M would both be valuable for searches, and the operations list gives
us what we need to convert back to fractional coordinates and handle the
symmetry properly.
- For 2D structures (slabs, surfaces), it might be nice to have the
symmetry, but honestly I don't know that anyone cares. P1 is probably
expected.
Bottom line for me:
Basically, for structures that are experimental determinations (most of COD
and most of ICSD), it just seems a bit odd to do all this conversion to
Cartesians for delivery only to have to convert back to fractional. I do
appreciate the consistency and accessibility of Cartesians. That is good.
And if there are decent pointers to the actual CIF file, probably it
doesn't matter anyway. Having at least the option to get fractional +
symmetry + space group would be the most sensible. If a database does not
support that, that's fine. Serving up Cartesians with symmetry operations
(necessarily fractional) seems odd to me.
Bob
|
Indeed it is odd; good point!
As I understand, the fractional coordinates were discussed but not yet described. Should we make an issue for this? |
I agree. Does this mean that we lean towards more complete solution (2) (which would also leave the currently merged Hall symbols, which there should be if we add H-M)? |
But what should we do if a client requests both H-M and Hall?
I think resolving contraditions is relatively easy: Mandate (suggest) the search order:
Of course the symbols SHOULD encode non-contradictory information (i.e. symops, ITC number and space group symbols SHOULD all point to the same space group). Seems unambiguous? |
If we explicitly choose a single space group symbol notation that all databases should implement (e.g. H-M) then such a query will simply be an invalid one.
I think that each of these fields serve a slightly different purpose and that we should not mandate any hierarchy/lookup order nor assume anything. From my point of view:
Under this approach H-M and Hall symbol have the same purpose and thus there is no need for the servers to provide them both, but only the one defined in the specification (which we have not chosen yet). A client, however, is free to implement a conversion between this and any other desired space group notation. Or do we want H-M and Hall symbol to serve different purposes? |
This was the first place where we realized that "just adding support also for fractional coordinates" was taking us in a direction where databases could choose to support either one, which would be terrible for clients. Let us just figure out how we best express a "support level" that says: "You may provide fractional coordinates in Does anyone see a drawback with supporting multiple fields with overlapping purposes if we add these kind of "dependencies"? If not, we just need to choose which is the "most" required field - which I think we all agree is some format that clearly can represent any list of symops.
A list of symmetry operations is surely needed, but why "only" the xyz notation? I recall finding it bulky to work with when I implemented xyz <-> matricies. I'm fairly sure I would have preferred Seitz symbols. Are there good arguments against Seitz symbols? The notation seems very standard, more normalized, and more human readable than the xyz-notation?
Note that there are potential issues beyond searching with having a very non-canonical primary format for symmetry information. For example: lets say quite a few databases only provide symmetry info via symops. Now, a client wants to always show a H-M symbol as part of the UI. It would in that case be nice (but, sure, not absolutely crucial) if it was fairly easy (e.g., via a lookup table) to map the list of symops returned by these databases into a H-M symbol. I also foresee finding myself fairly often in the position of asking "are these two entries I got from two different databases describing the same symmetry?", and then having to normalize the symop lists.
I'm pretty sure people who build databases of 2D materials care (I'm part of a collaboration that continues on that linked work...). But, after thinking a bit I think the xyz format, this is not an issue with that format - right? It can express any symop, it just requires unusual coefficients? |
I agree with this:
- *Symmetry operations (symops)*. The most complete way of describing a
space group with a specific setting and origin. However, it cannot be
queried in a convenient way.
- *IT number*. The least specific way of describing a space group (no
setting, no origin), but still conveys a lot of important information
(e.g., is the molecule a racemate). It is extremely easy to query (single
number in the range of 1--230).
- *H-M/Hall symbol*. Probably falls somewhere in between symops and IT
number in regards to describing the space group. It is much easier to read
and query than symops, but may not convey certain aspects of the space
group (e.g., origin).
- if symops are given, use symops;
- else, if Hall symbol is given, use the Hall symbol to look up or
derive symops;
- else, if universal H-M symbol is given, use that symbol to look up
symops in your tables;
- else, use either H-M or ITC number to look up the space group, assume
standard setting.
I would just say that one should never assume the standard setting.
Bob
|
I do not see it as terrible. Converting fractionals to cartesians is a simple task, something you can implement on a weekend. A lot of code around already does that. So I do not see any problem for clients in converting these two representations. And, as @BobHanson pointed out, working with symmetry is much more convenient in fractionals.
I am against such "MUST" and singling out Already the current spec, that only has only cartesians, is bad enough. For COD it would have been so much easier to implement OPTIMADE if we could just return the fractional coordinates which we have. Now, essentially, the spec demands that we do expensive calculation on the server side for the perceived benefit of the client, and in the end it turns out that for some clients this is not just unnecessary but actually counterproductive. To sum up:
Shall we move the further discussion to #206? |
To wrap the things up – do we agree on the following position:
Can we proceed with these ideas? |
If we do not, then having just an ITC number or a short H-M symbol in the response (which is permissible) would not allow to compute symmetry equivalent atoms, which would be perfectly possible, and indeed intended by specifying default settings in the ICT vol. A. If we do not specify the default setting in the standard (by referring to the ITC vol. A), then the client will have to produce an error in a situation where it could perfectly well continue (and indeed is the behaviour for most of the macromolecular crystallography software); e.g. specifying space group 'P 21' implies 'P 1 21 1'. What is Jmol's behaviour if you get just a short H-M symbol ? |
I agree with @sauliusg on this, but the main cost for the COD is not the fractional -> cartesian conversion, but symmetry reconstruction from an asymmetric unit. |
To clarify what @merkys brought up; @sauliusg, @BobHanson when you call for a field for fractional coordinates, do you mean a field to specify the coordinates of atoms in the full unit cell (which is the proposal in #206, and which only differs from To add a new field for the asymmetric unit fractional coordinates makes a lot of sense along with all the other symmetry data fields we are discussing here. The reason we don't have it yet is because it would be meaningless without the symops or equivalent. Maybe we should drop #206 entirely and just add this? But I think in that case we need two new fields in the line of
We've debated this before; so I'll keep it short here; but the aim to design the standardized fields so we avoid the need for clients to explore large amounts of fields that overlap (i.e., express the same data in different ways) for each individual database to find out precisely which ones that specific database supports, and then do all the conversions client-side is a hill I am prepared to die on. That would be the end of interoperability and would make common queries between databases impossible. This position is not the same as saying databases should be forced into expensive server-side conversions, the point is that we must carefully choose our "standard" fields so the necessary server-side conversions are comparatively cheap and straightforward to implement.
I'm not sure about "SHOULD" level for any of the symmetry info. Are we saying databases that do no care about symmetry and just want to give Cartesian coordinates of their, e.g., huge bio-molecules are SHOULD-violating? Also, does anyone have a link to the Johns-Faithful paper, or any other careful specification of the format? I've come up blank in my searches so far. I note that even the CIF definition of
Why a MUST requirement on string query support on an optional field? It isn't something we've done before. It can only be a MUST for equality string match, since a database doing on-the-fly translation from, e.g., Hall symbols cannot efficiently support partial string matching. |
I have been traveling and have lost track of this thread in the flurry of
messages of the last few days.
How about if someone creates a little survey that asks the general
questions that have been discussed and lets us all summarize our positions
on them?
I have some responses below to the most recent questions.
* I would just say that one should never assume the standard setting.If
we do not, then having just an ITC number or a short H-M symbol in the
response (which is permissible) would not allow to compute symmetry
equivalent atoms, which would be perfectly possible, and indeed intended by
specifying default settings in the ICT vol. A.*
BH: Agreed. But I think we had this discussion at the meeting and concluded
that in general, searches for structure are more likely to focus on what
the space group was, not what the precise origin or lattice setting was.
*If we do not specify the default setting in the standard (by referring to
the ITC vol. A), then the client will have to produce an error in a
situation where it could perfectly well continue (and indeed is the
behaviour for most of the macromolecular crystallography software); e.g.
specifying space group 'P 21' implies 'P 1 21 1'.*
BH: I guess that depends upon what the client is designed to do. Defaults
are the devil's handiwork. The default for ITA is "setting two" -- go
figure. Note that PDB REMARK 290 also generally includes operators
explicitly, though I do not think those are required. This is probably an
RCSB normalization. I don't know.
BH: I think the bottom line is that the crystallographic community fully
recognizes the H-M/Hall/ITA settings issue and has solved this ages ago by
the general agreement to provide a full set of operators and (preferably)
also an ITA space group number. Anything in addition to that is gratis and
for human readability only.
*What is Jmol's behaviour if you get just a short H-M symbol ?*
BH: Jmol requires operators. Fortunately, in 15 years of operation, no one
has had a problem with that.
*To clarify what @merkys brought up; @sauliusg, @BobHanson when you call
for a field for fractional coordinates, do you mean a field to specify the
coordinates of atoms in the full unit cell (which is the proposal in #206,
and which only differs from cartesian_coordinates by a matrix
multiplication with the lattice vectors), or the coordinates in only the
asymmetric unit, i.e., what is available in CIF? Since @BobHanson mentioned
needing the symops, I think you mean the latter?*
BH: I was meaning just the asymmetric unit. It would not make sense to me
to use fractional coordinates to refer to a "complete" unit cell, as
"complete" is not well defined. (Do we include all eight atoms if the
fractional unit cell position is {0 0 0}? What about faces and edges? How
close to exactly on a face counts as "on" the face? etc.)
*But I think in that case we need two new fields in the line of
asym_fractional_coords and asym_species_at_site?*
BH: My first thought was -- unnecessary. A site is a site. But that's an
interesting point. What if someone requests BOTH Cartesians and fractional.
Then what is a site? So, yes, I agree. We would need to distinguish them.
Good point.
*I do not see it as terrible. Converting fractionals to cartesians is a
simple task, something you can implement on a weekend. A lot of code around
already does that. So I do not see any problem for clients in converting
these two representations.*
BH: Well, maybe. You mean if you use a library. I certainly did not write
Jmol's fractional->cartesian code in a weekend (as Jmol does not use
external libraries). And even then, there are MANY nuances here. Avoiding
duplication; identifying faces; distinguishing and reconstructing molecular
systems; various options for what the expected representation is to
include.
*Symmetry operations (symops), in general position coordinates
(Johns-Faithful (x-1/2, y, -z) notation) SHOULD be supported in responses;
an EBNF grammar will be written in the OPTIMADE standard (I can do this);
querries MAY (but do not need to) be supported on this field;*
BH: I am not in favor of SHOULD here. I am not even sure I am in favor of
MAY. My overall concern is that the original format of the data (fractional
in the case of experimental XRD) may be lost in the current format, and it
cannot easily be recovered. If Cartesians are the common ground, that is
fine, in my opinion. But then if there is a CIF origin with fractional
coordinates, I would like to know about that so that I can discard or avoid
the cartesians and use those fractional coordinates that I would find in
the CIF link directly -- use OPTIMADE to find structures, but go back to
COD, for instance, for the actual position data I need.
*Also, does anyone have a link to the Johns-Faithful paper, or any other
careful specification of the format? I've come up blank in my searches so
far. I note that even the CIF definition of _symmetry_equiv_pos_as_xyz
doesn't give a careful definition of the format, nor links a definition.*
BH: I do not. All I know is that one has to be ready for +1/2-x or -x+1/2
or 1/2-x or 1/2 - x. Maybe even 0.5-x (though IMHO use of decimal numbers
is inappropriate).
*I'm aware of the investigation of @sauliusg and @BobHanson that concluded
that there are origin choices that cannot be represented as Hall symbols,
but that is not ambiguity.)*
BH: Actually, we concluded (or at least I did) that the extended Hall
system in ITA can, in fact, represent any space group setting. That was the
reason they did the extension. Same for H-M and "universal" H-M.
*I think they are ambiguous in the sense that multiple Hall symbols denote
exactly the same symmetry operators. *
BH: I believe you mean "not canonical"; "ambiguous" would mean that a
given Hall symbol could identify more than one space group setting. This is
not the case.
* My conclusion: Hall symbols are a nice try and a useful gadget to play
with, with certain utility in computer applications, but they are not to
replace H-M notation for humans.*
BH: My conclusion would be that "extended" Hall symbols and probably
"universal" H-M as well are sufficient for defining a space group in lieu
of a full operator list, but the reason all standard CIF files list the
operators explicitly is that no such abbreviated system is really
sufficient to be SURE that one has the right operators. Too many
opportunities for errors in transcription or translation.
Bob
|
This is something I have been thinking of for quite some time now, and is applicable for many other issues/PRs. I will look into technical means. |
So, is this where we are now?: Proposed symmetry-related fields in structure entries
|
My understanding is the following: yes it is, because no matter how you chose an origin and (affine) coordinate axes, there always exists a change-of-basis matrix that transforms your point coordinates from a "standard" setting to this new coordinate system, and by multiplying a "standard" symmetry operator by the change-of-basis matrix you will get a symmetry operator expression in this new basis. However, if you want simple symmetry operator expressions with rational coefficients, not all axes and not all origins are suitable. For example, if you describe 20 degree rotation in Cartesian frame, you will have an irrational This brings to the comment of @BobHanson :
If we want to accommodate any origin with any precision, we need to allow arbitrary floating point numbers, e.g. The tables [1] say:
(emphasis on real is mine). Thus, the Tables seem to suppose that arbitrary floating point (aka real) numbers can be used. However, if OPTIMADE stiks to "permissible" origins, the ones listed in the Tables, then we can get away by standardising te symmetry operation strings where only rational numbers are permitted. We can always extend later of needed. What is the general consensus, do we need general real translations (to specify any origin) or are with happy with only origins that can be expressed using rational translations? For what I know, all crystallographic varieties are expressible in rational translations. |
I see - smart way of looking at it, thanks for explaining.
From the experimental side it probably seems silly to cater for these arbitrary origin choices. Nevertheless, structures generated by random assignments of coordinates (e.g., as done by Chris Pickard and others) and ML generated structures can easily end up arbitrarily translated. It would be nice to be able to report symmetry information for these structures (e.g., to make them searchable by space group number) without being forced to shift them. But, this takes me to a, perhaps subtle, question: I assume we mean to allow "under-reporting" symmetry. I.e., it is ok to miss symmetry operators in However, do we really mean to make it a MUST-level violation to "miss" some symmetry operations in |
On Thu, Jun 30, 2022 at 10:36 AM Rickard Armiento ***@***.***> wrote:
do we need general real translations (to specify any origin) or are with
happy with only origins that can be expressed using rational translations?
For what I know, all crystallographic varieties are expressible in rational
translations.
From the experimental side it probably seems silly to cater for these
arbitrary origin choices. Nevertheless, structures generated by random
assignments of coordinates (e.g., as done by Chris Pickard and others) and
ML generated structures can easily end up arbitrarily translated. It would
be nice to be able to report symmetry information for these structures
(e.g., to make them searchable by space group number) without being forced
to shift them.
Interesting point. Good to have this perspective.
But, this takes me to a, perhaps subtle, question:
I assume we mean to allow "under-reporting" symmetry. I.e., it is ok to
miss symmetry operators in symmetry_operations as long as the reported
operators, ITN, H-M, and Hall are all consistent, and when replicating the
atoms in asym_fractional_coords using the operations in
symmetry_operations one gets the same thing as in cartesian_coords.
OOh. I would DEFINITELY assume we mean MUST be the complete list of
operators. From the CIF standard:
When a list of symmetry operations is given, it must contain
a *complete set* of coordinate representatives which generates
all the operations of the space group by the addition of
all primitive translations of the space group. *Such
representatives are to be found as the coordinates of
the general-equivalent position in International Tables for
Crystallography Vol. A (2002), to which it is necessary to
add any centring translations shown above the
general-equivalent position.*
[
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Ispace_group_symop_operation_xyz.html
]
I would stick with this requirement as the standard way of listing
operations. An alternative might be to list only the minimum number of
operations that, when mixed fully with each other, might be used to
*generate* all the operations associated with the general positions. For
example, [1] (which is from a paper on extending Hall symbols to magnetic
space groups [2]) lists just the minimum set of "generator" operations
necessary to generate all symmetry operations -- for example, just one of
the two C3 rotation operations. But I am not a big fan of that idea.
The thing is, of course, that with cubic groups one can have up to 192
operations. (#225 Fm-3m, http://img.chem.ucl.ac.uk/sgp/large/225az1.htm)
So, sure, that's a lot of seemingly redundant information. It's a trade
off.
However, do we really mean to make it a MUST-level violation to "miss"
some symmetry operations in symmetry_operations that would be possible
given the ITN specified in space_group_it_number?
Again - the ITA number is not *generally *definitive. That's why CIF
standard requires ALL operators, not even just some arbirary set of
"generators" (such as those listed in [1]).
Our discussion above seems to say this should be a violation (because
otherwise symmetry_operations could be ambiguous), but especially when
thinking of the "weird" symmetry operations that will be needed for
arbitrary origins, I'm worried this will end up overly stringent. At the
very least, maybe it would make sense to allow specifically giving *only*
the ITN without the symmetry_operations, nor any of the other symmetry
fields?
It would be fine by me if only the ITN were allowed to be given, but then
(a) no fractional coordinates should be given (as they would not be
actionable), and (b) recognizing that the ITN then would be only generally
useful in the context of searching, not in the context of using (F not R in
FAIR).
Bob
|
I am a bit in over my head lately, and this issue looks quite involved. It would be better if someone else steps up and drafts the PR. Remaining issues could be discussed per-point on the PR - I like PRs on GitHub better for that, as an issue is just a linear stream of messages and PR lets splitting off discussions per topic (=line of text). |
Indeed - the reason I think it is acceptable is that only giving the ITN is "obviously" ambiguous about the origin, so this ambiguity is less confusing than for, e.g., the usual H-M symbol. But we must indeed not allow giving
I don't think I was clear enough on what I think we probably need to allow. Lets look at regular NaCl in the conventional cell:
I'd argue case (2) fulfills the CIF requirement, because all the information is still consistent and gives a complete representation of the atomic sites, it is just under-reporting the possible symmetry compared to case (1). I've seen many CIF files do this. Do we mean for case (2) to be a violation of the OPTIMADE standard? I think there are good arguments for that this must be allowed. |
Totally happy with P1 and four atoms there. Lots of programs would use P1
for calculations I think. We cannot control choice of space group. As long
as the file is self consistent. Likewise Na or Cl at the origin.
…On Fri, Jul 1, 2022, 9:48 AM Rickard Armiento ***@***.***> wrote:
It would be fine by me if only the ITN were allowed to be given, but then
(a) no fractional coordinates should be given (as they would not be
actionable), and (b) recognizing that the ITN then would be only generally
useful in the context of searching, not in the context of using (F not R in
FAIR).
Indeed - the reason I think it is acceptable is that *only* giving the
ITN is "obviously" ambiguous about the origin, so this ambiguity is less
confusing than for, e.g., the usual H-M symbol. But we must indeed not
allow giving asym_fractional_coords without a complete list of the
symmetry operations.
I assume we mean to allow "under-reporting" symmetry
OOh. I would DEFINITELY assume we mean MUST be the complete list of
operators. From the CIF standard: When a list of symmetry operations is
given, it must contain a *complete set* of coordinate representatives
which generates all the operations of the space group by the addition of
all primitive translations of the space group. [...]
I don't think I was clear enough on what I think we probably need to
allow. Lets look at regular NaCl in the conventional cell:
1.
The full symmetry info would be: ITN=225, H-M=Fm-3m, have
symmetry_operations list the 192 symmetry operations, and let
asym_fractional_coords be a list of 2 coordinates, one for Na and one
for Cl (which sits on an 'a' and 'b' Wyckoff position respectively).
2.
However, what if I instead say: ITN=1, H-M=P1, I only let
symmetry_operations list the identity operation, and let
asym_fractional_coords be a list of four coordinates for Na, and four
coordinates for Cl.
I'd argue case (2) fulfills the CIF requirement, because all the
information is still consistent and gives a complete representation of the
atomic sites, it is just under-reporting the possible symmetry compared to
case (1). I've seen many CIF files do this.
Do we mean for case (2) to be a violation of the OPTIMADE standard? I
think there are good arguments for that this must be allowed.
—
Reply to this email directly, view it on GitHub
<#416 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW6UQDD4BXSSJIUFONTVR4AMZANCNFSM5YIVNAAA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
On Thu, Jun 16, 2022 at 2:44 AM Rickard Armiento ***@***.***> wrote:
Basically, for structures that are experimental determinations (most of
COD and most of ICSD), it just seems a bit odd to do all this conversion to
Cartesians for delivery only to have to convert back to fractional.
There is already a PR #206
<#206> for
introduction of fractional coordinates.
This was the first place where we realized that "just adding support also
for fractional coordinates" was taking us in a direction where databases
could choose to support either one, which would be terrible for clients.
Let us just figure out how we best express a "support level" that says:
"You may provide fractional coordinates in fractional_site_positions,
however, if you do, you MUST also support cartesian_site_positions. Then
we can use the same mechanism for, e.g., allowing the Hall symbol but in
that case force symops to be given.
Yes, I understand, and I am completely in agreement that use of MAY and
MUST. And I would add:
If fractional_site_positions are available, then also
symmetry_operators_xyz MUST also be available and must return at the very
least the identity operator "x,y,z".
The point here is that when people return fractional coordinates, they are
going to naturally return only the asymmetric unit, not the full set of
operator-derived positions. And if they have fractional coordinates, one
would reasonably expect that they would also have easy access to the
operator list.
Does anyone see a drawback with supporting multiple fields with
overlapping purposes if we add these kind of "dependencies"?
If not, we just need to choose which is the "most" required field - which
I think we all agree is *some* format that clearly can represent *any*
list of symops.
- For anyone wanting to convert back from Cartesians to the original
fractional coordinates (alas!), truly the only way to do this easily is
with the Johns-Faithful (x-1/2, y, -z) notation. No, it is not canonical,
but no one would search for this anyway.
A list of symmetry operations is surely needed, but why "only" the xyz
notation? I recall finding it bulky to work with when I implemented xyz <->
matricies. I'm fairly sure I would have preferred [Seitz symbols].(
https://www.cryst.ehu.es/html/cryst/help_pop-up/seitz_symbol.html).
Frankly, I have no idea how to handle Seitz symbols. Parsing Jones-Faithful
notation is straightforward and results in a simple 4x4 integer matrix in
all cases (other than magnetic space groups)
Are there good arguments against Seitz symbols? The notation seems very
standard, more normalized, and more human readable than the xyz-notation?
No objection other than that I have never seen them and don't how to
convert them to matrices.
I think the only real reason for using Seitz notation is that it is humanly
easier to see that we are talking about a rotation of some sort rather than
a reflection. But from a machine-readable sense, surely the x,y,z notation
is preferable, as it immediately translates into a canonical 4x4 integer
matrix.
- Personally, I would prefer ITA number + H-M + operations. The ITA
number and H-M would both be valuable for searches, and the operations list
gives us what we need to convert back to fractional coordinates and handle
the symmetry properly.
Note that there are potential issues beyond searching with having a very
non-canonical primary format for symmetry information. For example: lets
say quite a few databases only provide symmetry info via symops. Now, a
client wants to always show a H-M symbol as part of the UI. It would in
that case be a *nice* (but, sure, not absolutely crucial) if it was
fairly easy (e.g., via a lookup table) to map the list of symops returned
by these databases into a H-M symbol.
I would say that the design of that UI is flawed. Even CIF does not require
an H-M symbol, and without an ITA number and just having operators, it is a
VERY complex calculation to determine the space group, particularly for
nonstandard space group settings. There are services that do this (for
example,
https://www.cryst.ehu.es/cgi-bin/cryst/programs/checkgr.pl?tipog=gesp) but
I certainly do not know how to do it, and I would not know where to even
start to duplicate that service.
I also foresee finding myself fairly often in the position of asking "are
these two entries I got from two different databases describing the same
symmetry?", and then having to normalize the symop lists.
This is a well known and very difficult problem. Not sure how you would
"normalize" the symop lists, but I think you are really asking, "are these
two structures the same but just described differently?" See
https://www.cryst.ehu.es/cryst/compstru.html -- and even here the
structures have to first be put into standard settings, I think. I would
say this problem is out of scope. Same problem with any database (even
within a single one!) delivering multiple (potentially similar) structures.
- For 2D structures (slabs, surfaces), it might be nice to have the
symmetry, but honestly I don't know that anyone cares. P1 is probably
expected.
I'm pretty sure people who build (
https://www.nature.com/articles/s41524-022-00730-w)[databases of 2D
materials] care (I'm part of a collaboration that continues on that linked
work...).
But, after thinking a bit I think the xyz format, this is not an issue
with that format - right? It can express any symop, it just requires
unusual coefficients?
I was speaking out of my lane, there. I guess when you have symmetry in a
3D structure and you cleave along an arbitrary Miller plane, there must be
some way of determining what symmetry remains based on the new periodic u,v
axis system. I just have never had to deal with 2D space groups to date
other than to create them by cleaving along Miller planes and assigning the
symmetry to be P1.
Bob
|
[I've agreed implicitly 100% with every comment prior to this -- one mind
with Saulius]
On Thu, Jun 30, 2022 at 8:30 AM Saulius Gražulis ***@***.***> wrote:
BH: I do not. All I know is that one has to be ready for +1/2-x or
-x+1/2
or 1/2-x or 1/2 - x. Maybe even 0.5-x (though IMHO use of decimal numbers
is inappropriate).
If we want to accommodate *any* origin with any precision, we need to
allow arbitrary floating point numbers, e.g. X,Y,Z+3.1415926E-01.
The tables [1] say:
The change-of-basis operator V has the general form (v x , v y , v z ).
The vectors v x , v y and v z are specified by
[image: image]
<https://user-images.githubusercontent.com/10668420/176688688-893ded1d-05f2-4c08-a102-f17131c3d326.png>
where r i; j and t i are fractions or *real* numbers.
OK, agreed. However, by "real" numbers they also mean *exact* numbers,
almost certainly. So, for example, 0.5, 0.25, but not 0.33 (where that is
somehow referring to 1/3). I think we can dispense with a bizarre case such
as +3.1415926E-01. I'm pretty sure the writers of ITA were not considering
such a representation as allowable, and I certainly am not going to
reprogram Jmol to allow for such! :)
What is the general consensus, do we need general real translations (to
specify any origin) or are with happy with only origins that can be
expressed using rational translations? For what I know, all
crystallographic varieties are expressible in rational translations.
My opinion: Don't go there. "rational" means "can be expressed exactly by a
fraction involving two integers." Seeing as one cannot express a
non-rational real number in simple standard ASCII formats, it occurs to me
that the writers of ITA probably meant "rational" not "real" there. Thus,
3.1415926E-01 is 31415926/100000000 exactly. It would be appropriate to say
that Jones Faithful values SHOULD be expressed using explicit fractions
rather than decimal or scientific notation.
*Bob*
|
Liking what I see here; agreeing with Saulius. "rational numbers" meaning
fractions, I think. Or do you want to allow 0.5, but not 0.3333 (which is
also rational, but inexact)?
Bob
|
To summarize the above, I think we are ready for a PR here (which I have been too busy to write up myself; anyone is welcome to do it). But, in particular, please feel free to help clarify the two HM definitions. Symmetry-related fields in structure entries*
I think we have aligned on allowing all degrees of freedom in the xyz notation, with rational coefficients. Still, just to reply to these comments:
If one normalizes the degrees of freedom in the x,y,z notation (fix the order of operations, order of terms, sign of coefficients etc.), then each standard setting is identified by a unique set of symmetry operations. I.e., one can identify, e.g., the Hall symbol for the standard settings via a lookup table. At least I've seen this work in practice, but perhaps there are limitations for corner cases (?) (but in the context of an UI, it would just not show a H-M symbol if it couldn't make the identification.)
No, I did not mean the generalized problem of "are these structures 'physically' 'the same'" - indeed a very difficult problem. I really just mean: I have two OPTIMADE structures. I ask myself "are these sets of symmetry operations they specify representing the exact same symmetry?" But, this was not the most important point - if I care about this (and in this discussion it is so far just me) I will have to implement the canonicalization on the client side instead, which is fine. |
A small update based on today's web meeting. This is still partially blocked on the precise format for I did argue the benefits of alternative notations simpler to parse above. Nevertheless, if the only two options are:
I think I still come down on the side of the first one for the sake of compactness of notation. (On the other hand, a more compact but still direct representation of the matrices I could get behind, but perhaps lets not open that discussion again.) |
Not sure what I can contribute since I was not in the discussion, but since
I was CC'd here, I'll suggest:
- Matrices are fine, but realize that most of the space groups of interest
to materials work (in my limited experience) are cubic and have 48, 96, or
192 operations. That's a lot of matrix parsing of information that may or
may not be of interest. Strings are easier to ignore.
- Jones-Faithful is an exact description, while matrices are numerical
approximations.
- Q: What's the plan for magnetic space groups? (Does COD have any examples
of these?
http://webbdcrista1.ehu.es/magndata/ click [search]) Easy extension of
Jones-Faithful. Doesn't really fit the matrix mold.
Bob
|
(@BobHanson just to clarify: I post my comments in the GitHub issue/PR system. The emails you get are due to your GitHub settings to be notified about activity in threads you have participated in.) Thanks for the link on magnetism in cif. If I understand this correctly, the point is that one can add more comma separated items in the Jones-Faithful format to describe, e.g., a spin transformation for the operation (or really any linear transformation of a vector valued property on the sites). I assume we (in similarity with mcif) would add support for such symmetry operations by standardizing additional fields, e.g., I think the same extension is possible in mostly the same way in a matrix version of operations. But, to reiterate: I'd much prefer a reasonably compact format which a list of, e.g., 192 JSON-encoded matrices is not. So, unless someone comes forward with a more compact (while also exact) JSON-friendly matrix representation, I vote to PR the Johns-Faithful format. |
Thanks, Bob, for pointing out the potential problem with magnetic structures. Indeed, the larges number of symops in "classical" space groups would be 192 (groups No 225–228), but magnetic structures will add more. Also, modulated structures will add more operators (up to several thousands, see Stokes, 2011) with higher dimentionality (up to 6+1 dim.); quasicrystals can add at least the same, possibly even more. I think we need to be prepared for this. While any s.g. can be represented by matrices, the J-F notation is more compact, and, unlike space group symbols, is straightforward to interpret. But the classical "-X+Y+1/2,-X+1/2,Z" notation will need a moderately complicated grammar and an ad-hoc parser. I therefore feel that some pre-parsed for of J-F notation could be optimal for structured could be optimal for OPTIMADE. The idea is to have the symmetry operator string split into distinct grammatical components and presented as elements of the JSON array:
for P-1:
For P31 ("x,y,z", "-y,x-y,z+1/3", "-x+y,-x,z+2/3"):
The remaining tokens ("-x", "2/3") will be defined using regular expressions and can be easily parsed by regexp matching. From these, the matrices are easily built. The "+" symbol between the operator components in the JSON array is implied. The symop lists can be either transmitted within the response, or stored on a remote server and only a href link to that list transmitted with each structure, to minimise traffic. The client in this case will have a choice to either fetch the symop list (in JSON encoding) from the server, to use a cached value for the given space group and setting (the symops sholdnt ever change ;), or to decode the space group symbol from the href itself. The only drawback that I see in such representation is that it is unusual, which is cured by just starting to use it. The advantages would be:
What is your take on that? (@BobHanson , @rartino , @merkys , @vaitkus)? P.S. This reminds me of the LISP S-expressions... :) |
I would regard missing symmetry operators as and error, and a rather serious one. It is one think to check that the operators for a group, and quote another (more complicated thing) to reconstruct the group from the operators. Also, you can end up getting a subgroup of the original group if you through out too many operators. Too bad. I think on this point I completely support what @BobHanson said in this thread. |
I like the recent @sauliusg proposal. It stands somewhere between full-matrix and J-F representations which each have their own drawbacks. The proposed representation is quite concise and not too difficult to parse. |
I agree that the case (2) should not violate the OPTIMADE standard, but it is reporting a different (lower symmetry) structure than the case (1), with all the consequences – more independent parameters, highly correlated parameters. |
Not convinced. How is defining a standard for [["-y"], ["x", "-y"], ["z", "1/3"]] any different from requiring a specific syntax for: "-y,x-y,z+1/3" ? There is no more or less information there, only punctuation. I'm pretty sure it would take fewer words in a standard to describe how to create well-formed JF strings than what it would take to describe an entirely new format. As for the full symmetry/P1 issue, my understanding is that plenty of computational packages just go with P1 for their calculations and don't bother with symmetry constraints (particularly if they are single-point calculations). So I would guess plenty of structures would be described as "P1" that certainly could have more symmetry. |
From the discussion today I've got impression that everyone is OK to go forward with this suggestion; I'll put it into the PR and upload. |
The bracketed notation, In contrast, the The structured string notation is not much shorter than a value array, so value array seems a good compromise IMHO since we re-use existing grammar(s) (of JSON, CIF2, XML or whatever carrier we use for the response). PS Possible (but still simplified) Regexp for the symmetry operation grammar could be:
Tested as:
This is already probably too complex for regexp, and in reality will become even longer if we want to capture cases like '1/2-x' and exclude cases like 'x+5/2' (we will basically have to list all allowed fractions, I guess). |
I experience the opposite: if we want to describe the full J-F notation, we will need a full-fledged grammar in EBNF, or a lon-ish RE, to capture all permissible expressions and to block all unwanted expressions; and then it comes the superspacegroups and magnetic groups with their complications. This is a long description, needs testing and a special parser for decent implementation. In contrast, if we go for JSON then we can simply say: "symmetry description MUST be an array with the elements satisfying the following constraints", them list the regexps that the elements MUST satisfy, and then explain the semantics, and we are done. Standard users that implement a client just need to rely on the parsed JSON (or whatever carrier format it is) and check the regexp matches using the regexp engine of their implementation platform (all platforms have the regexp subset that we will use). PS As I was mentioning, we do not want to be bound to JSON, but other carrier formats allow the same: CIF2 has arrays a-la JSON, and XML has nested elements, e.g.:
and so on... |
This might be true now, but we want to represent experimental crystallographic data, computational descriptions that are in the same setting as experimental data (e.g. for comparison), and calculations that do take symmetry into account, don't we? |
I had an offline discussion with @sauliusg, he is OK with not including the Schönflies symbol in this PR. I will close this issue now, feel free to re-open it if you see anything else that has not been properly addressed. |
Insufficient space group descriptions
The text was updated successfully, but these errors were encountered: