-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New representation of atoms at unknown positions #279
New representation of atoms at unknown positions #279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a slight change in the wording, but otherwise it seems fine to me.
Thanks for the review @sauliusg, I've made the modifications suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the text one more today, it seem clear to me, so if other agree let's merge.
Shall we put an attached atom example (-CH3, -CH2-) in a separate PR, or is it already in? (I did not find it, but we can make it later).
Oh, I think I misunderstood you. I added the "example" of a "-CH3" to the description of species
But you mean to add an example to the list of species examples, right? Lets do that in this PR. Is this ok?:
|
Probably just "CH3" would be sufficient to give a computer-name like string, but of course "-CH3" would be more explicit that this is a methyl attached to something. I'm find with both ways the example would go (and assume a server can assign any string that is convenient). |
Alright, @sauliusg, @giovannipizzi, @CasperWA please re-review so we can get this merged. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks perfect now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Notes**: (for implementers) While this is unrelated to this OPTIMADE specification: If you decide to store internally the :property:`cartesian_site_positions` as a float array, you might want to represent :val:`null` values with :field-val:`NaN` values. | ||
The latter being valid float numbers in the IEEE 754 standard in `IEEE 754-1985 <https://doi.org/10.1109/IEEESTD.1985.82928>`__ and in the updated version `IEEE 754-2008 <https://doi.org/10.1109/IEEESTD.2008.4610935>`__. | ||
- It MUST be a list of length equal to the number of sites in the structure where every element is a list of the three Cartesian coordinates of a site expressed as float values, expressed in the unit angstrom (Å). | ||
- An entry MAY have multiple sites at the same Cartesian position (for a relevant use of this, see e.g., the property `assemblies`_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this also be equivalent to multiple chemical_symbols
in a species?
I am slightly unaligned with the exact definition of site
from the property description, and what it entails for this sentence.
Note, I don't see a need to change the description, but rather its implication on this sentence, i.e., extending this sentence with a comment on what it means for multiple chemical_symbols
in a species; whether multiple sites at the same Cartesian position can also be set by adding multiple values to chemical_symbols
of a species
or if not (then one could add something like not to be confused with adding multiple values for "chemical_symbols" of the related "species"
, or something?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this also be equivalent to multiple
chemical_symbols
in a species?
I would say physically yes, but logically no.
If two sites have the same Cartesian coords, then I would understand that physically either one atom or another can occupy the same location, with probabilities described in group_probabilities
. Of course then the sites MUST belong to different groups. Bu the fact that their coordinates are the same is accidental; you can run a positional refinement and the sites will diverge.
If we have just one site with a species consisting of two atoms, then I understand that either one or another atom can be encountered at each site where these species is present (but there are no correlations between sites, right?); the probabilities are given in concentration
. vacancy
can be specified for a missing atom, with its own concentration. In this case atoms will (should?) never ever diverge from the given site.
I am slightly unaligned with the exact definition of
site
from the property description, and what it entails for this sentence.
Hmmm... I wold understand that a site
is a position in space (given by Cartesian coordinates) potentially occupied by an atom of one or another chemical kind; what atoms, in what proportions and in what correlations is described by species
, species_at_sites
, sites_in_groups
. Actually, the current spec gives an example of describing the same situation using mixed-element species and alternatively using assemblies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sauliusg, I think your understanding aligns perfectly with my understanding as well.
For the purpose of the spec, I would then add a mention of species
, but link to the comparison of assemblies
and species
to describe the exact meaning of multiple sites at the same Cartesian position
, or rather examples of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precise point on how to interpret repeated coordinates, and how to formulate the text around it, was a tricky issue to agree on, so we should not edit this without multiple other people taking a look. If what you ask for is to add a further clarification on text that this PR does not modify (?), can you please post a separate PR with the suggested clarification so we can discuss that separately? I don't think this has anything to do with atoms at unknown coordinates?
On the actual point, my view is aligned with @sauliusg. Repeated coordinates represent different sites that happen to have the same numerical values as coordinates, but those coordinates could diverge with higher accuracy. There are very few reasons to ever use this in practice, aside assemblies. Multiple chemical_symbols
+ concentrations
on one site represent correlated disordered atoms that should sit on exactly the same site, correlated in such a way that two atoms never sit on that site at the same time. If you tried to represent the same thing with two sites with repeated coordinates in OPTIMADE with 50% atom + 50% vacancy, they would individually, and in an uncorrelated way, disorder at respective site as 50% atom/50% vacancy; meaning that sometimes both atoms would be sitting there at the same coordinates at the same time (which probably is not what you want).
It was an important point that there never should be a need on either client or server to "detect" repeated coordinates and prescribe a special meaning to that situation, compared to just treating them as two different sites.
optimade.rst
Outdated
- :val:`unknown_positions`: This flag MUST be present if at least one component of the :property:`cartesian_site_positions` list of lists has value :val:`null`. | ||
- :val:`assemblies`: This flag MUST be present if the property `assemblies`_ is present. | ||
- :val:`disorder`: this flag MUST be present if any one entry in the `species`_ list has a :field:`chemical_symbols` list that is longer than 1 element. | ||
- :val:`implicit_atoms`: this flag MUST be present if the structure contains atoms that are not assigned to sites via the property `species_at_sites`_ (e.g., because their positions are unknown). When this flag is present, the properties related to the chemical formula will likely not match the type and count of atoms represented by the `species_at_sites`_, `species`_, and `assemblies`_ properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this be present without also having site_attachments
present?
Or phrasing the question in another way: How could I add implicit atoms without mentioning them somewhere else? Or is that the whole point of implicit atoms? That it should be present if the structure contains "species/atoms" which are not accounted for by any property, e.g., crystalline/bound water or other solvents or the like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could I add implicit atoms without mentioning them somewhere else?
In OPTIMADE, you can not.
In CIFs (experimental crystallography), you can add a summary formula that specifes more atoms than are listed in the coordinates table; you can also add special _sqeeze data items that in my understanding describe electron density occupied by molecules that could not be resolved and modelled by discrete atoms because they are disordered. These molecules (e.g. solvent) will still be present in the formula, but absent from the atom list, and not attached to any particular atom. In that case, you would return for OPTIMADE attached_atoms == false && implicit_atoms == true
Or is that the whole point of implicit atoms?
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for my two questions you've answered them with opposing answers, as I understand it.
While we in OPTIMADE do not allow the flag "implicit_atoms"
to be true
without also adding the atoms somewhere else (i.e., in another property), then that is still the flags purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for my two questions you've answered them with opposing answers, as I understand it.
While we in OPTIMADE do not allow the flag"implicit_atoms"
to betrue
without also adding the atoms somewhere else (i.e., in another property), then that is still the flags purpose?
I probably misunderstood your (last?) question.
Let me try to clarify (for myself and for discussion with all of us):
- As intended in the current OPTIMADE PR, as I understand it, the flags
implicit_atoms
andattached_atoms
are orthogonal and independent. All four combinations are permissible:
-
implicit_atoms == false && attached_atoms == false
: This is a structure where every atom that has been modelled has a correspondingspecies_at_sites
entry and the Cartesian coordinates; these are structures that computational material scientists are mostly after, but not the only structures that experimentalists produce :) ; -
implicit_atoms == true && attached_atoms == false
: we know that the material described contains more atoms than specified in the coordinates list, but we do not know where these atoms are attached. In experimental databases, this will manifest through the fact that a formula computed from atomic coordinates and the chemical formula declared by the authors do not match. We do not know, however, where these atoms are in the cell. -
implicit_atoms == false && attached_atoms == true
: there are atoms to which other atoms attached, and we know all such attachments. Most often the attached atoms will be hydrogens (but can probably be deuterium; although deuterium is usually determined by neutron diffraction and its position is know precisely). The calculated formula matches the declared formula if you take attached atoms into account; no other atoms in the unit cell are known/modelled/claimed.
For such structures you can reasonably well reconstruct the hydrogens; for example in a benzene ring a hydrogen atom attached to a ring carbon will sit in a plane of the ring on a bissectrice at a known distance. If you place all such atoms (you can do this automatically most often), then you will know coordinates of all atoms and you can proceed with the calculation. Sometimes the position of hydrogens will be ambiguous, e.g. a methyl group (-CH3) can be rotated around its bond; but hopefully the differences are small for the coordinate relaxation to converge (I wold be interested to try such calculation, actually...).
While placing hydrogens in this case is possible, the database will not do it for you ;)
- Finally,
implicit_atoms == true && attached_atoms == true
: it means that both situations of from cases 2 and 3 are present. Some atoms are attached and you can roughly reconstruct them; but some will still be missing. It's up to you what you do with such structure; most probably computationalists will discard it (as well as the case 2), but the model can be uselful in many different cases. Also, a big question for computational material scientists: is it possible to guess an approximate positions of the missing atoms and then compute them using DFT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So again, I believe I understand all of these scenarios and can come up with examples of my own.
My concern is more of clarification of whether the OPTIMADE structures
data structure needs to (or could) contain the information of the implicit atoms, e.g., one could add a species
which do not have a site, which could be the implicit atoms.
One may or may not even know the content of such species, how would this then be specified in the concentration
field?
My concern here is simply that the implicit_atoms
is not well defined in terms of what it requires otherwise in the OPTIMADE structures
data structure.
We may say that this is fine, and providers can do as they wish, however, it may be valuable to understand which species are considered implicit if possible, and hence recommend a certain way of supplying this information.
A concrete example may be crystalline water, where it's easier to provide a 3D density mapping of the probability of where to find the water. It may be catalytic compounds, where the active molecule/ion/atom shifts around in the structure/framework so much that it cannot be pinpointed to a specific location, but could then (surely) be attached (i.e., one sets the attached_atoms
flag and fill in the appropriate fields in species
), however, if it's not attached anywhere, or more realistically, it is unknown where part of it is attached, then where should one put this information?
In the examples given here, it's possible I could perfectly know the content of such species/atoms through gravimetric experiments or otherwise, but not a position. Hence they are implicit, but could be accounted for when matching up the list of species with the chemical formula...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, just re-reading your latest post @sauliusg, I think my main point is this:
We should come up with a recommendation on how to supply information on implicit atoms/species.
And our data structure / specification should be flexible enough to accommodate this. Otherwise, we're losing information using the OPTIMADE schema compared to the original database schema (relating to the implicit atoms), while at the same time knowingly stating that we are missing this information (through the implicit_atoms
flag) - that's just frustrating.
Edit: Also, your well-written examples in your comment above could be a great addition to clarifying what we mean by implicit_atoms
(as well as attached_atoms
). So it could/should be put into some documentation of some kind, if not directly in the specification (but then shortened).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this time there is no standardized formal way of specifying which atoms are implicit, except that they MAY be indicated in the formulas (but the specification presently does not make this mandatory). It is my impression from the lengthy discussions around this that we should not add any mandatory requirements to further specify the implicit atoms, as in the case of many cif files, it will be impossible to do so without inserting someones interpretation of the data.
I don't have anything against additional optional specifications of what the missing atoms are. Please post as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to add further mandatory requirements, but rather specification recommendations, i.e., RECOMMENDED instead of MUST or SHOULD.
However, to not delay this any longer, I will direct any suggested changes I may have into another issue/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so what could be useful is some optional field for specifying the implicit atoms, which can go in another PR.
(I just note that as far as I read rfc2119, there is no difference between SHOULD and RECOMMENDED.)
Co-authored-by: Casper Welzel Andersen <[email protected]>
Co-authored-by: Casper Welzel Andersen <[email protected]>
I've commited @CasperWA's edits. For his larger points, I'd like to see them as separate PRs as I write in the comments. I don't think they are blocking v1.0.0-rc.2 (which we are already late for...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written in a comment response, my "larger" comments are not blocking for getting this PR through. They were merely suggestions to recommendations that could be made for how to deal with certain use cases.
I will try and make an issue/PR for these to move the discussion there.
Again, thanks for the extensive work on this @rartino and @sauliusg !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR follows discussions at the OPTIMADE 2020 workshop.
As per issues #260 and #258 (and also see #50, #93) there has been quite some discussion about how to handle unknown positions. It was previously decided that we would allow
null
in thecartesian_site_positions
property in thestructure
entry for this. However, since then it has come up that: (i) it is not desirable for databases to on-the-fly have to insert these null entries, (ii) it can be difficult for databases to provide reasonably efficient queries on whether suchnull
s will be present or not.Hence, after some discussion, this PR implements:
unknown_atoms
is completely removed.nulls
are no longer allowed in thecartesian_site_position
property.attached
andnattached
in thesites
list. There is a flagsite_attachments
instructure_features
to indicate when this feature is used so queries easily can exclude/include these structures.implicit atoms
. This means that not all atoms are actually included in the structure definition provided incartisian_site_positions
andspecies_at_sites
. This usually means there is a mismatch between the properties indicating the chemical formula and what is found in this structural definition. There is a flagimplicit_atoms
instructure_features
to indicate when this feature is used so queries can exclude/include these structures.(Other people prominently being part of the discussions shaping this PR: @sauliusg, @giovannipizzi, and @merkys).
Closes #260
Closes #258