Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New representation of atoms at unknown positions #279

Merged
merged 7 commits into from
Jun 14, 2020

Conversation

rartino
Copy link
Contributor

@rartino rartino commented Jun 12, 2020

This PR follows discussions at the OPTIMADE 2020 workshop.

As per issues #260 and #258 (and also see #50, #93) there has been quite some discussion about how to handle unknown positions. It was previously decided that we would allow null in the cartesian_site_positions property in the structure entry for this. However, since then it has come up that: (i) it is not desirable for databases to on-the-fly have to insert these null entries, (ii) it can be difficult for databases to provide reasonably efficient queries on whether such nulls will be present or not.

Hence, after some discussion, this PR implements:

  • The old notion of unknown_atoms is completely removed. nulls are no longer allowed in the cartesian_site_position property.
  • A species definition may now include attached atoms, e.g., one can define 'CH4' to be a Carbon with four attached Hydrogens. This feature is provided by keys attached and nattached in the sites list. There is a flag site_attachments in structure_features to indicate when this feature is used so queries easily can exclude/include these structures.
  • An implementation may serve structures with implicit atoms. This means that not all atoms are actually included in the structure definition provided in cartisian_site_positions and species_at_sites. This usually means there is a mismatch between the properties indicating the chemical formula and what is found in this structural definition. There is a flag implicit_atoms in structure_features to indicate when this feature is used so queries can exclude/include these structures.

(Other people prominently being part of the discussions shaping this PR: @sauliusg, @giovannipizzi, and @merkys).

Closes #260
Closes #258

@rartino rartino added PR/ready-for-review Add this flag if you are the author of the PR and you want it to be reviewed. Remove it when editing priority/high There is a consensus that resolving this should be prioritized. labels Jun 12, 2020
@rartino rartino added this to the 1.0 release milestone Jun 12, 2020
Copy link
Contributor

@sauliusg sauliusg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a slight change in the wording, but otherwise it seems fine to me.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
@rartino rartino requested a review from sauliusg June 12, 2020 17:44
@rartino
Copy link
Contributor Author

rartino commented Jun 12, 2020

Thanks for the review @sauliusg, I've made the modifications suggested.

giovannipizzi
giovannipizzi previously approved these changes Jun 12, 2020
sauliusg
sauliusg previously approved these changes Jun 13, 2020
Copy link
Contributor

@sauliusg sauliusg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the text one more today, it seem clear to me, so if other agree let's merge.

Shall we put an attached atom example (-CH3, -CH2-) in a separate PR, or is it already in? (I did not find it, but we can make it later).

@rartino
Copy link
Contributor Author

rartino commented Jun 13, 2020

Oh, I think I misunderstood you. I added the "example" of a "-CH3" to the description of species

(frequently used to indicate hydrogen atoms attached to another element, e.g., a carbon with three attached hydrogens might represent a methyl group, -CH3).

But you mean to add an example to the list of species examples, right? Lets do that in this PR. Is this ok?:

[ {"name": "-CH3", "chemical_symbols": ["C"], "concentration": [1.0], "attached":"H", "nattached":3} ]

@rartino rartino dismissed stale reviews from sauliusg and giovannipizzi via 2330025 June 13, 2020 09:01
@sauliusg
Copy link
Contributor

[ {"name": "-CH3", "chemical_symbols": ["C"], "concentration": [1.0], "attached":"H", "nattached":3} ]

Probably just "CH3" would be sufficient to give a computer-name like string, but of course "-CH3" would be more explicit that this is a methyl attached to something. I'm find with both ways the example would go (and assume a server can assign any string that is convenient).

@rartino
Copy link
Contributor Author

rartino commented Jun 13, 2020

Alright, @sauliusg, @giovannipizzi, @CasperWA please re-review so we can get this merged. Thanks.

@sauliusg sauliusg self-requested a review June 13, 2020 09:06
sauliusg
sauliusg previously approved these changes Jun 13, 2020
Copy link
Contributor

@sauliusg sauliusg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks perfect now!

Copy link
Member

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @rartino and @sauliusg !

I have mostly just some questions, but also a couple of formatting change requests.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
- **Notes**: (for implementers) While this is unrelated to this OPTIMADE specification: If you decide to store internally the :property:`cartesian_site_positions` as a float array, you might want to represent :val:`null` values with :field-val:`NaN` values.
The latter being valid float numbers in the IEEE 754 standard in `IEEE 754-1985 <https://doi.org/10.1109/IEEESTD.1985.82928>`__ and in the updated version `IEEE 754-2008 <https://doi.org/10.1109/IEEESTD.2008.4610935>`__.
- It MUST be a list of length equal to the number of sites in the structure where every element is a list of the three Cartesian coordinates of a site expressed as float values, expressed in the unit angstrom (Å).
- An entry MAY have multiple sites at the same Cartesian position (for a relevant use of this, see e.g., the property `assemblies`_).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this also be equivalent to multiple chemical_symbols in a species?
I am slightly unaligned with the exact definition of site from the property description, and what it entails for this sentence.
Note, I don't see a need to change the description, but rather its implication on this sentence, i.e., extending this sentence with a comment on what it means for multiple chemical_symbols in a species; whether multiple sites at the same Cartesian position can also be set by adding multiple values to chemical_symbols of a species or if not (then one could add something like not to be confused with adding multiple values for "chemical_symbols" of the related "species", or something?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this also be equivalent to multiple chemical_symbols in a species?

I would say physically yes, but logically no.

If two sites have the same Cartesian coords, then I would understand that physically either one atom or another can occupy the same location, with probabilities described in group_probabilities. Of course then the sites MUST belong to different groups. Bu the fact that their coordinates are the same is accidental; you can run a positional refinement and the sites will diverge.

If we have just one site with a species consisting of two atoms, then I understand that either one or another atom can be encountered at each site where these species is present (but there are no correlations between sites, right?); the probabilities are given in concentration. vacancy can be specified for a missing atom, with its own concentration. In this case atoms will (should?) never ever diverge from the given site.

I am slightly unaligned with the exact definition of site from the property description, and what it entails for this sentence.

Hmmm... I wold understand that a site is a position in space (given by Cartesian coordinates) potentially occupied by an atom of one or another chemical kind; what atoms, in what proportions and in what correlations is described by species, species_at_sites, sites_in_groups. Actually, the current spec gives an example of describing the same situation using mixed-element species and alternatively using assemblies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sauliusg, I think your understanding aligns perfectly with my understanding as well.

For the purpose of the spec, I would then add a mention of species, but link to the comparison of assemblies and species to describe the exact meaning of multiple sites at the same Cartesian position, or rather examples of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precise point on how to interpret repeated coordinates, and how to formulate the text around it, was a tricky issue to agree on, so we should not edit this without multiple other people taking a look. If what you ask for is to add a further clarification on text that this PR does not modify (?), can you please post a separate PR with the suggested clarification so we can discuss that separately? I don't think this has anything to do with atoms at unknown coordinates?

On the actual point, my view is aligned with @sauliusg. Repeated coordinates represent different sites that happen to have the same numerical values as coordinates, but those coordinates could diverge with higher accuracy. There are very few reasons to ever use this in practice, aside assemblies. Multiple chemical_symbols + concentrations on one site represent correlated disordered atoms that should sit on exactly the same site, correlated in such a way that two atoms never sit on that site at the same time. If you tried to represent the same thing with two sites with repeated coordinates in OPTIMADE with 50% atom + 50% vacancy, they would individually, and in an uncorrelated way, disorder at respective site as 50% atom/50% vacancy; meaning that sometimes both atoms would be sitting there at the same coordinates at the same time (which probably is not what you want).

It was an important point that there never should be a need on either client or server to "detect" repeated coordinates and prescribe a special meaning to that situation, compared to just treating them as two different sites.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated
- :val:`unknown_positions`: This flag MUST be present if at least one component of the :property:`cartesian_site_positions` list of lists has value :val:`null`.
- :val:`assemblies`: This flag MUST be present if the property `assemblies`_ is present.
- :val:`disorder`: this flag MUST be present if any one entry in the `species`_ list has a :field:`chemical_symbols` list that is longer than 1 element.
- :val:`implicit_atoms`: this flag MUST be present if the structure contains atoms that are not assigned to sites via the property `species_at_sites`_ (e.g., because their positions are unknown). When this flag is present, the properties related to the chemical formula will likely not match the type and count of atoms represented by the `species_at_sites`_, `species`_, and `assemblies`_ properties.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this be present without also having site_attachments present?
Or phrasing the question in another way: How could I add implicit atoms without mentioning them somewhere else? Or is that the whole point of implicit atoms? That it should be present if the structure contains "species/atoms" which are not accounted for by any property, e.g., crystalline/bound water or other solvents or the like?

Copy link
Contributor

@sauliusg sauliusg Jun 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could I add implicit atoms without mentioning them somewhere else?

In OPTIMADE, you can not.

In CIFs (experimental crystallography), you can add a summary formula that specifes more atoms than are listed in the coordinates table; you can also add special _sqeeze data items that in my understanding describe electron density occupied by molecules that could not be resolved and modelled by discrete atoms because they are disordered. These molecules (e.g. solvent) will still be present in the formula, but absent from the atom list, and not attached to any particular atom. In that case, you would return for OPTIMADE attached_atoms == false && implicit_atoms == true

Or is that the whole point of implicit atoms?

Yes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for my two questions you've answered them with opposing answers, as I understand it.
While we in OPTIMADE do not allow the flag "implicit_atoms" to be true without also adding the atoms somewhere else (i.e., in another property), then that is still the flags purpose?

Copy link
Contributor

@sauliusg sauliusg Jun 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for my two questions you've answered them with opposing answers, as I understand it.
While we in OPTIMADE do not allow the flag "implicit_atoms" to be true without also adding the atoms somewhere else (i.e., in another property), then that is still the flags purpose?

I probably misunderstood your (last?) question.

Let me try to clarify (for myself and for discussion with all of us):

  • As intended in the current OPTIMADE PR, as I understand it, the flags implicit_atoms and attached_atoms are orthogonal and independent. All four combinations are permissible:
  1. implicit_atoms == false && attached_atoms == false: This is a structure where every atom that has been modelled has a corresponding species_at_sites entry and the Cartesian coordinates; these are structures that computational material scientists are mostly after, but not the only structures that experimentalists produce :) ;

  2. implicit_atoms == true && attached_atoms == false: we know that the material described contains more atoms than specified in the coordinates list, but we do not know where these atoms are attached. In experimental databases, this will manifest through the fact that a formula computed from atomic coordinates and the chemical formula declared by the authors do not match. We do not know, however, where these atoms are in the cell.

  3. implicit_atoms == false && attached_atoms == true: there are atoms to which other atoms attached, and we know all such attachments. Most often the attached atoms will be hydrogens (but can probably be deuterium; although deuterium is usually determined by neutron diffraction and its position is know precisely). The calculated formula matches the declared formula if you take attached atoms into account; no other atoms in the unit cell are known/modelled/claimed.

For such structures you can reasonably well reconstruct the hydrogens; for example in a benzene ring a hydrogen atom attached to a ring carbon will sit in a plane of the ring on a bissectrice at a known distance. If you place all such atoms (you can do this automatically most often), then you will know coordinates of all atoms and you can proceed with the calculation. Sometimes the position of hydrogens will be ambiguous, e.g. a methyl group (-CH3) can be rotated around its bond; but hopefully the differences are small for the coordinate relaxation to converge (I wold be interested to try such calculation, actually...).

While placing hydrogens in this case is possible, the database will not do it for you ;)

  1. Finally, implicit_atoms == true && attached_atoms == true: it means that both situations of from cases 2 and 3 are present. Some atoms are attached and you can roughly reconstruct them; but some will still be missing. It's up to you what you do with such structure; most probably computationalists will discard it (as well as the case 2), but the model can be uselful in many different cases. Also, a big question for computational material scientists: is it possible to guess an approximate positions of the missing atoms and then compute them using DFT?

Copy link
Member

@CasperWA CasperWA Jun 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So again, I believe I understand all of these scenarios and can come up with examples of my own.
My concern is more of clarification of whether the OPTIMADE structures data structure needs to (or could) contain the information of the implicit atoms, e.g., one could add a species which do not have a site, which could be the implicit atoms.
One may or may not even know the content of such species, how would this then be specified in the concentration field?

My concern here is simply that the implicit_atoms is not well defined in terms of what it requires otherwise in the OPTIMADE structures data structure.
We may say that this is fine, and providers can do as they wish, however, it may be valuable to understand which species are considered implicit if possible, and hence recommend a certain way of supplying this information.

A concrete example may be crystalline water, where it's easier to provide a 3D density mapping of the probability of where to find the water. It may be catalytic compounds, where the active molecule/ion/atom shifts around in the structure/framework so much that it cannot be pinpointed to a specific location, but could then (surely) be attached (i.e., one sets the attached_atoms flag and fill in the appropriate fields in species), however, if it's not attached anywhere, or more realistically, it is unknown where part of it is attached, then where should one put this information?

In the examples given here, it's possible I could perfectly know the content of such species/atoms through gravimetric experiments or otherwise, but not a position. Hence they are implicit, but could be accounted for when matching up the list of species with the chemical formula...

Copy link
Member

@CasperWA CasperWA Jun 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, just re-reading your latest post @sauliusg, I think my main point is this:

We should come up with a recommendation on how to supply information on implicit atoms/species.
And our data structure / specification should be flexible enough to accommodate this. Otherwise, we're losing information using the OPTIMADE schema compared to the original database schema (relating to the implicit atoms), while at the same time knowingly stating that we are missing this information (through the implicit_atoms flag) - that's just frustrating.

Edit: Also, your well-written examples in your comment above could be a great addition to clarifying what we mean by implicit_atoms (as well as attached_atoms). So it could/should be put into some documentation of some kind, if not directly in the specification (but then shortened).

Copy link
Contributor Author

@rartino rartino Jun 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time there is no standardized formal way of specifying which atoms are implicit, except that they MAY be indicated in the formulas (but the specification presently does not make this mandatory). It is my impression from the lengthy discussions around this that we should not add any mandatory requirements to further specify the implicit atoms, as in the case of many cif files, it will be impossible to do so without inserting someones interpretation of the data.

I don't have anything against additional optional specifications of what the missing atoms are. Please post as a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to add further mandatory requirements, but rather specification recommendations, i.e., RECOMMENDED instead of MUST or SHOULD.
However, to not delay this any longer, I will direct any suggested changes I may have into another issue/PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so what could be useful is some optional field for specifying the implicit atoms, which can go in another PR.

(I just note that as far as I read rfc2119, there is no difference between SHOULD and RECOMMENDED.)

Co-authored-by: Casper Welzel Andersen <[email protected]>
Co-authored-by: Casper Welzel Andersen <[email protected]>
@rartino
Copy link
Contributor Author

rartino commented Jun 13, 2020

I've commited @CasperWA's edits. For his larger points, I'd like to see them as separate PRs as I write in the comments. I don't think they are blocking v1.0.0-rc.2 (which we are already late for...)

@rartino rartino requested review from sauliusg and CasperWA June 13, 2020 21:46
Copy link
Member

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written in a comment response, my "larger" comments are not blocking for getting this PR through. They were merely suggestions to recommendations that could be made for how to deal with certain use cases.
I will try and make an issue/PR for these to move the discussion there.

Again, thanks for the extensive work on this @rartino and @sauliusg !

@CasperWA CasperWA requested review from giovannipizzi and ml-evs June 13, 2020 23:52
Copy link
Contributor

@sauliusg sauliusg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine for me now; thanks @CasperWA , @rartino for taking care of this!

@rartino rartino merged commit 8481d21 into Materials-Consortia:develop Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/ready-for-review Add this flag if you are the author of the PR and you want it to be reviewed. Remove it when editing priority/high There is a consensus that resolving this should be prioritized.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Definition of an unknown position Query support of structure_features properties
4 participants