Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking: New method of POTCAR validation #3351

Merged
merged 47 commits into from
Sep 30, 2023

Conversation

esoteric-ephemera
Copy link
Contributor

Summary

Issues:
current method of POTCAR validation is too brittle to work well in high-throughput calculations. For example, hashes are sensitive to subtle changes in spaces (addition/deletion). However FORTRAN is space-insensitive, thus two POTCARs that differ by spaces are read identically by VASP.

Goals:
Develop less brittle method for validating POTCARs that reflects how POTCAR metadata and numeric values are read by FORTRAN.

Major changes:

  • is_valid function in pymatgen.io.vasp.inputs.PotcarSingle, which checks POTCAR metadata and values against a database of these
  • To respect copyright, only statistics on the numerical values are stored separately for the PSCTR / POTCAR header and pseudopotential numeric values
  • Keywords found within the PSCTR header and pseudopotential body are stored as well for comparison. For example, older POTCARs may have lacked the kinetic energy density (needed for meta-GGA calcs), and were later updated to include this. The check distinguishes the older and newer version of the same POTCAR
  • Tests have been updated (TestPotcarSingle class in tests/io/vasp/test_inputs.py)

Todos

  • While I generated POTCAR metadata from those available to our group, might be beneficial to expand the database of POTCAR metadata to older / newer releases

@mkhorton
Copy link
Member

Thanks for taking a fresh look at this problem @esoteric-ephemera, it's been unsatisfactory for a while. These warnings frequently confuse or alarm people unnecessarily, but the underlying motivation of rigorously checking pseudopotential compatibility is of course very important.

However FORTRAN is space-insensitive, thus two POTCARs that differ by spaces are read identically by VASP.

I imagine there's a way around this, to normalize the POTCAR file, but it might be slow and perhaps there'd still be brittleness in other ways so I like your suggestions.

When the original POTCAR hash database was assembled we tried to find as many examples as we could, but I'm still surprised there are files out there that might differ by e.g. spaces within the file itself, since I can't imagine why these edits would have been made.

return input_str

def is_valid(self, tol: float = 1.0e-6) -> bool:
"""Check that POTCAR matches reference metadata."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this docstring to specify what is being checked (e.g., what attributes of PotcarSingle), and also what the tol is doing, and why this can be changed by the user? It seems like a fixed tolerance might be preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Might have gone a little too verbose in explaining what's going on in is_valid(), but there's a better explanation there now.

For "tol", I left that adjustable for debugging - it's frozen now at the same value

@esoteric-ephemera
Copy link
Contributor Author

Thanks both @janosh and @mkhorton! Tried to address your comments.

It is possible just to completely strip a POTCAR of spaces and hash it that way, but I see one issue with that: in the PSCTR / header section of the POTCAR, there are comments. Normally, in fortran, in-line comments would be indicated by a leading "!", but these are omitted in the proprietary POTCAR format. One could add comments without changing any numeric values, and get the same POTCAR

There was actually a test of PotcarSingle in PMG that did just that - it basically added comments to a PSCTR tag. A hash would report these as different, but the numeric values are identical to vasp

Copy link
Member

@janosh janosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @esoteric-ephemera! 👍

I think we just need to rename test_identify_potcar and call is_valid in there.

if input_str.lower() in ["t", "f"] or input_str.lower() in ["true", "false"]:
return input_str[0].lower() == "t"

if (input_str.upper() == input_str.lower()) and input_str[0].isnumeric():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about floats like 1e3? I guess they never appear in POTCARs?

Copy link
Contributor Author

@esoteric-ephemera esoteric-ephemera Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortran-style scientific format always includes a decimal point on write, like 1.0E-06 or 0.5E+2, see this guide. While you can write 1E4, you can't format a float without a decimal point in fortran.

Probably best to rename this to _fortran_style_str_to_py_var

assert "Fe" in psingle.identify_potcar()[1]
matched_funcs = [refpsp["POTCAR_FUNCTIONAL"] for refpsp in psingle._matched_meta]
assert "PBE_54" in matched_funcs
assert "Fe" in psingle._matched_meta[0]["TITEL"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests generally shouldn't test an implementation detail like this private _matched_meta list. We need a test called test_is_valid that only calls the public PotcarSingle.is_valid property both for a case where it's True and False.

Copy link
Contributor Author

@esoteric-ephemera esoteric-ephemera Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! And thanks for cleaning up my messy python. The new test_is_valid operates on a working Fe POTCAR, and one where one variable is changed by 0.001. Should highlight that the check truly is sensitive to individual numeric values.

@Andrew-S-Rosen
Copy link
Member

Thank you, @esoteric-ephemera, for helping me sleep better at night with this PR.

Aaron Kaplan and others added 3 commits September 27, 2023 09:26
…fy_potcar to test_is_valid to ensure that PotcarSingle.is_valid is not sensitive to small numeric changes
Comment on lines 1041 to 1053

# corrupt the file
assert psingle.keywords["RCORE"] == 2.3
psingle.keywords["RCORE"] = 2.2
assert not psingle.is_valid

def test_potcar_hash_warning(self):
psingle.keywords.pop("RCORE")
assert not psingle.is_valid

psingle.data = psingle.data.replace("RCORE = 2.300", "RCORE = 2.200")
assert not psingle.is_valid

def test_unknown_potcar_warning(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than storing a new POTCAR file with just 1 changed line, it should be easy to invalidate a PotcarSingle instance on the fly. But actually this turned out to be pretty hard. You might say this is a contrived example since people shouldn't change PotcarSingle objects once they've been read from a file. But fwiw, I can currently reassign

psingle.keywords["RCORE"] = 2.2

remove it altogether

psingle.keywords.pop("RCORE")

and modify the underlying string data

psingle.data = psingle.data.replace("RCORE  =    2.300", "RCORE  =    2.200")

all without invalidating the PotcarSingle instance. See the new asserts I added. I think it would make sense for all of them to pass but they currently fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the header info is taken from self.PSCTR and not the parse_functions attributes of PotcarSingle. For each such attribute like RCORE (the ones in self.parse_functions), the key/value are stored both in self.keywords and self.PSCTR

So if you do instead self.PSCTR['RCORE'] = 2.2, the second psingle.is_valid will fail. Or, if you did
psingle.data = psingle.data.replace('local part','local port')
the second psingle.is_valid will also fail.

Hesitant to change how this is done (storing the same data in three different attributes of PotcarSingle) because it might break something that relies on it

Copy link
Member

@janosh janosh Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hesitant to change how this is done (storing the same data in three different attributes of PotcarSingle) because it might break something that relies on it

I don't think we need to be hesitant here. having redundant data in PotcarSingle.keywords and PotcarSingle.PSCTR is needlessly complicated, goes against the SSOT coding principle and now would be a great time to clean this up imo.
@shyuep @mkhorton Any concerns about reworking PotcarSingle to e.g. remove PSCTR and turn keywords into a @property so they're computed on the fly and always in sync with the underlying data string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, having redundant data doesn't make sense for the reasons you state. My only hesitation here is that I wouldn't want to impose scope creep on this PR for @esoteric-ephemera, since it's probably not strictly necessary to change it for this PR to achieve its goal of better validation. While the PotcarSingle should be immutable, I think the risk of someone mutating it in practice are quite low.

More broadly, I've never had a good understanding of the internals of the POTCAR file. If anyone did want to spend more time on the PotcarSingle class, I'd love to see more documented @property etc. for some of these keywords so that people can better understand what the POTCAR files they're using are actually specifying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged PotcarSingle.keywords and PotcarSingle.PSCTR, now just called PotcarSingle.keywords since PSCTR has no obvious meaning. Maybe we should rename PotcarSingle.keywords to PotcarSingle.header_keywords for added clarity?

@mkhorton
Copy link
Member

At the risk of asking a stupid question, I saw the MPRelaxSet defaults changed Yb_2 to Yb_3 recently (see #2968).

However, as far as I can tell, only Yb_2 is in the POTCAR database for the PBE functional (i.e., the MPRelaxSet default). The Yb_3 only exists for the PBE_52/PBE_54.

So my questions are, (1) is this correct?, and (2), if so, is it even possible to run MPRelaxSet with Yb-containing materials at present?

@janosh
Copy link
Member

janosh commented Sep 27, 2023

So my questions are, (1) is this correct?, and (2), if so, is it even possible to run MPRelaxSet with Yb-containing materials at present?

@mkhorton I think you forgot about #2972 which reverted the Yb_2 -> Yb_3 migration for MPRelaxSet:

@mkhorton
Copy link
Member

mkhorton commented Sep 27, 2023

Indeed I did, thanks @janosh! This was not reverted in atomate2 so the two sets are currently inconsistent (I realise this is an active topic of discussion right now).

Apologies to @esoteric-ephemera for bringing a tangent to this PR.

fixes TestTransformedStructure.test_get_vasp_input >           return self.keywords[attr.upper()]
E           KeyError: 'HASH'

(was broken in 983598)
@janosh janosh added io Input/output functionality vasp Vienna Ab initio Simulation Package breaking Breaking change api Application programming interface ux User experience labels Sep 29, 2023
warnings.warn(
f"POTCAR with symbol {self.symbol} and functional\n"
f"{self.functional} has a SHA256 hash defined,\n"
"but the computed hash differs.\n"
"YOUR POTCAR FILE HAS BEEN CORRUPTED AND SHOULD NOT BE USED!"
"YOUR POTCAR FILE HAS BEEN CORRUPTED AND SHOULD NOT BE USED!",
UnknownPotcarWarning,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esoteric-ephemera @mkhorton @Andrew-S-Rosen Are we sure we want to keep the old unknown POTCAR warning when the SHA256 hash provided by VASP doesn't match the one computed from the file? Isn't that what has been giving so many false positive warnings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR not make this more robust, reducing false positives?

The warning is definitely badly phrased if we do keep it, it’s inscrutable unless someone is very familiar with VASP and pymatgen. We could link to a docs page for explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR not make this more robust, reducing false positives?

The new warning that's supposed to be more robust comes just above these lines:

if not has_sha256 and not self.is_valid:
warnings.warn(
f"POTCAR data with symbol {self.symbol} is not known to pymatgen. Your "
"POTCAR may be corrupted or pymatgen's POTCAR database is incomplete.",
UnknownPotcarWarning,
)

Currently, when not self.is_valid is False, i.e. the POTCAR is valid according to the new validation, then we still check whether computed and in-file hash match and trigger the old warning if not (which has many false positives, I think):

elif has_sha256 and not sha256_pass:
    warnings.warn(
                f"POTCAR with symbol {self.symbol} and functional\n"
                ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SHA56 warning is not what's been giving us so many UnknownPotcarWarnings, that was the internal PMG hashing which has been removed. Some versions of VASP POTCARs have a SHA56 tag in the header that VASP computed on their side. We don't have any SHA56-tagged POTCARs for me to test that the current validator works well in their place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though honestly, given the comments on issue #2895, it seems like the SHA56 has similar fragility as PMG's hashes. I would lean towards removing the SHA56 check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the comments on issue #2895, it seems like the SHA56 has similar fragility as PMG's hashes

that was my understanding as well

I would lean towards removing the SHA56 check

same!

@janosh janosh changed the title New method of POTCAR validation Breaking: New method of POTCAR validation Sep 29, 2023
Copy link
Member

@janosh janosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great! 2 things in this PR:

  1. less brittle way of checking for invalid POTCARs
  2. clean up of PotcarSingle class. now less confusing due to
    • no duplication between keywords and (the no longer existent) PSCTR
    • using @property methods for hashes instead of direct class attributes

Thanks @esoteric-ephemera!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Application programming interface breaking Breaking change io Input/output functionality ux User experience vasp Vienna Ab initio Simulation Package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants