-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking: New method of POTCAR validation #3351
Conversation
Thanks for taking a fresh look at this problem @esoteric-ephemera, it's been unsatisfactory for a while. These warnings frequently confuse or alarm people unnecessarily, but the underlying motivation of rigorously checking pseudopotential compatibility is of course very important.
I imagine there's a way around this, to normalize the POTCAR file, but it might be slow and perhaps there'd still be brittleness in other ways so I like your suggestions. When the original POTCAR hash database was assembled we tried to find as many examples as we could, but I'm still surprised there are files out there that might differ by e.g. spaces within the file itself, since I can't imagine why these edits would have been made. |
pymatgen/io/vasp/inputs.py
Outdated
return input_str | ||
|
||
def is_valid(self, tol: float = 1.0e-6) -> bool: | ||
"""Check that POTCAR matches reference metadata.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand this docstring to specify what is being checked (e.g., what attributes of PotcarSingle
), and also what the tol
is doing, and why this can be changed by the user? It seems like a fixed tolerance might be preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Might have gone a little too verbose in explaining what's going on in is_valid(), but there's a better explanation there now.
For "tol", I left that adjustable for debugging - it's frozen now at the same value
Thanks both @janosh and @mkhorton! Tried to address your comments. It is possible just to completely strip a POTCAR of spaces and hash it that way, but I see one issue with that: in the PSCTR / header section of the POTCAR, there are comments. Normally, in fortran, in-line comments would be indicated by a leading "!", but these are omitted in the proprietary POTCAR format. One could add comments without changing any numeric values, and get the same POTCAR There was actually a test of PotcarSingle in PMG that did just that - it basically added comments to a PSCTR tag. A hash would report these as different, but the numeric values are identical to vasp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @esoteric-ephemera! 👍
I think we just need to rename test_identify_potcar
and call is_valid
in there.
pymatgen/io/vasp/inputs.py
Outdated
if input_str.lower() in ["t", "f"] or input_str.lower() in ["true", "false"]: | ||
return input_str[0].lower() == "t" | ||
|
||
if (input_str.upper() == input_str.lower()) and input_str[0].isnumeric(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about floats like 1e3
? I guess they never appear in POTCARs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortran-style scientific format always includes a decimal point on write, like 1.0E-06
or 0.5E+2
, see this guide. While you can write 1E4
, you can't format a float without a decimal point in fortran.
Probably best to rename this to _fortran_style_str_to_py_var
tests/io/vasp/test_inputs.py
Outdated
assert "Fe" in psingle.identify_potcar()[1] | ||
matched_funcs = [refpsp["POTCAR_FUNCTIONAL"] for refpsp in psingle._matched_meta] | ||
assert "PBE_54" in matched_funcs | ||
assert "Fe" in psingle._matched_meta[0]["TITEL"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests generally shouldn't test an implementation detail like this private _matched_meta
list. We need a test called test_is_valid
that only calls the public PotcarSingle.is_valid
property both for a case where it's True
and False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! And thanks for cleaning up my messy python. The new test_is_valid
operates on a working Fe POTCAR, and one where one variable is changed by 0.001. Should highlight that the check truly is sensitive to individual numeric values.
Thank you, @esoteric-ephemera, for helping me sleep better at night with this PR. |
…fy_potcar to test_is_valid to ensure that PotcarSingle.is_valid is not sensitive to small numeric changes
tests/io/vasp/test_inputs.py
Outdated
|
||
# corrupt the file | ||
assert psingle.keywords["RCORE"] == 2.3 | ||
psingle.keywords["RCORE"] = 2.2 | ||
assert not psingle.is_valid | ||
|
||
def test_potcar_hash_warning(self): | ||
psingle.keywords.pop("RCORE") | ||
assert not psingle.is_valid | ||
|
||
psingle.data = psingle.data.replace("RCORE = 2.300", "RCORE = 2.200") | ||
assert not psingle.is_valid | ||
|
||
def test_unknown_potcar_warning(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rather than storing a new POTCAR file with just 1 changed line, it should be easy to invalidate a PotcarSingle
instance on the fly. But actually this turned out to be pretty hard. You might say this is a contrived example since people shouldn't change PotcarSingle
objects once they've been read from a file. But fwiw, I can currently reassign
psingle.keywords["RCORE"] = 2.2
remove it altogether
psingle.keywords.pop("RCORE")
and modify the underlying string data
psingle.data = psingle.data.replace("RCORE = 2.300", "RCORE = 2.200")
all without invalidating the PotcarSingle
instance. See the new assert
s I added. I think it would make sense for all of them to pass but they currently fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the header info is taken from self.PSCTR
and not the parse_functions
attributes of PotcarSingle. For each such attribute like RCORE (the ones in self.parse_functions
), the key/value are stored both in self.keywords
and self.PSCTR
So if you do instead self.PSCTR['RCORE'] = 2.2
, the second psingle.is_valid
will fail. Or, if you did
psingle.data = psingle.data.replace('local part','local port')
the second psingle.is_valid
will also fail.
Hesitant to change how this is done (storing the same data in three different attributes of PotcarSingle) because it might break something that relies on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hesitant to change how this is done (storing the same data in three different attributes of PotcarSingle) because it might break something that relies on it
I don't think we need to be hesitant here. having redundant data in PotcarSingle.keywords
and PotcarSingle.PSCTR
is needlessly complicated, goes against the SSOT coding principle and now would be a great time to clean this up imo.
@shyuep @mkhorton Any concerns about reworking PotcarSingle
to e.g. remove PSCTR
and turn keywords
into a @property
so they're computed on the fly and always in sync with the underlying data
string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, having redundant data doesn't make sense for the reasons you state. My only hesitation here is that I wouldn't want to impose scope creep on this PR for @esoteric-ephemera, since it's probably not strictly necessary to change it for this PR to achieve its goal of better validation. While the PotcarSingle
should be immutable, I think the risk of someone mutating it in practice are quite low.
More broadly, I've never had a good understanding of the internals of the POTCAR file. If anyone did want to spend more time on the PotcarSingle
class, I'd love to see more documented @property
etc. for some of these keywords so that people can better understand what the POTCAR files they're using are actually specifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged PotcarSingle.keywords
and PotcarSingle.PSCTR
, now just called PotcarSingle.keywords
since PSCTR
has no obvious meaning. Maybe we should rename PotcarSingle.keywords
to PotcarSingle.header_keywords
for added clarity?
At the risk of asking a stupid question, I saw the MPRelaxSet defaults changed Yb_2 to Yb_3 recently (see #2968). However, as far as I can tell, only So my questions are, (1) is this correct?, and (2), if so, is it even possible to run MPRelaxSet with Yb-containing materials at present? |
@mkhorton I think you forgot about #2972 which reverted the pymatgen/pymatgen/io/vasp/MPRelaxSet.yaml Line 175 in 8e0b099
|
Indeed I did, thanks @janosh! This was not reverted in atomate2 so the two sets are currently inconsistent (I realise this is an active topic of discussion right now). Apologies to @esoteric-ephemera for bringing a tangent to this PR. |
fixes TestTransformedStructure.test_get_vasp_input > return self.keywords[attr.upper()] E KeyError: 'HASH' (was broken in 983598)
trying to fix tests
pymatgen/io/vasp/inputs.py
Outdated
warnings.warn( | ||
f"POTCAR with symbol {self.symbol} and functional\n" | ||
f"{self.functional} has a SHA256 hash defined,\n" | ||
"but the computed hash differs.\n" | ||
"YOUR POTCAR FILE HAS BEEN CORRUPTED AND SHOULD NOT BE USED!" | ||
"YOUR POTCAR FILE HAS BEEN CORRUPTED AND SHOULD NOT BE USED!", | ||
UnknownPotcarWarning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esoteric-ephemera @mkhorton @Andrew-S-Rosen Are we sure we want to keep the old unknown POTCAR warning when the SHA256 hash provided by VASP doesn't match the one computed from the file? Isn't that what has been giving so many false positive warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR not make this more robust, reducing false positives?
The warning is definitely badly phrased if we do keep it, it’s inscrutable unless someone is very familiar with VASP and pymatgen. We could link to a docs page for explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR not make this more robust, reducing false positives?
The new warning that's supposed to be more robust comes just above these lines:
pymatgen/pymatgen/io/vasp/inputs.py
Lines 1778 to 1783 in c329ff5
if not has_sha256 and not self.is_valid: | |
warnings.warn( | |
f"POTCAR data with symbol {self.symbol} is not known to pymatgen. Your " | |
"POTCAR may be corrupted or pymatgen's POTCAR database is incomplete.", | |
UnknownPotcarWarning, | |
) |
Currently, when not self.is_valid
is False
, i.e. the POTCAR is valid according to the new validation, then we still check whether computed and in-file hash match and trigger the old warning if not (which has many false positives, I think):
elif has_sha256 and not sha256_pass:
warnings.warn(
f"POTCAR with symbol {self.symbol} and functional\n"
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SHA56 warning is not what's been giving us so many UnknownPotcarWarnings
, that was the internal PMG hashing which has been removed. Some versions of VASP POTCARs have a SHA56 tag in the header that VASP computed on their side. We don't have any SHA56-tagged POTCARs for me to test that the current validator works well in their place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though honestly, given the comments on issue #2895, it seems like the SHA56 has similar fragility as PMG's hashes. I would lean towards removing the SHA56 check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the comments on issue #2895, it seems like the SHA56 has similar fragility as PMG's hashes
that was my understanding as well
I would lean towards removing the SHA56 check
same!
rename potcar_file_hash to md5_computed_file_hash rename potcar_hash to md5_header_hash
…self.sha256_computed_file_hash prone to false positives
…lid example in test_is_valid instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great! 2 things in this PR:
- less brittle way of checking for invalid POTCARs
- clean up of
PotcarSingle
class. now less confusing due to- no duplication between
keywords
and (the no longer existent)PSCTR
- using
@property
methods for hashes instead of direct class attributes
- no duplication between
Thanks @esoteric-ephemera!
Summary
Issues:
current method of POTCAR validation is too brittle to work well in high-throughput calculations. For example, hashes are sensitive to subtle changes in spaces (addition/deletion). However FORTRAN is space-insensitive, thus two POTCARs that differ by spaces are read identically by VASP.
Goals:
Develop less brittle method for validating POTCARs that reflects how POTCAR metadata and numeric values are read by FORTRAN.
Major changes:
Todos