Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To what extent should MetPy try to ensure CF units/UDUNITS compatibility? #1362

Open
jthielen opened this issue Apr 21, 2020 · 14 comments
Open
Labels
Area: Units Pertains to unit information Type: Enhancement Enhancement to existing functionality

Comments

@jthielen
Copy link
Collaborator

This comment by @rsignell-usgs, combined with issues like #1134 and #1350, led me to wonder: what efforts should MetPy be taking to maintain compatibility with CF units/UDUNITS? Right now, it seems to have been just implementing a fix when a particular problem arises (which is a reasonable enough way to continue going forward), but I've also wondered whether it's worth it at some point to parse the UDUNITS xml files and either run tests on MetPy's Pint registry with respect to this database or use the database to build a custom Pint registry configuration file.

@jthielen jthielen added Type: Enhancement Enhancement to existing functionality Area: Units Pertains to unit information labels Apr 21, 2020
@dopplershift
Copy link
Member

dopplershift commented Apr 22, 2020

I think there's some value in it. Really, what we need to support are standards in our field's common data formats. With the exception of random text units in CSV file headers or free text fields in HDF5/netCDF files, the units string attribute in netCDF files conforming to the CF conventions is the only standard unit string I'm aware of us needing to parse. So it's important that we support it and do it right.

Luckily, thanks to @lesserwhirls and netCDF-Java, there's already a place where all the XML unit information from UDUnits2 is aggregated together. I don't think we want to be parsing that at runtime in MetPy, but I think it would be good to do some testing against it and see how much we need to add (and hopefully not modify or remove). One thing I will note is (that I just learned through digging into the "Celsius" issue) is that UDUnits-2 is case insensitive for names only, not symbols. That could present some challenges.

I think we should at least be somewhat pro-active here--at least by following this particular thread (e.g. "Celsius" and look at the XML unit database). I'm not saying we need to exhaustively test 10^5 different unit strings and make sure we get the right results. I bet just solving the two things mentioned here, on top of the mods you've already made, would get us 99% there.

@dopplershift
Copy link
Member

Out of curiosity, how much of Pint's parser can we override or outright rip out?

@jthielen
Copy link
Collaborator Author

Out of curiosity, how much of Pint's parser can we override or outright rip out?

I'm not too sure about that! The most extensible parts of Pint right now are unit definitions, constants, and preprocessors functions. Beyond that, I'm pretty sure we'd have to rely on monkey patching or implementing changes upstream to make currently unexposed APIs extensible.

@jthielen
Copy link
Collaborator Author

Also adding to the list here: "Meter" (from #1459)

Should there be a short-term fix for this capitalization issue until more substantial testing can be done? Or is this something we should comment on in the unit support docs?

@dopplershift
Copy link
Member

Well, can we make it so we only consider case for symbols? Hell, I don't even know if there are symbols that collide if you make everything lowercase.

@jthielen
Copy link
Collaborator Author

jthielen commented Aug 12, 2020

Well, can we make it so we only consider case for symbols? Hell, I don't even know if there are symbols that collide if you make everything lowercase.

Only considering case for symbols would not be easy to do exactly, but there may be ways around that.

Since Pint's parsing internals currently don't seem to distinguish between symbols and any other alias, this would at the very least require making low-level changes in Pint's unit parsing and adding a bunch of extra references back to underlying Definitions to check if it is a symbol, on top of carrying through the appropriate case sensitive options.

Symbols do collide if you make everything lowercase in a preprocessor (e.g., G for gauss vs g for gram and M for molar vs m for meter), as do prefixes (Mega- vs milli-), so a "make everything lowercase" will not work with the current API.

That being said, Pint does have proper case insensitive handling on its UnitRegistry.get_name and UnitRegistry.parse_expression (with case_sensitive=False) that handles these collisions properly, which could get us somewhere. However,

  • in cases where there isn't a symbol collision, the case insensitive option is overly lenient (e.g., "j" is still joules)
  • the case insensitive option isn't exposed on UnitRegistry.parse_units, and isn't used in the constructors for Unit and Quantity.

So, if we are okay with a bit of extra leniency on case-insensitive symbols, then this should actually be achievable by adding a case_sensitive_units registry-level setting in Pint, exposing case_sensitive on UnitRegistry.parse_units, and modifying the Unit and Quantity constructors to rely on the case_sensitive_units registry setting when calling the registry's parsing methods. All the hard work would go in Pint, MetPy would just change the case_sensitive_units option from its default to False (monkey-patching these long methods would be rather fragile).

If this all sounds reasonable, should we propose this upstream in Pint?

@dopplershift
Copy link
Member

Given that we're not a conformance checker and our users frequently are not in control of their data, we should also lean towards being lenient, provided we're not in an ambiguous situation--which we're not here. I think that's reasonable to propose your suggestion in Pint.

@jthielen
Copy link
Collaborator Author

jthielen commented Aug 16, 2020

Once hgrecco/pint#1147 is merged (and 0.15 released?), all we have to do is add the case_sensitive=False option when we create our registry. This will likely need to be done with a try/except block unless we want to bump minimum Pint version to 0.15. If we wanted, the except block could issue a warning about case sensitivity.

Update: hgrecco/pint#1147 has been merged, so this can proceed forward now with upstream dev or wait until 0.15 release.

@dopplershift
Copy link
Member

We'll have to ponder whether it's better to bump Pint. I'm doubtful of the existence of a user-base that is capable of installing a new MetPy but is unable to update their version of Pint.

@dopplershift
Copy link
Member

So for the record, 0.15 has been released and conda-forge packages are out. I'm going to wait for dependabot to pick it up and that should open us up to testing on it.

@jthielen
Copy link
Collaborator Author

I went ahead and enabled the case_sensitive=False option on our unit registry, and just as quickly discovered problems. Unfortunately, the case sensitive symbol handling is less robust than I was expecting...prefixes with symbols breaks it (so we get things like millibytes instead of millibarnes and kilomolar instead of kilometer). This will take some time to debug, and likely some further upstream changes in Pint, so I'm going to punt on it for now. If anybody's interested, my branch with the case_sensitive=False optional conditionally set is at https://github.com/jthielen/MetPy/tree/pint-case-insensitive.

@dopplershift
Copy link
Member

Le sigh.

@jthielen
Copy link
Collaborator Author

For the record, this dataset from #1917 contains an expression that failed our regex ((W m-2 um-1)-1). I hypothesized a fix in #1917 (comment) but didn't test it.

@dopplershift dopplershift modified the milestones: 1.1.0, 1.2.0 Aug 2, 2021
@dopplershift dopplershift removed this from the 1.2.0 milestone Oct 8, 2021
@jthielen
Copy link
Collaborator Author

xarray-contrib/xwrf#50 uncovered another expression that fails our regex: kg(-1), which is interpretted as -1 <Unit('kilogram')> instead of the proper 1.0 <Unit('1 / kilogram')>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Units Pertains to unit information Type: Enhancement Enhancement to existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants