Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character set permitted for variable and attribute names. #307

Closed
martinjuckes opened this issue Nov 16, 2020 · 9 comments
Closed

Character set permitted for variable and attribute names. #307

martinjuckes opened this issue Nov 16, 2020 · 9 comments

Comments

@martinjuckes
Copy link
Contributor

The CF Convention does not impose any restrictions on the character set used in variable and attribute names. I have been, for a long time, working on the assumption that there were restrictions inherited from NetCDF, but that is no longer the case. The current NUG states that any UTF8 characters other than / can be used. This means that, for instance, a variable could be named Temperature (°C), as in the following code fragment:

import netCDF4
nc = netCDF4.Dataset( 'example_utf.nc', 'w' )
t = nc.createVariable( 'Temperature (°C)', 'f' )
nc.close()

Should CF impose some restriction, or is it OK to use the whole range of UTF8 in variable names?

@MaartenSneepKNMI
Copy link

MaartenSneepKNMI commented Nov 16, 2020

As far as I can tell there is an implicit restriction on the use of spaces, as the ancillary_variables attribute (and other attributes as well) uses a space separated list of variable names. That said, I think that restricting the names of variables to names that can be used as a variable name in (most) programming languages makes sense. In a regex expression: [A-Za-z][A-Za-z0-9_]*, in words: start with a letter (upper- or lowercase), then zero or more letters, numbers or underscores. Note that the underscore is excluded at the start of a name here.

To indicate a unit, as in your example, the appropriate attribute should be used, CF provides the mechanisms for that. It is also a usability issue, as some characters may be hard to type, or be easy to confuse with characters in this range (o or ο, spot the difference).

@taylor13
Copy link

I would agree that it would be nice if any variable name found in a CF-compliant file were able to be adopted without modification in the programming languages used in climate science. This would facilitate code documentation and generally make it easier for users, I think. So I support the view expressed by @MaartenSneepKNMI and would restrict the character set. Don't know if UTF8 is the right set.

@MaartenSneepKNMI
Copy link

The bit pattern of UTF-8 and ASCII overlap for the characters that I list as available for variable names, so there is no distinction.

I don't think any choice other than UTF-8 should be made at this point in time.

@DocOtak
Copy link
Member

DocOtak commented Nov 16, 2020

UTF-8 is not a character set, it is an encoding for unicode, how you actually store these names in the files is specified by the CDM Identifiers section and cannot be decided by CF.

Since CF doesn't support string attributes yet and given how some libraries interact with string attributes (e.g. netcdf4 python will force a string attribute if the text attribute cannot be converted to ASCII). The implicit and in practice restriction is that variable names are restricted to unicode points lower than U+007F (i.e. ASCII) if their name is going to appear in a CF standardized attribute. I think CF should only go so far as to warn about this limitation for names which will appear in these attributes, but not care beyond that.

@martinjuckes
Copy link
Contributor Author

Just to be clear, I wasn't suggesting that Temperature (°C) was something that I would want to use or encourage others to use, just wondering whether it should be considered as valid in a CF data file.

I tend to agree with @MaartenSneepKNMI and @taylor13 : the point that inclusion of spaces in variable break names would break other parts of the convention is a good one. I support the restriction proposed by @MaartenSneepKNMI above. This matches, I think, the approach used in all the CDL examples in the current convention.

@sethmcg
Copy link
Contributor

sethmcg commented Nov 16, 2020

I will also second @MaartenSneepKNMI's recommendation to restrict legal CF variable names to ones that could be used as variable names in important programming languages. It enables some very useful programming paradigms. I think it would also be good to mention that reasoning explicitly in the text describing the restriction.

@Dave-Allured
Copy link
Contributor

@martinjuckes, CF section 2.3 says:

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.`

However, your opening statement seems to contradict:

The CF Convention does not impose any restrictions on the character set used in variable and attribute names.

What is your interpretation? Is section 2.3 advisory only, or did you miss that?

@martinjuckes
Copy link
Contributor Author

@Dave-Allured : sorry, my mistake. Apologies for a unneeded discussion.

@Dave-Allured
Copy link
Contributor

@martinjuckes, I refer you to #237, Remove restrictions on netCDF object names. Please support that proposal, and further discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants