-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keys should not allow arbitrary characters: they should be [a-zA-Z_]\w*
only
#65
Comments
That would just make it more complex (in terms of naming things, not parsing things), the current regex (as far as my understanding goes) is |
Keys should contain only letters, numbers or underscore (`_`) characters, and should start with a letter or underscore. See toml-lang#65 for discussion.
One disadvantage of allowing absolutely any non-whitespace character in identifiers is that assignment becomes whitespace sensitive (one of the major warts of most shell scripting languages). If arbitrary characters (including "=") are allowed in identifiers, the documentation should note that whitespace sensitivity. |
Diacritics are useless in English, but semantic in a lot of other languages. I don't think that TOML should be English-centric. A clean way to handle consecutive white space in keys would be to coalesce them into a single space after parsing (and convert tabs to space). |
@pygy I'm not trying to hate on the French. I'm just saying that every example of a config file that's been mooted fits the form of keys with a) NO non-ascii characters, b) exactly one joining character ( Yes: the current spec allows for arbritrary content in keys. So, folks: please respond either:
|
@mrflip, there are more languages in the world than French and English. TOML is intended to be used by people, not machines, and it should IMO be as user friendly as possible. Technically I agree with 2, but I'd just blacklist a few potentially confusing characters, like |
Not at all. That is simplest as possible. Limiting the possible characters makes one need to think (or worse look up the spec to double check) which characters are allowed. The only characters not allowed are the Even I would think allowing anything except
ruby is my favorite. Which allows identifiers in e.g. Japanese. This is valid:
I thought perl had no limitations either. C# accepts UTF8 characters in identifiers. PHP identifiers allow characters in the ascii range 127-255. |
@lawrencepit just for completeness, PHP accepts pretty much anything that is not a reserved keyword as identifier, so the whole UTF8 range is available as well. Anyway I tend to agree that keys are quite ok as they are. Restricting control/non-printable chars might be a good thing but I would not go further. |
If you require ASCII-alphanum-only now, you can expand the character range later. Once you open the doors, you'll have to accept them for all time or deprecate them later. And nobody has yet brought forth an example of a configuration file in the wild that uses characters outside of [\w-] But if this moves forward with non-ASCII-aphanum allowed, then there are a few decisions to make at the outset. Python's PEP-3131 has a good discussion of the issues. DecisionsIf non-ASCII characters are permitted in identifiers, you need to make three main decisions. 1) what normalization is performed, if any; 2) is that required in advance, or are parsers required to do so; 3) what characters are allowed. NormalizationFrom Unicode TR-15:
In practice:
(All of these languages can handle text in any normalization; this is offered to show the different decisions you can make.) NFC seems like the way to go. Who Normalizes
In either of the last two options you're going to have subtle bugs and security holes. No user will understand how to do that conversion, and by design things in different normalizations appear identical outside of byte-level inspection. The robust choice is to say that parsers must perform normalization while parsing. Character RangesI'm unconvinced that "It's simpler to go with a loose character range". You already have a parser making a decision based on a character range. The question is which one: kitchen-sink ( From Unicode TR-15:
In practice, languages vary; see rosettacode:
There's a set of really good reasons why the Unicode folks say not to allow any character in identifiers. Just a few:
If non-ascii characters are allowed in identifiers, I would go with the unicode recommendation. Summary of Pros and ConsPro's:
Cons (several taken from PEP-3131):
@mojombo we could really use your thoughts here. |
Just a quick reaction to "a config file in the wild", I saw in other Issues reported here e.g. a Chef config that looked like this format:
Translation files :
|
Are key values identifiers? To me a key value is anything you can push into the key part of a hash object. An identifier is something else. TOML maps to a hash. Question is whether any hash object should be serializable into TOML. |
@lawrencepit I guess that's the meat of it: does TOML describe a nested structure of configurable identifiers with arbitrary values, or does it model a generic data structure of arbitrary key-arbitrary value pairs? If the model is arbitrary key-value pairs, then not only should the character range be unrestricted, there must be a robust escaping mechanism -- saying "oh yeah, you can store any string you want, except not URLs because those have dots" isn't a great compromise. If keys are identifiers, then the character set should be restricted and you don't need an escaping facility within key names. Even now, an array can represent an ordered list of arbitrary key-value pairs, but adding sugar in the form of anonymous hashes would then make sense. When I look at the range of configuration files, I predominantly see namespaced identifiers with arbitrary lightweight contents. I have already formats for describing an arbitrary data structure; I want one that tastefully constrains and encourages readable configuration, and that's why I'm in favor of the identifier-value model. |
@mrflip The updated README and issue #27 seem to suggest TOML's intention is to be a config file format, not a simpler/safer/better readable replacement for YAML/JSON as a data exchange format. Then I tend to agree with you. Which characters to allow within keys is probably always contentious, but I'd go with your suggested
In your suggested commit it's not clear to me if you'd also apply this to keygroup names. If not, why not? |
Key group names should be identifiers, separated by dots, with no other characters.
|
Don't forget about numbers.. If you look at hard_example.toml you will see a |
@emancu Lawrence On 4 Mar 2013, at 13:33, Emiliano Mancuso wrote:
Sent from my iWatch |
@lawrencepit oh you are right, I though the subject was describing the whole regexp inside |
In my opinion, all ASCII characters should be supported, except for:
|
I just got hit by a "NO-BREAK SPACE" character in the end of a key in a config file. Invisible to the eye it took a long time to track down this problem which would not have happened at all if the toml parser didnt allow nonbreaking spaces in keys. Fwiw: Im a non-native English speaker that currently works in China. Im clearly aware of other languages than English with non-ascii characters, but in this case I dont think unicode bring any major advantages compared to its disadvantages. |
Invisible characters could be banned or normalized in i simple fashion. Some of them are apparently meaningful though, like the invisible
This is probably not exhaustive. Visible space could be normalized to a single space (0x20) and invisible, -- Pierre-Yves On Mon, Mar 24, 2014 at 5:06 AM, Victor [email protected] wrote:
|
I definitely don't want to be imposing Unicode normalization on parsers. That seems like too much of a burden. I'm skeptical of the concerns mentioned. If introducing hidden Unicode characters into a config file could present a security concern, then I think the onus is on the application to catch that. From what I'm seeing, there are two "right" solutions:
I personally don't want to do either. @mojombo? |
I know I'm late to the discussion, but there is a 3rd option. Don't normalise. If someone mixes Unicode characters like that, they are asking for trouble. Just pass on the concern to the application. |
@mirhagk 👍 |
Another suggestion: simple normalisation mapping onto
Janapense could be normalised the same way too, though may not be so readable in normalised form. Implementations will need a normalisation dictionary; this can be copy+paste from a canonical version. Question: when Are name collisions going to be an issue? In my experience no, though others' experience may differ. As I see it, the only alternatives are to restrict to only a sub-set of ASCII, or to use a more complete version of Unicode normalisation (which has a few disadvantages: potentially complex to implement, users may not be able to enter characters, characters may not always render correctly). |
@mirhagk +1 |
Not allowing whitespace (spaces and |
Resolved by #283. |
I'm surprised that keys are allowed to be arbitrary characters. What is the configuration file use case for this? Can those cases be just as well handled as a literal hash, or as an array of pairs? Allowing funny characters seems to go against the "simple as possible" ethos.
I think of keys as living in the control path, not data path, and thus predictability should win over expressiveness (no matter how fun it would to configure
☃.♛=シ
). I can't say for sure what would go wrong, but allowing nulls, vertical tab characters, funny unicode spaces and so forth sounds like an eventual security flaw. Unicode opens a lot of "there should only be one way to do anything" holes: for just one example, the stringsdīáçṙïtĭč
anddīáçṙïtĭč
are semantically equivalent but not byte-comparable (one has combining diacritics, the other precomposed). Are two keys equal if they character comparable, or only if they are byte-comparable?The spec should require keys to be identifier-like: they must start with a letter or underbar, and contain only letter, underbar or number. That is:
[a-zA-Z_]\w*
.The text was updated successfully, but these errors were encountered: