Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keys should not allow arbitrary characters: they should be [a-zA-Z_]\w* only #65

Closed
mrflip opened this issue Feb 24, 2013 · 26 comments
Closed

Comments

@mrflip
Copy link

mrflip commented Feb 24, 2013

I'm surprised that keys are allowed to be arbitrary characters. What is the configuration file use case for this? Can those cases be just as well handled as a literal hash, or as an array of pairs? Allowing funny characters seems to go against the "simple as possible" ethos.

I think of keys as living in the control path, not data path, and thus predictability should win over expressiveness (no matter how fun it would to configure ☃.♛=シ). I can't say for sure what would go wrong, but allowing nulls, vertical tab characters, funny unicode spaces and so forth sounds like an eventual security flaw. Unicode opens a lot of "there should only be one way to do anything" holes: for just one example, the strings dīáçṙïtĭč and dīáçṙïtĭč are semantically equivalent but not byte-comparable (one has combining diacritics, the other precomposed). Are two keys equal if they character comparable, or only if they are byte-comparable?

The spec should require keys to be identifier-like: they must start with a letter or underbar, and contain only letter, underbar or number. That is: [a-zA-Z_]\w*.

@C0mkid
Copy link
Contributor

C0mkid commented Feb 24, 2013

That would just make it more complex (in terms of naming things, not parsing things), the current regex (as far as my understanding goes) is \n[\s\t]*(.*?)[\s\t]*=

mrflip pushed a commit to mrflip/toml that referenced this issue Feb 24, 2013
Keys should contain only letters, numbers or underscore (`_`) characters, and should
start with a letter or underscore. See toml-lang#65 for discussion.
@hyperpape
Copy link

One disadvantage of allowing absolutely any non-whitespace character in identifiers is that assignment becomes whitespace sensitive (one of the major warts of most shell scripting languages).

If arbitrary characters (including "=") are allowed in identifiers, the documentation should note that whitespace sensitivity.

@pygy
Copy link
Contributor

pygy commented Feb 24, 2013

Diacritics are useless in English, but semantic in a lot of other languages. I don't think that TOML should be English-centric.

A clean way to handle consecutive white space in keys would be to coalesce them into a single space after parsing (and convert tabs to space).

@mrflip
Copy link
Author

mrflip commented Feb 25, 2013

@pygy I'm not trying to hate on the French. I'm just saying that every example of a config file that's been mooted fits the form of keys with a) NO non-ascii characters, b) exactly one joining character (_ most places, - in lisp), c) no whitespace. There's nothing english-centric about that, any more than your favorite programming language's identifier rules are.

Yes: the current spec allows for arbritrary content in keys.
This issue is saying "That is a bad idea! You should not allow arbritrary content in keys! Here is why: (see above)"

So, folks: please respond either:

  1. "Keys SHOULD allow arbitrary content. Here is why, in a config file, that is absolutely necessary: (... fill in...)"
  2. "Keys SHOULD NOT allow arbitrary characters, I agree. [optional: explain additional reasons if any]"
  3. or, as shorthand for (2), a simple "+1".

@pygy
Copy link
Contributor

pygy commented Feb 25, 2013

@mrflip, there are more languages in the world than French and English. TOML is intended to be used by people, not machines, and it should IMO be as user friendly as possible.

Technically I agree with 2, but I'd just blacklist a few potentially confusing characters, like =, [, and ], ", \n, ,, perhaps invisible characters (or see my proposition above), maybe a few more, and that's it. No need to restrict it to [0-9a-zA-Z_].

@lawrencepit
Copy link

Allowing funny characters seems to go against the "simple as possible" ethos.

Not at all. That is simplest as possible. Limiting the possible characters makes one need to think (or worse look up the spec to double check) which characters are allowed.

The only characters not allowed are the = and \n characters, which covers the obvious bit of TOML.

Even [, ], " can be allowed.

I would think allowing anything except = and \n also makes for the fastest possible parsers. The parser doesn't need to care about whatever is in front of the = character, whereas otherwise you need to dedicate CPU to checking whether the key is valid.

There's nothing english-centric about that, any more than your favorite
programming language's identifier rules are.

ruby is my favorite. Which allows identifiers in e.g. Japanese. This is valid:

>  持つ = 2
 => 2
>  持つ * 5
 => 10

I thought perl had no limitations either. C# accepts UTF8 characters in identifiers. PHP identifiers allow characters in the ascii range 127-255.

@Seldaek
Copy link

Seldaek commented Feb 26, 2013

@lawrencepit just for completeness, PHP accepts pretty much anything that is not a reserved keyword as identifier, so the whole UTF8 range is available as well.

Anyway I tend to agree that keys are quite ok as they are. Restricting control/non-printable chars might be a good thing but I would not go further.

@mrflip
Copy link
Author

mrflip commented Feb 26, 2013

If you require ASCII-alphanum-only now, you can expand the character range later. Once you open the doors, you'll have to accept them for all time or deprecate them later. And nobody has yet brought forth an example of a configuration file in the wild that uses characters outside of [\w-]

But if this moves forward with non-ASCII-aphanum allowed, then there are a few decisions to make at the outset. Python's PEP-3131 has a good discussion of the issues.

Decisions

If non-ASCII characters are permitted in identifiers, you need to make three main decisions. 1) what normalization is performed, if any; 2) is that required in advance, or are parsers required to do so; 3) what characters are allowed.

Normalization

From Unicode TR-15:

Normalization as described in this report can be used to avoid problems where apparently identical identifiers are not treated equivalently. To avoid such problems, ... normalize identifiers before storing or comparing them. ... [For] case-sensitive identifiers then Normalization Form C should be used.

In practice:

  • Python: "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."
    • C++: "An identifier in a conforming program must be in the canonical format defined by Unicode Normalization Form C"
    • Ruby does not normalize.

(All of these languages can handle text in any normalization; this is offered to show the different decisions you can make.)

NFC seems like the way to go.

Who Normalizes

  • If you require a normalization form and require parsers do the normalization, then all parsers need a unicode library at hand.
  • If you require a normalization and specify that files must be in that normalization, then you invite ambiguity and bugs when a non-validating parser meets a non-normalized file.
  • If you require no normalization, then comparison must be done at byte level. Two identically-appearing documents will have different semantics. Browsers often normalize text at entry, so copy-pasting a document into a gist becomes ill-defined behavior.

In either of the last two options you're going to have subtle bugs and security holes. No user will understand how to do that conversion, and by design things in different normalizations appear identical outside of byte-level inspection.

The robust choice is to say that parsers must perform normalization while parsing.

Character Ranges

I'm unconvinced that "It's simpler to go with a loose character range". You already have a parser making a decision based on a character range. The question is which one: kitchen-sink ([^\[\]=\,\.\r\n]+), identifier-like ([a-zA-Z_]\w*), or extended-identifier-like (see below)

From Unicode TR-15:

The Unicode Standard provides a recommended syntax for identifiers for programming languages that allow the use of non-ASCII languages in code. It is a natural extension of the identifier syntax used in C and other programming languages:
<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
That is, the first character of an identifier can be an uppercase letter, lowercase letter, titlecase letter, modifier letter, other letter, or letter number. The subsequent characters of an identifier can be any of those, plus non-spacing marks, spacing combining marks, decimal numbers, connector punctuations, and formatting codes (such as right-left-mark). Normally the formatting codes should be filtered out before storing or comparing identifiers.

In practice, languages vary; see rosettacode:

  • Python: "uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), the underscore, ... followed by all characters in ID_Start, plus nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), and characters carryig the Other_ID_Continue property", closed under normalization.
  • C++: letter (Lu, Ll, Lt, Lm, Lo, or Nl) or underscore (_ only), followed by
    letter, decimal-digit (Nd), connecting-character (Mn or Mc), combining-character (Pc) or formatting-character (Cf).
  • Java: letter, followed by letter, currency, connecting punctuation, digit, numeric, combining, or non-spacing.

There's a set of really good reasons why the Unicode folks say not to allow any character in identifiers. Just a few:

  • If you allow a digit as the start character of an identifier, you are in a world where you're allowing things like { "1": "zeroth", "0": "first" }. In a config file. Yikes.
  • Allowing Bidi leads to really diabolical security holes.
  • Allowing the whole range of characters means you've only got compatibility with a subset of languages.
  • Allowing nulls is a disaster in several languages, esp. C.

If non-ascii characters are allowed in identifiers, I would go with the unicode recommendation.

Summary of Pros and Cons

Pro's:

  • People want the expressiveness of identifiers outside the ASCII range

Cons (several taken from PEP-3131):

  • You can allow them in a later version of the spec, you can't un-allow them.
  • Nobody has brought forth an example of a config file in the wild that uses, let alone needs, characters outside of [\w-].
  • Complicates the ability of libraries to provide method-like sugar on a config object.
  • lose the ability to make a reliable round trip to a human-readable display on screen or on paper.
  • TOML becomes vulnerable to a new class of security exploits; code and submitted patches will be much harder to inspect.
  • Humans will no longer be able to validate syntax.
  • The Unicode bidi algorithm yields an extremely confusing display order for RTL text when digits or operators are nearby.
  • A raft of subtle edge-cases with inter-language compatibility

@mojombo we could really use your thoughts here.

@lawrencepit
Copy link

Just a quick reaction to "a config file in the wild", I saw in other Issues reported here e.g. a Chef config that looked like this format:

hosts:
  a.b.c.example.com:
    ip: 1.2.3.4
    note: abc
  d.e.f.example.com:
    ip: 10.0.0.1
    note: efg

Translation files :

fr:
  "Add Task": "Ajouter Tâche"
  "First Name": "Prénom"

@lawrencepit
Copy link

Are key values identifiers? To me a key value is anything you can push into the key part of a hash object. An identifier is something else. TOML maps to a hash. Question is whether any hash object should be serializable into TOML.

@mrflip
Copy link
Author

mrflip commented Feb 27, 2013

@lawrencepit I guess that's the meat of it: does TOML describe a nested structure of configurable identifiers with arbitrary values, or does it model a generic data structure of arbitrary key-arbitrary value pairs?

If the model is arbitrary key-value pairs, then not only should the character range be unrestricted, there must be a robust escaping mechanism -- saying "oh yeah, you can store any string you want, except not URLs because those have dots" isn't a great compromise.

If keys are identifiers, then the character set should be restricted and you don't need an escaping facility within key names. Even now, an array can represent an ordered list of arbitrary key-value pairs, but adding sugar in the form of anonymous hashes would then make sense.

When I look at the range of configuration files, I predominantly see namespaced identifiers with arbitrary lightweight contents. I have already formats for describing an arbitrary data structure; I want one that tastefully constrains and encourages readable configuration, and that's why I'm in favor of the identifier-value model.

@lawrencepit
Copy link

@mrflip The updated README and issue #27 seem to suggest TOML's intention is to be a config file format, not a simpler/safer/better readable replacement for YAML/JSON as a data exchange format. Then I tend to agree with you.

Which characters to allow within keys is probably always contentious, but I'd go with your suggested [a-zA-Z_]\w* as a start, yes.

Name your key groups whatever crap you please, just don't use funny characters.

In your suggested commit it's not clear to me if you'd also apply this to keygroup names. If not, why not?

@mrflip
Copy link
Author

mrflip commented Feb 28, 2013

Key group names should be identifiers, separated by dots, with no other characters.

\[([a-zA-Z_]\w*)(\.[a-zA-Z_]\w*)*\]

@emancu
Copy link
Contributor

emancu commented Mar 4, 2013

Don't forget about numbers..

If you look at hard_example.toml you will see a test_array2 = ... that could be a perfect name for a keygroup too.

@lawrencepit
Copy link

@emancu \w matches numbers. (to be precise: matches letters, digits
and underscores, it's the same as [A-Za-z0-9_])

Lawrence

On 4 Mar 2013, at 13:33, Emiliano Mancuso wrote:

Don't forget about numbers..

If you look at
hard_example.toml
you will see a test_array2 = ... that could be a perfect name for a
keygroup too.


Reply to this email directly or view it on GitHub:
#65 (comment)

Sent from my iWatch

@emancu
Copy link
Contributor

emancu commented Mar 7, 2013

@lawrencepit oh you are right, I though the subject was describing the whole regexp inside [] I didn't see the last \w*. My bad.

@88Alex
Copy link

88Alex commented Jun 26, 2013

In my opinion, all ASCII characters should be supported, except for:

  • Unprintable characters (0x00 - 0x19)
  • Whitespace (space, tab, newline)
  • TOML syntax characters ( = [ ] " ' # )

@viblo
Copy link

viblo commented Mar 24, 2014

I just got hit by a "NO-BREAK SPACE" character in the end of a key in a config file. Invisible to the eye it took a long time to track down this problem which would not have happened at all if the toml parser didnt allow nonbreaking spaces in keys.

Fwiw: Im a non-native English speaker that currently works in China. Im clearly aware of other languages than English with non-ascii characters, but in this case I dont think unicode bring any major advantages compared to its disadvantages.

@pygy
Copy link
Contributor

pygy commented Mar 24, 2014

Invisible characters could be banned or normalized in i simple fashion.

Some of them are apparently meaningful though, like the invisible
multiplication sign or the invisible function application (these are IMO
rather esoteric). There's also an invisible vowel separator in Mongolian,
and 2^16 "variation signselectors" that specify script declinations.

This is probably not exhaustive.

Visible space could be normalized to a single space (0x20) and invisible,
zero-width characters could be dropped altogether, or the files may be
rejected as invalid.

-- Pierre-Yves

On Mon, Mar 24, 2014 at 5:06 AM, Victor [email protected] wrote:

I just got hit by a "NO-BREAK SPACE" character in the end of a key in a
config file. Invisible to the eye it took a long time to track down this
problem which would not have happened at all if the toml parser didnt allow
nonbreaking spaces in keys.

Fwiw: Im a non-native English speaker that currently works in China. Im
clearly aware of other languages than English with non-ascii characters,
but in this case I dont think unicode bring any major advantages compared
to its disadvantages.


Reply to this email directly or view it on GitHubhttps://github.com//issues/65#issuecomment-38410910
.

@BurntSushi
Copy link
Member

I definitely don't want to be imposing Unicode normalization on parsers. That seems like too much of a burden.

I'm skeptical of the concerns mentioned. If introducing hidden Unicode characters into a config file could present a security concern, then I think the onus is on the application to catch that.

From what I'm seeing, there are two "right" solutions:

  1. Keep it as is, but require Unicode normalization.
  2. Revert to traditional identifier characters.

I personally don't want to do either. @mojombo?

@mirhagk
Copy link

mirhagk commented Jul 28, 2014

I know I'm late to the discussion, but there is a 3rd option. Don't normalise. If someone mixes Unicode characters like that, they are asking for trouble. Just pass on the concern to the application.

@ChristianSi
Copy link
Contributor

@mirhagk 👍

@dhardy
Copy link

dhardy commented Oct 8, 2014

Another suggestion: simple normalisation mapping onto [a-zA-Z_], allowing only characters with defined normalisation rules. These can then be added bit-by-bit as necessary. Examples of possible rules:

- → _
@ → _at_
é → e
ö → oe

Janapense could be normalised the same way too, though may not be so readable in normalised form.

Implementations will need a normalisation dictionary; this can be copy+paste from a canonical version.

Question: when . and are used in quoted keys, should these map to_? Should. be its own character, even though it is only usable when quoted? Or maybe it is preferable to use normalisation only for non-ASCII characters?

Are name collisions going to be an issue? In my experience no, though others' experience may differ.

As I see it, the only alternatives are to restrict to only a sub-set of ASCII, or to use a more complete version of Unicode normalisation (which has a few disadvantages: potentially complex to implement, users may not be able to enter characters, characters may not always render correctly).

@cies
Copy link
Contributor

cies commented Oct 8, 2014

@mirhagk +1

@cies
Copy link
Contributor

cies commented Oct 8, 2014

Not allowing whitespace (spaces and \t\n\{etc}) would be, IMHO, great from a readablity perspective.

@mojombo
Copy link
Member

mojombo commented Jan 15, 2015

Resolved by #283.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests