-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed-type arrays still legal, even though the intent appears to have been to forbid them #553
Comments
FWIW, the current README (https://github.com/toml-lang/toml#array) says:
Explicitly stating that arrays of mixed-type sub-arrays should be considered the same type (array of array, I presume). I don't know what Haskell will make of that. I don't have any skin in this game, so I'll leave it to you guys to hammer out. |
Yes, that's precisely the part of the README that I meant to quote in my issue description, but I failed to actually quote it. Thanks! |
I had actually implemented array typing in my parser before I re-read that section and realized they were actually allowed. That said, I'm not sure the typing argument holds much water, as tables themselves already allow a mixture of types for values - presumably parsers which can't support heterogeneous arrays would need to handle them with the same care that they take to handle tables which don't map cleanly to some type they've defined (whether that's a struct, or just a typed map). I doubt a language which does not permit arrays like that would permit heterogeneous maps either. Either way, this problem already exists with JSON for example, as far as I'm aware all of those languages have mechanisms to cope with the dynamic nature of the format - why would TOML be any different? |
What is the use-case for things like |
How about something simpler, which demonstrates why the definition of "type" is pretty loosely used in the spec: deps = [
{ name = "foo", path = "../foo" },
{ name = "bar", url = "http://....", port = 8080 },
] According to the spec, these are the same type, table; but from a strongly typed perspective, these are different types, similar for sure, but different. I would guess this type of config is quite common, or at least common enough. I don't really have a problem with the inconsistency, but it's hard to argue that this isn't at least a little confusing to people. |
Although I missed all of the past arguments leading to homogenous arrays, from the spec alone, the typing seems simple enough to understand. TOML typing is "strong but shallow," I'd call it. There's a fixed list of types (String, Integer, Float, Boolean, Datetime, Array, Inline Table) and every value has one of those types. As for arrays, all elements of a given array are the same type (call it the array's "subtype" if you wish), that is, one of the following: String, Integer, Float, Boolean, Datetime, Array, or Inline Table. So an Array of Arrays is properly homogenous in TOML v0.5.0, even if the subtypes of the inner arrays differ. The subtypes don't matter outside of the inner arrays. Not knowing Haskell, I don't know the burden that the TOML types bring upon the language, nor how they can be properly handled. But surely a specific config format demands specific types be used for a config file to be considered valid. |
If the purpose of homogeneous arrays in TOML is to prevent issues with strongly typed languages decoding TOML data structures, then allowing an array of arrays with different subtypes defeats that goal, as you either have to treat the type of the outer array as the common supertype of all the inner arrays, sacrificing the utility of types, or rely on whatever reflection mechanism the language provides, which usually incurs expensive overhead and is basically an escape hatch anyway. I don't think anyone is debating whether types in the spec are valuable, they definitely are (in my opinion), and are a strength of TOML over things like JSON. That said, there doesn't appear to be a good justification, from a type system perspective, why heterogeneous arrays are disallowed, but arrays of arrays with different subtypes are allowed - it's unsound. Let's take it at face value that arrays have a type of I think TOML should make a choice, post-1.0, to either commit to subtyping |
@bitwalker's example from #553 (comment) is a good starting point to explain why I'd like "array of mixed array types" to be forbidden. As I implement a TOML parser in F# (which, like C#, is strongly-typed), I have to choose how to represent arrays in the values I return to the user. Arrays of integers or strings will simply be Now, if the only thing I represent as Permitting heterogenous tables does present a few difficulties too, of course, but there are very good use cases for it. But there are no good use cases for heterogenous arrays given that tables exist; for any valid use for heterogenous arrays (such as Therefore, allowing arrays of arrays to have mixed types is not only a weird inconsistency in the spec, it also makes things more difficult for strongly-typed languages without a good reason for creating that difficulty. And so I'm arguing against having it in there. Also, both @mojombo and @BurntSushi appear to have intended to remove the "arrays of arrays of mixed types" example from the spec, but it seems to have fallen through the cracks, which is another reason I want to revive this discussion before 1.0 hits. |
I think that ship has sailed. The current spec says:
It's certainly humanly possible to leave Array of (mixed) Array in the spec, so unless the spec writers want to declare themselves liars, we are talking about a possible change to TOML 2.0. Personally, I think the current decision is quite reasonable. As @eksortso said, TOML's typing is "strong but shallow." It would be madness to force a type upon tables, so arrays of (untyped) inline tables must remain allowed, but allowing arrays of (arbitrary) tables while forbidding arrays of (arbitrary) arrays would feel weird and inconsistent to me. Also, as a Haskell programmer I can confirm that Haskell can deal with JSON's extremely weak typing just fine, so TOML's much stronger (though not perfectly strong) typing certainly won't present insurmountable obstacles. |
Except that since there's no use case that I can think of, and therefore nobody using it in practice, there's no downside to removing it. The goal was stability before 1.0, but a goal can be changed with a good enough reason: and "Oops, we meant to do this but forgot" is, IMHO, a good enough reason when there's zero downside and a significant benefit (ease of use in strongly-typed languages). Is anyone using this in practice? Please speak up if you are; that would be a VERY strong argument against removing this weird corner case, because if it's actually useful to someone then it's not a weird corner case. But if nobody is using it in practice, I'd argue that removing it comes at basically no cost, since in the languages where this matters it's actually harder to allow mixed arrays than it is to forbid them. |
In other languages such as Java it's more tedious to forbid them than to allow them. You'll have to check the class of the elements, which on a shallow level is fine but not on a deep level because lists all have the same type at run-time. Regarding your library: shouldn't an array of tables be something like |
Hi, For statically/strongly typed language, the representation of arrays of mixed values is a solved issue: it should be represented as a sum type. So you'd have a
You can treat tables as a There was an example above (#553 (comment)): deps = [
{ name = "foo", path = "../foo" },
{ name = "bar", url = "http://....", port = 8080 },
] From a parser pointer of view, this is an array of tables. From a consumer point of view this is an array of strongly typed resource references, a sum type between a local file and HTTP document: it makes no sense for TOML to try to guess the type of them items and enforce uniformity, so it defaults to a lower common denominator: But how different is this case from (for example) a list of tuples represented as arrays? Sure, some languages like Java are against tuples because they are not explicit, but on the other hand they allow to pack a lot of information without noise. Preventing mixed-type arrays prevents me from using TOML to represent a group of entities as a list of tuples:
It's good that TOML has a few intrinsic types, but enforcing schema validation (even locally) should be out of scope of a data representation language. TOML should remain language agnostic and avoid this kind of homogeneity constraint. |
@demurgos Arrays are logically different from tuples:
Also, there are named tuples where each of the values expected by a tuple has an explicit name (in regular tuples, each value only has an implicit name: someone must know what it represents, but that information is not part of the tuple). Every tuple can be converted to a named tuple by making the implicit information explicit. TOML doesn't have named tuples, but (as you know) it has tables – and, in particular, inline tables – which serve exactly the same purpose. Your example can also be represented nicely as an array of inline tables:
Inline tables have the advantage of being self-documenting, which also prevents the risk of undetected errors. #154 was a proposal to add (unnamed) tuples to TOML, but it was rejected in favor of inline tables – quite reasonably, I'd say. |
@ChristianSi To decode them from this TOML representation and retrieve their high-level type, you need context (what I call the schema). If we go on, TOML does not have sets, queues, graphs, etc. But I can use it to represent them. TOML provides some primitive types that I can use as an intermediate abstraction. The primitive "array of any legal TOML value" is more useful than "array of TOML values with the same TOML type". Another way to look at it is that mixed type arrays allow you to have uniform type arrays, but not the other way around. It allows users to chose the behaviour they want. For example in Python, checking for uniformity is a one-liner: assert all(type(item) == type(arr[0]) for item in arr) My main point is that TOML does not have enough information to decide to enforce this check or not. Forcing this check at the spec level artificially restricts the possible usages. |
I'd like to point out that tables can be used to represent arrays of non-uniform arrays.
This is exactly the reasoning used to include inline tables instead of tuples. If you have non-homogeneous data, each element of the data likely has a reasonable name that could be associated with it. Likewise, if you have an array of arrays, where each inner array needs a different data-type, I'd claim that each inner array has a reasonable name that could be associated with it. The shallow nature of TOML's type system is confusing. My (unimportant) opinion would be to either go all out (arrays are strongly typed based on their subtype) or avoid it all together (arrays don't care about subtype). My preference would be for the first, because type validation can be very useful. As pointed out, this would make it incompatible with v5.0.0. But in the last 4 months, no one has piped up claiming to actually use arrays of non-uniform arrays. |
You lose ordering with this scheme (and it hurts semantics). If you don't need ordering it's fine, but it prevents you from differentiating the following situations: letters followed by numbers (
I agree with this observation. I am still not convinced that TOML is well suited for validation. It's type system is oriented toward ensuring that data is well-formed. Asking TOML to handle validation would require to add abstractions such as interfaces and would complicate implementations. I'd prefer TOML to focus only on data representation and leave validation to higher levels (such as IDLs). |
There will always be TOML legal configs that not allowed by program, like: lastName = "LongTeng Dao"
So, if an legal value is not supported by some program, just specify a rule for itself... I can't see any difference between illegal case and error case. |
Do you have an example of where the ordering of non-uniform unnamed items matter? The closest example coming to my mind is in specifying a data-format, at which point a table is probably a better candidate anyway:
When do you find yourself wanting a clear ordering of unnamed items of different types? |
The most common situation I have seen is config files allowing to mix strings and tables corresponding to the parsed version (or with extra parameters).
[package]
contributors = [
"Foo bar <[email protected]>",
{ name = "Baz Qux", email = "[email protected]", url = "baz.qux.com"},
] Uplink servers in private package managers: [uplink]
servers = [
"https://srv1.uplink.com/",
{ url = "https://private.uplink.com/", auth_token = "..." },
] In these cases, ordering matters for core logic (in which order to print the contributors, which uplink server has priority in case of inconsistency), but ordering is very useful for seemingly unordered values when you want determinism (for example to run your test suite and compare the results). Regarding the example you posted with the tables, this is a common case. It currently works because of the shallow typing of TOML. This is also the kind of situation that I fear may break if you ask TOML to do some deep checks: how is the decoder supposed to know that these values are compatible? I just want to reiterate that your method is useful and may solve many issues, but not all. I just think that a more general solution is possible. |
AFAICT, @demurgos's examples (of arrays which mix strings and tables) violate the TOML spec, and if they currently work, it's because some TOML parsers aren't enforcing the spec. The spec says "Data types may not be mixed", and gives as an example of mixing data types: arr6 = [ 1, 2.0 ] # INVALID Since TOML considers ints and floats to be different data types, which may not be mixed in a single array, it therefore follows that strings and tables (which are far more different than ints and floats) are also different data types, and the The obvious way to rewrite those examples to bring them into spec compliance would be: [package]
contributors = [
{ name = "Foo bar", email = "[email protected]" },
{ name = "Baz Qux", email = "[email protected]", url = "baz.qux.com"},
] [uplink]
servers = [
{ url = "https://srv1.uplink.com/" },
{ url = "https://private.uplink.com/", auth_token = "..." },
] TOML does not (and should not) require that tables in an array all have exactly the same "shape", so these |
@rmunn I brought these up because I am advocating that the spec should be relaxed to make these examples legal. I think that the current spec sits in the middle with regard to validation: it's more than simple syntax checks but not enough to provide full validation. |
@demurgos what you're advocating for is ringing similar to the various tuples proposals, such as in #154. I claim that, thanks to inline tables, TOML has two very useful "group" datatypes right now: homogeneous arrays, where elements are unnamed but strongly ordered, and non-homogeneous tables, where elements are named but not strongly ordered. Combining these two constructs together can give you any combination of homogeneous, non-homogeneous, ordered, and unordered. A problem I see is that, if arrays are not strongly typed, you can do something like this:
Which looks like someone trying to hack around a problem. You claim that the best solution is to let them do what they wanted in the first place:
While I claim that the strictness of TOML should guide them to a less succinct, but more descriptive format:
Or, if the order matters:
A more strict TOML might lead to slightly more verbose data, but I claim that the names of elements are important when the data is non-homogeneous. |
I consider @demurgos 's examples useful use cases indicating that the "array elements must have the same type" restriction should indeed be dropped from the spec. The TOML spec should not force applications to use a specific modelling style, prohibiting anything else. If app writers want to allow their contributors (or whatever else a list is used for) to specify their credentials either compactly as An additional oddity of the current situation is that not even ints and floats might be mixed in arrays. For example, consider the 1–2–5 series of preferred numbers, written in the Wikipedia (after slight TOML-ification) as follows:
Currently, TOML parsers have to reject that array, though there is no good reason for this. Every mathematically inclined person knows that 1 and 1.0 are the same number, and (while this might not be true for their digital representations) nearly all programming languages will autoconvert ints to floats as needed. TOML probably shouldn't autoconvert, but it nevertheless should allow arbitrary lists of numbers. By allowing arbitrary values in arrays, we get that for free. |
From the perspective of the spec, changing arrays from single-type to heterogeneous is a backwards-compatible change. I'm personally okay with this change. But whatever basic validation was provided by existing parsers to prevent heterogeneous arrays will likely be going away. At the very least, older parsers that properly implemented that validation could make it an optional feature instead. For small applications, users could make a simple fix to parser calls to ensure that arrays stay single-typed. For bigger applications, schema validators could take up the slack and provide a way to specify only specific types within arrays. Heads up to #116, for instance. |
fwiw, since some "but what will Haskell do" questions came up about this and this is still being referenced, I feel I should add a data point: https://typeclasses.com/phrasebook/dynamic
Though I don't think this changes anything in particular inside the conversation. |
PR #663 intends to enforce homogeneous arrays in v1.0.0. If accepted, then mixed-type arrays will remain invalid until at least v1.0.1. Just so you know, I'm doing this only to get v1.0.0 done. |
I just bit the bullet and went the other way entirely -- allowing arrays to contain heterogeneous values as argued for in #665. That change makes 1.0.0 a superset of 0.5.0, so every 0.5.0 file is still a valid 1.0.0 file. I imagine that some implementations might need updating for this. I'm already planning to do a |
Closing since there's nothing actionable here now. :) |
@ChristianSi Just a thought about this solution. Imagine you have a usecase where you allow users, having no or little toml knowledge, to make their own toml file based on a reference file. I.e. in the example from above, users would change the id and the name in the inline tables, and add/delete inline tables if necessary.
He will get an error, since: |
@ViaFerrata It's true that config editors without a lot of experience may be confused by this. So, specifying a good format to copy from would be useful for them. For a simple list of users = [
{id=123, name="user1"},
{id=456, name="user2"},
{id=789, name="user3"},
] ... you can instead use the double-bracket [[users]]
id = 123
name = "user1"
[[users]]
id = 456
name = "user2"
[[users]]
id = 789
name = "user3" ... and those editors would be able to add a new entry to the list with little fuss. If an array of inline tables were used, they may try to split a table up and fail at it. Inline tables are for small, self-contained tables, and that's that. It's up to whoever sets the example for future editors to pick the smartest default choice for their purposes. |
Whether to allow mixed type data is contested. Since V2Ray does not use this kind of mix content in array by design, relaxing this test to avoid test break. toml-lang/toml#553
Creating a new issue for this because the history of this discussion is spread among multiple old (and now-closed) issues, and it's not at all clear which one should be used to re-open the discussion.
The TOML spec currently (2018-08-02) says that data types may not be mixed inside an array, but if the array contains nested sub-arrays then mixing data types is allowed. The example given is:
In #28 (comment), the comment was made that strongly-typed languages like Haskell are going to have a very difficult time mapping
arr5
to a native data structure, because of the mixing of types in the inner arrays. That is,arr5
is an array that contains two different types: "array of ints" and "array of strings", and in Haskell those two data types are completely different and can't be combined in the same collection: arr5 is neither an "array of arrays of ints" nor an "array of arrays of strings" and is thus invalid.After some discussion, @mojombo decided to forbid mixing data types in arrays, and moved the
[ [ 1, 2 ], ["a", "b", "c"] ]
example to the "things that are not allowed" section of the spec. He created #154 for this, which also included adding a syntax for tuples. The discussion on #154 went on for some time, with a lot of back-and-forth between @dahu (who wanted to allow heterogenous arrays) and @BurntSushi (who was in favor of keeping arrays homogenous and forbidding arrays like[ [ 1, 2 ], ["a", "b", "c"] ]
).In the end, #154 ended up being closed in favor of #235 (motivation found in #219). But what was lost in the move from #154 to #219 was the decision about homogenous arrays. #154 updated the spec to forbid
[ [ 1, 2 ], ["a", "b", "c"] ]
, but #219 did not.So despite the fact that both @mojombo and @BurntSushi appear to have been in favor of forbidding
[ [ 1, 2 ], ["a", "b", "c"] ]
, the spec still allows that mixing today. This has caused some confusion (see #28 (comment) for example), which is why I think the discussion needs to be re-opened.Is
[ [ 1, 2 ], ["a", "b", "c"] ]
legal syntax or not?The consensus of the TOML maintaners in those comments appears to have been to forbid that mixed array, but that part of the spec change got lost in the move from one PR to another. Was that intentional, and a reversal of the previous decision? Or was the intention always to forbid mixed arrays, but the spec change just fell through the cracks?
The text was updated successfully, but these errors were encountered: