-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better define Substrait "required types" #747
Comments
I would answer 1. Some relations have required types. However, I do not believe this prevents systems with alternative representations from fully implementing Substrait. I ran into a similar problem when mapping the Arrow type system to Substrait. The problem I ran into is "when is X a variation / implementation of Y and when are X and Y different types". I decided on the following (non-canonical) definition:
In other words, an 8-bit integer is a valid variation of Therefore, I believe that an engine is free to use whatever physical implementation of Substrait's types it wants to use provided they are compatible variations as described above.
Again, if an engine is using 8-bit integers for booleans that is fine.
Note: this is not just
If we define this as an i64 and an engine wants to use a However, if an engine wants to use |
One more note on this. If an engine is using |
That being said, I'm not all that bothered by option 2. I'd say you are starting to formalize dialects. We already have ways to express dialect-style variations in functions (options, different sets of supported functions, etc.) and it makes sense we'd have ways to support dialect-style variations in relations too. I think it's fine to say "some engines have a limit relation that supports i32 and some have a limit relation that supports i64" or something like that. |
Let me throw a couple of ideas into the discussion. With the exceptions of the Boolean in filter and the grouping set ID in aggregate, we leave types completely open, right? This means that a producer uses whatever types it sees fit and the consumer can decide what to do with it. Hopefully, producers use as standard types as possible but they don't have to. If a consumer does not natively support a type in a plan it is given, it can (1) decide to not execute it or (2) try to reproduce the behavior of the types in the plan with the types it does natively support. I think that that latter case corresponds to (or is an instance of) the mapping that @westonpace described above. In other words: the consumer does whatever it has to to simulate the behavior of the type specified in the plan such that the observable result is exactly as if the consumer actually supported the type. For me, that seems to be perfectly fine and in line with the rest of the spec. Consumers need to produce the behavior specified by the plan no matter how they achieve it. As another example, they have the freedom to execute an In this spirit, I think one could argue to leave the types in question up to the producer/consumer pair as well. For aggregate and fetch, we only have to require that it's a type suitable for counting. (We could even think about lifting the requirements in aggregate to allow for any type that supports identity test.) If a producer/consumer pair shares the types -- no problem. If the consumer can simulate the types in the plan, that's also fine. Otherwise, that producer/consumer pair cannot exchange these plans but that's something that can happen anyways. Concretely, for the three cases, I think this would mean:
In order to reduce divergence, the spec could also strongly suggest to use I think that this is better than forcing a particular type because (1) it does not help in situations where the producer and consumer do not share a type they could use, i.e., a forced typed doesn't magically make incompatible producer/consumer pairs compatible, and (2) it would break producer/consumer pairs that can otherwise agree on a different type. |
[This came up in community sync today.]
There are some situations where Substrait defines expectation of specific types. Examples include:
This is somewhat non-intuitive since we generally don't have a formal requirement that a system support specific data types. We should evaluate whether:
The text was updated successfully, but these errors were encountered: