-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use higher order approach to null lifting #71
Comments
I think we disagree on many points, but that's fine. Let's see what are the advantages and drawbacks of each strategy. |
I agree with most of this, until we get to
I don't really understand this. The idea of the higher-order lifting approach --- at least in principle --- is that you write one definition that says how to lift any function to nullables, and then you're done. There's not much room for inconsistency there. If you want a different lifting semantics, you write another definition. If you want N different lifting strategies, you write N definitions. If there are functions that need very different lifting behavior, it might actually make sense to add methods to them even if we use higher-order lifting in general. This might be justifiable since you'd presumably need to understand the differences anyway. After all, if we took this to the extreme and let every function have its own lifting behavior working with nullables would become quite a minefield. I believe you that we could eventually add however many methods are needed for the "whitelist" approach, it's just that when you need to do the same work over and over again, or have a top-level
💯 Yes, in fact that is so simple I think we should use it in all or most cases. (Note the distinction between simple and easy is important here.) |
I couldn't agree more with this. But until someone has found the right abstraction (and in my mind clearly no proposal on the table has it), it seems the lesser of two bad designs to me.
If that worked with the higher-lifting approaches discussed so far, it would be great. But it doesn't. With the current higher-order lifting approach, even if you define a new method that has a signature that has |
This is a straw-man issue to put down my thoughts why this is not a good approach. There are two parts. The first is a discussion of some technical aspects. Those could in principle be overcome by e.g. new language features. The second is a discussion of the semantics of lifting, and why I think a general approach is futile in that area.
Technical aspects
Nullable
s in a query framework. Those two issues are orthogonal. Expressions withNullable
s occur outside of queries as well, and any solution that only works within a query framework would not help with the situation outside of a query framework.@where
clause here:has a different semantics than what say
Nullable(2) + 4
would have in some non-query context. I think that kind of context-dependent semantics changes are highly confusing for users and generally a sign of a bad system design, because one can't reason about the semantics of the system based on first principles anymore, instead one needs to learn a list of rules how things work in different contexts.3. I think of queries as very similar to generator expressions. In fact, I want simple query expression that only project and filter to be isomorphic to the equivalent generator expression. For example, I want to make sure that
behaves exactly in the same way as
Then I want the following code to behave exactly in the same way:
The higher order lifting approach in queries violates all of these things, and I think that is too high of a cost.
Semantics of lifting
The current idea for any general lifting semantics story is the lifted version of a function would return a
null
value if any of its inputs arenull
. I think that makes sense for some functions, but so far I can think of at least two groups of functions where this is actually not the correct semantics. This makes the situation very different from the vectorization story, where it is clear that the application of the.
syntax should have the same semantics in all cases. Here are the two cases that in my mind should not follow a general lifting semantics:==
), but also functions likecontains
etc. This is a large group of functions. I think predicate functions that take aNullable
as an input in almost all cases should return aBool
and not aNullable{Bool}
. Why? Because otherwise one will constantly end up dealing with 3VL, which is unintuitive and a pain and plain and simply confusing.Nullable{Bool}
, then&
and|
should follow the 3VL rules, which are different from any generic lifting strategy.I think these two examples are actually enough to conclude that a generic lifting strategy here is not a good idea. With this list we already have at least three types of functions that should all get different semantics, and who knows, we might well stumble upon another group of functions with yet another natural lifting story.
The case for a white-list lifting approach
Various folks have suggested that a white-listing approach to lifting is bound to not work. I disagree. So far I've seen two arguments against it: 1) it is too much work to define lifted versions (variants of that are "how do we decide what to lift?", "these are too many methods", "we'll never cover all functions") and 2) user-defined functions can't use multi-dispatch if they want to work with a set of minimal lifted functions.
I think 1) is not a convincing argument at all. That argument can be made about pretty much anything. We'll never be able to implement all math functions in julia. Sure, but why should that stop us implementing as many as we can? I think the right response to this problem is to roll up our sleeves and write code. We might not get to a 100% coverage, but I'm sure we could get to a coverage level that is good enough for the vast number of data science use cases. And for the remaining cases there is a very simple solution for users. Just write
isnull(x) ? bla : foo(get(x))
.Number 2) is more difficult, but I think in practice also not that bad. There are two types of user-defined functions in my mind: the small anonymous functions that get written in frameworks like Query.jl and StructuredQuries.jl as e.g. the filter condition etc. The arguments of those are by default not typed, so I think things should work just fine there. The second type of user-defined function is one where the user defines the method properly and wants to dispatch on types. There is no good solution for that with the white-list approach, i.e. if the user wants his/her function to work with
Nullable
s, he/she will have to define a lifted method. Not ideal, but I still prefer that over all the inconsistencies that are introduced by the higher-order lifting approach. There might also be some clever way to get around this using traits.Summary
Long story short, in my mind the drawbacks of a higher-order lifting approach are 1) technical that they violate some basic principles that I think are key for a good language design and 2) I don't think there is a general lifting semantics that one can slap on all functions. The latter point seems to biggest difference to the vectorization story.
At the end of the day all of that in my mind suggests that for now we should stick with a white-list approach. If someone comes up with a convincing general lifting strategy (both from a technical and from a semantic point of view) we can revisit that decision.
The text was updated successfully, but these errors were encountered: