Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use higher order approach to null lifting #71

Open
davidanthoff opened this issue Oct 24, 2016 · 3 comments
Open

Use higher order approach to null lifting #71

davidanthoff opened this issue Oct 24, 2016 · 3 comments

Comments

@davidanthoff
Copy link
Member

This is a straw-man issue to put down my thoughts why this is not a good approach. There are two parts. The first is a discussion of some technical aspects. Those could in principle be overcome by e.g. new language features. The second is a discussion of the semantics of lifting, and why I think a general approach is futile in that area.

Technical aspects

  1. It looks like a design smell to me to try to solve the problem of operations on Nullables in a query framework. Those two issues are orthogonal. Expressions with Nullables occur outside of queries as well, and any solution that only works within a query framework would not help with the situation outside of a query framework.
  2. I'm a strong believer in the following principle: an expression that operates on some inputs should always have the same semantics, regardless where that expression occurs. The surroundings of an expression should, as much as possible, not change the semantics of the expression. So for example, I'm adamantly opposed that the expression in the @where clause here:
@from i in source begin
    @where i + 4
    @select i
end

has a different semantics than what say Nullable(2) + 4 would have in some non-query context. I think that kind of context-dependent semantics changes are highly confusing for users and generally a sign of a bad system design, because one can't reason about the semantics of the system based on first principles anymore, instead one needs to learn a list of rules how things work in different contexts.
3. I think of queries as very similar to generator expressions. In fact, I want simple query expression that only project and filter to be isomorphic to the equivalent generator expression. For example, I want to make sure that

@from i in source begin
    @where i>2
    @select i^2
    @collect
end

behaves exactly in the same way as

[i^2 for i in source if i>2]
  1. I think one should always be able to factor out an expression from a query into a function and still get the same result. For example, lets say one starts with this query:
@from i in source begin
    @select i^2
end

Then I want the following code to behave exactly in the same way:

foo(x)=x^2
@from i in source begin
    @select foo(i)
end
  1. I want the following code to work as one would expect it to work:
foo(x)=isnull(x) ? 0 : x^2
@from i in source begin
    @select foo(i)
end

The higher order lifting approach in queries violates all of these things, and I think that is too high of a cost.

Semantics of lifting

The current idea for any general lifting semantics story is the lifted version of a function would return a null value if any of its inputs are null. I think that makes sense for some functions, but so far I can think of at least two groups of functions where this is actually not the correct semantics. This makes the situation very different from the vectorization story, where it is clear that the application of the . syntax should have the same semantics in all cases. Here are the two cases that in my mind should not follow a general lifting semantics:

  1. Any predicat. These include things like comparison operators (==), but also functions like contains etc. This is a large group of functions. I think predicate functions that take a Nullable as an input in almost all cases should return a Bool and not a Nullable{Bool}. Why? Because otherwise one will constantly end up dealing with 3VL, which is unintuitive and a pain and plain and simply confusing.
  2. The operators of 3VL. Now this is slightly ironic, because my previous point of course argued that we should try to keep 3VL out of the system as much as possible, but still. If you do have a Nullable{Bool}, then & and | should follow the 3VL rules, which are different from any generic lifting strategy.

I think these two examples are actually enough to conclude that a generic lifting strategy here is not a good idea. With this list we already have at least three types of functions that should all get different semantics, and who knows, we might well stumble upon another group of functions with yet another natural lifting story.

The case for a white-list lifting approach

Various folks have suggested that a white-listing approach to lifting is bound to not work. I disagree. So far I've seen two arguments against it: 1) it is too much work to define lifted versions (variants of that are "how do we decide what to lift?", "these are too many methods", "we'll never cover all functions") and 2) user-defined functions can't use multi-dispatch if they want to work with a set of minimal lifted functions.

I think 1) is not a convincing argument at all. That argument can be made about pretty much anything. We'll never be able to implement all math functions in julia. Sure, but why should that stop us implementing as many as we can? I think the right response to this problem is to roll up our sleeves and write code. We might not get to a 100% coverage, but I'm sure we could get to a coverage level that is good enough for the vast number of data science use cases. And for the remaining cases there is a very simple solution for users. Just write isnull(x) ? bla : foo(get(x)).

Number 2) is more difficult, but I think in practice also not that bad. There are two types of user-defined functions in my mind: the small anonymous functions that get written in frameworks like Query.jl and StructuredQuries.jl as e.g. the filter condition etc. The arguments of those are by default not typed, so I think things should work just fine there. The second type of user-defined function is one where the user defines the method properly and wants to dispatch on types. There is no good solution for that with the white-list approach, i.e. if the user wants his/her function to work with Nullables, he/she will have to define a lifted method. Not ideal, but I still prefer that over all the inconsistencies that are introduced by the higher-order lifting approach. There might also be some clever way to get around this using traits.

Summary

Long story short, in my mind the drawbacks of a higher-order lifting approach are 1) technical that they violate some basic principles that I think are key for a good language design and 2) I don't think there is a general lifting semantics that one can slap on all functions. The latter point seems to biggest difference to the vectorization story.

At the end of the day all of that in my mind suggests that for now we should stick with a white-list approach. If someone comes up with a convincing general lifting strategy (both from a technical and from a semantic point of view) we can revisit that decision.

@nalimilan
Copy link

I think we disagree on many points, but that's fine. Let's see what are the advantages and drawbacks of each strategy.

@JeffBezanson
Copy link

I agree with most of this, until we get to

Not ideal, but I still prefer that over all the inconsistencies that are introduced by the higher-order lifting approach.
the drawbacks of a higher-order lifting approach are 1) technical that they violate some basic principles that I think are key for a good language design and 2) I don't think there is a general lifting semantics that one can slap on all functions

I don't really understand this. The idea of the higher-order lifting approach --- at least in principle --- is that you write one definition that says how to lift any function to nullables, and then you're done. There's not much room for inconsistency there. If you want a different lifting semantics, you write another definition. If you want N different lifting strategies, you write N definitions.

If there are functions that need very different lifting behavior, it might actually make sense to add methods to them even if we use higher-order lifting in general. This might be justifiable since you'd presumably need to understand the differences anyway. After all, if we took this to the extreme and let every function have its own lifting behavior working with nullables would become quite a minefield.

I believe you that we could eventually add however many methods are needed for the "whitelist" approach, it's just that when you need to do the same work over and over again, or have a top-level for loop do it, I begin to suspect we're using the wrong abstraction.

And for the remaining cases there is a very simple solution for users. Just write isnull(x) ? bla : foo(get(x)).

💯 Yes, in fact that is so simple I think we should use it in all or most cases. (Note the distinction between simple and easy is important here.)

@davidanthoff
Copy link
Member Author

I begin to suspect we're using the wrong abstraction.

I couldn't agree more with this. But until someone has found the right abstraction (and in my mind clearly no proposal on the table has it), it seems the lesser of two bad designs to me.

If there are functions that need very different lifting behavior, it might actually make sense to add methods to them even if we use higher-order lifting in general.

If that worked with the higher-lifting approaches discussed so far, it would be great. But it doesn't. With the current higher-order lifting approach, even if you define a new method that has a signature that has Nullables in it, that method won't be called from expression that occur in the higher-order lifting context. I just think that is so unintuitive and essentially breaks the semantics of the whole multi-dispatch system, that I don't want to add that kind of thing to Query.

@davidanthoff davidanthoff added this to the Without schedule milestone Oct 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants