-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A data-friendly alternative to Nullable
#132
Comments
I was actually thinking the same thing going through the base thread, that it may be worthwhile to pursue an entirely separate type for statistical missingness. Thanks for getting this discussion started, Andy, much appreciated! As we think about this, we should also take a hard look back at DataArrays, which was the initial attempt to do exactly this. What did and did not work well? I'll also cc @JeffBezanson and @StefanKarpinski here. |
I'll start with the obvious and say type stability was a "slight" issue... |
We should also be careful not to negate the massive amount of work many of the JuliaStats folks have put into making |
I like this idea a lot. In fact, the more I think about it, I think I'll just use that approach for now in Query.jl. There is a super simple way for me to make Query.jl still work with sources and sinks that are based on |
Sure. There's no reason we can't define On that note, the obvious bikeshedding - what is a good name for a stats-y nullable type?
|
Given that this type is meant to squarely target data scientists, it would be nice to stay as close as possible to names/concepts that folks in that area know. I think that speaks for something with |
I dislike |
I'm surprised |
I think a behavior like this could be nice, though in general I think I prefer using |
That is a good point... |
"Missable" is a word but it means something unrelated. |
Looking at some other languages (from wikipedia), there are only two common classes of naming:
Yes I don't think |
I've contemplated creating an alternative nullable-like type a few times, but honestly I'm not sure it would solve all problems.
|
I totally agree there will be (major!) difficulties, @nalimilan. And there are major issues of semantics that need to be discussed In my opinion, I see two self-consistent, useful, sensible semantics for a nullable type. I think we should implement both.
In the future, as ideas progress and Julia evolves, I'm betting we'll find nicer ways of working with the former (e.g. using Even if the latter type is implemented outside of (f::Function)(x::Nullable) = hasvalue(x) ? f(x.value) : NA{return_type(f, Tuple{eltype(x)})}() There could be a million other helpful things we can't even imagine yet. Like @davidanthoff said quite well here, we shouldn't give up on trying just because its a little disgusting or difficult or imperfect on our first try. |
I don't see anything in your proposal which says what using a new type would allow us doing right now. How would it help getting type-stable nullable arrays/data frames ready for a release in the next few weeks (or even by the time Julia 1.0 is released)? The (See discussion a JuliaStats/NullableArrays.jl#85 about |
Call overloading on functions doesn't seem to be permitted:
But if it were, writing |
I'm converting Query.jl to use a new type This approach works well for Query.jl, even if the rest of the data universe doesn't adopt the Obviously this doesn't help with the old API stuff like I'd be happy to a) rename the type to something better if there is a consensus and b) move all of that into its own package eventually. But for now I just want to move forward with Query.jl, so I'll just have it there for now. |
It's certainly great to have someone charge forward on this approach. One of my biggest concerns was just finding someone to implement and maintain the code. I think it'll be great to see how things unfold with it's usage and it will provide some good chances to learn as we go (separate from Base and with more flexibility to change/update things). |
This is an exciting thread. My use of Julia falls squarely into the data analyst camp, and being able to easily work with dataframes is my highest priority in these changes. The allure of DataArrays was significant -- they worked almost exactly like normal arrays in most cases, and @andyferris's second option or something along those lines seems like a good step forward, until |
Thanks @davidanthoff for going forward with this. I was going to tell @nalimilan that unfortunately I quite probably wouldn't have time to create an entirely new and complete system in a few weeks, but it seems that David will. :) Though, I would recommending splitting off |
I've been reading (and re-reading) about earlier discusions surrounding
However I'm not sure (I could be wrong) those languages had a well thought out system of Another thing - I saw a comment equating hashing and |
And sometimes they will want it to be
That's easy: don't hash the value of null. |
Here we go: https://github.com/davidanthoff/NAables.jl I also have a branch in Query.jl that uses Right now all the code in NAables.jl is also in Query.jl, but that is just until we've decide on a final name for NAables.jl and METADATA registration. I'm still somewhat nervous about the Query.jl switch, so I'll probably wait a little longer before I merge this. But right now it solves a huge issue, namely I can get rid of all type piracy, and I don't really see any downside... |
I think the answer to the question of the behavior of This is quite simply because Note that since |
No, it's definitely not that simple (read discussions I liked to above). For example, R uses a completely different approach. There are several legitimate strategies, and it's hard to know which one is best. There are also subsidiary questions, like do we want |
Having |
I don't know, I've worked with NaN's for a long time. In fact, before my current job I was a physicist, and since I dealt almost exclusively with doubles, I guess I have to concede that not everyone shares this experience, but I would like to make one further point: |
If you think R and Pandas are a niche thing, would you also qualify SQL as such (cf. Wikipedia on NULLs in SQL)? Then what's left in the realm of data tables management? |
Sorry, I was probably a bit overzealous in my language. I was just trying to make the point that floating point NaN is a very standard, universally understood thing in a way that certain conventions in R and pandas are not. I'm still arguing for using NaN as the paradigm, but completely understand that there are other, perfectly legitimate ways of doing things. |
Based on a Slack conversation, this is what I feel needs to be done for 1.0:
Things that don't need to be done for 1.0:
|
These statements are too strong and are probably where some of the frustration is stemming from in this discussion. As far as I know from this and other related discussions, there is only a single dissenter (@davidanthoff) on how a missing value type should behave, be defined, or be called. Everyone else who has chimed in has agreed that For me, I certainly understand the hesitance to consider including another "standard library" package at this point before a 1.0 release, but I also firmly believe that this is a topic that has already been iterated on for years in the julia ecosystem, with all design & iteration leading to current Nulls.jl. I also firmly believe that it would be a godsend for the data ecosystem to have a "blessed" representation of missingness, that packages could rely on and build against. It would go a long ways to making the "marketing event" of 1.0 more complete as a swath of data-related packages could also be released using a coherent strategy for missingness. I think Julia stands to "miss out" on more than it gains by leaving them out for 1.0. |
I should point out that any data-frames-like data analysis tooling package should absolutely not make assumptions about what kind of nulls are used. This was one of the major mistakes of the original DataFrames – the If you want a data column that can hold integers or missing values, its type should be Many people also don't want to have null support forced on them – it should be possible to have an |
@quinnj From what I can see, the definition is |
@mlhetland, no, sorry, it was my mistake of saying Nulls.jl follows @StefanKarpinski's description of |
@quinnj's "marketing event" is exactly what I'm talking about - I think this would be a real declaration that the language is intended for use by life scientists, and I could personally use this to good effect tomorrow to argue that we ought to be thinking of a (long-term!) strategy of moving our department's research and ultimately our teaching over to a real programming language. I also believe that @ExpandingMan is completely wrong both about NaN-like behaviour being okay and about people not caring about inconsistencies. Missing data is everywhere in the life sciences, and behaves exactly how @quinnj and I have described. It's also often not floating point - factors are a big issue in particular. I've personally stopped using R in anger completely and replaced it with Julia, but I will never try to teach a language with contradictory and inconsistent syntax to my students because it's more trouble than it's worth. I'm not trying to put words in anyone's mouth (sorry, @StefanKarpinski!), I'm just telling you what I believe to be true. As everyone acknowledges, this is a really important thing to get right, but we know how to do it now, and so we should just bite the bullet and do it. PS None of this requires |
FWIW, I still haven't seen a strong argument for why Nulls.jl needs to be in Base. As it stands, I'd like to keep Base as minimal as possible, so the language can iterate independent of these datascience features. Is the issue simply not wanting to type |
You wouldn't even need to do |
@rofinn, technically, it's mainly about being able to have the syntax Another benefit I just thought of is the "proximity to the type system"; the I standby my arguments from above though that it also sends out a strong message that we've finally settled on a sensible standard that everyone can rely on. |
There are a few other reasons one might want a built in |
Yeah, I'd be okay with that. I'm still concerned that we're conflating missing values in software development and statistics. I can understand wanting something like |
I'm not sure why you'd expect that. Adding a number to a missing value isn't an error, we just don't know what the answer is, surely? Or am I getting confused about what you're saying? |
Yes, there were many problems here... however the specific problem I was referring to was scarcity - that it was unique. It was the only macro operator, and it becomes a race to see who claims it first. I decided not to use it specifically because I could foresee that another popular package might be used in tandem with whatever it was I was playing with at the time (sorry, the details elude me right now). It seems to me that if there were different "nulls" using |
+100 to this. Not intentionally trying to muddy the waters, but I did start playing with As for |
@richardreeve Adding a number to a missing value could be an error depending on what the context of that missing value is (e.g., we don't know what |
@rofinn Okay, I was getting confused about what you were saying - I thought you were discussing using |
At the moment I see these 4 as semantically clear and distinct possibilities. @rofinn, @richardreeve - it seems to me that you are discussing 3 vs 4.
While I appreciate that other languages have lamented the sheer number of different types of "nullable" types, I see here four semantically distinct things - it would be hard to accidentally use one of the 4 in place of another and not get errors relatively quickly. |
@andyferris And to be honest, although I understand the importance of the others, I only care about 4. And though there may be technical reasons for putting each of them in Base or not, I think all of them should go in partly for "marketing" reasons to make a "strong statement" about support, and partly because as we've just perfectly demonstrated this is very confusing so we have to have a completely clear core definition of each so there is no disagreement about which one is being talked about and used at any given time! |
I very much appreciate that sentiment, @richardreeve. If I might offer some interpretation here for you - I think to many (most) here, the v1.0 milestone marks the beginning of some stability in the language - that we won't be changing the syntax or making too many breaking changes to the types in Thus, the prerequisite to putting something completely new like in The nice thing is that we can still have definite plans to add case 4 to |
@andyferris Which other uses than no. 3 do you see for null pointers (i.e., no. 2)? And how is this different from the use of |
I do think there are cases in Base where it will be nice/ergonomic to handle null correctly out of the box, but I really, really don't want to do this. The problem is that since you don't know whether |
Yes, |
It may be a cheat, but … it may not be entirely unreasonable to just forbid basing control flow on it, as you don’t know what you’re supposed to do? I mean, if you have |
Eh. Right. Wrote my response in parallel with @nalimilan, there. |
Both. For |
Sure, not magical or crazy :-) But it does seem that the short-circuit operators are idiomatically used as conditionals in Julia, in which case the lines seem to blur, perhaps? Things like More destructive in cases like |
That makes sense, though that's low priority compared with other features we would want to work with |
close now? |
At last! |
The semantics of
Nullable
have been under heavy discussion over in Julia base at JuliaLang/julia#19034 (comment).Personally, I had been somewhat confused as to the core problems but @johnmyleswhite makes a very compelling argument which really solidifies (for me) why there has been tension regarding adding features to
Nullable
. I'll copy it here (I hope you don't mind, John).To me, if we are going to make a way forward, we should really begin developing a data-friendly alternative to
Nullable
and leaveBase.Nullable
for the software engineers. I think this will allow for much more rapid progress. My proposal would be to create a type which behaves semantically as close as possible toUnion{T, NA}
while remaining type-stable, whereNA
is the missing value type in the DataArrays.jl package which behaves somewhat likeNaN
does forFloat64
.However, this needs to be a discussion for the community, and it doesn't necessarily have to involve
Base
Julia at all. I'll ping a bunch of people and see what happens.@johnmyleswhite @JuliaData (not sure if that works so I'll just add everyone @ararslan @davidagold @dmbates @kleinschmidt @nalimilan @quinnj @richardreeve @Scidom @shashi @simonbyrne )
The text was updated successfully, but these errors were encountered: