-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NullableArrays + New API Release Roadmap #1092
Comments
NullableArrays.jl could use some love before the new release -- either JuliaStats/NullableArrays.jl#148 or some subset of it. EDIT: Though I suppose this isn't strictly blocking, it would be good to settle where lifted operators should live before making public any APIs that depend on them. |
Would this be a good place to link to issues with other packages that rely on DataFrames or should that be handled in a separate issue? |
@IainNZ -- I saw you wrote a tool to look at package dependencies. Would you mind running it on DataFrames and DataArrays so we can look for actively-maintained packages that might be affected? (If there's a way I should be able to do this let me know). Thanks -A |
Thanks for your help getting the word out on the changes, @amellnik! If I'm not mistaken, |
Ah thanks Alex -- I didn't know about that Pkg command. I'll start looking through the list and trying to submit PRs for the low-hanging fruit now. |
Only for packages that you have installed though, right? |
I think it scans your local copy of |
You can also use JuliaPkgDb and do |
It won't look at test dependencies though. Those are worth checking as well. I don't recall being asked explicitly to run pkgeval, I was asked how one would go about doing it which I answered. |
Hey @tkelman, would you be willing to explicitly run PkgEval? 🙂 |
latest master here then? anyone have a particular metadata fork+branch with a provisional test tag? skip the git tag and pr part of it for now. |
Yep.
I don't but I can make one (much) later today if no one has beaten me to it. |
Does this look right, @tkelman? JuliaLang/METADATA.jl#6735 |
May want to rebase that right before I fire off the run. Been sick the last few days, bug me again if I haven't run this by Monday. |
Looks like you had "enable edits" checked, so I did the rebase and fired off PkgEval. Hopefully it'll work, check https://github.com/JuliaCI/pkg.julialang.org/branches in 10-12ish hours to see if it creates a new "dataframes" branch with the results. |
I should have run that differently since this version doesn't support Julia 0.4 and there's a bunch of unrelated SSL certificate failures on 0.6-dev right now, but focus on the 0.5 pulse changes here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/JuliaCI/pkg.julialang.org/dataframes/pulse.html
Many of these, but not all, look like packages just making the now-wrong assumption that depending on DataFrames is enough to also ensure DataArrays will be installed. The easiest short term fix for those cases would be adding DataArrays to each of these packages' REQUIRE files (could even be done directly in METADATA without having to wait for a new tag), though there could easily still be remaining failures after doing that. |
What exactly is the plan here? We are not going to say that some query framework is the only way to interact with a |
I don't think we're going to remove the current API (at least not immediately, and not all of it), but even with lifted operators it's significantly more cumbersome to use after the port to |
Just as an aside and follow up to @ararslan's post on how you can query the package database with his JuliaPkgDb repo: I took his code and turned it into a data source that can be queried with Query.jl. For now you need the code here, but I'll probably move that into Query.jl at some point. His example could then be written like this in pure julia terms, without savings things into a SQL database first: using DataFrames, Query
include("packagedb.jl")
db = pkgdb()
@from i in pkgdb() begin
@from j in i.versions
@from k in j.requires
@where k.name=="DataFrames"
@select {i.name, j.version}
@collect DataFrame
end Should be fairly simple to also read information about test requirements and add them to the mix. @ararslan Let me know if you have any thoughts about this code, but maybe here queryverse/Query.jl#51. I just posted this info in the current issue in case it might be helpful for going through packages, any follow up discussion about the code we should probably have somewhere else. |
@davidanthoff That's neat, thanks for sharing. 🙂 |
I may have been premature in opening issues related to this change in other packages -- except in trivial cases like Mustache.jl, I don't anticipate any progress on those until we have a clear strategy for queries and lifting/conversion. If we do go with either StructuredQueries or Queries, I think we also need to make sure there's still a way to do very simple operations inline. |
I'd just like to provide some feedback here in regard to the query methods. I personally rather dislike Query.jl (no offense to the developers of course); though it's cleverly done, it involves a rather elaborate interface which gets too far away from the normal structure of the language. To me, the great thing about By contrast, I really quite like DataFramesMeta.jl. It's beautifully elegant and (aesthetically) efficient, and fits very comfortably and uniformly into the existing style and structure of the Julia language. To me, using macros to splice SQL into Julia doesn't make sense. Julia is a wonderful, incredibly well-thought-out language (things that I would never say about SQL), any query API for dataframes should take advantage of this, and embrace all the things that make Julia so much better than SQL. Furthermore, a query package is much more suitable for general, custom use if it doesn't hide the underlying structure of the dataframes or the queries themselves. This is just my opinion of course, hopefully a few others will agree with me. Of course I'm not trying to discourage work on Query.jl, those who prefer that style should go with it, and certainly there will be some cases where it is much easier to use that DataFramesMeta.jl, I'm just hoping it won't become a paradigm. It's likely that at some point I'll start working on DataFramesMeta.jl, or something like it. |
Both Query.jl and StructuredQueries.jl are new. As such, this is a good time to request features and describe use cases. For example, David has started coding some shortcuts in Query.jl that make it look more like DataFramesMeta and pandas. Feedback on that might be helpful. Unfortunately, I don't have a link handy where he showed an example. |
@ExpandingMan I totally agree that something like Query.jl shouldn't become the only way to interact with DataFrames. It is really powerful is you want to do something complicated, but for the small, simple manipulations is is overkill and too verbose. I've started to add another alternative user-facing API that is a little less heavy weight, and the underlying architecture is flexible enough to potentially allow additions of more alternative user facing APIs (e.g. something that looks more like dplyr could be added as another option). But that is all future talk. The more I think about it, the more convinced I am that the current transition plan for |
@ExpandingMan and @tshort Here is an issue discussing this new API queryverse/Query.jl#72. Feedback welcome! |
Was there ever any real choice about transitioning to For example, I think implementations of basic math functions for I know it'll be a lot of work, and I haven't been involved in this, so I don't want to sound like a back-seat-driver, but it seems to me like you guys made the right decision. |
I think there is a zero chance that we will be allowed to add a lifted version of things like If we don't want to give up on this kind of API, I think the only way to get there is to use a different type for missing values, see JuliaData/Roadmap.jl#3. I've actually migrated Query.jl to that kind of design, and everything works just fine: I get type stability, I can provide lifted versions of all the functions I like etc. |
I've now read many of these issues, and I am starting to see the point about why lifted methods should not be defined for First, let me just say that I still think type-stability is absolutely crucial. Even if performance were the only issue (and it isn't), there seems to be reason enough not to use something like Here's how I see it: there are two versions of null values which are currently in widespread use which should serve as pototypes: C++'s I think I agree with those who are saying that we need another type which resembles There's two other things I'd like to point out about lifting strategies:
Also, let's try not to lose sight of something: at the end of the day, when what's being stored in a dataframe is ultimately fed into a machine learning method, integrator, whatever, the null values must be dealt with. I still haven't entirely given up on the idea of having a new type down-graph from |
I vote to reign this thread back to it's original purpose, which is to plan for the release of the new DataFrames.jl master. The foregoing comments suggest that we might want to offer at least a minimal native API that doesn't rely on a query framework. I think we can accommodate as much. Higher-order functions such as julia> using NullableArrays
julia> immutable Lifted{F}
f::F
end
julia> lift(f) = Lifted(f)
lift (generic function with 1 method)
julia> @inline function (_f::Lifted{F}){F}(x)
f = _f.f
if Base.null_safe_op(f, typeof(x))
return Nullable(f(x.value), !isnull(x))
else
U = Core.Inference.return_type(f, Tuple{eltype(typeof(x))})
if isnull(x)
return Nullable{U}()
else
return Nullable(f(unsafe_get(x)))
end
end
end
julia> Base.broadcast(f::Function, X::NullableArray) = broadcast(Lifted(f), X)
julia> digamma.(NullableArray(rand(5), rand(Bool, 5)))
5-element Array{Nullable{Float64},1}:
#NULL
#NULL
#NULL
-1.6254
#NULL So perhaps we may as well offer those, even if I would encourage a query framework over their use in general. In any case, we should make very clear which APIs will work, which will not, and why we're moving away from the one's that we're breaking. I can see a compromise where we deprecate the old API by including a minimal set of lifted methods, to ease the transition. |
I think that's a great plan/idea. It also occurred to me that, due to potential reach of such a change, we could provide a limited-time "upgrade message" that is printed in the DataFrames |
Just to be clear, @davidagold's suggestion would have the following consequences: isnull.(df[:a]) would not work in the way anyone would expect it to work. That case could probably be fixed by something more elaborate, but I believe nothing could be done about the following case, which would also not work as one would expect: foo(x) = isnull(x) ? 0 : x
foo.(df[:a]) I seem to be in the minority on this, but I think both cases are really bad. The dot syntax is conceptually really simple, except that with this it all of a sudden it gets really complicated. I actually would much prefer if the difficulties with the |
I'm pretty sure that David's solution allows you to provide a custom definition for how |
Yes, f(x) = isnull(x) ? foo : bar
f.(X) cannot. Even I agree that such inconsistencies with the behavior for arrays is not ideal. Neither lifting approach is ideal. By any means. What would be ideal (from my limited perspective) would be to have two versions of |
In terms of timeline, I think we should be explicit about what we're thinking here and commit to something publicly. It gives the best chance for others to be aware and make changes and it also helps this package keep moving forward without being in this "limbo" state for too long (i.e. the difference between the latest release tag and master). Of course, I'd say the bulk of what needs to happen before a new release depends on non-DataFrames things: NullableArarys, Nullable operations in general, StructuredQueries.jl/Query.jl, etc. I'd like to propose (and include in John's original post at the top), that we aim to have the bulk of the non-DataFrames package work done by the first of January. That gives us over two months to help update the ecosystem and iron out any remaining design decisions, as well as allowing StructuredQueries/Query to mature. We can think of the first of January as being a the moment we want to aim for a "release-candidate" of sorts, allowing a few weeks of final bugs to sort out before the final public release and announcement. We should obviously try to come up with some kind of announcement that can be released soonish letting everyone know about the the plan. |
Another point in favor of @davidagold 's comment above: Metaprogramming tools are unbelievably complicated unless we have a null type that propagates. That seems to me like the only good solution to that particular problem. I've now spent some time looking at DataFramesMeta with the intention of making it work for |
@ExpandingMan so you're aware, I've been working on a query representation framework that automatically transforms user-specified calls EDIT: That's by way of saying, the metaprogramming involved in a generic, macro-based lifting approach is not intractable. That's good, because I highly doubt that the solution I proposed above will happen any time soon. (It certainly won't happen in the form I suggested above, in which the transformation is conditional upon the types of the arguments.) |
If I may add something to the API discussion, as someone who wasn't involved in its development: I'm currently getting a DataFrame as a result of an SQL query. I would rather not be forced to use another Query API to get my values out of that DataFrame a second time. And I totally agree that imitating SQL in Julia would be a step backwards. |
The Thanks, everyone! |
I'd like to have a single issue that identifies every issue that's blocking a release of DataFrames that replaces DataArrays.
At a high-level, my understanding is the following:
The text was updated successfully, but these errors were encountered: