Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

package interactions #2025

Closed
StefanKarpinski opened this issue Jan 12, 2013 · 76 comments
Closed

package interactions #2025

StefanKarpinski opened this issue Jan 12, 2013 · 76 comments
Labels
packages Package management and loading

Comments

@StefanKarpinski
Copy link
Member

Especially in the presence of multiple dispatch, there are situations where there exists glue code that you want to load only when using both of two packages. For example, the k-nearest neighbors algorithm makes perfect sense to apply just to plain old matrices – but of course one also wants it to work for data frames, data matrices and various other containers of data. Currently the only way to make this work is to have the kNN package depend on DataFrames and add the appropriate DataFrame-specific methods. This is going to get out of hand very quickly.

I can think of two solutions. One way is to write the kNN code in a more generic fashion so that it isn't coupled with the DataFrames package but uses an interface for containers of data which DataFrames happens to provide. This is generally a good idea, but I kind of suspect that it may be rather hard to make work in all cases. The other way is to provide a mechanism for loading glue code only when both kNN and DataFrames are loaded.

@johnmyleswhite
Copy link
Member

One partial way to cope with this is to establish canonical types used as interfaces between packages: this is part of the reason that we created vector and matrix in DataFrame. Then you can write

knn(a::Any, b::Any) = knn(matrix(a), matrix(b))

The trouble is that so many methods will need to have DataFrame's as the canonical type if the method is robust to missing data.

@JeffBezanson
Copy link
Member

I think the only thing core julia can do to help with this situation is some kind of conditional loading (your second option).

@StefanKarpinski
Copy link
Member Author

Right, but I'm thinking of a very particular kind of conditional loading: require("kNN") when DataFrames has already been loaded or require("DataFrames") when kNN has already been loaded both trigger the loading of the following two files if they exist:

  • kNN/glue/DataFrames.jl
  • DataFrames/glue/kNN.jl

@StefanKarpinski
Copy link
Member Author

This arrangement allow you to provide glue code for a package to make it work nicely with as many other packages as you want, without any of the packages depending on each other. If you happen to load both, you get the appropriate glue; if you only load one or the other, then you don't.

@johnmyleswhite
Copy link
Member

This seems like it will put a big burden on DataFrames, no?

@JeffBezanson
Copy link
Member

It should be more like an optional dependency, so only one of those glue directories is needed.

@StefanKarpinski
Copy link
Member Author

Neither glue directory is required – they're only loaded if they exist. The main reason to look for both of them is so that the order in which requires occur doesn't affect what gets loaded. Afaict, "optional dependency" is an oxymoron.

@StefanKarpinski
Copy link
Member Author

Typically for a foundational package like DataFrames, the other packages will provide the glue.

@timholy
Copy link
Member

timholy commented Jan 12, 2013

What's wrong with a separate glue package?

Though see my last comment in #1809. If you need to override (not just extend) the behavior of another module to achieve what you want, I guess

evalfile(fname::String, mod::Module) = eval(mod, parse(readall(fname))[1])

might be useful.

@StefanKarpinski
Copy link
Member Author

The issue with a separate glue package is that it we support loading a third package when kNN or DataFrames is used, but not when kNN and DataFrames, which is what you want for glue packages. Glue packages could modify existing code and they could be guaranteed to be loaded after both of the packages they connect.

@StefanKarpinski
Copy link
Member Author

I suspect that all this points towards making requirements declarative rather than imperative.

@StefanKarpinski
Copy link
Member Author

Let me elaborate on that. I think I've figured out what "optional dependency" means: if A is an optional dependency of B then if A and B are both required, A should be loaded before B. If we can arrange for that to happen, we don't need a special glue mechanism since B can simply check for the presence of A when it's loading and execute "glue code" conditionally. However, it seems to me that this entire notion implies that requirements must be declarative since otherwise you can't know if A is going to be required if B is loaded before A.

@diegozea
Copy link
Contributor

This looks related with my actual situation: https://groups.google.com/forum/?hl=es&fromgroups=#!topic/julia-users/wwxKj0QoKzM

I'm thinking on this too... When you need a package on another, you penalized the load. For example:

julia> @elapsed require("DataFrames")
3.3809280395507812       

Package Benchmark uses Dataframes [ https://github.com/johnmyleswhite/Benchmark.jl ] :

julia> @elapsed require("Benchmark") 
3.6832501888275146       

And the time of load of this small package is huge because the load time of DataFrames.

Maybe the better option is, compile only what its used. In order you don't compile the full DataFrame package if you only call a type and two methods from DataFrames.

When Julia becomes compilable, are this things going to happen?
But maybe check at run time can be useful for avoid this and allow load conditional dependencies only what they are need it ?

@diegozea
Copy link
Contributor

You can't use only a few thing of a package/module...

julia> using DataFrames.DataFrame
invalid using statement: name exists but does not refer to a module

@StefanKarpinski
Copy link
Member Author

Diego, your focus on how fast Julia programs load hints to me that you may be doing something wrong. Why is starting Julia such a bottleneck for you? That said, the package interactions thing is clearly an issue (hence me opening this issue).

@StefanKarpinski
Copy link
Member Author

You can only use a module as a module.

@diegozea
Copy link
Contributor

I usually run scripts programs. A lot of times. I know I can't avoid this running everything inside Julia. The problem I see it's not a problem for Julia, it's a problem for a lot of languages. For example... Y created scripts on Python using Bio, for me and for sharing with co-workers in my group. Doing this, I note the load time of modules. Yes... Are seconds! But I use to run them in pipelines 188086 times (at one second... gives me a little more that two days only loading packages). I'm affray of Julia going in that way. Maybe when becomes compilable this it's not going to be a problem.... But at the moment I don't know if design a faster-to-load package or not ?

If the answer if trying to make a little faster to load package (even for a compilable Julia)... interactions between them is a problem for make it possible.

@StefanKarpinski
Copy link
Member Author

Loading Julia programs will be fast when we have a compiler. Until then it will continue to be slow.

@diegozea
Copy link
Contributor

As fast that I don't have to be worry for load times and size package... Or is it good trying to make it smaller ?

@StefanKarpinski
Copy link
Member Author

It's always good to make things smaller. Honestly though, if you're starting a program 1 BILLION TIMES, you should really consider trying to run things in a single long-running process. Starting a C program that just exits is not instantaneous either.

@pao
Copy link
Member

pao commented Jan 17, 2013

Julia is a general-purpose language with good support for running external programs--perhaps consider using Julia as the glue for your pipelines?

@StefanKarpinski
Copy link
Member Author

+1. Julia is good at this kind of thing: http://docs.julialang.org/en/latest/manual/running-external-programs/

@diegozea
Copy link
Contributor

I read about that, but I didn't get a chance yet. I'm used to use bash. It would be a good idea, I'm going to try it ;)

@diegozea
Copy link
Contributor

Getting back to the point of this issue (excuse me for the noise)

I think can be great to be able to define a method for a DataFrame without import the package for example. It's going to be useful the declaration of method for types without importing all the packages.

@johnmyleswhite
Copy link
Member

Are you proposing lazy loading of dependencies?

@diegozea
Copy link
Contributor

I don't know if lazy loading it the expression... And I don't know if Stefan it's saying the same with declarative instead of imperative.

I'm saying that if you are going to use k-means on a matrix, you don't need to load DataFrames.
But if you load DataFrames, you can use k-means on DataFrames if the method is defined.

Maybe its more like the ability of define methods for types you don't load, in order that Julia can use it when their are already loading.

Maybe lazy loading (load only when you need it) can be a good option too.

@mlubin
Copy link
Member

mlubin commented Jan 23, 2013

+1 for allowing loading glue when a specific set (pair) of packages is installed

@carlobaldassi
Copy link
Member

It seems to me that this "glue plan" calls for introducing a "CONFLICTS" file for packages alongside "REQUIRES": suppose packages A and B have some glue code stored within A, then B gets updated in an incompatible way, and the glue code in A doesn't work with the new version of B. Since the two packages do not explicitly depend on each other, the packaging system would have no way to know this, unless told explicitly somehow. Maybe there are better ways to deal with this situation than introducing conflicts, but this is the easiest I can think of.

BTW I'd also like to have the "glue code" feature.

@aviks
Copy link
Member

aviks commented Feb 7, 2013

There is now a request to add a JSON serialiser for DataFrames. aviks/JSON.jl#10

However, given the relative sizes, and use-cases, of the two packages, I am loath to add a dependency on Dataframes to JSON. Any thoughts on a way out at present?

@lobingera
Copy link

Is there some visible development on this?

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Dec 31, 2017
@JeffBezanson
Copy link
Member

I believe at this point this can and needs to be added as a feature in 1.x.

@bjarthur
Copy link
Contributor

so the 30-sec wait until display of first plot in Gadfly waits until then?

@JeffBezanson
Copy link
Member

I don't believe that delay is related to conditional dependencies? Am I missing something?

@ChrisRackauckas
Copy link
Member

Yes, Gadfly doesn't even have conditional dependencies. It doesn't use Requires.jl and it doesn't @eval using statements (which is Plots.jl's issue). Gadfly's first time to plot is completely orthogonal to this and just due to precompilation not capturing most of what users "think it would/should".

@lobingera
Copy link

@ChrisRackauckas Just for the record, Gadfly renders via Compose and Compose has some infrastructure like https://github.com/GiovineItalia/Compose.jl/blob/master/src/Compose.jl#L30-L47, so you're technically right that Gadfly doesn't have conditional dependencies, but it depends on it ...

@ChrisRackauckas
Copy link
Member

Interesting. I didn't know Compose did that. Then it is the same problem as Plots.jl. Why does it have to be lazy though? It's the lazy loading part that makes it difficult.

@lobingera
Copy link

lobingera commented Dec 31, 2017

Maybe we see (compared to Plots.jl) some covergence here: similar problems bounded by the same constraints lead to similar solutions. Compose actually manages two backends, a homegrown SVG and a link to Cairo for other formats.

@ChrisRackauckas
Copy link
Member

The ideas for fixing Plots.jl backends is much simpler than having some kind of Base hook for conditional deps. Instead its to pull in the backends with using and work off of that syntax, i.e. using PlotsGR instead of gr() doing that kind of stuff, and then having it add new dispatches to core functions using some abstract type. I think that's a sane thing to do and it fixes the precompilation problem. It just requires a re-write of the backend code to do it.

@lobingera
Copy link

lobingera commented Dec 31, 2017

It just requires a re-write of the backend code to do it.

What do you mean by just? Isn't that just giving up on modularization (which is (imho) re-using code without changing it)?

@ChrisRackauckas
Copy link
Member

No. It's putting the backend code into a separate package like PlotsGR and having that implement a documented function interface by implementing dispatch on a concrete subtype of some abstract backend dispatching type. It's more modular and allows more code re-use, at the cost of having to have the backend code in a separate repo. But if Pkg3 can handle separate submodules in the same package well (with precompilation), then it can be one repo.

@lobingera
Copy link

Sorry, i'm lost. I thought, that the backend code is already in a separate package (i.e. GR.jl). And in your example, isn't the PlotsGR the abovementioned glue package? And when is the decision taken to execute/precompile PlotsGR?

@ChrisRackauckas
Copy link
Member

When the user calls using PlotsGR. That would be how a backend is chosen, then the package's init call could set a global in Plots to make the backend choice reflected in the latest using. Then each plot call can have an optional argument passing through this global that says what the current backend is in terms of a type, and then core functions can be overloaded for specific backends by new dispatches in PlotsGR. So the decision to execute PlotsGR code is done when the user calls using PlotsGR, and the code to precompile are the new dispatches.

@StefanKarpinski StefanKarpinski modified the milestones: 1.0, 1.x Jan 4, 2018
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Jan 4, 2018
@timholy
Copy link
Member

timholy commented Jul 8, 2018

Now that we have Base.package_callbacks as a "blessed" interface, in my opinion JuliaPackaging/Requires.jl#46 seems like a non-objectionable solution to the other half of this problem. If that gets merged then perhaps we can close this.

@nalimilan
Copy link
Member

@timholy That sounds great. Could you explain a bit what that PR does? Does it fix all issues with the approach currently adopted by Requires?

@timholy
Copy link
Member

timholy commented Jul 10, 2018

At one time Requires did a lot of "sneaky stuff" (i.e., overwrite methods in Core and Base), but over time it has worked more harmoniously with base julia; in particular, the addition of Base.package_callbacks in 0.6 gave us an official interface for calling a function whenever a new package has been loaded. The interface for a package callback is f(id::Base.PkgId), thus passing information to the callback about which package just got loaded. The entire list of callbacks gets called every time you load a new package. You'll note a comment that the interface was marked as experimental, but it hasn't changed during the entire 0.7 cycle (lots of stuff about loading has changed, but not the package_callbacks interface) and since we're about to release I think we can consider it safe. At least Revise.jl uses the same interface, so Requires is not the only consumer.

All this is provided by Base; now, on to Requires. First, let me describe the state of Requires master, which is largely the work of @MikeInnes. Requires defines a single callback function and pushes it to Base.package_callbacks. Requires also maintains a Dict of thunks for use by its callback function; the Dict is indexed by PkgId, which does not require that the module itself exists (yet); the values stored in the Dict are just lists of functions (thunks) to call conditional on the loading of that package.

This Dict gets populated through @require calls. @require is a bit complicated in master, so now let me turn to my PR. In my PR, @require just does this (I've edited this heavily so it looks more like regular code):

julia> macroexpand(Main, quote
           @require JSON="682c06a0-de6a-54ab-a142-c8b1cf79cde6" include("morecode.jl")
       end)

        if !Requires.isprecompiling()
            Requires.listenpkg(Base.PkgId(Base.UUID("682c06a0-de6a-54ab-a142-c8b1cf79cde6"), "JSON")) do 
                Requires.withpath(@__DIR__) do 
                    Requires.err(@__MODULE__, "JSON") do 
                        const JSON = Base.require(JSON [682c06a0-de6a-54ab-a142-c8b1cf79cde6])
                        include("morecode.jl")
                    end
                end
            end
        end

All that basically does is register the following:

const JSON = Base.require(JSON [682c06a0-de6a-54ab-a142-c8b1cf79cde6])
include("morecode.jl")

to be executed (whenever JSON gets loaded) inside whichever module you used @require in (that's what the @__MODULE__ is about). The last touch is setting the path (via @__DIR__) for finding "morecode.jl". In my PR, this @require statement must occur inside the module's __init__ function, which means that we register this JSON-dependency at the time of module initialization.

In the master branch of Requires, @require does a little bit more stuff because it supports having a @require statement outside __init__; it basically stores all the @require calls in a module-global array __inits__ and then creates an __init__ function that iterates through the list and registers them. IMO this is a bad idea because it is exclusive with having a user-written __init__ function. (You can iterate over __inits__ yourself, but this appears to be undocumented, and the lack of error output about why it failed is problematic.) So I would describe this as the one dicey remaining feature in Requires, which is why I stripped it out. So on that branch Requires plays well with precompilation, custom initialization, and all the other fancy things we now know we need.

If that gets merged, I think it's fair to say that Requires is a clean and straightforward solution(*) to the problem of executing code that is dependent upon other modules having been loaded. That may not be the full list of ways we want to support interaction among packages, but it's the big one, and the one for which there aren't as good alternatives. Again, most of this progress has been from the work of @MikeInnes and those who designed the Base.package_callbacks interface; all I did was give this a nudge to fix a couple of bugs and strip out the last bit of problematic behavior.

Unfortunately, I don't think it's a deprecatable change, so it's pretty heavily breaking.


(*) some might object to monkeying with task_local_storage to set the path, but by my reading (I could be wrong) it's safe.

@StefanKarpinski
Copy link
Member Author

That sounds great, @timholy! In the future it might be good to make this entire business a little nicer to use and more official, but for now it sounds like we have everything we need. I support making the breaking change now so that Requires becomes a "clean and straightforward solution".

@lobingera
Copy link

@timholy You mention above this a solution of one half of the problem, what would be the other?

@timholy
Copy link
Member

timholy commented Jul 10, 2018

@lobingera, meaning Base.package_callbacks makes it possible to do this correctly (it's "the backend") and Requires is what implements the specific logic ("the frontend").

@MikeInnes
Copy link
Member

Tim's PR is merged and tagged; I concur that it's a strong and stable solution for package negotiations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
packages Package management and loading
Projects
None yet
Development

No branches or pull requests