-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A global, shared ExecutionContext
#824
Comments
Thanks for the suggestion. I do not think we should do this in Also, imo having a singleton like this in Rust library is an anti-pattern. |
@yjshen based on the sample code in your io source PR, it looks like you want the following API as a consumer of datafusion: let ctx = ExecutionContext::get(); // returns global context singleton
ExecutionContext::set(ctx); // updates global context singleton If so, I think this pattern can be implemented entirely within a self-contained crate that can be published in crates.io. The global singleton context crate could provide a get/set API like so: let ctx = global_df_ctx::get();
global_df_ctx::set(ctx); The end user experience would stay the same, but the code would be better decoupled and users can choose to opt-in to the feature by simply adding the extension crate as a dependency. |
If a singleton
|
If it's for memory control, perhaps we could extend @jorgecarleitao @andygrove @alamb @Dandandan for the context, @yjshen is working on a native spark executor using Datafusion: https://the-asf.slack.com/archives/C01QUFS30TD/p1621582734043500. Executor memory control would be very relevant to ballista as well, so I am curious if @andygrove has any opinion on what's the right abstraction here. |
Thanks a lot for the context. I think this takes 2 components:
With respect to the first item, what about adding a new argument to wrt to the second, since the execute is already |
By execution state, you are referring to |
I am also very interested in the ability to both track and limit memory usage for a datafusion plan -- there is some additional discussion / context related to this on #587. @jorgecarleitao 's implementation idea in #824 (comment) sounds like a good start. I don't think we can realistically try and limit memory usage before we can measure it. As for the single global context, it is also my preference that the code that is using DataFusion to manage the contexts. So specifically it would make sense to me if |
I would not want the execution plan traits to have a dependency on DataFusion's context. Ballista has its own context that also creates execution plans. |
I am very interested in this work. I have been talking about deterministic memory use as being one of the advantages of Rust over JVM for some time and it would be great to see this implemented. I like the idea of passing in some form of context state with a memory tracker. It would be good if this is not tied specifically to a DataFusion context, so that physical operators can be used in other contexts. I also think this gets us back into discussing scheduling and I have just added the following note to #587: We should also discuss creating a scheduler in DataFusion (see #64) since it is related to this work. Rather than try and run all the things at once, it would be better to schedule work based on the available resources (cores / memory). We would still need the ability to track/limit memory use within operators but the scheduler could be aware of this and only allocate tasks if there is memory budget available. |
The tying to datafusion is easily achived via a trait and E.g. in ballista memory is not even from a single node, so which node is being run changes what resources are available. |
Based on the discussion so far, I recommend closing this issue since it was originally created for global shared ExecutionContext and I think we have already reached a consensus on this particular topic. We can move all memory ExecutionPlan measurement and enforcement related discussions to #587 instead. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'd like to propose making an
ExecutionContext
static and publicly accessible. Several cases would benefit from a global context:Describe the solution you'd like
A single, global
ExecutionContext
that is inited on its first use and a getter to always returning the same instance.Describe alternatives you've considered
Passing around a shared state/context to each operator needs it. For example, table scan, join, hash agg, and sort.
Additional context
Similar to the shared
SparkEnv
in each process (driver/executor) of Spark.The text was updated successfully, but these errors were encountered: