Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig #1562

Merged
merged 5 commits into from
Jan 17, 2022

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Jan 14, 2022

Which issue does this PR close?

Closes #1565 .

Rationale for this change

  1. batch_size is only execution runtime related, should consolidate all into RuntimeConfig.
  2. Rename PhysicalPlanConfig to FileScanConfig to make it clear.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Jan 14, 2022
@@ -246,8 +246,8 @@ impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode {
.collect::<Result<Vec<_>, _>>()?,
partition_count as usize,
),
PartitionMethod::RoundRobin(batch_size) => {
Partitioning::RoundRobinBatch(batch_size as usize)
PartitionMethod::RoundRobin(partition_count) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_size here is wrong, it's partition count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh god, good find, 4096 partitions is not going to be fun.

@@ -92,8 +92,8 @@ mod roundtrip_tests {
.map_err(BallistaError::DataFusionError)?,
);

for batch_size in test_batch_sizes.iter() {
let rr_repartition = Partitioning::RoundRobinBatch(*batch_size);
for partition_count in test_partition_counts.iter() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -178,7 +178,7 @@ mod tests {
async fn read_limit() -> Result<()> {
let runtime = Arc::new(RuntimeEnv::default());
let projection = Some(vec![0, 1, 2, 3]);
let exec = get_exec("aggregate_test_100.csv", &projection, 1024, Some(1)).await?;
let exec = get_exec("aggregate_test_100.csv", &projection, Some(1)).await?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, most of the magic number 1024 is removed from the project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of putting batch_size on the RuntimeConfig is 🏅 👍

@@ -901,14 +898,13 @@ pub struct ExecutionConfig {
/// Should Datafusion parquet reader using the predicate to prune data
parquet_pruning: bool,
/// Runtime configurations such as memory threshold and local disk for spill
pub runtime_config: RuntimeConfig,
pub runtime: RuntimeConfig,
Copy link
Member Author

@yjshen yjshen Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to runtime since I think the word config is redundant. Consider the usage exec_config.runtime_config.batch_size for example. Everything in exec_config is config.

@yjshen yjshen marked this pull request as ready for review January 14, 2022 10:14
@alamb
Copy link
Contributor

alamb commented Jan 14, 2022

Thanks @yjshen -- I'll check this out over the weekend.

BTW are you planning on working on the next steps for #1526 (external sorting)? I would be interested in working on this area too, but I don't want to conflict with your plans.

I found #1526 (comment) -- sorry

@alamb alamb changed the title Consolidate configurations in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig Jan 16, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really nice improvement @yjshen -- thank you!

In fact it would probably be good to add some other common things to the RuntimeEnv that is currently plumbed through in other ways such as object_store

cc @rdettai who had some thoughts related to config on #1141

@@ -246,8 +246,8 @@ impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode {
.collect::<Result<Vec<_>, _>>()?,
partition_count as usize,
),
PartitionMethod::RoundRobin(batch_size) => {
Partitioning::RoundRobinBatch(batch_size as usize)
PartitionMethod::RoundRobin(partition_count) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find

@@ -178,7 +178,7 @@ mod tests {
async fn read_limit() -> Result<()> {
let runtime = Arc::new(RuntimeEnv::default());
let projection = Some(vec![0, 1, 2, 3]);
let exec = get_exec("aggregate_test_100.csv", &projection, 1024, Some(1)).await?;
let exec = get_exec("aggregate_test_100.csv", &projection, Some(1)).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of putting batch_size on the RuntimeConfig is 🏅 👍

@yjshen
Copy link
Member Author

yjshen commented Jan 18, 2022

Thanks @alamb , the main reason I'm not moving object_store into RuntimeEnv is that it's used both during query planning and query execution. What do you think?

I was trying to consolidate more of the configs from the three Config as the initial name for this PR shows. But after a round of review of all the configurations, I find only one batch_size weird. Do you have any other config in mind?

@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

Do you have any other config in mind?

No, object_store was the one I had in mind; If anything else occurs to me I will let you know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate various configurations options, remove unrelated batch_size
3 participants