Introduce LogicalPlan invariants, begin automatically checking them #13651

wiedld · 2024-12-05T00:10:55Z

Which issue does this PR close?

Rationale for this change

The original discussion included implicit changes which can cause problems when trying to upgrade DF. One class of bugs are related to user-constructed LPs, and the mutations of these LPs. This PR is a first step to programmatically enforce the rules of what should, and should not, be done.

What changes are included in this PR?

defined what must always be true for an LP.
- This applies to all LPs, including those constructed by user (e.g. own language frontends).
define the contract for semantic mutation.
- The LP may not be executable before the Analyzer passes.
- The LP must be executable after the Analyzer passes.
define what further can be mutated during the LP optimizer passes.
- The LP must be executable before, and after, the optimizer passes.
- The LP schema cannot be mutated by the optimizer.

Are these changes tested?

By existing tests, and are benchmarked for impact to planning time.

Are there any user-facing changes?

There is a new LogicalPlan::check_invariants public api.

…tation

…as, after each optimizer pass

datafusion/optimizer/src/analyzer/mod.rs

datafusion/optimizer/src/optimizer.rs

alamb

Thank you @wiedld -- I really like this idea

I filed an issue for this idea here: #13652

@findepi @jonahgao or @Omega359 I wonder if you have any thoughts on this basic idea?

datafusion/optimizer/src/optimizer.rs

Omega359 · 2024-12-05T03:24:50Z

I love the idea of having additional testing like this! It's definitely out of my area of expertise but a few questions:

Should invariants be pluggable? For example, Influxdb might have additional invariants that another derivative of DF does not (or vice versa)?
Is the invariant testing overhead acceptable for production or should it there be the capability to disable it (enabled by default of course!) ?
Is there an equivalent set of invariants elsewhere in the core besides the LP (maybe physical plan? Just guessing here as I'm definitely no expert here) where this approach could be duplicated ?

wiedld · 2024-12-05T04:37:57Z

datafusion/optimizer/src/optimizer.rs

+    // verify invariant: equivalent schema across union inputs
+    // assert_unions_are_valid(check_name, plan)?;
+
+    // TODO: trait API and provide extension on the Optimizer to define own validations?


Here is a mention of the extensibility of invariants. Options include:

for general invariants:

defined as being checked before/after each OptimizerRule, and applied here in check_plan() (or equivalent code)

we could provide Optimizer.invariants = Vec<Arc<dyn InvariantCheck>> for user-defined invariants

for invariants specific for a given OptimizerRule:

we could provide OptimizerRule::check_invariants() such that certain invariants are only checked for a given rule (instead of all rules)

for a user-defined OptimizerRule, users can also check their own invariants

Ditto for the AnalyzerRule passes. Altho I wasn't sure about how much is added complexity and planning time overhead - as @Omega359 mentions we could make it configurable (e.g. run for CI and debugging in downstream projects).

This WIP is about proposing different ideas of what we could do. 🤔

Maybe it can be controlled through environment variables, similar to RUST_LOG or RUST_BACKTRACE. Enable it for debugging when problems are encountered or during an upgrade.

do you have example use-case for user-defined plan invariants?

We have some special invariants for our SortPreservingMerge replacement, ProgressiveEval (related to time ranges of parquet files) that would be great to be able to encode

Maybe it can be controlled through environment variables, similar to RUST_LOG or RUST_BACKTRACE. Enable it for debugging when problems are encountered or during an upgrade.

We could also add it as a debug_assert! after each optimizer pass and call the real validation

After analyze

After all the optimizer passes

We have some special invariants for our SortPreservingMerge replacement, ProgressiveEval (related to time ranges of parquet files) that would be great to be able to encode

is this about LogicalPlan::Extension? I agree it makes sense to support validation of these if we validate the overall plan.

I've added this to the followup items list: #13652 (comment)

datafusion/optimizer/src/optimizer.rs

findepi · 2024-12-05T13:28:31Z

datafusion/optimizer/src/optimizer.rs

+/// This invariant is subject to change.
+/// refer: <https://github.com/apache/datafusion/issues/13525#issuecomment-2494046463>
+fn assert_unique_field_names(plan: &LogicalPlan) -> Result<()> {
+    plan.schema().check_names()?;


Is check_names also called whenever creating new DFSchema?

Yes, on every creation. But not after every merge. Which should be ok if no bug is introduced in the merge -- altho I would prefer to add the check there.

findepi · 2024-12-05T13:31:26Z

I like the verification. The only concern is the overhead of doing it, especially for large plans. Every optimizer pass gets a new verification of all plan nodes. i hope one day we go towards more iterative optimization and then we should be able to verify plan invariants locally. Which would guarantee that the verification overhead is linearly proportional to number of modifications applied to the plan. This would require eg that get_type is a constant operation (#12604), since plan verification should also include that the types do match.

alamb · 2024-12-06T20:53:29Z

I think it would also be great if we could consider moving the invariant check directly into LogicalPlan -- for exmaple
LogicalPlan::check_invariants

This would make this function easier to discover and for others to understand what the invariants are.

…sus assertions made after a given analyzer or optimizer pass

datafusion/optimizer/src/analyzer/mod.rs

datafusion/expr/src/logical_plan/plan.rs

wiedld · 2024-12-16T05:52:44Z

The updates focus on dividing out LP invariants, vs analyzer pass checks, vs optimizer pass checks (including invariants listed as the analyzer's responsibility). I added in reference links to invariants in the docs, in an attempt to delineate an invariant vs a valid plan. But I'm muddy on this boundary.

I'll make it configurable (or debug only) after I get feedback on this^^. 🙏🏼

alamb

This is looking great @wiedld -- thank you

I think the basic structure of assert_invariants is looking good.

I had a suggestion for how to combine the two different invariant checks into a single API.

In terms of planning:

Perhaps we can begin creating an epic for "Enforcing Invariants" and list items there to follow up on (like "Define the invariants for Union plans" for example)

In terms of this PR, once we solidify the API the only remaining concern I have is with potential performance slowdown of re-checking the invariants. When we have the code ready we can run the sql_planning benchmark to make sure this PR doesn't slow down planning. If it does, we can perhaps only run the invariant checks in debug builds

datafusion/optimizer/src/analyzer/mod.rs

datafusion/expr/src/logical_plan/plan.rs

datafusion/expr/src/logical_plan/mod.rs

…nges

…e (valid semantic plan) vs basic LP invariants

…o a TableScan's filter clause

datafusion/expr/src/logical_plan/invariants.rs

wiedld · 2024-12-18T00:17:41Z

Ran all sql_planner benchmarks once, and then re-ran anything showing 1.09 to 1.24 ratio difference (max 3.3ms change in 17ms total).

all benchmarks

c3d-standard-8 debian-12

group                                         main                                   with_invariants_check
-----                                         ----                                   ---------------------
logical_aggregate_with_join                   1.00   1015.8±9.56µs        ? ?/sec    1.00  1015.9±10.30µs        ? ?/sec
logical_select_all_from_1000                  1.00      4.7±0.03ms        ? ?/sec    1.01      4.8±0.01ms        ? ?/sec
logical_select_one_from_700                   1.00    762.6±7.60µs        ? ?/sec    1.00    763.6±5.10µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    729.3±5.35µs        ? ?/sec    1.00    732.4±4.83µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    712.5±4.97µs        ? ?/sec    1.00    714.2±5.51µs        ? ?/sec
physical_intersection                         1.00   1729.6±9.00µs        ? ?/sec    1.08   1872.4±9.49µs        ? ?/sec
physical_join_consider_sort                   1.00      2.4±0.02ms        ? ?/sec    1.19      2.9±0.02ms        ? ?/sec
physical_join_distinct                        1.00    702.1±8.19µs        ? ?/sec    1.01    706.8±6.23µs        ? ?/sec
physical_many_self_joins                      1.00     13.8±0.09ms        ? ?/sec    1.24     17.1±0.16ms        ? ?/sec
physical_plan_clickbench_all                  1.00    187.3±0.61ms        ? ?/sec    1.00    187.4±0.46ms        ? ?/sec
physical_plan_clickbench_q1                   1.01      2.7±0.02ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q10                  1.01      3.6±0.04ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_clickbench_q11                  1.00      3.6±0.04ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_clickbench_q12                  1.00      3.8±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q13                  1.00      3.4±0.02ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec
physical_plan_clickbench_q14                  1.00      3.6±0.02ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_clickbench_q15                  1.00      3.5±0.02ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q16                  1.00      3.0±0.02ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_clickbench_q17                  1.00      3.1±0.03ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q18                  1.00      2.9±0.02ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
physical_plan_clickbench_q19                  1.00      3.6±0.02ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_clickbench_q2                   1.01      2.9±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q20                  1.00      2.7±0.02ms        ? ?/sec    1.01      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q21                  1.00      2.9±0.02ms        ? ?/sec    1.01      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q22                  1.00      3.7±0.02ms        ? ?/sec    1.01      3.7±0.02ms        ? ?/sec
physical_plan_clickbench_q23                  1.00      4.1±0.02ms        ? ?/sec    1.01      4.2±0.02ms        ? ?/sec
physical_plan_clickbench_q24                  1.00      4.9±0.02ms        ? ?/sec    1.04      5.0±0.02ms        ? ?/sec
physical_plan_clickbench_q25                  1.00      3.2±0.02ms        ? ?/sec    1.02      3.2±0.02ms        ? ?/sec
physical_plan_clickbench_q26                  1.00      2.9±0.01ms        ? ?/sec    1.02      3.0±0.02ms        ? ?/sec
physical_plan_clickbench_q27                  1.00      3.2±0.02ms        ? ?/sec    1.02      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q28                  1.00      3.8±0.02ms        ? ?/sec    1.02      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q29                  1.00      4.8±0.03ms        ? ?/sec    1.01      4.9±0.02ms        ? ?/sec
physical_plan_clickbench_q3                   1.02      2.9±0.04ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q30                  1.00     16.0±0.06ms        ? ?/sec    1.00     16.0±0.07ms        ? ?/sec
physical_plan_clickbench_q31                  1.00      3.9±0.02ms        ? ?/sec    1.01      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q32                  1.00      3.9±0.02ms        ? ?/sec    1.01      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q33                  1.00      3.5±0.02ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q34                  1.00      3.1±0.01ms        ? ?/sec    1.01      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q35                  1.00      3.3±0.02ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q36                  1.01      4.4±0.03ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q37                  1.00      4.4±0.02ms        ? ?/sec    1.01      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q38                  1.00      4.4±0.02ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q39                  1.00      4.0±0.02ms        ? ?/sec    1.01      4.0±0.02ms        ? ?/sec
physical_plan_clickbench_q4                   1.02      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q40                  1.00      4.5±0.02ms        ? ?/sec    1.00      4.5±0.02ms        ? ?/sec
physical_plan_clickbench_q41                  1.00      4.3±0.02ms        ? ?/sec    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q42                  1.00      4.1±0.02ms        ? ?/sec    1.01      4.1±0.02ms        ? ?/sec
physical_plan_clickbench_q43                  1.00      4.1±0.02ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
physical_plan_clickbench_q44                  1.00      2.8±0.02ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q45                  1.00      2.8±0.01ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q46                  1.01      3.3±0.02ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q47                  1.00      3.9±0.02ms        ? ?/sec    1.00      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q48                  1.00      4.4±0.02ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q49                  1.00      4.6±0.03ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
physical_plan_clickbench_q5                   1.01      2.9±0.02ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q6                   1.01      2.9±0.02ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
physical_plan_clickbench_q7                   1.02      3.4±0.03ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec
physical_plan_clickbench_q8                   1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q9                   1.02      3.4±0.03ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpcds_all                       1.00   1178.8±3.25ms        ? ?/sec    1.00   1183.7±0.95ms        ? ?/sec
physical_plan_tpch_all                        1.00     73.1±0.31ms        ? ?/sec    1.00     73.1±0.21ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.6±0.01ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.6±0.01ms        ? ?/sec    1.01      3.6±0.01ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.1±0.01ms        ? ?/sec    1.01      3.1±0.01ms        ? ?/sec
physical_plan_tpch_q12                        1.00      2.5±0.01ms        ? ?/sec    1.01      2.5±0.01ms        ? ?/sec
physical_plan_tpch_q13                        1.00   1833.9±8.83µs        ? ?/sec    1.00  1830.2±10.09µs        ? ?/sec
physical_plan_tpch_q14                        1.00      2.2±0.01ms        ? ?/sec    1.01      2.2±0.01ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.2±0.01ms        ? ?/sec    1.01      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q17                        1.00      2.9±0.01ms        ? ?/sec    1.01      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.2±0.01ms        ? ?/sec    1.00      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q19                        1.00      5.4±0.02ms        ? ?/sec    1.01      5.4±0.02ms        ? ?/sec
physical_plan_tpch_q2                         1.00      6.3±0.03ms        ? ?/sec    1.01      6.3±0.02ms        ? ?/sec
physical_plan_tpch_q20                        1.00      3.8±0.01ms        ? ?/sec    1.00      3.8±0.01ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.1±0.02ms        ? ?/sec    1.00      5.1±0.01ms        ? ?/sec
physical_plan_tpch_q22                        1.00      2.9±0.01ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.5±0.01ms        ? ?/sec    1.01      2.6±0.01ms        ? ?/sec
physical_plan_tpch_q4                         1.00  1995.3±10.65µs        ? ?/sec    1.01      2.0±0.01ms        ? ?/sec
physical_plan_tpch_q5                         1.00      3.5±0.01ms        ? ?/sec    1.02      3.6±0.01ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1364.7±12.81µs        ? ?/sec    1.01   1375.4±8.17µs        ? ?/sec
physical_plan_tpch_q7                         1.00      4.7±0.02ms        ? ?/sec    1.02      4.8±0.02ms        ? ?/sec
physical_plan_tpch_q8                         1.00      5.6±0.02ms        ? ?/sec    1.01      5.7±0.03ms        ? ?/sec
physical_plan_tpch_q9                         1.00      4.3±0.02ms        ? ?/sec    1.01      4.4±0.01ms        ? ?/sec
physical_select_aggregates_from_200           1.00     24.6±0.09ms        ? ?/sec    1.01     24.7±0.09ms        ? ?/sec
physical_select_all_from_1000                 1.00     44.8±0.44ms        ? ?/sec    1.03     46.3±0.27ms        ? ?/sec
physical_select_one_from_700                  1.00      2.5±0.01ms        ? ?/sec    1.15      2.9±0.02ms        ? ?/sec
physical_theta_join_consider_sort             1.00      2.7±0.01ms        ? ?/sec    1.17      3.2±0.02ms        ? ?/sec
physical_unnest_to_join                       1.00      2.5±0.01ms        ? ?/sec    1.09      2.7±0.02ms        ? ?/sec
with_param_values_many_columns                1.02    206.4±1.18µs        ? ?/sec    1.00    202.4±0.89µs        ? ?/sec

confirmation run

group                                         main                                   with_invariants_check
-----                                         ----                                   ---------------------
physical_join_consider_sort                   1.00      2.4±0.01ms        ? ?/sec    1.19      2.9±0.01ms        ? ?/sec
physical_many_self_joins                      1.00     13.9±0.09ms        ? ?/sec    1.23     17.0±0.11ms        ? ?/sec
physical_select_one_from_700                  1.00      2.5±0.01ms        ? ?/sec    1.15      2.9±0.02ms        ? ?/sec
physical_theta_join_consider_sort             1.00      2.7±0.02ms        ? ?/sec    1.15      3.2±0.01ms        ? ?/sec
physical_unnest_to_join                       1.00      2.4±0.01ms        ? ?/sec    1.10      2.7±0.02ms        ? ?/sec

The performance change was minimized because we didn't add that many extra timepoints of checking. Specifically:

(new check): the subset of InvariantLevel::Always checks occurs once before the analyzer starts
(existing/main check): the full InvariantLevel::Executable check occurs once after analyzer is done.
(existing/main check): each optimizer pass only does a check that the schema is not changed
(new check): the full InvariantLevel::Executable check occurs once before all optimizer passes start
(existing/main check): the full InvariantLevel::Executable check occurs once after all optimizer passes start

I propose that we add another full InvariantLevel::Executable between each optimization pass, that only runs in debug mode (since that will be a performance impact as @findepi mentioned).

I can also make some of these check listed above into debug only, if the current performance change is unacceptable.

…ter all optimizer passes, except in debug mode it runs after each pass.

wiedld · 2024-12-24T04:08:03Z

Fixed the performance regression. It wasn't where we thought it was.

The problem was a recursive check (down the LP) of the check_fields within the assert_unique_field_names(). I've removed the recursive nature of this check.

Final numbers, no regression

$ critcmp main invariants__no_recursive_namecheck__schema_check_per_optimizer_pass
group                                         invariants__no_recursive_namecheck__schema_check_per_optimizer_pass    main
-----                                         -------------------------------------------------------------------    ----
logical_aggregate_with_join                   1.00    987.0±5.66µs        ? ?/sec                                    1.02  1007.8±10.29µs        ? ?/sec
logical_select_all_from_1000                  1.04      5.0±0.04ms        ? ?/sec                                    1.00      4.8±0.04ms        ? ?/sec
logical_select_one_from_700                   1.00    739.4±5.38µs        ? ?/sec                                    1.01    743.5±5.70µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    703.2±5.00µs        ? ?/sec                                    1.02    718.2±4.64µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    685.1±5.28µs        ? ?/sec                                    1.02    699.6±3.40µs        ? ?/sec
physical_intersection                         1.00  1726.0±10.72µs        ? ?/sec                                    1.00  1726.6±15.43µs        ? ?/sec
physical_join_consider_sort                   1.00      2.4±0.02ms        ? ?/sec                                    1.02      2.5±0.01ms        ? ?/sec
physical_join_distinct                        1.00    677.2±5.94µs        ? ?/sec                                    1.02    689.2±7.02µs        ? ?/sec
physical_many_self_joins                      1.00     13.6±0.07ms        ? ?/sec                                    1.02     13.9±0.07ms        ? ?/sec
physical_plan_clickbench_all                  1.00    181.7±0.39ms        ? ?/sec                                    1.00    181.5±0.36ms        ? ?/sec
physical_plan_clickbench_q1                   1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q10                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_clickbench_q11                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_clickbench_q12                  1.00      3.6±0.01ms        ? ?/sec                                    1.01      3.7±0.22ms        ? ?/sec
physical_plan_clickbench_q13                  1.00      3.3±0.01ms        ? ?/sec                                    1.00      3.3±0.01ms        ? ?/sec
physical_plan_clickbench_q14                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_clickbench_q15                  1.00      3.4±0.01ms        ? ?/sec                                    1.00      3.4±0.01ms        ? ?/sec
physical_plan_clickbench_q16                  1.00      3.0±0.01ms        ? ?/sec                                    1.00      2.9±0.01ms        ? ?/sec
physical_plan_clickbench_q17                  1.00      3.0±0.01ms        ? ?/sec                                    1.00      3.0±0.01ms        ? ?/sec
physical_plan_clickbench_q18                  1.01      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q19                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.4±0.01ms        ? ?/sec
physical_plan_clickbench_q2                   1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q20                  1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q21                  1.01      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q22                  1.00      3.6±0.01ms        ? ?/sec                                    1.01      3.6±0.01ms        ? ?/sec
physical_plan_clickbench_q23                  1.00      4.0±0.01ms        ? ?/sec                                    1.00      4.0±0.02ms        ? ?/sec
physical_plan_clickbench_q24                  1.01      4.8±0.02ms        ? ?/sec                                    1.00      4.7±0.01ms        ? ?/sec
physical_plan_clickbench_q25                  1.01      3.2±0.02ms        ? ?/sec                                    1.00      3.1±0.01ms        ? ?/sec
physical_plan_clickbench_q26                  1.01      2.9±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q27                  1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_clickbench_q28                  1.00      3.7±0.01ms        ? ?/sec                                    1.00      3.7±0.02ms        ? ?/sec
physical_plan_clickbench_q29                  1.00      4.7±0.02ms        ? ?/sec                                    1.00      4.7±0.01ms        ? ?/sec
physical_plan_clickbench_q3                   1.01      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q30                  1.01     15.5±0.41ms        ? ?/sec                                    1.00     15.3±0.04ms        ? ?/sec
physical_plan_clickbench_q31                  1.00      3.8±0.02ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q32                  1.00      3.8±0.01ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q33                  1.00      3.4±0.01ms        ? ?/sec                                    1.00      3.4±0.01ms        ? ?/sec
physical_plan_clickbench_q34                  1.00      3.1±0.01ms        ? ?/sec                                    1.00      3.1±0.01ms        ? ?/sec
physical_plan_clickbench_q35                  1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_clickbench_q36                  1.00      4.2±0.01ms        ? ?/sec                                    1.00      4.2±0.02ms        ? ?/sec
physical_plan_clickbench_q37                  1.00      4.3±0.01ms        ? ?/sec                                    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q38                  1.00      4.3±0.02ms        ? ?/sec                                    1.00      4.3±0.01ms        ? ?/sec
physical_plan_clickbench_q39                  1.00      3.8±0.01ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q4                   1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q40                  1.00      4.3±0.01ms        ? ?/sec                                    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q41                  1.00      4.1±0.01ms        ? ?/sec                                    1.00      4.1±0.02ms        ? ?/sec
physical_plan_clickbench_q42                  1.00      4.0±0.02ms        ? ?/sec                                    1.00      4.0±0.01ms        ? ?/sec
physical_plan_clickbench_q43                  1.00      4.0±0.01ms        ? ?/sec                                    1.00      4.0±0.01ms        ? ?/sec
physical_plan_clickbench_q44                  1.00      2.7±0.01ms        ? ?/sec                                    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q45                  1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q46                  1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.3±0.01ms        ? ?/sec
physical_plan_clickbench_q47                  1.00      3.8±0.01ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q48                  1.00      4.3±0.01ms        ? ?/sec                                    1.00      4.3±0.01ms        ? ?/sec
physical_plan_clickbench_q49                  1.00      4.5±0.02ms        ? ?/sec                                    1.00      4.5±0.01ms        ? ?/sec
physical_plan_clickbench_q5                   1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q6                   1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q7                   1.01      3.3±0.01ms        ? ?/sec                                    1.00      3.3±0.01ms        ? ?/sec
physical_plan_clickbench_q8                   1.00      3.0±0.01ms        ? ?/sec                                    1.00      3.0±0.01ms        ? ?/sec
physical_plan_clickbench_q9                   1.00      3.2±0.02ms        ? ?/sec                                    1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpcds_all                       1.00   1160.9±2.22ms        ? ?/sec                                    1.00   1161.8±5.74ms        ? ?/sec
physical_plan_tpch_all                        1.00     72.5±0.23ms        ? ?/sec                                    1.00     72.7±0.20ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.1±0.01ms        ? ?/sec                                    1.01      3.2±0.17ms        ? ?/sec
physical_plan_tpch_q12                        1.00      2.5±0.01ms        ? ?/sec                                    1.00      2.5±0.01ms        ? ?/sec
physical_plan_tpch_q13                        1.00   1832.8±8.27µs        ? ?/sec                                    1.00   1832.2±7.35µs        ? ?/sec
physical_plan_tpch_q14                        1.00      2.2±0.01ms        ? ?/sec                                    1.00      2.2±0.01ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q17                        1.00      2.9±0.01ms        ? ?/sec                                    1.00      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q19                        1.00      5.3±0.01ms        ? ?/sec                                    1.00      5.3±0.02ms        ? ?/sec
physical_plan_tpch_q2                         1.01      6.4±0.03ms        ? ?/sec                                    1.00      6.3±0.02ms        ? ?/sec
physical_plan_tpch_q20                        1.00      3.9±0.01ms        ? ?/sec                                    1.00      3.8±0.02ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.1±0.01ms        ? ?/sec                                    1.00      5.1±0.02ms        ? ?/sec
physical_plan_tpch_q22                        1.00      2.9±0.01ms        ? ?/sec                                    1.00      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.5±0.01ms        ? ?/sec                                    1.00      2.5±0.01ms        ? ?/sec
physical_plan_tpch_q4                         1.00   1994.6±9.01µs        ? ?/sec                                    1.00   1993.7±9.04µs        ? ?/sec
physical_plan_tpch_q5                         1.01      3.6±0.02ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_tpch_q6                         1.00   1350.1±7.99µs        ? ?/sec                                    1.00   1349.8±7.07µs        ? ?/sec
physical_plan_tpch_q7                         1.01      4.7±0.02ms        ? ?/sec                                    1.00      4.7±0.01ms        ? ?/sec
physical_plan_tpch_q8                         1.01      5.6±0.03ms        ? ?/sec                                    1.00      5.6±0.02ms        ? ?/sec
physical_plan_tpch_q9                         1.00      4.3±0.02ms        ? ?/sec                                    1.00      4.3±0.01ms        ? ?/sec
physical_select_aggregates_from_200           1.00     24.3±0.09ms        ? ?/sec                                    1.00     24.3±0.05ms        ? ?/sec
physical_select_all_from_1000                 1.01     43.2±0.35ms        ? ?/sec                                    1.00     42.6±0.41ms        ? ?/sec
physical_select_one_from_700                  1.00      2.5±0.01ms        ? ?/sec                                    1.02      2.5±0.01ms        ? ?/sec
physical_theta_join_consider_sort             1.00      2.7±0.02ms        ? ?/sec                                    1.02      2.8±0.01ms        ? ?/sec
physical_unnest_to_join                       1.00      2.4±0.01ms        ? ?/sec                                    1.02      2.5±0.01ms        ? ?/sec
with_param_values_many_columns                1.00    183.9±0.62µs        ? ?/sec                                    1.00    183.2±0.56µs        ? ?/sec

The release vs debug mode only has a single difference in the checks. The debug mode will run a full InvariantLevel::Executable check per each optimizer pass (instead of after all pass).

…cks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug

alamb

Thank you @wiedld -- this looks like a great improvement

I left some comments but I don't they are required

In my opinion this is needed before merging:

I will wait 24 to allow time for anyone else to comment
I am also double-checking the benchmarks

If you have time I do think some of my comments would improve the usability of this PR as well (e.g. the message changes)

Also, shall we file some follow on tickets for follow on work?

datafusion/expr/src/logical_plan/invariants.rs

datafusion/expr/src/utils.rs

datafusion/optimizer/src/analyzer/mod.rs

datafusion/optimizer/src/optimizer.rs

alamb · 2024-12-24T12:53:13Z

datafusion/optimizer/src/optimizer.rs

+                        })?;
+
+                    // run LP invariant checks only in debug
+                    #[cfg(debug_assertions)]


I also double checked this is the right name: https://doc.rust-lang.org/reference/conditional-compilation.html#debug_assertions

datafusion/optimizer/src/optimizer.rs

alamb · 2024-12-24T12:56:24Z

datafusion/optimizer/src/optimizer.rs

@@ -529,7 +567,9 @@ mod tests {
        let err = opt.optimize(plan, &config, &observe).unwrap_err();
        assert_eq!(
            "Optimizer rule 'get table_scan rule' failed\n\
-            caused by\nget table_scan rule\ncaused by\n\
+            caused by\n\
+            check_optimizer_specific_invariants after optimizer pass: get table_scan rule\n\


that is nicer

datafusion/expr/src/logical_plan/invariants.rs

berkaysynnada · 2024-12-24T13:33:32Z

datafusion/optimizer/src/optimizer.rs

                .and_then(|tnr| {
-                    assert_schema_is_the_same(rule.name(), &starting_schema, &tnr.data)?;
+                    // run checks optimizer invariant checks, per pass


what did you mean exactly here?

The correct statement is // run checks optimizer invariant checks, per optimizer rule applied. And then multiple rules are applied per pass. Statement is fixed. Thanks!.

alamb · 2024-12-24T13:58:56Z

My performance benchmarks show also show now difference ✅

++ critcmp main 13525_invariant-checking-for-implicit-LP-changes
group                                         13525_invariant-checking-for-implicit-LP-changes    main
-----                                         ------------------------------------------------    ----
logical_aggregate_with_join                   1.00  1498.7±38.07µs        ? ?/sec                 1.00  1494.0±18.89µs        ? ?/sec
logical_select_all_from_1000                  1.00      5.3±0.05ms        ? ?/sec                 1.00      5.3±0.05ms        ? ?/sec
logical_select_one_from_700                   1.00  1169.9±26.47µs        ? ?/sec                 1.01  1179.5±14.41µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00  1154.9±13.48µs        ? ?/sec                 1.00  1158.7±16.27µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00  1140.9±18.36µs        ? ?/sec                 1.00  1140.3±14.97µs        ? ?/sec
physical_intersection                         1.00      2.5±0.02ms        ? ?/sec                 1.00      2.5±0.02ms        ? ?/sec
physical_join_consider_sort                   1.00      3.3±0.02ms        ? ?/sec                 1.00      3.3±0.04ms        ? ?/sec
physical_join_distinct                        1.00  1130.5±14.83µs        ? ?/sec                 1.00  1133.2±15.97µs        ? ?/sec
physical_many_self_joins                      1.01     17.5±0.10ms        ? ?/sec                 1.00     17.4±0.08ms        ? ?/sec
physical_plan_clickbench_all                  1.00    227.8±1.70ms        ? ?/sec                 1.01    230.0±3.38ms        ? ?/sec

jonahgao

LGTM👍, thanks @wiedld

datafusion/expr/src/logical_plan/invariants.rs

datafusion/optimizer/src/optimizer.rs

wiedld · 2024-12-24T21:25:14Z

Thank you for the reviews.
I made the requested changes to the terminology and error messages.

.

Also, shall we file some follow on tickets for follow on work?

I have a task lists here for follow up items. I was planning to dig more into code to assess viability (& I can make tickets then too) as I start implementing.

datafusion/sqllogictest/test_files/subquery.slt

…nges

alamb · 2024-12-25T12:10:20Z

Looks like Ci is failing on some tests

alamb

Thank you @wiedld -- I think this is a nice step forward in helping avoid bugs in DataFusion ❤️

Thanks @jonahgao @findepi @Sl1mb0 and @berkaysynnada

I'll merge this PR now and we can address any other suggestions as a follow on PR

* Minor: Use `div_ceil` * Fix hash join with sort push down (#13560) * fix: join with sort push down * chore: insert some value * apply suggestion * recover handle_costom_pushdown change * apply suggestion * add more test * add partition * Improve substr() performance by avoiding using owned string (#13688) Co-authored-by: zhangli20 <[email protected]> * reinstate down_cast_any_ref (#13705) * Optimize performance of `character_length` function (#13696) * Optimize performance of function Signed-off-by: Tai Le Manh <[email protected]> * Add pre-check array is null * Fix clippy warnings --------- Signed-off-by: Tai Le Manh <[email protected]> * Update prost-build requirement from =0.13.3 to =0.13.4 (#13698) Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version. - [Release notes](https://github.com/tokio-rs/prost/releases) - [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md) - [Commits](https://github.com/tokio-rs/prost/compare/v0.13.3...v0.13.4) --- updated-dependencies: - dependency-name: prost-build dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Minor: Output elapsed time for sql logic test (#13718) * Minor: Output elapsed time for sql logic test * refactor: simplify the `make_udf_function` macro (#13712) * refactor: replace `Vec` with `IndexMap` for expression mappings in `ProjectionMapping` and `EquivalenceGroup` (#13675) * refactor: replace Vec with IndexMap for expression mappings in ProjectionMapping and EquivalenceGroup * chore * chore: Fix CI * chore: comment * chore: simplify * Handle alias when parsing sql(parse_sql_expr) (#12939) * fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr` * Improve documentation for TableProvider (#13724) * Reveal implementing type and return type in simple UDF implementations (#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait. * minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731) * Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests * Update to apache-avro 0.17, fix compatibility changes schema handling (#13727) * Update apache-avro requirement from 0.16 to 0.17 --- updated-dependencies: - dependency-name: apache-avro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compatibility changes schema handling apache-avro 0.17 - Handle ArraySchema struct - Handle MapSchema struct - Map BigDecimal => LargeBinary - Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None) - Map LocalTimestampNanos => todo!() - Add Default to FixedSchema test * Update Cargo.lock file for apache-avro 0.17 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Minor: Add doc example to RecordBatchStreamAdapter (#13725) * Minor: Add doc example to RecordBatchStreamAdapter * Update datafusion/physical-plan/src/stream.rs Co-authored-by: Berkay Şahin <[email protected]> --------- Co-authored-by: Berkay Şahin <[email protected]> * Implement GroupsAccumulator for corr(x,y) aggregate function (#13581) * Implement GroupsAccumulator for corr(x,y) * feedbacks * fix CI MSRV * review * avoid collect in accumulation * add back cast * fix union serialisation order in proto (#13709) * fix union serialisation order in proto * clippy * address comments * Minor: make unsupported `nanosecond` part a real (not internal) error (#13733) * Minor: make unsupported `nanosecond` part a real (not internal) error * fmt * Improve wording to refer to date part * Add tests for date_part on columns + timestamps with / without timezones (#13732) * Add tests for date_part on columns + timestamps with / without timezones * Add tests from https://github.com/apache/datafusion/pull/13372 * remove trailing whitespace * Optimize performance of `initcap` function (~2x faster) (#13691) * Optimize performance of initcap (~2x faster) Signed-off-by: Tai Le Manh <[email protected]> * format --------- Signed-off-by: Tai Le Manh <[email protected]> * Minor: Add documentation explaining that initcap oly works for ASCII (#13749) * Support sqllogictest --complete with postgres (#13746) Before the change, the request to use PostgreSQL was simply ignored when `--complete` flag was present. * doc-gen: migrate window functions documentation to attribute based (#13739) * doc-gen: migrate window functions documentation Signed-off-by: zjregee <[email protected]> * fix: update Cargo.lock --------- Signed-off-by: zjregee <[email protected]> * Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751) * Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation * Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation. * Update to bigdecimal 0.4.7 (#13747) * Add big decimal formatting test cases with potential trailing zeros * Rename and simplify decimal rendering functions - add `decimal` to function name - drop `precision` parameter as it is not supposed to affect the result * Update to bigdecimal 0.4.7 Utilize new `to_plain_string` function * chore: clean up dependencies (#13728) * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Clean up dependencies * CI: Clean up dependencies * fix: Implicitly plan `UNNEST` as lateral (#13695) * plan implicit lateral if table factor is UNNEST * check for outer references in `create_relation_subquery` * add sqllogictest * fix lateral constant test to not expect a subquery node * replace sqllogictest in favor of logical plan test * update lateral join sqllogictests * add sqllogictests * fix logical plan test * Minor: improve the Deprecation / API health guidelines (#13701) * Minor: improve the Deprecation / API health policy * prettier * Update docs/source/library-user-guide/api-health.md Co-authored-by: Jonah Gao <[email protected]> * Add version guidance and make more copy/paste friendly * prettier * better * rename to guidelines --------- Co-authored-by: Jonah Gao <[email protected]> * fix: specify roottype in substrait fieldreference (#13647) * fix: specify roottype in fieldreference Signed-off-by: MBWhite <[email protected]> * Fix formatting Signed-off-by: MBWhite <[email protected]> * review suggestion Signed-off-by: MBWhite <[email protected]> --------- Signed-off-by: MBWhite <[email protected]> * Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372) * add type sig class Signed-off-by: jayzhan211 <[email protected]> * timestamp Signed-off-by: jayzhan211 <[email protected]> * date part Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * taplo format Signed-off-by: jayzhan211 <[email protected]> * tpch test Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * explicit hash Signed-off-by: jayzhan211 <[email protected]> * Enhance type coercion and function signatures - Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions. * fix comment Signed-off-by: Jay Zhan <[email protected]> * fix signature Signed-off-by: Jay Zhan <[email protected]> * fix test Signed-off-by: Jay Zhan <[email protected]> * Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds. * Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps. * Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file. * Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations. * Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling --------- Signed-off-by: jayzhan211 <[email protected]> Signed-off-by: Jay Zhan <[email protected]> * Minor: Add some more blog posts to the readings page (#13761) * Minor: Add some more blog posts to the readings page * prettier * prettier * Update docs/source/user-guide/concepts-readings-events.md --------- Co-authored-by: Oleks V <[email protected]> * docs: update GroupsAccumulator instead of GroupAccumulator (#13787) Fixing `GroupsAccumulator` trait name in its docs * Improve Deprecation Guidelines more (#13776) * Improve deprecation guidelines more * prettier * fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758) * fix: add `null_buffer` check for `LargeStringArray` Add a safety check to ensure that the alignment of buffers cannot be overflowed. This introduces a panic if they are not aligned through a runtime assertion. * fix: remove value_buffer assertion These buffers can be misaligned and it is not problematic, it is the `null_buffer` which we care about being of the same length. * feat: add `null_buffer` check to `StringArray` This is in a similar vein to `LargeStringArray`, as the code is the same, except for `i32`'s instead of `i64`. * feat: use `row_count` var to avoid drift * Revert the removal of reservation in HashJoin (#13792) * fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption. * fmt Signed-off-by: Jay Zhan <[email protected]> --------- Signed-off-by: Jay Zhan <[email protected]> * added count aggregate slt (#13790) * Update documentation guidelines for contribution content (#13703) * Update documentation guidelines for contribution content * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * clarify discussions and remove requirements note * prettier * Update docs/source/contributor-guide/index.md Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * Add Round trip tests for Array <--> ScalarValue (#13777) * Add Round trip tests for Array <--> ScalarValue * String dictionary test * remove unecessary value * Improve comments * fix: Limit together with pushdown_filters (#13788) * fix: Limit together with pushdown_filters * Fix format * Address new comments * Fix testing case to hit the problem * Minor: improve Analyzer docs (#13798) * Minor: cargo update in datafusion-cli (#13801) * Update datafusion-cli toml to pin home=0.5.9 * update Cargo.lock * Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797) * fix: enable pruning by bloom filters for dictionary columns (#13768) * Handle empty rows for `array_distinct` (#13810) * handle empty array distinct * ignore * fix --------- Co-authored-by: Cyprien Huet <[email protected]> * Fix get_type for higher-order array functions (#13756) * Fix get_type for higher-order array functions * Fix recursive flatten The fix is covered by recursive flatten test case in array.slt * Restore "keep LargeList" in Array signature * clarify naming in the test * Chore: Do not return empty record batches from streams (#13794) * do not emit empty record batches in plans * change function signatures to Option<RecordBatch> if empty batches are possible * format code * shorten code * change list_unnest_at_level for returning Option value * add documentation take concat_batches into compute_aggregates function again * create unit test for row_hash.rs * add test for unnest * add test for unnest * add test for partial sort * add test for bounded window agg * add test for window agg * apply simplifications and fix typo * apply simplifications and fix typo * Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802) * test(13796): reproducer of overflow on capacity * fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer * refactor: use simple solution and provide panic * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750) * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema * clippy * fix csv and json tests * add testing for parquet * cleanup * fix parquet tests * document describe_partition, add back repartition options to one of the csv empty files tests * Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <[email protected]> * Minor: Extend ScalarValue::new_zero() (#13828) * Update mod.rs * Update mod.rs * Update mod.rs * Update mod.rs * chore: temporarily disable windows flow (#13833) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 * Fix test * Add test * Add test * Refine negative scales * Update comment * Refine bigint_to_i256 * UT for bigint_to_i256 * Add ut for parse_decimal * Replace `BooleanArray::extend` with `append_n` (#13832) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> * improve docs --------- Co-authored-by: Piotr Findeisen <[email protected]> * [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS * Add example of interacting with a remote catalog (#13722) * Add example of interacting with a remote catalog * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Berkay Şahin <[email protected]> * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Use HashMap to hold tables --------- Co-authored-by: Berkay Şahin <[email protected]> Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Update substrait requirement from 0.49 to 0.50 (#13808) * Update substrait requirement from 0.49 to 0.50 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compilation * Add expr test --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * typo: remove extraneous "`" in doc comment, fix header (#13848) * typo: extraneous "`" in doc comment * Update datafusion/execution/src/runtime_env.rs * Update datafusion/execution/src/runtime_env.rs --------- Co-authored-by: Oleks V <[email protected]> * typo: remove extra "`" interfering with doc formatting (#13847) * Support n-ary monotonic functions in ordering equivalence (#13841) * Support n-ary monotonic functions in `discover_new_orderings` * Add tests for n-ary monotonic functions in `discover_new_orderings` * Fix tests * Fix non-monotonic test case * Fix unintended simplification * Minor comment changes * Fix tests * Add `preserves_lex_ordering` field * Use `preserves_lex_ordering` on `discover_new_orderings()` * Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc` * Update tests * Move logic to UDF * Cargo fmt * Refactor * Cargo fmt * Simply use false value on default implementation * Remove unnecessary import * Clippy fix * Update Cargo.lock * Move dep to dev-dependencies * Rename output_preserves_lex_ordering to preserves_lex_ordering * minor --------- Co-authored-by: berkaysynnada <[email protected]> * Replace `execution_mode` with `emission_type` and `boundedness` (#13823) * feat: update execution modes and add bitflags dependency - Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure. * add exec API Signed-off-by: Jay Zhan <[email protected]> * replace done but has stackoverflow Signed-off-by: Jay Zhan <[email protected]> * exec API done Signed-off-by: Jay Zhan <[email protected]> * Refactor execution plan properties to remove execution mode - Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability. * Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans. * fix test Signed-off-by: Jay Zhan <[email protected]> * Refactor join handling and emission type logic - Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties. * Implement emission type for execution plans - Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project. * Enhance join type documentation and refine emission type logic - Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project. * Refactor emission type logic in join and sort execution plans - Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types. * Refactor emission type handling in execution plans - Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework. * Enhance execution plan properties with boundedness and emission type - Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor execution plans to enhance boundedness and emission type handling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor memory handling in execution plans - Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints. * Refactor boundedness checks in execution plans - Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies. * Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project. * Refactor execution plan boundedness and emission type handling - Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types. * Refactor emission type and boundedness handling in execution plans - Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness. * Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations. * Review Part1 * Update sanity_checker.rs * addressing reviews * Review Part 1 * Update datafusion/physical-plan/src/execution_plan.rs * Update datafusion/physical-plan/src/execution_plan.rs * Shorten imports * Enhance documentation for JoinType and Boundedness enums - Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max. --------- Signed-off-by: Jay Zhan <[email protected]> Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Preserve ordering equivalencies on `with_reorder` (#13770) * Preserve ordering equivalencies on `with_reorder` * Add assertions * Return early if filtered_exprs is empty * Add clarify comment * Refactor * Add comprehensive test case * Add comment for exprs_equal * Cargo fmt * Clippy fix * Update properties.rs * Update exprs_equal and add tests * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * replace CASE expressions in predicate pruning with boolean algebra (#13795) * replace CASE expressions in predicate pruning with boolean algebra * fix merge * update tests * add some more tests * add some more tests * remove duplicate test case * Update datafusion/physical-optimizer/src/pruning.rs * swap NOT for != * replace comments, update docstrings * fix example * update tests * update tests * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Chunchun Ye <[email protected]> * enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857) fixes #13854 Co-authored-by: Arttu Voutilainen <[email protected]> * Add configurable normalization for configuration options and preserve case for S3 paths (#13576) * Do not normalize values * Fix tests & update docs * Prettier * Lowercase config params * Unify transform and parse * Fix tests * Rename `default_transform` and relax boundaries * Make `compression` case-insensitive * Comment to new line * Deprecate and ignore `enable_options_value_normalization` * Update datafusion/common/src/config.rs * fix typo --------- Co-authored-by: Oleks V <[email protected]> * Improve`Signature` and `comparison_coercion` documentation (#13840) * Improve Signature documentation more * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> * feat: support normalized expr in CSE (#13315) * feat: support normalized expr in CSE * feat: support normalize_eq in cse optimization * feat: support cumulative binary expr result in normalize_eq --------- Co-authored-by: Andrew Lamb <[email protected]> * Upgrade to sqlparser `0.53.0` (#13767) * chore: Udpate to sqlparser 0.53.0 * Update for new sqlparser API * more api updates * Avoid serializing query to SQL string unless it is necessary * Box wildcard options * chore: update datafusion-cli Cargo.lock * Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861) Thanks @Dandandan * feat(function): add `least` function (#13786) * start adding least fn * feat(function): add least function * update function name * fix scalar smaller function * add tests * run Clippy and Fmt * Generated docs using `./dev/update_function_docs.sh` * add comment why `descending: false` * update comment * Update least.rs Co-authored-by: Bruce Ritchie <[email protected]> * Update scalar_functions.md * run ./dev/update_function_docs.sh to update docs * merge greatest and least implementation to one * add header --------- Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Improve SortPreservingMerge::enable_round_robin_repartition docs (#13826) * Clarify SortPreservingMerge::enable_round_robin_repartition docs * tweaks * Improve comments more * clippy * fix doc link * Minor: Unify `downcast_arg` method (#13865) * Implement `SHOW FUNCTIONS` (#13799) * introduce rid for different signature * implement show functions syntax * add syntax example * avoid duplicate join * fix clippy * show function_type instead of routine_type * add some doc and comments * Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740) * Update bzip2 requirement from 0.4.3 to 0.5.0 Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version. - [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases) - [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0) --- updated-dependencies: - dependency-name: bzip2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix test * Fix CLI cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Fix build (#13869) * feat(substrait): modular substrait consumer (#13803) * feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing * Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862) * include FetchRel when producing LogicalPlan from Sort * add suggested test * address review feedback * Minor: improve error message when ARRAY literals can not be planned (#13859) * Minor: improve error message when ARRAY literals can not be planned * fmt * Update datafusion/sql/src/expr/value.rs Co-authored-by: Oleks V <[email protected]> --------- Co-authored-by: Oleks V <[email protected]> * Add documentation for `SHOW FUNCTIONS` (#13868) * Support unicode character for `initcap` function (#13752) * Support unicode character for 'initcap' function Signed-off-by: Tai Le Manh <[email protected]> * Update unit tests * Fix clippy warning * Update sqllogictests - initcap * Update scalar_functions.md docs * Add suggestions change Signed-off-by: Tai Le Manh <[email protected]> --------- Signed-off-by: Tai Le Manh <[email protected]> * [minor] make recursive package dependency optional (#13778) * make recursive optional * add to default for common package * cargo update * added to readme * make test conditional * reviews * cargo update --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: remove unused async-compression `futures-io` feature (#13875) * Minor: remove unused async-compression feature * Fix cli cargo lock * Consolidate Example: dataframe_output.rs into dataframe.rs (#13877) * Restore `DocBuilder::new()` to avoid breaking API change (#13870) * Fix build * Restore DocBuilder::new(), deprecate * cmt * clippy * Improve error messages for incorrect zero argument signatures (#13881) * Improve error messages for incorrect zero argument signatures * fix errors * fix fmt * Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883) * minor: fix typos in comments / structure names (#13879) * minor: fix typo error in datafusion * fix: fix rebase error * fix: format HashJoinExec doc * doc: recover thiserror/preemptively * fix: other typo error fixed * fix: directories to dir_entries in catalog example * Support 1 or 3 arg in generate_series() UDTF (#13856) * Support 1 or 3 args in generate_series() UDTF * address comment * Support (order by / sort) for DataFrameWriteOptions (#13874) * Support (order by / sort) for DataFrameWriteOptions * Fix fmt * Fix import * Add insert into example * Update sort_merge_join.rs (#13894) * Update join_selection.rs (#13893) * Fix `recursive-protection` feature flag (#13887) * Fix recursive-protection feature flag * rename feature flag to be consistent * Make default * taplo format * Fix visibility of swap_hash_join (#13899) * Minor: Avoid emitting empty batches in partial sort (#13895) * Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * Prepare for 44.0.0 release: version and changelog (#13882) * Prepare for 44.0.0 release: version and changelog * update changelog * update configs * update before release * Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824) * support unparsing the implicit lateral unnest plan * cargo clippy and fmt * refactor for `check_unnest_placeholder_with_outer_ref` * add const for the prefix string of unnest and outer refernece column * fix case_column_or_null with nullable when conditions (#13886) * fix case_column_or_null with nullable when conditions * improve sqllogictests for case_column_or_null --------- Co-authored-by: zhangli20 <[email protected]> * Fixed Issue #13896 (#13903) The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used. * Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880) * make ast builder public * introduce udlp unparser * add documents * add examples * add negative tests and fmt * fix the doc * rename udlp to extension * apply the first unparsing result only * improve the doc * seperate the enum for the unparsing result * fix the doc --------- Co-authored-by: Andrew Lamb <[email protected]> * Preserve constant values across union operations (#13805) * Add value tracking to ConstExpr for improved union optimization * Update PartialEq impl * Minor change * Add docstring for ConstExpr value * Improve constant propagation across union partitions * Add assertion for across_partitions * fix fmt * Update properties.rs * Remove redundant constant removal loop * Remove unnecessary mut * Set across_partitions=true when both sides are constant * Extract and use constant values in filter expressions * Add initial SLT for constant value tracking across UNION ALL * Assign values to ConstExpr where possible * Revert "Set across_partitions=true when both sides are constant" This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449. * Temporarily take value from literal * Lint fixes * Cargo fmt * Add get_expr_constant_value * Make `with_value()` accept optional value * Add todo * Move test to union.slt * Fix changed slt after merge * Simplify constexpr * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902) * fix RecordBatch size in topK (#13906) * ci improvements, update protoc (#13876) * Fix md5 return_type to only return Utf8 as per current code impl. * ci improvements * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Revert nextest change until action is approved. * Exclude requires workspace * Fixing minor typo to verify ci caching of builds is working as expected. * Updates from PR review. * Adding issue link for disabling intel mac build * improve performance of running examples * remove cargo check * Introduce LogicalPlan invariants, begin automatically checking them (#13651) * minor(13525): perform LP validation before and after each possible mutation * minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass * minor(13525): validate union after each optimizer passes * refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass * chore: add link to invariant docs * fix: add new invariants module * refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants * test: update test for slight error message change * fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause * refactor: move collect_subquery_cols() to common utils crate * refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass. * refactor: based upon performance tests, run the maximum number of checks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug * chore: update error naming and terminology used in code comments * refactor: use proper error methods * chore: more cleanup of error messages * chore: handle option trailer to error message * test: update sqllogictests tests to not use multiline * Correct return type for initcap scalar function with utf8view (#13909) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view * Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905) * Implement maintains_input_order for AggregateExec (#13897) * Implement maintains_input_order for AggregateExec * Update mod.rs * Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: mertak-synnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Move join type input swapping to pub methods on Joins (#13910) * doc-gen: migrate scalar functions (string) documentation 3/4 (#13926) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917) * Update sqllogictest requirement from 0.24.0 to 0.25.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Remove labels --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913) * doc-gen: migrate scalar functions (crypto) documentation (#13918) * doc-gen: migrate scalar functions (crypto) documentation * doc-gen: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920) * doc-gen: migrate scalar functions (datetime) documentation 1/2 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * fix RecordBatch size in hash join (#13916) * doc-gen: migrate scalar functions (array) documentation 1/3 (#13928) * doc-gen: migrate scalar functions (array) documentation 1/3 * fix: remove unsed import, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 1/2 (#13922) * doc-gen: migrate scalar functions (math) documentation 1/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 2/2 (#13923) * doc-gen: migrate scalar functions (math) documentation 2/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 3/3 (#13930) * doc-gen: migrate scalar functions (array) documentation 3/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 2/3 (#13929) * doc-gen: migrate scalar functions (array) documentation 2/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (string) documentation 4/4 (#13927) * doc-gen: migrate scalar functions (string) documentation 4/4 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Support explain query when running dfbench with clickbench (#13942) * Support explain query when running dfbench * Address comments * Consolidate example to_date.rs into dateframe.rs (#13939) * Consolidate example to_date.rs into dateframe.rs * Assert results using assert_batches_eq * clippy * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945) * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf. * add comment * Implement predicate pruning for `like` expressions (prefix matching) (#12978) * Implement predicate pruning for like expressions * add function docstring * re-order bounds calculations * fmt * add fuzz tests * fix clippy * Update datafusion/core/tests/fuzz_cases/pruning.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * doc-gen: migrate scalar functions (string) documentation 1/4 (#13924) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * consolidate dataframe_subquery.rs into dataframe.rs (#13950) * migrate btrim to user_doc macro (#13952) * doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921) * doc-gen: migrate scalar functions (datetime) documentation 2/2 * fix: fix typo and update function docs * doc: update function docs * doc-gen: remove slash --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936) * Fix md5 return_type to only return Utf8 as per current code impl. * Add support for sqlite test files to sqllogictest * Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed. * Removed workaround for bug that was fixed. * Git submodule update ... err update, link to sqlite tests. * Git submodule update * Readd submodule --------- Co-authored-by: Andrew Lamb <[email protected]> * Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR * chore: Create devcontainer.json (#13520) * Create devcontainer.json * update devcontainer * remove useless features * Minor: consolidate ConfigExtension example into API docs (#13954) * Update examples README.md * Minor: consolidate ConfigExtension example into API docs * more docs * Remove update * clippy * Fix issue with ExtensionsOptions docs * Parallelize pruning utf8 fuzz test (#13947) * Add swap_inputs to SMJ (#13984) * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966) * added failing test * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows * Update datafusion/functions-nested/src/set_ops.rs Co-authored-by: Andrew Lamb <[email protected]> * Update set_ops.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update release instructions for 44.0.0 (#13959) * Update release instructions for 44.0.0 * update macros and order * add functions-table * Add datafusion python 43.1.0 blog post to doc. (#13974) * Include license and notice files in more crates (#13985) * Extract postgres container from sqllogictest, update datafusion-testing pin (#13971) * Add support for sqlite test files to sqllogictest * Removed workaround for bug that was fixed. * Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock. * Add missing license header. * Update rstest requirement from 0.23.0 to 0.24.0 (#13977) Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0) --- updated-dependencies: - dependency-name: rstest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Move hash collision test to run only when merging to main. (#13973) * Update itertools requirement from 0.13 to 0.14 (#13965) * Update itertools requirement from 0.13 to 0.14 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix build * Simplify * Update CLI lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988) * Rename hash_collision.yml to extended.yml and add comments * Adjust schedule, add comments * Update job, rerun * doc-gen: migrate scalar functions (string) documentation 2/4 (#13925) * doc-gen: migrate scalar functions (string) documentation 2/4 * doc-gen: update function docs * doc: fix related udf order for upper function in documentation * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * doc-gen: update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Oleks V <[email protected]> * Update substrait requirement from 0.50 to 0.51 (#13978) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update release README for datafusion-cli publishing (#13982) * Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980) - Updated LastValueAccumulator to include requirement satisfaction check before updating the last value. - Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios. * Improve deserialize_to_struct example (#13958) * Cleanup deserialize_to_struct example * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> --------- Co-authored-by: Jonah Gao <[email protected]> * Update docs (#14002) * Optimize CASE expression for "expr or expr" usage. (#13953) * Apply optimization for ExprOrExpr. * Implement optimization similar to existing code. * Add sqllogictest. * feat(substrait): introduce consume_rel and consume_expression (#13963) * feat(substrait): introduce consume_rel and consume_expression Route calls to from_substrait_rel and from_substrait_rex through the SubstraitConsumer in order to allow users to provide their own behaviour * feat(substrait): consume nulls of user-defined types * docs(substrait): consume_rel and consume_expression docstrings * Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981) * Consolidate csv_opener.rs and json_opener.rs into a single example (#13955) * Update datafusion-examples/examples/csv_json_opener.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion-examples/README.md Co-authored-by: Andrew Lamb <[email protected]> * Apply code formatting with cargo fmt --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * FIX : Incorrect NULL handling in BETWEEN expression (#14007) * submodule update * FIX : Incorrect NULL handling in BETWEEN expression * Revert "submodule update" This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3. * fix incorrect unit test * move sqllogictest to expr * feat(substrait): modular substrait producer (#13931) * feat(substrait): modular substrait producer * refactor(substrait): simplify col_ref_offset handling in producer * refactor(substrait): remove column offset tracking from producer * docs(substrait): document SubstraitProducer * refactor: minor cleanup * feature: remove unused SubstraitPlanningState BREAKING CHANGE: SubstraitPlanningState is no longer available * refactor: cargo fmt * refactor(substrait): consume_ -> handle_ * refactor(substrait): expand match blocks * refactor: DefaultSubstraitProducer only needs serializer_registry * refactor: remove unnecessary warning suppression * fix(substrait): route expr conversion through handle_expr * cargo fmt * fix: Avoid re-wrapping planning errors Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000) * fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err * test: add tests for error formatting during planning * feat: support `RightAnti` for `SortMergeJoin` (#13680) * feat: support `RightAnti` for `SortMergeJoin` * feat: preserve session id when using cxt.enable_url_table() (#14004) * Return error message during planning when inserting into a MemTable with zero partitions. (#14011) * Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012) * Refactor max_rows for join plan, made it easier to understand * Simplified max_rows for Union * Chore: update wasm-supported crates, add tests (#14005) * Chore: update wasm-supported crates * format * Use workspace rust-version for all workspace crates (#14009) * [Minor] refactor: make ArraySort public for broader access (#14006) * refactor: make ArraySort public for broader access Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the struct, enabling its use in other modules and promoting better code reuse. * clippy and docs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017) * Update sqllogictest requirement from =0.24.0 to =0.26.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * remove version pin and note --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Eduard Karacharov <[email protected]> * `url` dependancy update (#14019) * `url` dependancy update * `url` version update for datafusion-cli * Minor: Improve zero partition check when inserting into `MemTable` (#14024) * Improve zero partition check when inserting into `MemTable` * update err msg * refactor: make structs public and implement Default trait (#14030) * Minor: Remove redundant implementation of `StringArrayType` (#14023) * Minor: Remove redundant implementation of StringArrayType Signed-off-by: Tai Le Manh <[email protected]> * Deprecate rather than remove StringArrayType --------- Signed-off-by: Tai Le Manh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014) * Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995) * Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema * Implement aggregate functions with spill handling in tests * Add tests for aggregate functions with and without spill handling * Move test related imports into mod test * Rename spill pool test functions for clarity and consistency * Refactor aggregate function imports to use fully qualified paths * Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream * Update aggregate test to use AVG instead of MAX * assert spill count * Refactor partial aggregate schema creation to use create_schema function * Refactor partial aggregation schema creation and remove redundant function * Remove unused import of Schema from arrow::datatypes in row_hash.rs * move spill pool testing for aggregate functions to physical-plan/src/aggregates * Use Arc::clone for schema references in aggregate functions * Encapsulate fields of `EquivalenceProperties` (#14040) * Encapsulate fields of `EquivalenceGroup` (#14039) * Fix error on `array_distinct` when input is empty #13810 (#14034) * fix * add test * oops --------- Co-authored-by: Cyprien Huet <[email protected]> * Update petgraph requirement from 0.6.2 to 0.7.1 (#14045) * Update petgraph requirement from 0.6.2 to 0.7.1 Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version. - [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst) - [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.7.1) --- updated-dependencies: - dependency-name: petgraph dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update datafusion-cli/Cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]> * Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037) * Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) * fix doc * Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021) * chore: add test to verify that schema is inferred as expected * chore: add comment to method as suggested * chore: restructure to avoid need to clone * chore: fix flaw in rewrite * feat(optimizer): Enable filter pushdown on window functions (#14026) * feat(optimizer): Enable filter pushdown on window functions Ensures selections can be pushed past window functions similarly to what is already done with aggregations, when possible. * fix: Add missing dependency * minor(optimizer): Use 'datafusion-functions-window' as a dev dependency * docs(optimizer): Add example to filter pushdown on LogicalPlan::Window * Unparsing optimized (> 2 inputs) unions (#14031) * tests and optimizer in testing queries * unparse optimized unions * format Cargo.toml * format Cargo.toml * revert test * rewrite test to avoid cyclic dep * remove old test * cleanup * comments and error handling * handle union with lt 2 inputs * Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047) * Simplify error handling in case.rs (#13990) (#14033) * Simplify error handling in case.rs (#13990) * Fix issues causing GitHub checks to fail * Update datafusion/physical-expr/src/expressions/case.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800) * Add asynchronous catalog traits to help users that have asynchronous catalogs * Apply clippy suggestions * Address PR reviews * Remove allow_unused exceptions * Update remote catalog example to demonstrate new helper structs * Move schema_name / catalog_name parameters into resolve f…

* Handle alias when parsing sql(parse_sql_expr) (#12939) * fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr` * Improve documentation for TableProvider (#13724) * Reveal implementing type and return type in simple UDF implementations (#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait. * minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731) * Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests * Update to apache-avro 0.17, fix compatibility changes schema handling (#13727) * Update apache-avro requirement from 0.16 to 0.17 --- updated-dependencies: - dependency-name: apache-avro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compatibility changes schema handling apache-avro 0.17 - Handle ArraySchema struct - Handle MapSchema struct - Map BigDecimal => LargeBinary - Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None) - Map LocalTimestampNanos => todo!() - Add Default to FixedSchema test * Update Cargo.lock file for apache-avro 0.17 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Minor: Add doc example to RecordBatchStreamAdapter (#13725) * Minor: Add doc example to RecordBatchStreamAdapter * Update datafusion/physical-plan/src/stream.rs Co-authored-by: Berkay Şahin <[email protected]> --------- Co-authored-by: Berkay Şahin <[email protected]> * Implement GroupsAccumulator for corr(x,y) aggregate function (#13581) * Implement GroupsAccumulator for corr(x,y) * feedbacks * fix CI MSRV * review * avoid collect in accumulation * add back cast * fix union serialisation order in proto (#13709) * fix union serialisation order in proto * clippy * address comments * Minor: make unsupported `nanosecond` part a real (not internal) error (#13733) * Minor: make unsupported `nanosecond` part a real (not internal) error * fmt * Improve wording to refer to date part * Add tests for date_part on columns + timestamps with / without timezones (#13732) * Add tests for date_part on columns + timestamps with / without timezones * Add tests from https://github.com/apache/datafusion/pull/13372 * remove trailing whitespace * Optimize performance of `initcap` function (~2x faster) (#13691) * Optimize performance of initcap (~2x faster) Signed-off-by: Tai Le Manh <[email protected]> * format --------- Signed-off-by: Tai Le Manh <[email protected]> * Minor: Add documentation explaining that initcap oly works for ASCII (#13749) * Support sqllogictest --complete with postgres (#13746) Before the change, the request to use PostgreSQL was simply ignored when `--complete` flag was present. * doc-gen: migrate window functions documentation to attribute based (#13739) * doc-gen: migrate window functions documentation Signed-off-by: zjregee <[email protected]> * fix: update Cargo.lock --------- Signed-off-by: zjregee <[email protected]> * Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751) * Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation * Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation. * Update to bigdecimal 0.4.7 (#13747) * Add big decimal formatting test cases with potential trailing zeros * Rename and simplify decimal rendering functions - add `decimal` to function name - drop `precision` parameter as it is not supposed to affect the result * Update to bigdecimal 0.4.7 Utilize new `to_plain_string` function * chore: clean up dependencies (#13728) * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Clean up dependencies * CI: Clean up dependencies * fix: Implicitly plan `UNNEST` as lateral (#13695) * plan implicit lateral if table factor is UNNEST * check for outer references in `create_relation_subquery` * add sqllogictest * fix lateral constant test to not expect a subquery node * replace sqllogictest in favor of logical plan test * update lateral join sqllogictests * add sqllogictests * fix logical plan test * Minor: improve the Deprecation / API health guidelines (#13701) * Minor: improve the Deprecation / API health policy * prettier * Update docs/source/library-user-guide/api-health.md Co-authored-by: Jonah Gao <[email protected]> * Add version guidance and make more copy/paste friendly * prettier * better * rename to guidelines --------- Co-authored-by: Jonah Gao <[email protected]> * fix: specify roottype in substrait fieldreference (#13647) * fix: specify roottype in fieldreference Signed-off-by: MBWhite <[email protected]> * Fix formatting Signed-off-by: MBWhite <[email protected]> * review suggestion Signed-off-by: MBWhite <[email protected]> --------- Signed-off-by: MBWhite <[email protected]> * Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372) * add type sig class Signed-off-by: jayzhan211 <[email protected]> * timestamp Signed-off-by: jayzhan211 <[email protected]> * date part Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * taplo format Signed-off-by: jayzhan211 <[email protected]> * tpch test Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * explicit hash Signed-off-by: jayzhan211 <[email protected]> * Enhance type coercion and function signatures - Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions. * fix comment Signed-off-by: Jay Zhan <[email protected]> * fix signature Signed-off-by: Jay Zhan <[email protected]> * fix test Signed-off-by: Jay Zhan <[email protected]> * Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds. * Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps. * Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file. * Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations. * Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling --------- Signed-off-by: jayzhan211 <[email protected]> Signed-off-by: Jay Zhan <[email protected]> * Minor: Add some more blog posts to the readings page (#13761) * Minor: Add some more blog posts to the readings page * prettier * prettier * Update docs/source/user-guide/concepts-readings-events.md --------- Co-authored-by: Oleks V <[email protected]> * docs: update GroupsAccumulator instead of GroupAccumulator (#13787) Fixing `GroupsAccumulator` trait name in its docs * Improve Deprecation Guidelines more (#13776) * Improve deprecation guidelines more * prettier * fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758) * fix: add `null_buffer` check for `LargeStringArray` Add a safety check to ensure that the alignment of buffers cannot be overflowed. This introduces a panic if they are not aligned through a runtime assertion. * fix: remove value_buffer assertion These buffers can be misaligned and it is not problematic, it is the `null_buffer` which we care about being of the same length. * feat: add `null_buffer` check to `StringArray` This is in a similar vein to `LargeStringArray`, as the code is the same, except for `i32`'s instead of `i64`. * feat: use `row_count` var to avoid drift * Revert the removal of reservation in HashJoin (#13792) * fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption. * fmt Signed-off-by: Jay Zhan <[email protected]> --------- Signed-off-by: Jay Zhan <[email protected]> * added count aggregate slt (#13790) * Update documentation guidelines for contribution content (#13703) * Update documentation guidelines for contribution content * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * clarify discussions and remove requirements note * prettier * Update docs/source/contributor-guide/index.md Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * Add Round trip tests for Array <--> ScalarValue (#13777) * Add Round trip tests for Array <--> ScalarValue * String dictionary test * remove unecessary value * Improve comments * fix: Limit together with pushdown_filters (#13788) * fix: Limit together with pushdown_filters * Fix format * Address new comments * Fix testing case to hit the problem * Minor: improve Analyzer docs (#13798) * Minor: cargo update in datafusion-cli (#13801) * Update datafusion-cli toml to pin home=0.5.9 * update Cargo.lock * Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797) * fix: enable pruning by bloom filters for dictionary columns (#13768) * Handle empty rows for `array_distinct` (#13810) * handle empty array distinct * ignore * fix --------- Co-authored-by: Cyprien Huet <[email protected]> * Fix get_type for higher-order array functions (#13756) * Fix get_type for higher-order array functions * Fix recursive flatten The fix is covered by recursive flatten test case in array.slt * Restore "keep LargeList" in Array signature * clarify naming in the test * Chore: Do not return empty record batches from streams (#13794) * do not emit empty record batches in plans * change function signatures to Option<RecordBatch> if empty batches are possible * format code * shorten code * change list_unnest_at_level for returning Option value * add documentation take concat_batches into compute_aggregates function again * create unit test for row_hash.rs * add test for unnest * add test for unnest * add test for partial sort * add test for bounded window agg * add test for window agg * apply simplifications and fix typo * apply simplifications and fix typo * Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802) * test(13796): reproducer of overflow on capacity * fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer * refactor: use simple solution and provide panic * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750) * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema * clippy * fix csv and json tests * add testing for parquet * cleanup * fix parquet tests * document describe_partition, add back repartition options to one of the csv empty files tests * Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <[email protected]> * Minor: Extend ScalarValue::new_zero() (#13828) * Update mod.rs * Update mod.rs * Update mod.rs * Update mod.rs * chore: temporarily disable windows flow (#13833) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 * Fix test * Add test * Add test * Refine negative scales * Update comment * Refine bigint_to_i256 * UT for bigint_to_i256 * Add ut for parse_decimal * Replace `BooleanArray::extend` with `append_n` (#13832) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> * improve docs --------- Co-authored-by: Piotr Findeisen <[email protected]> * [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS * Add example of interacting with a remote catalog (#13722) * Add example of interacting with a remote catalog * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Berkay Şahin <[email protected]> * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Use HashMap to hold tables --------- Co-authored-by: Berkay Şahin <[email protected]> Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Update substrait requirement from 0.49 to 0.50 (#13808) * Update substrait requirement from 0.49 to 0.50 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compilation * Add expr test --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * typo: remove extraneous "`" in doc comment, fix header (#13848) * typo: extraneous "`" in doc comment * Update datafusion/execution/src/runtime_env.rs * Update datafusion/execution/src/runtime_env.rs --------- Co-authored-by: Oleks V <[email protected]> * typo: remove extra "`" interfering with doc formatting (#13847) * Support n-ary monotonic functions in ordering equivalence (#13841) * Support n-ary monotonic functions in `discover_new_orderings` * Add tests for n-ary monotonic functions in `discover_new_orderings` * Fix tests * Fix non-monotonic test case * Fix unintended simplification * Minor comment changes * Fix tests * Add `preserves_lex_ordering` field * Use `preserves_lex_ordering` on `discover_new_orderings()` * Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc` * Update tests * Move logic to UDF * Cargo fmt * Refactor * Cargo fmt * Simply use false value on default implementation * Remove unnecessary import * Clippy fix * Update Cargo.lock * Move dep to dev-dependencies * Rename output_preserves_lex_ordering to preserves_lex_ordering * minor --------- Co-authored-by: berkaysynnada <[email protected]> * Replace `execution_mode` with `emission_type` and `boundedness` (#13823) * feat: update execution modes and add bitflags dependency - Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure. * add exec API Signed-off-by: Jay Zhan <[email protected]> * replace done but has stackoverflow Signed-off-by: Jay Zhan <[email protected]> * exec API done Signed-off-by: Jay Zhan <[email protected]> * Refactor execution plan properties to remove execution mode - Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability. * Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans. * fix test Signed-off-by: Jay Zhan <[email protected]> * Refactor join handling and emission type logic - Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties. * Implement emission type for execution plans - Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project. * Enhance join type documentation and refine emission type logic - Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project. * Refactor emission type logic in join and sort execution plans - Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types. * Refactor emission type handling in execution plans - Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework. * Enhance execution plan properties with boundedness and emission type - Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor execution plans to enhance boundedness and emission type handling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor memory handling in execution plans - Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints. * Refactor boundedness checks in execution plans - Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies. * Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project. * Refactor execution plan boundedness and emission type handling - Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types. * Refactor emission type and boundedness handling in execution plans - Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness. * Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations. * Review Part1 * Update sanity_checker.rs * addressing reviews * Review Part 1 * Update datafusion/physical-plan/src/execution_plan.rs * Update datafusion/physical-plan/src/execution_plan.rs * Shorten imports * Enhance documentation for JoinType and Boundedness enums - Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max. --------- Signed-off-by: Jay Zhan <[email protected]> Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Preserve ordering equivalencies on `with_reorder` (#13770) * Preserve ordering equivalencies on `with_reorder` * Add assertions * Return early if filtered_exprs is empty * Add clarify comment * Refactor * Add comprehensive test case * Add comment for exprs_equal * Cargo fmt * Clippy fix * Update properties.rs * Update exprs_equal and add tests * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * replace CASE expressions in predicate pruning with boolean algebra (#13795) * replace CASE expressions in predicate pruning with boolean algebra * fix merge * update tests * add some more tests * add some more tests * remove duplicate test case * Update datafusion/physical-optimizer/src/pruning.rs * swap NOT for != * replace comments, update docstrings * fix example * update tests * update tests * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Chunchun Ye <[email protected]> * enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857) fixes #13854 Co-authored-by: Arttu Voutilainen <[email protected]> * Add configurable normalization for configuration options and preserve case for S3 paths (#13576) * Do not normalize values * Fix tests & update docs * Prettier * Lowercase config params * Unify transform and parse * Fix tests * Rename `default_transform` and relax boundaries * Make `compression` case-insensitive * Comment to new line * Deprecate and ignore `enable_options_value_normalization` * Update datafusion/common/src/config.rs * fix typo --------- Co-authored-by: Oleks V <[email protected]> * Improve`Signature` and `comparison_coercion` documentation (#13840) * Improve Signature documentation more * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> * feat: support normalized expr in CSE (#13315) * feat: support normalized expr in CSE * feat: support normalize_eq in cse optimization * feat: support cumulative binary expr result in normalize_eq --------- Co-authored-by: Andrew Lamb <[email protected]> * Upgrade to sqlparser `0.53.0` (#13767) * chore: Udpate to sqlparser 0.53.0 * Update for new sqlparser API * more api updates * Avoid serializing query to SQL string unless it is necessary * Box wildcard options * chore: update datafusion-cli Cargo.lock * Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861) Thanks @Dandandan * feat(function): add `least` function (#13786) * start adding least fn * feat(function): add least function * update function name * fix scalar smaller function * add tests * run Clippy and Fmt * Generated docs using `./dev/update_function_docs.sh` * add comment why `descending: false` * update comment * Update least.rs Co-authored-by: Bruce Ritchie <[email protected]> * Update scalar_functions.md * run ./dev/update_function_docs.sh to update docs * merge greatest and least implementation to one * add header --------- Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Improve SortPreservingMerge::enable_round_robin_repartition docs (#13826) * Clarify SortPreservingMerge::enable_round_robin_repartition docs * tweaks * Improve comments more * clippy * fix doc link * Minor: Unify `downcast_arg` method (#13865) * Implement `SHOW FUNCTIONS` (#13799) * introduce rid for different signature * implement show functions syntax * add syntax example * avoid duplicate join * fix clippy * show function_type instead of routine_type * add some doc and comments * Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740) * Update bzip2 requirement from 0.4.3 to 0.5.0 Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version. - [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases) - [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0) --- updated-dependencies: - dependency-name: bzip2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix test * Fix CLI cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Fix build (#13869) * feat(substrait): modular substrait consumer (#13803) * feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing * Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862) * include FetchRel when producing LogicalPlan from Sort * add suggested test * address review feedback * Minor: improve error message when ARRAY literals can not be planned (#13859) * Minor: improve error message when ARRAY literals can not be planned * fmt * Update datafusion/sql/src/expr/value.rs Co-authored-by: Oleks V <[email protected]> --------- Co-authored-by: Oleks V <[email protected]> * Add documentation for `SHOW FUNCTIONS` (#13868) * Support unicode character for `initcap` function (#13752) * Support unicode character for 'initcap' function Signed-off-by: Tai Le Manh <[email protected]> * Update unit tests * Fix clippy warning * Update sqllogictests - initcap * Update scalar_functions.md docs * Add suggestions change Signed-off-by: Tai Le Manh <[email protected]> --------- Signed-off-by: Tai Le Manh <[email protected]> * [minor] make recursive package dependency optional (#13778) * make recursive optional * add to default for common package * cargo update * added to readme * make test conditional * reviews * cargo update --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: remove unused async-compression `futures-io` feature (#13875) * Minor: remove unused async-compression feature * Fix cli cargo lock * Consolidate Example: dataframe_output.rs into dataframe.rs (#13877) * Restore `DocBuilder::new()` to avoid breaking API change (#13870) * Fix build * Restore DocBuilder::new(), deprecate * cmt * clippy * Improve error messages for incorrect zero argument signatures (#13881) * Improve error messages for incorrect zero argument signatures * fix errors * fix fmt * Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883) * minor: fix typos in comments / structure names (#13879) * minor: fix typo error in datafusion * fix: fix rebase error * fix: format HashJoinExec doc * doc: recover thiserror/preemptively * fix: other typo error fixed * fix: directories to dir_entries in catalog example * Support 1 or 3 arg in generate_series() UDTF (#13856) * Support 1 or 3 args in generate_series() UDTF * address comment * Support (order by / sort) for DataFrameWriteOptions (#13874) * Support (order by / sort) for DataFrameWriteOptions * Fix fmt * Fix import * Add insert into example * Update sort_merge_join.rs (#13894) * Update join_selection.rs (#13893) * Fix `recursive-protection` feature flag (#13887) * Fix recursive-protection feature flag * rename feature flag to be consistent * Make default * taplo format * Fix visibility of swap_hash_join (#13899) * Minor: Avoid emitting empty batches in partial sort (#13895) * Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * Prepare for 44.0.0 release: version and changelog (#13882) * Prepare for 44.0.0 release: version and changelog * update changelog * update configs * update before release * Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824) * support unparsing the implicit lateral unnest plan * cargo clippy and fmt * refactor for `check_unnest_placeholder_with_outer_ref` * add const for the prefix string of unnest and outer refernece column * fix case_column_or_null with nullable when conditions (#13886) * fix case_column_or_null with nullable when conditions * improve sqllogictests for case_column_or_null --------- Co-authored-by: zhangli20 <[email protected]> * Fixed Issue #13896 (#13903) The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used. * Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880) * make ast builder public * introduce udlp unparser * add documents * add examples * add negative tests and fmt * fix the doc * rename udlp to extension * apply the first unparsing result only * improve the doc * seperate the enum for the unparsing result * fix the doc --------- Co-authored-by: Andrew Lamb <[email protected]> * Preserve constant values across union operations (#13805) * Add value tracking to ConstExpr for improved union optimization * Update PartialEq impl * Minor change * Add docstring for ConstExpr value * Improve constant propagation across union partitions * Add assertion for across_partitions * fix fmt * Update properties.rs * Remove redundant constant removal loop * Remove unnecessary mut * Set across_partitions=true when both sides are constant * Extract and use constant values in filter expressions * Add initial SLT for constant value tracking across UNION ALL * Assign values to ConstExpr where possible * Revert "Set across_partitions=true when both sides are constant" This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449. * Temporarily take value from literal * Lint fixes * Cargo fmt * Add get_expr_constant_value * Make `with_value()` accept optional value * Add todo * Move test to union.slt * Fix changed slt after merge * Simplify constexpr * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902) * fix RecordBatch size in topK (#13906) * ci improvements, update protoc (#13876) * Fix md5 return_type to only return Utf8 as per current code impl. * ci improvements * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Revert nextest change until action is approved. * Exclude requires workspace * Fixing minor typo to verify ci caching of builds is working as expected. * Updates from PR review. * Adding issue link for disabling intel mac build * improve performance of running examples * remove cargo check * Introduce LogicalPlan invariants, begin automatically checking them (#13651) * minor(13525): perform LP validation before and after each possible mutation * minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass * minor(13525): validate union after each optimizer passes * refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass * chore: add link to invariant docs * fix: add new invariants module * refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants * test: update test for slight error message change * fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause * refactor: move collect_subquery_cols() to common utils crate * refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass. * refactor: based upon performance tests, run the maximum number of checks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug * chore: update error naming and terminology used in code comments * refactor: use proper error methods * chore: more cleanup of error messages * chore: handle option trailer to error message * test: update sqllogictests tests to not use multiline * Correct return type for initcap scalar function with utf8view (#13909) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view * Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905) * Implement maintains_input_order for AggregateExec (#13897) * Implement maintains_input_order for AggregateExec * Update mod.rs * Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: mertak-synnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Move join type input swapping to pub methods on Joins (#13910) * doc-gen: migrate scalar functions (string) documentation 3/4 (#13926) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917) * Update sqllogictest requirement from 0.24.0 to 0.25.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Remove labels --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913) * doc-gen: migrate scalar functions (crypto) documentation (#13918) * doc-gen: migrate scalar functions (crypto) documentation * doc-gen: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920) * doc-gen: migrate scalar functions (datetime) documentation 1/2 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * fix RecordBatch size in hash join (#13916) * doc-gen: migrate scalar functions (array) documentation 1/3 (#13928) * doc-gen: migrate scalar functions (array) documentation 1/3 * fix: remove unsed import, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 1/2 (#13922) * doc-gen: migrate scalar functions (math) documentation 1/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 2/2 (#13923) * doc-gen: migrate scalar functions (math) documentation 2/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 3/3 (#13930) * doc-gen: migrate scalar functions (array) documentation 3/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 2/3 (#13929) * doc-gen: migrate scalar functions (array) documentation 2/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (string) documentation 4/4 (#13927) * doc-gen: migrate scalar functions (string) documentation 4/4 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Support explain query when running dfbench with clickbench (#13942) * Support explain query when running dfbench * Address comments * Consolidate example to_date.rs into dateframe.rs (#13939) * Consolidate example to_date.rs into dateframe.rs * Assert results using assert_batches_eq * clippy * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945) * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf. * add comment * Implement predicate pruning for `like` expressions (prefix matching) (#12978) * Implement predicate pruning for like expressions * add function docstring * re-order bounds calculations * fmt * add fuzz tests * fix clippy * Update datafusion/core/tests/fuzz_cases/pruning.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * doc-gen: migrate scalar functions (string) documentation 1/4 (#13924) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * consolidate dataframe_subquery.rs into dataframe.rs (#13950) * migrate btrim to user_doc macro (#13952) * doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921) * doc-gen: migrate scalar functions (datetime) documentation 2/2 * fix: fix typo and update function docs * doc: update function docs * doc-gen: remove slash --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936) * Fix md5 return_type to only return Utf8 as per current code impl. * Add support for sqlite test files to sqllogictest * Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed. * Removed workaround for bug that was fixed. * Git submodule update ... err update, link to sqlite tests. * Git submodule update * Readd submodule --------- Co-authored-by: Andrew Lamb <[email protected]> * Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR * chore: Create devcontainer.json (#13520) * Create devcontainer.json * update devcontainer * remove useless features * Minor: consolidate ConfigExtension example into API docs (#13954) * Update examples README.md * Minor: consolidate ConfigExtension example into API docs * more docs * Remove update * clippy * Fix issue with ExtensionsOptions docs * Parallelize pruning utf8 fuzz test (#13947) * Add swap_inputs to SMJ (#13984) * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966) * added failing test * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows * Update datafusion/functions-nested/src/set_ops.rs Co-authored-by: Andrew Lamb <[email protected]> * Update set_ops.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update release instructions for 44.0.0 (#13959) * Update release instructions for 44.0.0 * update macros and order * add functions-table * Add datafusion python 43.1.0 blog post to doc. (#13974) * Include license and notice files in more crates (#13985) * Extract postgres container from sqllogictest, update datafusion-testing pin (#13971) * Add support for sqlite test files to sqllogictest * Removed workaround for bug that was fixed. * Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock. * Add missing license header. * Update rstest requirement from 0.23.0 to 0.24.0 (#13977) Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0) --- updated-dependencies: - dependency-name: rstest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Move hash collision test to run only when merging to main. (#13973) * Update itertools requirement from 0.13 to 0.14 (#13965) * Update itertools requirement from 0.13 to 0.14 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix build * Simplify * Update CLI lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988) * Rename hash_collision.yml to extended.yml and add comments * Adjust schedule, add comments * Update job, rerun * doc-gen: migrate scalar functions (string) documentation 2/4 (#13925) * doc-gen: migrate scalar functions (string) documentation 2/4 * doc-gen: update function docs * doc: fix related udf order for upper function in documentation * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * doc-gen: update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Oleks V <[email protected]> * Update substrait requirement from 0.50 to 0.51 (#13978) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update release README for datafusion-cli publishing (#13982) * Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980) - Updated LastValueAccumulator to include requirement satisfaction check before updating the last value. - Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios. * Improve deserialize_to_struct example (#13958) * Cleanup deserialize_to_struct example * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> --------- Co-authored-by: Jonah Gao <[email protected]> * Update docs (#14002) * Optimize CASE expression for "expr or expr" usage. (#13953) * Apply optimization for ExprOrExpr. * Implement optimization similar to existing code. * Add sqllogictest. * feat(substrait): introduce consume_rel and consume_expression (#13963) * feat(substrait): introduce consume_rel and consume_expression Route calls to from_substrait_rel and from_substrait_rex through the SubstraitConsumer in order to allow users to provide their own behaviour * feat(substrait): consume nulls of user-defined types * docs(substrait): consume_rel and consume_expression docstrings * Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981) * Consolidate csv_opener.rs and json_opener.rs into a single example (#13955) * Update datafusion-examples/examples/csv_json_opener.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion-examples/README.md Co-authored-by: Andrew Lamb <[email protected]> * Apply code formatting with cargo fmt --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * FIX : Incorrect NULL handling in BETWEEN expression (#14007) * submodule update * FIX : Incorrect NULL handling in BETWEEN expression * Revert "submodule update" This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3. * fix incorrect unit test * move sqllogictest to expr * feat(substrait): modular substrait producer (#13931) * feat(substrait): modular substrait producer * refactor(substrait): simplify col_ref_offset handling in producer * refactor(substrait): remove column offset tracking from producer * docs(substrait): document SubstraitProducer * refactor: minor cleanup * feature: remove unused SubstraitPlanningState BREAKING CHANGE: SubstraitPlanningState is no longer available * refactor: cargo fmt * refactor(substrait): consume_ -> handle_ * refactor(substrait): expand match blocks * refactor: DefaultSubstraitProducer only needs serializer_registry * refactor: remove unnecessary warning suppression * fix(substrait): route expr conversion through handle_expr * cargo fmt * fix: Avoid re-wrapping planning errors Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000) * fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err * test: add tests for error formatting during planning * feat: support `RightAnti` for `SortMergeJoin` (#13680) * feat: support `RightAnti` for `SortMergeJoin` * feat: preserve session id when using cxt.enable_url_table() (#14004) * Return error message during planning when inserting into a MemTable with zero partitions. (#14011) * Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012) * Refactor max_rows for join plan, made it easier to understand * Simplified max_rows for Union * Chore: update wasm-supported crates, add tests (#14005) * Chore: update wasm-supported crates * format * Use workspace rust-version for all workspace crates (#14009) * [Minor] refactor: make ArraySort public for broader access (#14006) * refactor: make ArraySort public for broader access Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the struct, enabling its use in other modules and promoting better code reuse. * clippy and docs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017) * Update sqllogictest requirement from =0.24.0 to =0.26.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * remove version pin and note --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Eduard Karacharov <[email protected]> * `url` dependancy update (#14019) * `url` dependancy update * `url` version update for datafusion-cli * Minor: Improve zero partition check when inserting into `MemTable` (#14024) * Improve zero partition check when inserting into `MemTable` * update err msg * refactor: make structs public and implement Default trait (#14030) * Minor: Remove redundant implementation of `StringArrayType` (#14023) * Minor: Remove redundant implementation of StringArrayType Signed-off-by: Tai Le Manh <[email protected]> * Deprecate rather than remove StringArrayType --------- Signed-off-by: Tai Le Manh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014) * Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995) * Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema * Implement aggregate functions with spill handling in tests * Add tests for aggregate functions with and without spill handling * Move test related imports into mod test * Rename spill pool test functions for clarity and consistency * Refactor aggregate function imports to use fully qualified paths * Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream * Update aggregate test to use AVG instead of MAX * assert spill count * Refactor partial aggregate schema creation to use create_schema function * Refactor partial aggregation schema creation and remove redundant function * Remove unused import of Schema from arrow::datatypes in row_hash.rs * move spill pool testing for aggregate functions to physical-plan/src/aggregates * Use Arc::clone for schema references in aggregate functions * Encapsulate fields of `EquivalenceProperties` (#14040) * Encapsulate fields of `EquivalenceGroup` (#14039) * Fix error on `array_distinct` when input is empty #13810 (#14034) * fix * add test * oops --------- Co-authored-by: Cyprien Huet <[email protected]> * Update petgraph requirement from 0.6.2 to 0.7.1 (#14045) * Update petgraph requirement from 0.6.2 to 0.7.1 Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version. - [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst) - [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.7.1) --- updated-dependencies: - dependency-name: petgraph dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update datafusion-cli/Cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]> * Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037) * Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) * fix doc * Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021) * chore: add test to verify that schema is inferred as expected * chore: add comment to method as suggested * chore: restructure to avoid need to clone * chore: fix flaw in rewrite * feat(optimizer): Enable filter pushdown on window functions (#14026) * feat(optimizer): Enable filter pushdown on window functions Ensures selections can be pushed past window functions similarly to what is already done with aggregations, when possible. * fix: Add missing dependency * minor(optimizer): Use 'datafusion-functions-window' as a dev dependency * docs(optimizer): Add example to filter pushdown on LogicalPlan::Window * Unparsing optimized (> 2 inputs) unions (#14031) * tests and optimizer in testing queries * unparse optimized unions * format Cargo.toml * format Cargo.toml * revert test * rewrite test to avoid cyclic dep * remove old test * cleanup * comments and error handling * handle union with lt 2 inputs * Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047) * Simplify error handling in case.rs (#13990) (#14033) * Simplify error handling in case.rs (#13990) * Fix issues causing GitHub checks to fail * Update datafusion/physical-expr/src/expressions/case.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800) * Add asynchronous catalog traits to help users that have asynchronous catalogs * Apply clippy suggestions * Address PR reviews * Remove allow_unused exceptions * Update remote catalog example to demonstrate new helper structs * Move schema_name / catalog_name parameters into resolve function and out of trait * Custom scalar to sql overrides support for DuckDB Unparser dialect (#13915) * Allow adding custom scalar to sql overrides for DuckDB (#68) * Add unit test: custom_scalar_overrides_duckdb * Move `with_custom_scalar_overrides` definition on `Dialect` trait level * Improve perfomance of `reverse` function (#14025) * Improve perfomance of 'reverse' function Signed-off-by: Tai Le Manh <[email protected]> * Apply sugestion change * Fix typo --------- Signed-off-by: Tai Le Manh <[email protected]> * docs(ci): use up-to-date protoc with docs.rs (#14048) * fix (#14042) Co-authored-by: Cyprien Huet <[email protected]> * Re-export TypeSignatureClass from the datafusion-expr package (#14051) * Fix clippy for Rust 1.84 (#14065) * fix: incorrect error message of function_length_check (#14056) * minor fix * add ut * remove check for 0 arg * test: Add plan execution during tests for bounded source (#14013) * Bump `ctor` to `0.2.9` (#14069) * Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone (#14038) * Move collapse_lex_ordering to Lexordering::collapse * reduce diff * avoid clone, cleanup * Introduce LexRequirement::collapse * Improve performance of collapse, from @akurmustafa https://github.com/alamb/datafusion/pull/26 fix formatting * Revert "Improve performance of collapse, from @akurmustafa" This reverts commit a44acfdb3af5bf0082c277de6ee7e09e92251a49. * remove incorrect comment --------- Co-authored-by: Mustafa Akur <[email protected]> * Bump `wasm-bindgen` and `wasm-bindgen-futures` (#14068) * update (#14070) * fix: make get_valid_types handle TypeSignature::Numeric correctly (#14060) * fix get_valid_types with TypeSignature::Numeric * f…

wiedld added 3 commits December 4, 2024 15:24

minor(13525): perform LP validation before and after each possible mu…

6d43dc2

…tation

minor(13525): validate unique field names on query and subquery schem…

a855811

…as, after each optimizer pass

minor(13525): validate union after each optimizer passes

0163a40

github-actions bot added the optimizer Optimizer rules label Dec 5, 2024

wiedld commented Dec 5, 2024

View reviewed changes

datafusion/optimizer/src/analyzer/mod.rs Outdated Show resolved Hide resolved

wiedld commented Dec 5, 2024

View reviewed changes

datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Dec 5, 2024

Automatically check "invariants" #13652

Open

3 tasks

alamb reviewed Dec 5, 2024

View reviewed changes

datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved

wiedld commented Dec 5, 2024

View reviewed changes

findepi reviewed Dec 5, 2024

View reviewed changes

refactor: make explicit what is an invariant of the logical plan, ver…

bee7e92

…sus assertions made after a given analyzer or optimizer pass

github-actions bot added the logical-expr Logical plan and expressions label Dec 16, 2024

wiedld commented Dec 16, 2024

View reviewed changes

datafusion/optimizer/src/analyzer/mod.rs Outdated Show resolved Hide resolved

chore: add link to invariant docs

4eee9c4

wiedld commented Dec 16, 2024

View reviewed changes

datafusion/expr/src/logical_plan/plan.rs Outdated Show resolved Hide resolved

alamb reviewed Dec 16, 2024

View reviewed changes

fix: add new invariants module

a7d9770

alamb mentioned this pull request Dec 16, 2024

Minor: improve Analyzer docs #13798

Merged

wiedld added 4 commits December 17, 2024 12:39

Merge branch 'main' into 13525/invariant-checking-for-implicit-LP-cha…

72718ad

…nges

refactor: move all LP invariant checking into LP, delineate executabl…

2002b1a

…e (valid semantic plan) vs basic LP invariants

test: update test for slight error message change

fbc9c46

fix: push_down_filter optimization pass can push a IN(<subquery>) int…

e52187e

…o a TableScan's filter clause

wiedld force-pushed the 13525/invariant-checking-for-implicit-LP-changes branch from 651f5d3 to e52187e Compare December 17, 2024 23:35

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 17, 2024

wiedld commented Dec 17, 2024

View reviewed changes

datafusion/expr/src/logical_plan/invariants.rs Show resolved Hide resolved

refactor: move collect_subquery_cols() to common utils crate

ad1a1f8

wiedld marked this pull request as draft December 23, 2024 20:15

wiedld force-pushed the 13525/invariant-checking-for-implicit-LP-changes branch from 10cd5d5 to 1164a7b Compare December 23, 2024 20:26

refactor: clarify the purpose of assert_valid_optimization(), runs af…

1164a7b

…ter all optimizer passes, except in debug mode it runs after each pass.

wiedld marked this pull request as ready for review December 24, 2024 04:10

alamb approved these changes Dec 24, 2024

View reviewed changes

berkaysynnada reviewed Dec 24, 2024

View reviewed changes

datafusion/expr/src/logical_plan/invariants.rs Outdated Show resolved Hide resolved

berkaysynnada reviewed Dec 24, 2024

View reviewed changes

jonahgao approved these changes Dec 24, 2024

View reviewed changes

datafusion/expr/src/logical_plan/invariants.rs Outdated Show resolved Hide resolved

datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved

wiedld force-pushed the 13525/invariant-checking-for-implicit-LP-changes branch from 1884564 to 9bca470 Compare December 24, 2024 21:24

wiedld commented Dec 24, 2024

View reviewed changes

datafusion/sqllogictest/test_files/subquery.slt Outdated Show resolved Hide resolved

wiedld added 5 commits December 24, 2024 14:50

chore: update error naming and terminology used in code comments

911d4b8

refactor: use proper error methods

810246d

chore: more cleanup of error messages

9842d19

Merge branch 'main' into 13525/invariant-checking-for-implicit-LP-cha…

00700ae

…nges

chore: handle option trailer to error message

9bca470

test: update sqllogictests tests to not use multiline

529ac3e

alamb approved these changes Dec 26, 2024

View reviewed changes

alamb merged commit cf8f2f8 into apache:main Dec 26, 2024
24 checks passed

alamb mentioned this pull request Jan 1, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Closed

11 tasks

This was referenced Jan 2, 2025

Interface for physical plan invariant checking. #13986

Merged

Define extension API for user-defined invariants. #14029

Open

edmondop mentioned this pull request Jan 21, 2025

Automatically check "invariants" edmondop/arrow-datafusion#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce LogicalPlan invariants, begin automatically checking them #13651

Introduce LogicalPlan invariants, begin automatically checking them #13651

wiedld commented Dec 5, 2024 •

edited

Loading

alamb left a comment

Omega359 commented Dec 5, 2024

wiedld Dec 5, 2024

jonahgao Dec 5, 2024 •

edited

Loading

findepi Dec 5, 2024

alamb Dec 6, 2024

alamb Dec 6, 2024

findepi Dec 7, 2024

wiedld Dec 18, 2024

findepi Dec 5, 2024

wiedld Dec 16, 2024

findepi commented Dec 5, 2024

alamb commented Dec 6, 2024

wiedld commented Dec 16, 2024 •

edited

Loading

alamb left a comment

wiedld commented Dec 18, 2024 •

edited

Loading

wiedld commented Dec 24, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb Dec 24, 2024

alamb Dec 24, 2024

berkaysynnada Dec 24, 2024

wiedld Dec 24, 2024

alamb commented Dec 24, 2024

jonahgao left a comment

wiedld commented Dec 24, 2024

alamb commented Dec 25, 2024 •

edited

Loading

alamb left a comment

Introduce LogicalPlan invariants, begin automatically checking them #13651

Introduce LogicalPlan invariants, begin automatically checking them #13651

Conversation

wiedld commented Dec 5, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Omega359 commented Dec 5, 2024

Choose a reason for hiding this comment

jonahgao Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 5, 2024

alamb commented Dec 6, 2024

wiedld commented Dec 16, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

wiedld commented Dec 18, 2024 • edited Loading

wiedld commented Dec 24, 2024 • edited Loading

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 24, 2024

jonahgao left a comment

Choose a reason for hiding this comment

wiedld commented Dec 24, 2024

alamb commented Dec 25, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

wiedld commented Dec 5, 2024 •

edited

Loading

jonahgao Dec 5, 2024 •

edited

Loading

wiedld commented Dec 16, 2024 •

edited

Loading

wiedld commented Dec 18, 2024 •

edited

Loading

wiedld commented Dec 24, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb commented Dec 25, 2024 •

edited

Loading