Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track whether an optimization decision was cost-based #20990

Merged
merged 1 commit into from
Oct 10, 2023

Conversation

mlyublena
Copy link
Contributor

@mlyublena mlyublena commented Sep 28, 2023

Some of Presto's optimizers are heuristic, while others are cost-based. This change allows tracking which optimizers were driven by a cost-based decision (independent of whether the cost was estimated or supplied by HBO). This information is added to PlanOptimizerInformation and can be seen in the explain plan when verbose_optimizer_info_enabled=true, for example:

presto:tpch> explain select lineitem.linenumber,count(*) from orders join lineitem on (lineitem.orderkey=orders.orderkey) group by linenumber;
                                                                                                                                                                                                                                        Query Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[PlanNodeId 12][linenumber, _col1] => [linenumber:integer, count:bigint]
         _col1 := count (1:36)
...
 Triggered optimizers: [AddLocalExchanges, ApplyConnectorOptimization, HashGenerationOptimizer, PickTableLayoutWithoutPredicate, PruneJoinChildrenColumns, PruneJoinColumns, PruneProjectColumns, PruneTableScanColumns, PruneUnreferencedOutputs, PushPartialAggregationThroughExchange, RemoveRedundantDistinctAggregation, RemoveRedundantIdentityProjections, ReorderJoins, SetFlatteningOptimizer, SimplifyPlanWithEmptyInput, StatsRecordingPlanOptimizer, UnaliasSymbolReferences]
 Applicable optimizers: [AddNotNullFiltersToJoinNode, KeyBasedSampler, MergePartialAggregationsWithFilter, PushPartialAggregationThroughJoin]
 Cost-based optimizers: [PushPartialAggregationThroughExchange(CBO), ReorderJoins(CBO)]

Note: currently we don't track whether the cost came from HBO or through cost estimation, only that the decision was cost-driven. We already have a change that tracks what the source of the cost estimation is, so we can potentially intersect the two to find this information. I could potentially reconsider and track and log this information directly here.

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Print information about cost-based optimizers and the source of stats they use (CBO/HBO) in explain plans when session property verbose_optimizer_info_enabled=true

@mlyublena mlyublena marked this pull request as ready for review September 28, 2023 21:44
@mlyublena mlyublena requested a review from a team as a code owner September 28, 2023 21:44
List<String> triggeredOptimizers = planOptimizerInfo.stream()
.filter(x -> x.getOptimizerTriggered())
.map(x -> x.getOptimizerName()).collect(toList());
.map(x -> x.getOptimizerName()).distinct().sorted().collect(toList());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned up the output to be sorted and de-duplicated

@@ -201,6 +211,8 @@ private void addJoinsWithDifferentDistributions(JoinNode joinNode, List<PlanNode

private JoinNode getSyntacticOrderJoin(JoinNode joinNode, Context context, JoinDistributionType joinDistributionType)
{
isCostBased = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if caller can pass this information rather setting this here?

if (isTriggered || isApplicable) {
session.getOptimizerInformationCollector().addInformation(new PlanOptimizerInformation(optimizerName, isTriggered, Optional.of(isApplicable), Optional.empty()));
boolean isCostBased = optimizer.isCostBased(session);
if (isTriggered || isApplicable || isCostBased) {
Copy link
Contributor

@ClarenceThreepwood ClarenceThreepwood Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this disjunction cause all cost-based optimizers to be logged? Currently it does not matter since none of the PlanOptimizers are cost-based, but if that should change this would not be correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, I'll change it to only log isCostBased if the optimization triggered

@@ -63,12 +64,20 @@ public class DetermineJoinDistributionType
private final CostComparator costComparator;
private final TaskCountEstimator taskCountEstimator;

// records whether distribution decision was cost-based
private boolean isCostBased;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n

@ClarenceThreepwood
Copy link
Contributor

Add these as well?

DetermineSemiJoinDistributionType
TransformDistinctInnerJoinToRightEarlyOutJoin

Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo the nits. Please fix the release notes as well

@mlyublena mlyublena force-pushed the log-hbo-optimizer-info branch 6 times, most recently from 2ba92de to 4eb770b Compare October 10, 2023 04:14
Some of Presto's optimizers are heuristic, while others are cost-based.
This change allows tracking which optimizers were driven by a cost-based decision (independent of whether the cost was estimated or supplied by HBO).
This information is added to PlanOptimizerInformation and can be seen in the explain plan when verbose_optimizer_info_enabled=true, for example:

presto:tpch> explain select lineitem.linenumber,count(*) from orders join lineitem on (lineitem.orderkey=orders.orderkey) group by linenumber;
                                                                                                                                                                                                                                        Query Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[PlanNodeId 12][linenumber, _col1] => [linenumber:integer, count:bigint]
         _col1 := count (1:36)
     - RemoteStreamingExchange[PlanNodeId 297][GATHER] => [linenumber:integer, count:bigint]
         - Project[PlanNodeId 406][projectLocality = LOCAL] => [linenumber:integer, count:bigint]
             - Aggregate(FINAL)[linenumber][$hashvalue][PlanNodeId 7] => [linenumber:integer, $hashvalue:bigint, count:bigint]
                     count := "presto.default.count"((count_15)) (1:36)
                 - LocalExchange[PlanNodeId 355][HASH][$hashvalue] (linenumber) => [linenumber:integer, count_15:bigint, $hashvalue:bigint]
                     - RemoteStreamingExchange[PlanNodeId 361][REPARTITION][$hashvalue_16] => [linenumber:integer, count_15:bigint, $hashvalue_16:bigint]
                         - Aggregate(PARTIAL)[linenumber][$hashvalue_22][PlanNodeId 359] => [linenumber:integer, $hashvalue_22:bigint, count_15:bigint]
                                 count_15 := "presto.default.count"(*) (1:36)
                             - Project[PlanNodeId 405][projectLocality = LOCAL] => [linenumber:integer, $hashvalue_22:bigint]
                                     Estimates: {source: CostBasedSourceInfo, rows: 58490 (799.67kB), cpu: 7320844.01, memory: 270000.00, network: 1654025.00}
                                     $hashvalue_22 := combine_hash(BIGINT'0', COALESCE($operator$hash_code(linenumber), BIGINT'0')) (1:119)
                                 - InnerJoin[PlanNodeId 271][("orderkey_0" = "orderkey")][$hashvalue_17, $hashvalue_19] => [linenumber:integer]
                                         Estimates: {source: CostBasedSourceInfo, rows: 58490 (799.67kB), cpu: 6501977.37, memory: 270000.00, network: 1654025.00}
                                         Distribution: PARTITIONED
                                     - RemoteStreamingExchange[PlanNodeId 294][REPARTITION][$hashvalue_17] => [orderkey_0:bigint, linenumber:integer, $hashvalue_17:bigint]
                                             Estimates: {source: CostBasedSourceInfo, rows: 60175 (822.71kB), cpu: 3610500.00, memory: 0.00, network: 1384025.00}
                                         - ScanProject[PlanNodeId 1,403][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=lineitem, analyzePartitionValues=Optional.empty}', layout='Optional[tpch.lineitem{}]'}, projectLocality = LOCAL] => [orderkey_0:bigint, linenumber:integer, $hashvalue_18:bigint]
                                                 Estimates: {source: CostBasedSourceInfo, rows: 60175 (822.71kB), cpu: 842450.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: 60175 (822.71kB), cpu: 2226475.00, memory: 0.00, network: 0.00}
                                                 $hashvalue_18 := combine_hash(BIGINT'0', COALESCE($operator$hash_code(orderkey_0), BIGINT'0')) (1:62)
                                                 LAYOUT: tpch.lineitem{}
                                                 orderkey_0 := orderkey:bigint:0:REGULAR (1:62)
                                                 linenumber := linenumber:int:3:REGULAR (1:62)
                                     - LocalExchange[PlanNodeId 338][HASH][$hashvalue_19] (orderkey) => [orderkey:bigint, $hashvalue_19:bigint]
                                             Estimates: {source: CostBasedSourceInfo, rows: 15000 (205.08kB), cpu: 945000.00, memory: 0.00, network: 270000.00}
                                         - RemoteStreamingExchange[PlanNodeId 295][REPARTITION][$hashvalue_20] => [orderkey:bigint, $hashvalue_20:bigint]
                                                 Estimates: {source: CostBasedSourceInfo, rows: 15000 (205.08kB), cpu: 675000.00, memory: 0.00, network: 270000.00}
                                             - ScanProject[PlanNodeId 0,404][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=orders, analyzePartitionValues=Optional.empty}', layout='Optional[tpch.orders{}]'}, projectLocality = LOCAL] => [orderkey:bigint, $hashvalue_21:bigint]
                                                     Estimates: {source: CostBasedSourceInfo, rows: 15000 (205.08kB), cpu: 135000.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: 15000 (205.08kB), cpu: 405000.00, memory: 0.00, network: 0.00}
                                                     $hashvalue_21 := combine_hash(BIGINT'0', COALESCE($operator$hash_code(orderkey), BIGINT'0')) (1:50)
                                                     LAYOUT: tpch.orders{}
                                                     orderkey := orderkey:bigint:0:REGULAR (1:50)
 Triggered optimizers: [AddLocalExchanges, ApplyConnectorOptimization, HashGenerationOptimizer, PickTableLayoutWithoutPredicate, PruneJoinChildrenColumns, PruneJoinColumns, PruneProjectColumns, PruneTableScanColumns, PruneUnreferencedOutputs, PushPartialAggregationThroughExchange, RemoveRedundantDistinctAggregation, RemoveRedundantIdentityProjections, ReorderJoins, SetFlatteningOptimizer, SimplifyPlanWithEmptyInput, StatsRecordingPlanOptimizer, UnaliasSymbolReferences]
 Applicable optimizers: [AddNotNullFiltersToJoinNode, KeyBasedSampler, MergePartialAggregationsWithFilter, PushPartialAggregationThroughJoin]
 Cost-based optimizers: [PushPartialAggregationThroughExchange(CBO), ReorderJoins(CBO)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants