perf(postprocessing): improve pivot postprocessing operation #23465

Usiel · 2023-03-23T05:22:18Z

SUMMARY

See the linked issue for a description of how this can affect a Superset user.

Executing a pivot for with drop_missing_columns=False and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets. The main culprit is df.drop(...) operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time to seconds instead of minutes for large datasets (thousands or millions of columns).

TESTING INSTRUCTIONS

Luckily, this feature is already covered by unit tests.

Manual test to see performance improvement:

Explore the example dataset cleaned_sales_data
Add multiple dimensions (e.g. contact_first_name, contact_last_name, phone)
Add any metric
Select the Time-series Line Chart
Click on "Update Chart"

Results on this branch
Chart loads within a few seconds

Negative results (e.g. on master)
Chart will time out or take a very long time

ADDITIONAL INFORMATION

Has associated issue: Creating Time-series Line Chart with high cardinality always times out #23464

Related to #15975 - keeps the workaround and its tests, just changes the implementation very slightly.

codecov · 2023-03-23T05:29:59Z

Codecov Report

Merging #23465 (e4edc55) into master (7ef06b0) will increase coverage by 1.85%.
The diff coverage is 82.14%.

❗ Current head e4edc55 differs from pull request most recent head 20bc99a. Consider uploading reports for the commit 20bc99a to get more accurate results

@@            Coverage Diff             @@
##           master   #23465      +/-   ##
==========================================
+ Coverage   65.76%   67.62%   +1.85%     
==========================================
  Files        1908     1908              
  Lines       73726    73736      +10     
  Branches     7989     7988       -1     
==========================================
+ Hits        48489    49861    +1372     
+ Misses      23189    21829    -1360     
+ Partials     2048     2046       -2

Flag	Coverage Δ
hive	`52.73% <45.00%> (?)`
postgres	`78.51% <70.00%> (+<0.01%)`	⬆️
presto	`52.66% <45.00%> (?)`
python	`82.32% <80.00%> (+3.81%)`	⬆️
sqlite	`77.01% <55.00%> (?)`
unit	`52.58% <55.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...chart-echarts/src/Timeseries/Area/controlPanel.tsx	`40.00% <ø> (ø)`
...harts/src/Timeseries/Regular/Line/controlPanel.tsx	`33.33% <ø> (ø)`
...ts/src/Timeseries/Regular/Scatter/controlPanel.tsx	`40.00% <ø> (ø)`
...src/Timeseries/Regular/SmoothLine/controlPanel.tsx	`40.00% <ø> (ø)`
...chart-echarts/src/Timeseries/Step/controlPanel.tsx	`33.33% <ø> (ø)`
...s/plugin-chart-pivot-table/src/PivotTableChart.tsx	`0.00% <ø> (ø)`
...n-chart-pivot-table/src/react-pivottable/Styles.js	`0.00% <ø> (ø)`
superset-frontend/src/SqlLab/actions/sqlLab.js	`64.32% <ø> (-0.18%)`	⬇️
...frontend/src/SqlLab/components/SaveQuery/index.tsx	`75.55% <ø> (ø)`
...frontend/src/SqlLab/components/SqlEditor/index.jsx	`56.68% <0.00%> (ø)`
... and 14 more

... and 90 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

villebro

Awesome perf improvement! One comment about the PR description and added comment in the diff, but other than that I feel like this is good to go if CI passes.

villebro · 2023-03-23T07:36:33Z

superset/utils/pandas_postprocessing/pivot.py

@@ -100,11 +100,9 @@ def pivot(  # pylint: disable=too-many-arguments,too-many-locals
        margins_name=marginal_distribution_name,
    )

+    # removes additional columns from pandas issue #18030 described above


In general I don't think we should place comments like this in the codebase, as over time it will cause a significant amount of clutter. People can always check git blame and take a look at the underlying PR if they need more context, and that's where info like this should be placed IMO (btw, the issue referenced here doesn't appear to be mentioned in the PR description). In addition, it was unclear to me what is meant by "described above" - there is no description directly above this block.

Sure, I can remove this if you all prefer to keep the comments to a minimum. The intent was a to make it more obvious that the workaround for the pandas issue is a 2 step operation.

I'll add a ref to the original PR for better context, in case this ever gets fixed on pandas.

Executing a pivot for with `drop_missing_columns=False` and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets. The main culprit is `df.drop(...)` operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time to seconds instead of minutes for large datasets (millions of columns). Fixes apache#23464

villebro

LGTM but I'd be curious to hear @zhaoyongjie 's thoughts, too

villebro · 2023-03-23T09:12:21Z

superset/utils/pandas_postprocessing/pivot.py

@@ -87,7 +87,7 @@ def pivot(  # pylint: disable=too-many-arguments,too-many-locals
    if not drop_missing_columns and columns:
        for row in df[columns].itertuples():
            for metric in aggfunc.keys():
-                series_set.add(str(tuple([metric]) + tuple(row[1:])))
+                series_set.add(tuple([metric]) + tuple(row[1:]))


@zhaoyongjie do you have any reservations about this change? IMO we have pretty good test coverage, so if this change doesn't break existing tests I feel this change is good to go (if it does cause a regression it's just a sign that we needed more/better tests). So I feel like this change is warranted, as it simplifies the code.

@villebro from the line of code seems to serialize the combination of tuples, this change(L104) has used vector difference on the columns so I think this PR is better than before performance, and Thanks @Usiel for this PR.

zhaoyongjie · 2023-03-23T11:10:21Z

superset/utils/pandas_postprocessing/pivot.py

@@ -87,7 +87,7 @@ def pivot(  # pylint: disable=too-many-arguments,too-many-locals
    if not drop_missing_columns and columns:
        for row in df[columns].itertuples():
            for metric in aggfunc.keys():
-                series_set.add(str(tuple([metric]) + tuple(row[1:])))
+                series_set.add(tuple([metric]) + tuple(row[1:]))


@villebro from the line of code seems to serialize the combination of tuples, this change(L104) has used vector difference on the columns so I think this PR is better than before performance, and Thanks @Usiel for this PR.

pull-request-size bot added the size/XS label Mar 23, 2023

villebro requested a review from zhaoyongjie March 23, 2023 07:30

villebro reviewed Mar 23, 2023

View reviewed changes

Usiel force-pushed the usiel/perf-pivot-post-process branch from 8641834 to 20bc99a Compare March 23, 2023 08:44

villebro approved these changes Mar 23, 2023

View reviewed changes

zhaoyongjie approved these changes Mar 23, 2023

View reviewed changes

villebro merged commit be2eb31 into apache:master Mar 24, 2023

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.0.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(postprocessing): improve pivot postprocessing operation #23465

perf(postprocessing): improve pivot postprocessing operation #23465

Usiel commented Mar 23, 2023 •

edited

Loading

codecov bot commented Mar 23, 2023 •

edited

Loading

villebro left a comment

villebro Mar 23, 2023

Usiel Mar 23, 2023

villebro left a comment

villebro Mar 23, 2023

zhaoyongjie Mar 23, 2023

zhaoyongjie Mar 23, 2023

perf(postprocessing): improve pivot postprocessing operation #23465

perf(postprocessing): improve pivot postprocessing operation #23465

Conversation

Usiel commented Mar 23, 2023 • edited Loading

SUMMARY

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Mar 23, 2023 • edited Loading

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

villebro Mar 23, 2023

Choose a reason for hiding this comment

Usiel Mar 23, 2023

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

villebro Mar 23, 2023

Choose a reason for hiding this comment

zhaoyongjie Mar 23, 2023

Choose a reason for hiding this comment

zhaoyongjie Mar 23, 2023

Choose a reason for hiding this comment

Usiel commented Mar 23, 2023 •

edited

Loading

codecov bot commented Mar 23, 2023 •

edited

Loading