-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(postprocessing): improve pivot postprocessing operation #23465
Conversation
Codecov Report
@@ Coverage Diff @@
## master #23465 +/- ##
==========================================
+ Coverage 65.76% 67.62% +1.85%
==========================================
Files 1908 1908
Lines 73726 73736 +10
Branches 7989 7988 -1
==========================================
+ Hits 48489 49861 +1372
+ Misses 23189 21829 -1360
+ Partials 2048 2046 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 90 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome perf improvement! One comment about the PR description and added comment in the diff, but other than that I feel like this is good to go if CI passes.
@@ -100,11 +100,9 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals | |||
margins_name=marginal_distribution_name, | |||
) | |||
|
|||
# removes additional columns from pandas issue #18030 described above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I don't think we should place comments like this in the codebase, as over time it will cause a significant amount of clutter. People can always check git blame
and take a look at the underlying PR if they need more context, and that's where info like this should be placed IMO (btw, the issue referenced here doesn't appear to be mentioned in the PR description). In addition, it was unclear to me what is meant by "described above" - there is no description directly above this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove this if you all prefer to keep the comments to a minimum. The intent was a to make it more obvious that the workaround for the pandas issue is a 2 step operation.
I'll add a ref to the original PR for better context, in case this ever gets fixed on pandas.
Executing a pivot for with `drop_missing_columns=False` and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets. The main culprit is `df.drop(...)` operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time to seconds instead of minutes for large datasets (millions of columns). Fixes apache#23464
8641834
to
20bc99a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I'd be curious to hear @zhaoyongjie 's thoughts, too
@@ -87,7 +87,7 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals | |||
if not drop_missing_columns and columns: | |||
for row in df[columns].itertuples(): | |||
for metric in aggfunc.keys(): | |||
series_set.add(str(tuple([metric]) + tuple(row[1:]))) | |||
series_set.add(tuple([metric]) + tuple(row[1:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaoyongjie do you have any reservations about this change? IMO we have pretty good test coverage, so if this change doesn't break existing tests I feel this change is good to go (if it does cause a regression it's just a sign that we needed more/better tests). So I feel like this change is warranted, as it simplifies the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -87,7 +87,7 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals | |||
if not drop_missing_columns and columns: | |||
for row in df[columns].itertuples(): | |||
for metric in aggfunc.keys(): | |||
series_set.add(str(tuple([metric]) + tuple(row[1:]))) | |||
series_set.add(tuple([metric]) + tuple(row[1:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUMMARY
See the linked issue for a description of how this can affect a Superset user.
Executing a pivot for with
drop_missing_columns=False
and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets. The main culprit isdf.drop(...)
operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time to seconds instead of minutes for large datasets (thousands or millions of columns).TESTING INSTRUCTIONS
Luckily, this feature is already covered by unit tests.
Manual test to see performance improvement:
cleaned_sales_data
contact_first_name, contact_last_name, phone
)Time-series Line Chart
Results on this branch
Chart loads within a few seconds
Negative results (e.g. on
master
)Chart will time out or take a very long time
ADDITIONAL INFORMATION
Related to #15975 - keeps the workaround and its tests, just changes the implementation very slightly.