-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Crash caused by numpy.vectorize #21936
Conversation
|
||
with np.nditer(result, flags=["refs_ok"], op_flags=["readwrite"]) as it: | ||
for obj in it: | ||
obj[...] = stringify(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never come across the ...
in the context of Numpy before, but this logic was lifted from their official documentation.
Update result_set.py Update result_set.py
8c1f001
to
83d8739
Compare
Codecov Report
@@ Coverage Diff @@
## master #21936 +/- ##
==========================================
+ Coverage 66.85% 67.00% +0.15%
==========================================
Files 1807 1807
Lines 69190 69315 +125
Branches 7402 7402
==========================================
+ Hits 46258 46447 +189
+ Misses 21021 20957 -64
Partials 1911 1911
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I assume this is stringifying a query result that contains lots of massive objects or something? Anyway, IMO In the long run this whole data passing-through logic should be refactored to stream through data in chunks using some proper binary format rather than processing the whole thing in memory and then dumping it all as one massive JSON blob. So until that happens (never? 😄) I think we need to prioritize stability over performance. If this does cause an unacceptable perf hit we can probably try to figure out a reasonable threshold that determines whether to use vectorization vs row iteration.
@villebro, per your comment,
Yes this was the case. The type in question was an array of string and it barfed when the row limit was increased from 10,000 to 100,000 records. |
(cherry picked from commit 059e53a)
(cherry picked from commit 059e53a)
SUMMARY
We (Airbnb) has a user report an error where in SQL Lab an aysnc query would run for infinitum when the row limit was increased. The issue was the Celery worker crashed with the following error:
It turns out the root cause was a call to
numpy.vectorize
function which per here can consume copious amounts of memory. The fix was merely to un-vectorize the logic using a iterator per the Numpy documentation, acknowledging that there may be a performance hit.Note this is the only place in the code base which uses the
numpy.vectorize
function.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
CI.
ADDITIONAL INFORMATION