Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Investigate exposing Arrow StringBuilder to numba #2

Closed
chmp opened this issue Mar 6, 2018 · 6 comments
Closed

Investigate exposing Arrow StringBuilder to numba #2

chmp opened this issue Mar 6, 2018 · 6 comments

Comments

@chmp
Copy link
Collaborator

chmp commented Mar 6, 2018

The numba implementation should be much slower, than arrow's StringBuilder.

Question: what can numba inteface with? The docs mention cffi.

@xhochy
Copy link
Owner

xhochy commented Mar 7, 2018

What is the use case where we need to the builder in Numba?

@chmp
Copy link
Collaborator Author

chmp commented Mar 7, 2018

Whenever you want to return an array of strings and don't know what size the result will be.

@xhochy
Copy link
Owner

xhochy commented Mar 7, 2018

Seems like it is not easily possible to call C++ code from Numba. We have a GLib-based C-API in Arrow: https://github.com/apache/arrow/blob/master/c_glib/arrow-glib/array-builder.h#L891 This may be useful here.

@alendit
Copy link

alendit commented Dec 5, 2018

I've looked into it recently: the fletcher's NumbaStringArrayBuilder is around 10 times slower than Arrow's StringBuilder for code like this for the current version of numba:

@numba.njit(nogil=True)
def build_string(chars):
   sb = fr._numba_compat.NumbaStringArrayBuilder(10 ** 6, 10 ** 7)
   for str_idx in range(0, 10 ** 6):
      for c_idx in range(0, 10):
         sb.put_byte(chars[str_idx * 10 + c_idx])
      sb.finish_string()
   sb.finish()

This comes from the issues with inlining and optimization of numba's jitclasses.

Good news is that I'm working on a set of patches which will bring down the runtime of the code above from 400ms to 40ms on my machine, i.e. on par wit this C++ code. The relevant issues are numba/numba#2166, numba/numba#3305. This PR fixes the inlining numba/numba#3531. This PR numba/llvmlite#429 to llvmlite will allow linking custom optimization passes, and finally there is local code which optimizes ref counting across the CFG.

TL;DR It can take a while, but there is not reason why numba wouldn't be as fast as native C++ for this task.

@xhochy
Copy link
Owner

xhochy commented Dec 5, 2018

@alendit That sounds promising. Thank you for taking a look into this!

@xhochy
Copy link
Owner

xhochy commented Feb 22, 2023

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.

@xhochy xhochy closed this as completed Feb 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants