Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: use Apache Arrow compute for string function #885

Merged
merged 3 commits into from
Aug 21, 2020

Conversation

maartenbreddels
Copy link
Member

This is a draft PR to check the status of arrow compute with vaex. I think we likely cherry pick from this branch as arrow makes new releases.

@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch 10 times, most recently from adf48db to da0a232 Compare July 3, 2020 14:49
@xhochy
Copy link

xhochy commented Jul 3, 2020

FYI: If you have a bit of patience (like 24h of patience), you could use the arrow conda packages in the arrow-nightlies channel instead of building it yourself.

@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch from da0a232 to 229f9ce Compare July 3, 2020 18:38
@maartenbreddels
Copy link
Member Author

Great, I didn't know it existed, and was difficult to find, thanks a lot!

@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch 2 times, most recently from 820f61c to d55a96a Compare July 10, 2020 17:05
@maartenbreddels
Copy link
Member Author

@JovanVeljanoski would be great if you can add/finish the str->booleans added in apache/arrow#7656
There are a few new ones (see for compute.py), and especially binary_isascii is something we may want to think about. Maybe we want to expose this under Expression.str.isascii() but also later on under Expression.binary.isascii() if we are going to add that accessor.

@maartenbreddels
Copy link
Member Author

@JovanVeljanoski I think I want to merge this early and leave it for you to do the rest in a different PR, we need some of this in #865 and I also want to merge that soon.
This means that very likely, master will be breaking, meaning the next release will be vaex v4, do you agree?

@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch from dbc8a7b to 0f89f54 Compare July 16, 2020 06:43
@@ -134,6 +134,7 @@ def get_main_executor():
_doc_snippets['chunk_size_export'] = 'Number of rows to be written to disk in a single iteration'
_doc_snippets['evaluate_parallel'] = 'Evaluate the (virtual) columns in parallel'
_doc_snippets['array_type'] = 'Type of output array, possible values are None/"numpy" (ndarray), "xarray" for a xarray.DataArray, or "list" for a Python list'
_doc_snippets['ascii'] = 'Transform only ascii character (usually faster).'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

character -> characters

@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch from 0f89f54 to 62e8302 Compare July 16, 2020 08:32
@maartenbreddels maartenbreddels marked this pull request as ready for review July 16, 2020 08:36
@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch 9 times, most recently from 025988f to 2a85f7c Compare July 17, 2020 11:49
@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch 3 times, most recently from e24fc95 to 5ed5b5b Compare July 18, 2020 17:41
@maartenbreddels maartenbreddels force-pushed the refactor_use_arrow_compute branch from 5e50d73 to 394e70a Compare August 21, 2020 10:30
@maartenbreddels maartenbreddels merged commit 48531b5 into master Aug 21, 2020
@maartenbreddels
Copy link
Member Author

windows CI has becomes crazy slow btw, we'll have to trace back when/why that happened. It seem the conda env creation takes ages.

@maartenbreddels maartenbreddels deleted the refactor_use_arrow_compute branch August 21, 2020 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants