Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

🚀 String Super-Issue #121

Closed
xhochy opened this issue Jun 5, 2020 · 1 comment
Closed

🚀 String Super-Issue #121

xhochy opened this issue Jun 5, 2020 · 1 comment

Comments

@xhochy
Copy link
Owner

xhochy commented Jun 5, 2020

We want to support strings (UTF-8 encoded) as fast as possible inside of pandas. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.

We try to add the functionality in three stages:

  1. Implement the functionality using plain Python operations. This will be the same speed as with pandas.StringDtype but already provides the API to fletcher users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.
    a) Also ensure that we have benchmarks setup to compare the pandas/object implementation to ours.
  2. Given the algorithm isn't too complicated, we try to make an efficient implementation with numba. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.
  3. For all methods, add an efficient implementation to Apache Arrow if there is none yet.
function meta issue naïve implementation numba implementation pyarrow implementation
capitalize #124 #200
casefold #125 #200
cat #126 #200
center #122 #200
contains (exact match) #123 #140 #141 ARROW-9160 / #151
contains (other) #123 #200
count #127 #200
decode
encode
endswith #130 - #131
extract #137 #200
extractall #200
find #200
findall #200
get #200
index #200
join
len #200
ljust #200
lower #135 #200 ARROW-9133
lstrip #200
match #200
normalize #200
pad #200
partition #200
repeat #200
replace #133 #200
rfind #200
rindex #200
rjust #200
rpartition #200
rstrip #200
slice #114 #200
slice_replace #200
split #200
rsplit #200
startswith #132 - #131
strip #136 #160
swapcase #200
title #200
translate #200
upper #200 ARROW-9133
wrap #200
zfill #134 #139
isalnum #200 ARROW-9268 / #203
isalpha #200 ARROW-9268 / #203
isdigit #200 ARROW-9268 / #203
isspace #200 ARROW-9268 / #203
islower #200 [ARROW-9268](apache/arrow#7656 / #203 )
isupper #200 ARROW-9268 / #203
istitle #200 ARROW-9268 / #203
isnumeric #200 ARROW-9268 / #203
isdecimal #200 ARROW-9268 / #203
get_dummies #200
@xhochy
Copy link
Owner Author

xhochy commented Feb 22, 2023

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.

@xhochy xhochy closed this as completed Feb 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant