🚀 String Super-Issue #121

xhochy · 2020-06-05T15:00:26Z

We want to support strings (UTF-8 encoded) as fast as possible inside of pandas. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.

We try to add the functionality in three stages:

Implement the functionality using plain Python operations. This will be the same speed as with pandas.StringDtype but already provides the API to fletcher users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.
a) Also ensure that we have benchmarks setup to compare the pandas/object implementation to ours.
Given the algorithm isn't too complicated, we try to make an efficient implementation with numba. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.
For all methods, add an efficient implementation to Apache Arrow if there is none yet.

function	meta issue	naïve implementation	`numba` implementation	`pyarrow` implementation
`capitalize`	#124	#200	…	…
`casefold`	#125	#200	…	…
`cat`	#126	#200	…	…
`center`	#122	#200	…	…
`contains` (exact match) ✅	#123	#140	#141	ARROW-9160 / #151
`contains` (other)	#123	#200	…	…
`count`	#127	#200	…	…
`decode`	…	…	…
`encode`	…	…	…
`endswith`	#130	-	#131	…
`extract`	#137	#200	…	…
`extractall`	…	#200	…
`find`	…	#200	…
`findall`	…	#200	…
`get`	…	#200	…
`index`	…	#200	…
`join`	…	…	…
`len`	…	#200	…
`ljust`	…	#200	…
`lower`	#135	#200	…	ARROW-9133
`lstrip`	…	#200	…
`match`	…	#200	…
`normalize`	…	#200	…
`pad`	…	#200	…
`partition`	…	#200	…
`repeat`	…	#200	…
`replace`	#133	#200	…	…
`rfind`	…	#200	…	…
`rindex`	…	#200	…	…
`rjust`	…	#200	…	…
`rpartition`	…	#200	…	…
`rstrip`	…	#200	…	…
`slice`	#114	#200	…	…
`slice_replace`	…	#200	…	…
`split`	…	#200	…	…
`rsplit`	…	#200	…	…
`startswith`	#132	-	#131	…
`strip`	#136	–	#160	…
`swapcase`	…	#200	…	…
`title`	…	#200	…	…
`translate`	…	#200	…	…
`upper`	…	#200	…	ARROW-9133
`wrap`	…	#200	…	…
`zfill`	#134	#139	…	…
`isalnum` ✅	…	#200	…	ARROW-9268 / #203
`isalpha` ✅	…	#200	…	ARROW-9268 / #203
`isdigit` ✅	…	#200	…	ARROW-9268 / #203
`isspace` ✅	…	#200	…	ARROW-9268 / #203
`islower` ✅	…	#200	…	[ARROW-9268](apache/arrow#7656 / #203 )
`isupper` ✅	…	#200	…	ARROW-9268 / #203
`istitle` ✅	…	#200	…	ARROW-9268 / #203
`isnumeric` ✅	…	#200	…	ARROW-9268 / #203
`isdecimal` ✅	…	#200	…	ARROW-9268 / #203
`get_dummies`	…	#200	…	…

The text was updated successfully, but these errors were encountered:

xhochy · 2023-02-22T15:15:02Z

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.

xhochy pinned this issue Jul 7, 2020

xhochy added hackathon-2020-07 usecase-202003-qc labels Jul 8, 2020

xhochy mentioned this issue Jul 8, 2020

Plan for a native string dtype pandas-dev/pandas#35169

Closed

asfimport mentioned this issue Oct 14, 2021

[C++] String algorithm library for StringArray/BinaryArray apache/arrow#16192

Open

xhochy closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 String Super-Issue #121

🚀 String Super-Issue #121

xhochy commented Jun 5, 2020 •

edited

Loading

xhochy commented Feb 22, 2023

🚀 String Super-Issue #121

🚀 String Super-Issue #121

Comments

xhochy commented Jun 5, 2020 • edited Loading

xhochy commented Feb 22, 2023

xhochy commented Jun 5, 2020 •

edited

Loading