Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title: Optimize make_typos
Description
The
make_typos
noising function is notoriously slow. This PRmakes it run 3-4x faster. There is almost certainly more optimization
that can be done; I'm all ears if you've got any ideas!
Rajan and Imessed around for a bit with speeding up
pd.Series.str::isin()
but couldn't find any appreciable gains. I poked around online a bit
on an alternative to
isin
but the stuff I found either didn't workwell or seemed too complicated.
I previously tried implementing
numba
with no luck (refer to theJira ticket for details). I might try and implement numba now that the
function is much simpler with a lot less loops. Open to ideas here as well.
Testing
Pytests pass.