-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize noise functions #333
Conversation
return ( | ||
data[column_name] | ||
.astype(str) | ||
.apply(ocr_corrupt, corrupted_pr=token_noise_level, rng=rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, nice catches - I'm slightly embarrassed these loops snuck through!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loops aren't inherently bad! In fact there is probably a loop that is slightly faster than this apply
. The previous loop was a bottleneck because it used Pandas indexing within each iteration. Eliminating that makes this fast enough that I didn't see a need to optimize further.
|
||
# As long as noise is relatively rare, it will be faster to randomly select cells to | ||
# noise rather than generating a random draw for every item eligible | ||
if isinstance(noise_level, float) and noise_level < 0.2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you settle on 0.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used timeit
to compare the runtime of the two methods at different noise levels. I observed much faster (~20x) performance for this method with the default 1% noise level for a shard of national data. It was better but not dramatically better for large proportions, such as 0.5. It crossed over at about 0.8 and was substantially slower than the existing method as the noise level approached 100%.
Because I did not test it in all cases, and was concerned about making things slower, I "conservatively" decided to only use this method under 20%. Of course in practice that will be nearly all the time; more noise than that should occur only when people are intentionally pushing things to their breaking point.
|
||
# We only need to do this once, because noise does not introduce missingness, | ||
# except for the leave_blank kind which is special-cased below | ||
missingness = (dataset_data == "") | (dataset_data.isna()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the speedup is calculating missingness once for the entire dataframe instead of ad hoc as needed per column? It really sped things up that much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old way was calculating it for each noise type for each column. Even with the profiling I did it's a bit hard to directly measure the impact of this change since it changes the structure of things, but the total time spent in the isna
method was cut in half.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff here
36ca537
to
ec34445
Compare
ec34445
to
abfd94f
Compare
Optimize noise functions
Description
Speeds up the noising process by taking advantage of vectorization, avoiding unnecessary random number generation, and caching expensive information.
This is a large PR but I have broken it down into pretty atomic commits. You may want to review one at a time.
Testing
pytest --runslow
)To get a pretty representative benchmark of pseudopeople performance, I noised the first two shards of each dataset in the full-USA data. On
main
this took about 15 minutes.I profiled this benchmark with cProfile, and drilled down to what was taking the time in the noising process (within
noise_dataset
):I then tried to address all the hotspots I saw in that benchmark. I didn't optimize all the noise functions, just the ones that seemed to be easily optimizable.
On this branch, the benchmark takes 7.5 minutes to run. Since a big chunk of the time is just loading the data, that means noising sped up by significantly more than 50%. Here is what the same profile looks like after these changes: