Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize noise functions #333

Merged
merged 13 commits into from
Oct 25, 2023
Merged

Optimize noise functions #333

merged 13 commits into from
Oct 25, 2023

Conversation

zmbc
Copy link
Collaborator

@zmbc zmbc commented Oct 19, 2023

Optimize noise functions

Description

  • Category: performance
  • JIRA issue: none

Speeds up the noising process by taking advantage of vectorization, avoiding unnecessary random number generation, and caching expensive information.

This is a large PR but I have broken it down into pretty atomic commits. You may want to review one at a time.

Testing

  • all tests pass (pytest --runslow)

To get a pretty representative benchmark of pseudopeople performance, I noised the first two shards of each dataset in the full-USA data. On main this took about 15 minutes.

I profiled this benchmark with cProfile, and drilled down to what was taking the time in the noising process (within noise_dataset):

image

I then tried to address all the hotspots I saw in that benchmark. I didn't optimize all the noise functions, just the ones that seemed to be easily optimizable.

On this branch, the benchmark takes 7.5 minutes to run. Since a big chunk of the time is just loading the data, that means noising sped up by significantly more than 50%. Here is what the same profile looks like after these changes:

image

return (
data[column_name]
.astype(str)
.apply(ocr_corrupt, corrupted_pr=token_noise_level, rng=rng)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, nice catches - I'm slightly embarrassed these loops snuck through!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loops aren't inherently bad! In fact there is probably a loop that is slightly faster than this apply. The previous loop was a bottleneck because it used Pandas indexing within each iteration. Eliminating that makes this fast enough that I didn't see a need to optimize further.

src/pseudopeople/noise_functions.py Outdated Show resolved Hide resolved
src/pseudopeople/noise_functions.py Show resolved Hide resolved

# As long as noise is relatively rare, it will be faster to randomly select cells to
# noise rather than generating a random draw for every item eligible
if isinstance(noise_level, float) and noise_level < 0.2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you settle on 0.2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used timeit to compare the runtime of the two methods at different noise levels. I observed much faster (~20x) performance for this method with the default 1% noise level for a shard of national data. It was better but not dramatically better for large proportions, such as 0.5. It crossed over at about 0.8 and was substantially slower than the existing method as the noise level approached 100%.

Because I did not test it in all cases, and was concerned about making things slower, I "conservatively" decided to only use this method under 20%. Of course in practice that will be nearly all the time; more noise than that should occur only when people are intentionally pushing things to their breaking point.


# We only need to do this once, because noise does not introduce missingness,
# except for the leave_blank kind which is special-cased below
missingness = (dataset_data == "") | (dataset_data.isna())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the speedup is calculating missingness once for the entire dataframe instead of ad hoc as needed per column? It really sped things up that much?

Copy link
Collaborator Author

@zmbc zmbc Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old way was calculating it for each noise type for each column. Even with the profiling I did it's a bit hard to directly measure the impact of this change since it changes the structure of things, but the total time spent in the isna method was cut in half.

Copy link
Contributor

@stevebachmeier stevebachmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff here

@zmbc zmbc requested review from pletale and a team as code owners October 24, 2023 23:44
@zmbc zmbc force-pushed the optimize_noise_functions branch from 36ca537 to ec34445 Compare October 24, 2023 23:50
@zmbc zmbc force-pushed the optimize_noise_functions branch from ec34445 to abfd94f Compare October 24, 2023 23:51
@rmudambi rmudambi merged commit 909f4a3 into main Oct 25, 2023
@rmudambi rmudambi deleted the optimize_noise_functions branch October 25, 2023 00:02
This was referenced Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants