Optimize noise functions #333

zmbc · 2023-10-19T17:05:30Z

Optimize noise functions

Description

Category: performance
JIRA issue: none

Speeds up the noising process by taking advantage of vectorization, avoiding unnecessary random number generation, and caching expensive information.

This is a large PR but I have broken it down into pretty atomic commits. You may want to review one at a time.

Testing

all tests pass (pytest --runslow)

To get a pretty representative benchmark of pseudopeople performance, I noised the first two shards of each dataset in the full-USA data. On main this took about 15 minutes.

I profiled this benchmark with cProfile, and drilled down to what was taking the time in the noising process (within noise_dataset):

I then tried to address all the hotspots I saw in that benchmark. I didn't optimize all the noise functions, just the ones that seemed to be easily optimizable.

On this branch, the benchmark takes 7.5 minutes to run. Since a big chunk of the time is just loading the data, that means noising sped up by significantly more than 50%. Here is what the same profile looks like after these changes:

stevebachmeier · 2023-10-24T16:29:52Z

src/pseudopeople/noise_functions.py

+    return (
+        data[column_name]
+        .astype(str)
+        .apply(ocr_corrupt, corrupted_pr=token_noise_level, rng=rng)


Oof, nice catches - I'm slightly embarrassed these loops snuck through!

Loops aren't inherently bad! In fact there is probably a loop that is slightly faster than this apply. The previous loop was a bottleneck because it used Pandas indexing within each iteration. Eliminating that makes this fast enough that I didn't see a need to optimize further.

src/pseudopeople/noise_functions.py

rmudambi · 2023-10-24T20:25:37Z

src/pseudopeople/utilities.py

+
+    # As long as noise is relatively rare, it will be faster to randomly select cells to
+    # noise rather than generating a random draw for every item eligible
+    if isinstance(noise_level, float) and noise_level < 0.2:


How did you settle on 0.2?

I used timeit to compare the runtime of the two methods at different noise levels. I observed much faster (~20x) performance for this method with the default 1% noise level for a shard of national data. It was better but not dramatically better for large proportions, such as 0.5. It crossed over at about 0.8 and was substantially slower than the existing method as the noise level approached 100%.

Because I did not test it in all cases, and was concerned about making things slower, I "conservatively" decided to only use this method under 20%. Of course in practice that will be nearly all the time; more noise than that should occur only when people are intentionally pushing things to their breaking point.

src/pseudopeople/noise_scaling.py

src/pseudopeople/entity_types.py

stevebachmeier · 2023-10-24T22:32:21Z

src/pseudopeople/noise.py

+
+    # We only need to do this once, because noise does not introduce missingness,
+    # except for the leave_blank kind which is special-cased below
+    missingness = (dataset_data == "") | (dataset_data.isna())


So the speedup is calculating missingness once for the entire dataframe instead of ad hoc as needed per column? It really sped things up that much?

The old way was calculating it for each noise type for each column. Even with the profiling I did it's a bit hard to directly measure the impact of this change since it changes the structure of things, but the total time spent in the isna method was cut in half.

stevebachmeier

Great stuff here

zmbc requested review from albrja, hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners October 19, 2023 17:05

stevebachmeier reviewed Oct 24, 2023

View reviewed changes

src/pseudopeople/noise_functions.py Show resolved Hide resolved

zmbc commented Oct 24, 2023

View reviewed changes

src/pseudopeople/noise_functions.py Show resolved Hide resolved

rmudambi approved these changes Oct 24, 2023

View reviewed changes

stevebachmeier reviewed Oct 24, 2023

View reviewed changes

src/pseudopeople/noise_scaling.py Show resolved Hide resolved

stevebachmeier reviewed Oct 24, 2023

View reviewed changes

src/pseudopeople/entity_types.py Show resolved Hide resolved

stevebachmeier reviewed Oct 24, 2023

View reviewed changes

stevebachmeier approved these changes Oct 24, 2023

View reviewed changes

zmbc requested review from pletale and a team as code owners October 24, 2023 23:44

zmbc force-pushed the optimize_noise_functions branch from 36ca537 to ec34445 Compare October 24, 2023 23:50

zmbc added 12 commits October 24, 2023 16:51

Optimize use_nicknames

d81acf1

Optimize make_phonetic_errors

75f7565

Optimize make_ocr_errors

2700d64

Optimize do_not_respond

2a0609c

Optimize write_wrong_zipcode_digits

ab5f21f

Optimize write_wrong_digits

1d6f837

Optimize make_typos

26100c3

Optimize date formatting

5919d55

Optimize cell noise selection

a0daa60

Cache data files

d86e57a

Cache missingness information

552ca58

Optimize vectorized_choice

e44f8de

Add a changelog entry for performance improvements

abfd94f

zmbc force-pushed the optimize_noise_functions branch from ec34445 to abfd94f Compare October 24, 2023 23:51

rmudambi merged commit 909f4a3 into main Oct 25, 2023

rmudambi deleted the optimize_noise_functions branch October 25, 2023 00:02

This was referenced Oct 25, 2023

Release 0.8.0 #341

Merged

Fix bug formatting missing dates #346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize noise functions #333

Optimize noise functions #333

zmbc commented Oct 19, 2023

stevebachmeier Oct 24, 2023

zmbc Oct 24, 2023

rmudambi Oct 24, 2023

zmbc Oct 24, 2023

stevebachmeier Oct 24, 2023

zmbc Oct 24, 2023 •

edited

Loading

stevebachmeier left a comment

Optimize noise functions #333

Optimize noise functions #333

Conversation

zmbc commented Oct 19, 2023