Avoid df/str copies and unneeded iterables #55

alexandermorgan · 2024-03-13T17:16:49Z

The most significant change here is in batch_unpack_from_supply_curve. I originally tried doing this:
sc_developable_df = sc_df[sc_df['capacity'] > 0].reset_index(drop=True)
That is fine but the runtime is the same. A significant improvement happens when we modify the sc_df in place. This looks safe since the function is only called from cli.py where the table is read in from a csv file and only used to feed into this batch_unpack_from_supply_curve function. So modifying it in place won't affect anything else. An increasing percentage of the runtime of this filter & reset_index work is saved as the df size increases. A further considerable improvement could be gained by setting pd.options.mode.copy_on_write = True which will be the default in Pandas 3 anyway. But that is too large of a change to make without being more familiar with this repository, so this pr does not change that Pandas option.

In to_wgs the del calls are not needed since the series are already being filtered on the self.df.column names anyway.

This change n_workers = min(cpu_count(), n_workers) was recommended by your linter.

And this call is also optionally routed through multiprocessing all_turbines_df['geometry'] = all_turbines_df.apply(

In characterizations.py, this .keys() is removed because iterating over a dictionary already iterates over the keys, so dict.keys() is almost never needed in Python. columns=[c for c in col_df.columns if c not in lkup.keys()]

col_df.rename( no need for the dictionary comprehension here since we can operate directly on the columns with col_df.columns += "_area_sq_km"

This creates 3 lists, 3 sets, and 1 dict_keys object, but none of these are necessary:
characterization_cols = list(characterization_remapper.keys()) df_cols = supply_curve_df.columns.tolist() cols_not_in_df = list(set(characterization_cols).difference(set(df_cols)))
Instead iterate of the dictionary and don't even make 1 list if it succeeds, and fail as soon as you hit a bad value if it fails:
if any(key not in df.columns for key in characterization_remapper):
Though in the failure case, you do have to make the list for the error message.

In functions.py, using df.rename without inplace=True currently copies the entire dataframe. And doing this inside a loop meant that several copies of the whole df could be made if multiple columns got renamed. So change df = df.rename({col: ncol}, axis=1) to keeping a list of columns going outside of the loop and only updating the column names if needed with df.columns = new_columns.

Finally insert an element into the zeroth position in a list will cause all of the elements to have to be updated. So replace values.insert(0, wkb) with an iterable unpacking version of making the values list: values = [wkb, *row.values]

alexandermorgan · 2024-03-13T17:20:48Z

Related #9

codecov-commenter · 2024-03-16T19:40:30Z

Codecov Report

Attention: Patch coverage is 52.94118% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 49.63%. Comparing base (cdd9594) to head (b922a5d).
Report is 134 commits behind head on main.

Files	Patch %	Lines
reView/utils/functions.py	0.00%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #55      +/-   ##
==========================================
+ Coverage   39.65%   49.63%   +9.97%     
==========================================
  Files          45       48       +3     
  Lines        3929     4745     +816     
  Branches      665      740      +75     
==========================================
+ Hits         1558     2355     +797     
+ Misses       2344     2340       -4     
- Partials       27       50      +23

Flag	Coverage Δ
unittests	`49.63% <52.94%> (+9.97%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexandermorgan · 2024-03-19T12:28:52Z

The latest commit is a small change to read_timeseries that addresses the "too many statements" linter issue by moving a lot of the day, month, hour, etc. calculations out of the if/else statement.

alexandermorgan · 2024-03-22T17:14:58Z

My apologies for this issue. With the most recent commit, the linter now returns 10/10 for all three files changed in this pr.

mjgleason

@alexandermorgan Thanks for submitting this PR, and apologies for not reviewing it sooner - I didn't see it until late last week.

Do you have benchmarks on the runtime before and after these changes?

I did some quick benchmarks using pytest-benchmark and one of the tests test_cli.test_cli.test_unpack_turbines_happy. I used the existing test dataset (tests/data/bespoke-supply-curve.csv) which is pretty small (10 rows) and only saw a very small speedup (25-30ms). I also created a larger CSV (1000 rows) and in that case the updated code is a bit slower (but differences are in the noise). Regardless, these were both pretty small datasets and I didn't test the added parallelization, so I was curious if you'd done more thorough benchmarking.

Overall, the actual code changes look great. I did have one problem (see in-line comments on bespoke.py) that needs your attention before we can merge.

Thanks again!

mjgleason · 2024-04-15T16:53:14Z

reView/utils/bespoke.py

-        rdf["latitude"] = lats
-        del rdf["x"]
-        del rdf["y"]
+        rdf[['longitude', 'latitude']] = transformer.transform(xs, ys)


This line raises the following error for me: ValueError: Columns must be same length as key.

I'm not quite sure why this didn't cause the tests to fail. I've ensured that my versions of pandas and pyproj match those used in the Github Action Workflows, and this still fails.

Maybe we can just revert to:

lons, lats = transformer.transform(xs, ys) rdf["longitude"] = lons rdf["latitude"] = lats

I'd be surprised if this has any substantial effect on run time.

It looks like there were two issues here. Referencing the columns as you did resolves one, and the other was the ValueError that you found. That was because I was counting on the following line rdf = rdf[self.df.columns] to remove the 'x' and 'y' columns. But that was a mistake and was causing the "Columns must be same length as key" issue. In the end, I also replaced the two del rdf['x'] (and 'y') calls with assignment of the values in your way, but to the 'x' and 'y' columns, then renaming those columns (in place) to be "longitude" and "latitude". It's a small difference, but I saw it amounted to about a 1.5% speedup.

alexandermorgan · 2024-04-16T17:13:42Z

Thank you for the extra tests you did! I'm not sure why those passed earlier, but coming back to it now I was able to reproduce the same error you mentioned. Stepping back into the bespoke.py file, I noticed a way to refactor unpack_turbines to do a lot less individual copying and concatenating, and to assign values column-wise instead of element-wise. That was more significant resulting in a ~10-12% decrease in runtime.
I also went ahead and set the reset_index calls in model.py and element_builders.py to happen in place since they were overwriting the original dataframes anyway. Without inplace=True pandas will copy the entire dataframe. There's also a reset_index call in test_plots.py that doesn't have an inplace=True but in that case it can't be done in place since the index isn't dropped and that changes it from a series to a dataframe.
Regarding the earlier changes to characterizations.validate_characterization_remapper, those simplifications were more for style than for actual performance improvement. The changes would cause less work to be done, but only a tiny amount since it's just dealing with a few keys and things like that. This time around, however, I noticed that characterizations.unpack_characterizations was overwriting its return dataframe with a copy just before returning it. Throughout the function though, it changes that in_df in place making this seem like an unnecessary copy so I removed it.
As far as my benchmarking, I didn't do anything too elaborate, just calling the changed functions with the sample data you mentioned in a script and using timeit to measure before and after timings.

mjgleason · 2024-04-16T17:17:06Z

Thanks @alexandermorgan. I'll take another look, hopefully later this week. Will post my benchmarks on the latest code here as well, for posterity.

mjgleason · 2024-05-10T18:07:33Z

Benchmark: `test_unpack_turbines_happy`

Summary: ~20% speedup from main

Benchmark settings:

benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)

Dataset size: 10 records

Branch: main

----------------------------------------------------- benchmark: 1 tests ----------------------------------------------------
Name (time in ms)                   Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     740.4823  808.7749  781.6769  31.2783  795.6267  55.1303       1;0  1.2793       5           1
-----------------------------------------------------------------------------------------------------------------------------

Branch: pr55

------------------------------------------------------ benchmark: 1 tests -----------------------------------------------------
Name (time in ms)                   Min       Max      Mean    StdDev    Median       IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     583.2502  829.1872  672.8237  102.7993  622.8418  151.2235       1;0  1.4863       5           1
-------------------------------------------------------------------------------------------------------------------------------

Dataset Size: 1000 records

Branch: main

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in s)                   Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     61.2281  64.9779  62.6157  1.6454  61.6155  2.5542       1;0  0.0160       5           1
-----------------------------------------------------------------------------------------------------------------------

Branch: pr55

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in s)                   Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     48.2061  49.9964  49.0719  0.6842  48.9526  0.9799       2;0  0.0204       5           1
-----------------------------------------------------------------------------------------------------------------------

mjgleason · 2024-05-10T18:10:02Z

Benchmark 2: `test_unpack_turbines_parallel`

Summary: ~10% speedup

Benchmark settings:

benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)

Dataset Size: 1000 records

Branch: main

--------------------------------------------------- benchmark: 1 tests ---------------------------------------------------
Name (time in s)                      Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_parallel     32.3119  33.5078  32.8476  0.4892  32.8665  0.8010       2;0  0.0304       5           1
--------------------------------------------------------------------------------------------------------------------------

Branch: pr55

--------------------------------------------------- benchmark: 1 tests ---------------------------------------------------
Name (time in s)                      Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_parallel     29.5965  30.8086  29.9385  0.5039  29.6873  0.5150       1;0  0.0334       5           1
--------------------------------------------------------------------------------------------------------------------------

mjgleason

Thanks for the contributions @alexandermorgan . Based on the benchmarks I ran (posted in the comments above), the overall speed improvements are on the order of 10-20% for the unpack_turbines command.

@WilliamsTravis I've approved this PR. All tests are now passing, except for the one that was already failing on main.

I did not test the app manually, so you may want to do that before merging, but up to you.

Note that I do not know if this actually closes issue #9, which seems more focused on render performance than actual unpacking performance. Again, maybe that's something you could test manually in the app.

alexandermorgan added 2 commits March 13, 2024 17:15

Avoid df and str copies and unneeded iterables

09c39ac

Put comment above block

a3079dd

Revert to using str.replace

e4f436f

Refactor read_timeseries to reuse more code

626c5ff

Fix bad var reference

b922a5d

mjgleason self-requested a review April 9, 2024 18:39

mjgleason requested changes Apr 15, 2024

View reviewed changes

alexandermorgan added 4 commits April 16, 2024 17:14

Merge branch 'NREL:main' into main

e05377a

Fix to_wgs, optimize unpack_turbines

7367cab

Use inplace=True where possible with reset_index()

e860b2a

Remove unneeded .copy() call

adfedc0

mjgleason approved these changes May 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid df/str copies and unneeded iterables #55

Avoid df/str copies and unneeded iterables #55

alexandermorgan commented Mar 13, 2024 •

edited

Loading

alexandermorgan commented Mar 13, 2024

codecov-commenter commented Mar 16, 2024 •

edited

Loading

alexandermorgan commented Mar 19, 2024

alexandermorgan commented Mar 22, 2024

mjgleason left a comment

mjgleason Apr 15, 2024

alexandermorgan Apr 16, 2024

alexandermorgan commented Apr 16, 2024

mjgleason commented Apr 16, 2024

mjgleason commented May 10, 2024 •

edited

Loading

mjgleason commented May 10, 2024

mjgleason left a comment •

edited

Loading

Avoid df/str copies and unneeded iterables #55

Are you sure you want to change the base?

Avoid df/str copies and unneeded iterables #55

Conversation

alexandermorgan commented Mar 13, 2024 • edited Loading

alexandermorgan commented Mar 13, 2024

codecov-commenter commented Mar 16, 2024 • edited Loading

Codecov Report

alexandermorgan commented Mar 19, 2024

alexandermorgan commented Mar 22, 2024

mjgleason left a comment

Choose a reason for hiding this comment

mjgleason Apr 15, 2024

Choose a reason for hiding this comment

alexandermorgan Apr 16, 2024

Choose a reason for hiding this comment

alexandermorgan commented Apr 16, 2024

mjgleason commented Apr 16, 2024

mjgleason commented May 10, 2024 • edited Loading

Benchmark: test_unpack_turbines_happy

Dataset size: 10 records

Branch: main

Branch: pr55

Dataset Size: 1000 records

Branch: main

Branch: pr55

mjgleason commented May 10, 2024

Benchmark 2: test_unpack_turbines_parallel

Dataset Size: 1000 records

Branch: main

Branch: pr55

mjgleason left a comment • edited Loading

Choose a reason for hiding this comment

alexandermorgan commented Mar 13, 2024 •

edited

Loading

codecov-commenter commented Mar 16, 2024 •

edited

Loading

mjgleason commented May 10, 2024 •

edited

Loading

Benchmark: `test_unpack_turbines_happy`

Benchmark 2: `test_unpack_turbines_parallel`

mjgleason left a comment •

edited

Loading