-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid df/str copies and unneeded iterables #55
base: main
Are you sure you want to change the base?
Conversation
Related #9 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #55 +/- ##
==========================================
+ Coverage 39.65% 49.63% +9.97%
==========================================
Files 45 48 +3
Lines 3929 4745 +816
Branches 665 740 +75
==========================================
+ Hits 1558 2355 +797
+ Misses 2344 2340 -4
- Partials 27 50 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
The latest commit is a small change to |
My apologies for this issue. With the most recent commit, the linter now returns 10/10 for all three files changed in this pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexandermorgan Thanks for submitting this PR, and apologies for not reviewing it sooner - I didn't see it until late last week.
Do you have benchmarks on the runtime before and after these changes?
I did some quick benchmarks using pytest-benchmark
and one of the tests test_cli.test_cli.test_unpack_turbines_happy
. I used the existing test dataset (tests/data/bespoke-supply-curve.csv
) which is pretty small (10 rows) and only saw a very small speedup (25-30ms). I also created a larger CSV (1000 rows) and in that case the updated code is a bit slower (but differences are in the noise). Regardless, these were both pretty small datasets and I didn't test the added parallelization, so I was curious if you'd done more thorough benchmarking.
Overall, the actual code changes look great. I did have one problem (see in-line comments on bespoke.py
) that needs your attention before we can merge.
Thanks again!
reView/utils/bespoke.py
Outdated
rdf["latitude"] = lats | ||
del rdf["x"] | ||
del rdf["y"] | ||
rdf[['longitude', 'latitude']] = transformer.transform(xs, ys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line raises the following error for me: ValueError: Columns must be same length as key
.
I'm not quite sure why this didn't cause the tests to fail. I've ensured that my versions of pandas
and pyproj
match those used in the Github Action Workflows, and this still fails.
Maybe we can just revert to:
lons, lats = transformer.transform(xs, ys)
rdf["longitude"] = lons
rdf["latitude"] = lats
I'd be surprised if this has any substantial effect on run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there were two issues here. Referencing the columns as you did resolves one, and the other was the ValueError that you found. That was because I was counting on the following line rdf = rdf[self.df.columns]
to remove the 'x' and 'y' columns. But that was a mistake and was causing the "Columns must be same length as key" issue. In the end, I also replaced the two del rdf['x']
(and 'y') calls with assignment of the values in your way, but to the 'x' and 'y' columns, then renaming those columns (in place) to be "longitude" and "latitude". It's a small difference, but I saw it amounted to about a 1.5% speedup.
Thank you for the extra tests you did! I'm not sure why those passed earlier, but coming back to it now I was able to reproduce the same error you mentioned. Stepping back into the bespoke.py file, I noticed a way to refactor unpack_turbines to do a lot less individual copying and concatenating, and to assign values column-wise instead of element-wise. That was more significant resulting in a ~10-12% decrease in runtime. |
Thanks @alexandermorgan. I'll take another look, hopefully later this week. Will post my benchmarks on the latest code here as well, for posterity. |
Benchmark:
|
Benchmark 2:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contributions @alexandermorgan . Based on the benchmarks I ran (posted in the comments above), the overall speed improvements are on the order of 10-20% for the unpack_turbines
command.
@WilliamsTravis I've approved this PR. All tests are now passing, except for the one that was already failing on main
.
I did not test the app manually, so you may want to do that before merging, but up to you.
Note that I do not know if this actually closes issue #9, which seems more focused on render performance than actual unpacking performance. Again, maybe that's something you could test manually in the app.
The most significant change here is in batch_unpack_from_supply_curve. I originally tried doing this:
sc_developable_df = sc_df[sc_df['capacity'] > 0].reset_index(drop=True)
That is fine but the runtime is the same. A significant improvement happens when we modify the sc_df in place. This looks safe since the function is only called from cli.py where the table is read in from a csv file and only used to feed into this batch_unpack_from_supply_curve function. So modifying it in place won't affect anything else. An increasing percentage of the runtime of this filter & reset_index work is saved as the df size increases. A further considerable improvement could be gained by setting
pd.options.mode.copy_on_write = True
which will be the default in Pandas 3 anyway. But that is too large of a change to make without being more familiar with this repository, so this pr does not change that Pandas option.In
to_wgs
the del calls are not needed since the series are already being filtered on the self.df.column names anyway.This change
n_workers = min(cpu_count(), n_workers)
was recommended by your linter.And this call is also optionally routed through multiprocessing
all_turbines_df['geometry'] = all_turbines_df.apply(
In characterizations.py, this .keys() is removed because iterating over a dictionary already iterates over the keys, so dict.keys() is almost never needed in Python.
columns=[c for c in col_df.columns if c not in lkup.keys()]
col_df.rename(
no need for the dictionary comprehension here since we can operate directly on the columns withcol_df.columns += "_area_sq_km"
This creates 3 lists, 3 sets, and 1 dict_keys object, but none of these are necessary:
characterization_cols = list(characterization_remapper.keys()) df_cols = supply_curve_df.columns.tolist() cols_not_in_df = list(set(characterization_cols).difference(set(df_cols)))
Instead iterate of the dictionary and don't even make 1 list if it succeeds, and fail as soon as you hit a bad value if it fails:
if any(key not in df.columns for key in characterization_remapper):
Though in the failure case, you do have to make the list for the error message.
In functions.py, using df.rename without inplace=True currently copies the entire dataframe. And doing this inside a loop meant that several copies of the whole df could be made if multiple columns got renamed. So change
df = df.rename({col: ncol}, axis=1)
to keeping a list of columns going outside of the loop and only updating the column names if needed withdf.columns = new_columns
.Finally insert an element into the zeroth position in a list will cause all of the elements to have to be updated. So replace
values.insert(0, wkb)
with an iterable unpacking version of making the values list:values = [wkb, *row.values]