Integrate part of spot check - simulated and historical data prep and plot by state #129

jenhagg · 2020-06-30T21:04:45Z

Purpose - Move logic from the spot check notebook into the code base.
What it does - Add functions to prepare data for analysis and one of the plots, and refactored code where possible for readability and/or using built in pandas methods (hopefully idiomatically). There are some unit tests for the data manipulation logic but could probably add more.

The main additions, in terms of what I'd expect to be used by the notebook, are the following:

# simulated generation based on a scenario
sum_generation_by_state(scenario_info):

# historical generation, based on a csv or excel file loaded by the user into hist_gen_raw
summarize_hist_gen(hist_gen_raw, all_resources):

# plot given state
plot_generation_sim_vs_hist(sim_gen, hist_gen, s_info, state, showmax=True):

Time to review - 30 mins (rough estimate). For testing I compared results of the original notebook with these functions, to spot check (pun intended) that we get the same results numerically, visually, etc. The unit tests cover some basic assumptions as well but I think it's a good point to get user feedback, or try it on other scenarios before adding a bunch more test coverage. I'm pretty sure the docstrings can be improved but wasn't sure what best practices are so just left for code review.

Questions - Would it be helpful to pair review? I can demo my notebook used for testing if that's useful. And how do we update the notebook script once we want to start using these?

postreise/analyze/generation/summarize.py

BainanXia · 2020-07-06T18:25:33Z

postreise/analyze/generation/summarize.py

+
+
+def get_generation(s_info):
+    def get_value_at(state, resource):


Not sure whether this is necessary here, given it simply calls the get_generation function via an instance of ScenarioInfo class, which can be put into a one-liner.

BainanXia · 2020-07-06T19:16:50Z

postreise/analyze/generation/summarize.py

+    to each (state, resource) pair to populate the resulting dataframe. This
+    function is meant to contain the core logic of summarize_by_state and is
+    parameterized to simplify testing.
+    """


I understand the reason of breaking this up into parts is for simplicity of testing. However, maybe it is worth creating a Mock ScenarioInfo object or improve the Mock Scenario object with time series attributes such that it can be loaded into a ScenarioInfo object, since ScenarioInfo is widely used in the spot check integration script, which will make the tests in the future much easier once we have a mock for that.

I was pondering the same thing. Think you're right about simplifying future code/testing, so I'll look into setting up a mock implementation. Leaning towards mocking ScenarioInfo itself, so the tests for spot check logic are totally independent of ScenarioInfo implementation, which should be tested separately against mock Scenario or State instances. At least, that's my current frame of mind, we'll see how it goes :)

BainanXia · 2020-07-06T19:30:10Z

postreise/analyze/generation/summarize.py

+    Use state as a dict key if index is a smaller region (e.g. Texas East),
+    otherwise use the given index.
+    """
+    for state in ("Texas", "New Mexico", "Montana"):


Thanks for handling this ugly setup :)

Is this tuple 'states that are split across interconnections'? If so, do you think this would fit better as a named global variable instead of hardcoded within this variable?

I'm not sure, but some examples from the raw data are "Montana Western", "Texas North Central", "Texas Far West" etc. Is that consistent with your description?

Yeah. The TAMU network mostly follows state lines when approximating the borders between interconnections (even though small bits of some plains states stretch over the lines), I believe Montana, Texas, and New Mexico are the only exceptions.

BainanXia · 2020-07-08T00:22:19Z

postreise/analyze/generation/summarize.py

+
+
+def summarize_hist_gen(hist_gen_raw, all_resources):
+    """


Once we start to finalize this PR, we should come up with better names of these intermediate functions together with other parameter documentations.

BainanXia · 2020-07-08T00:32:27Z

postreise/plot/plot_sim_vs_hist.py

+    width = 0.3  # width of bars
+    fontsize = 15
+    size_inches = (20, 10)
+


Organizing these plot parameters in a class is clean which is something in my mind for a while. However, setting up these parameters with hard coded constants is tricky given they might not fit with all the plots including the future ones we may have, specifically here, different bar plots may have adaptive bar width according the total number of bars within a plot. Let's think about a better way to handle general cases later.

Sounds good, I figured we'd need to do more here but left it as kind of a reminder.

BainanXia · 2020-07-08T00:36:01Z

postreise/plot/plot_sim_vs_hist.py

+    size_inches = (20, 10)
+
+
+def plot_generation_sim_vs_hist_all(sim_gen, hist_gen, s_info):


Not sure whether we need functions like this in the codebase. I think having a simple loop to call a function with parameters is acceptable in the user's script.

BainanXia · 2020-07-08T00:39:27Z

postreise/plot/plot_sim_vs_hist.py

+
+    def calc_max(gen_type):
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")


What kind of warnings we are trying to simplify here?

Ah, I see, it's from get_capacity in ScenarioInfo object. Basically, when we are trying to get capacities of a specific type of generation which does not exist in certain area, it returns 0 with warnings.

Ah, I didn't actually know that. We have suppression in the notebook, so I just switched it to use the context manager since that seemed right.

BainanXia · 2020-07-08T00:44:22Z

postreise/plot/plot_sim_vs_hist.py

+    ax.set_xticks(x)
+    ax.set_xticklabels(labels, fontsize=fontsize)
+    ax.legend(fontsize=fontsize)
+


When I was implementing these functions, I simply wrote down the code that appeared on the top of my head. Not sure whether there are cleaner ways to set up these parameters when prepare a plot. If it looks bad to you, free feel to refactor :)

Good to know :) I'm pretty amateur at matplotlib so played it safe for now, but I'll come back to it and see what we can do.

BainanXia · 2020-07-08T00:46:11Z

postreise/plot/plot_sim_vs_hist.py

+                ha="center",
+                va="bottom",
+                fontsize=8,
+            )


Same here, I simply found an example from StackoverFlow of labeling the bars in a bar chart and used it here.

BainanXia · 2020-07-24T19:08:02Z

postreise/analyze/generation/tests/test_summarize.py

+    interconnect = ["Western"]
+    s_info = MockScenarioInfo()
+    s_info.grid.interconnect = interconnect
+    monkeypatch.setattr(s_info, "get_available_resource", mock_resource)


I've heard of it but never used it before. I presume monkeypatch is defined in pytest.fixture, which makes it easy to mock function returns in a mock object.

BainanXia

I've gone through this again together with the previous PR on MockScenarioInfo, https://github.com/Breakthrough-Energy/PowerSimData/pull/227/files. I think the testing part is much cleaner with MockScenarioInfo. Thanks for the effort. I will approve it once our file system is ready and I could check it with some integration tests. In the meantime, it would be great if we can get a second eye on this given it is the first PR on moving stuff in spot check into code base, which sets an example of merging plot features into PostREISE. We will refactor AnalyzePG in the same fashion.

postreise/analyze/generation/summarize.py

postreise/plot/plot_sim_vs_hist.py

danielolsen · 2020-07-30T22:47:42Z

postreise/analyze/generation/summarize.py

+)
+
+
+colname_map = {


Could we correct the column names in the input CSV and avoid this dictionary?

Yeah that'd be cleaner, not sure how the csv columns are generated though, or if it's part of some process. Probably doesn't hurt to have this kind of mapping but could move it somewhere more intentional if there are other usages.

Good point, I don't remember where they come from.

I brought this up long time ago to @danlivengood who creates these input CSVs. At that time the argument was that "names on the left (keys in colname_map) make more sense for human reading, whereas names on the right (values in colname_map) make more sense for computer reading (what we defined in the grid DataFrame)."

danielolsen

Looks good. The spot check script quickly grew into a beast, I'm sorry that you got tasked with cleaning up our mess.

jenhagg · 2020-07-30T23:14:22Z

I've seen much worse of a mess, glad to do it (it's also informative). One thing I noticed though is the plot fails on Texas since we have "Texas" in the index for both state and interconnect. I'll look at a fix for that before merging.

BainanXia · 2020-07-30T23:40:27Z

One thing I noticed though is the plot fails on Texas since we have "Texas" in the index for both state and interconnect. I'll look at a fix for that before merging.

Good catch, previously it's fine since we were summarizing simulated results via ScenarioInfo and function area_to_loadzone took care of the confusion: Texas is treated as a state by default, unless the area_type is specified as 'interconnect'.

BainanXia

This is nice. Thanks.

jenhagg requested review from danielolsen and rouille June 30, 2020 21:04

jenhagg assigned BainanXia Jun 30, 2020

jenhagg added this to the Black Lives Matter milestone Jun 30, 2020

jenhagg linked an issue Jun 30, 2020 that may be closed by this pull request

Move plots and features in the spot check integration script into codebase #116

Closed

2 tasks

danielolsen requested a review from BainanXia June 30, 2020 21:44

danielolsen assigned jenhagg Jun 30, 2020

BainanXia reviewed Jul 6, 2020

View reviewed changes

postreise/analyze/generation/summarize.py Outdated Show resolved Hide resolved

BainanXia reviewed Jul 6, 2020

View reviewed changes

postreise/analyze/generation/summarize.py Show resolved Hide resolved

BainanXia reviewed Jul 6, 2020

View reviewed changes

BainanXia reviewed Jul 8, 2020

View reviewed changes

Lab-ITTeam unassigned BainanXia Jul 15, 2020

kasparm unassigned jenhagg Jul 15, 2020

Jon Hagg added 3 commits July 22, 2020 12:49

feat: spot check - simulated and historical data prep and plot by state

0d2eead

style: fix and sort imports

780bb3c

test: use in progress mock from powersimdata

fd26e69

jenhagg force-pushed the jon/spotcheck branch from f74c187 to e4124e4 Compare July 22, 2020 20:23

style: rename function and remove some boilerplate

a7a346d

jenhagg force-pushed the jon/spotcheck branch from e4124e4 to a7a346d Compare July 22, 2020 20:32

BainanXia self-requested a review July 24, 2020 17:15

BainanXia reviewed Jul 24, 2020

View reviewed changes