Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate part of spot check - simulated and historical data prep and plot by state #129

Merged
merged 8 commits into from
Aug 5, 2020

Conversation

jenhagg
Copy link
Collaborator

@jenhagg jenhagg commented Jun 30, 2020

Purpose - Move logic from the spot check notebook into the code base.
What it does - Add functions to prepare data for analysis and one of the plots, and refactored code where possible for readability and/or using built in pandas methods (hopefully idiomatically). There are some unit tests for the data manipulation logic but could probably add more.

The main additions, in terms of what I'd expect to be used by the notebook, are the following:

# simulated generation based on a scenario
sum_generation_by_state(scenario_info):

# historical generation, based on a csv or excel file loaded by the user into hist_gen_raw
summarize_hist_gen(hist_gen_raw, all_resources):

# plot given state
plot_generation_sim_vs_hist(sim_gen, hist_gen, s_info, state, showmax=True):

Time to review - 30 mins (rough estimate). For testing I compared results of the original notebook with these functions, to spot check (pun intended) that we get the same results numerically, visually, etc. The unit tests cover some basic assumptions as well but I think it's a good point to get user feedback, or try it on other scenarios before adding a bunch more test coverage. I'm pretty sure the docstrings can be improved but wasn't sure what best practices are so just left for code review.

Questions - Would it be helpful to pair review? I can demo my notebook used for testing if that's useful. And how do we update the notebook script once we want to start using these?



def get_generation(s_info):
def get_value_at(state, resource):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this is necessary here, given it simply calls the get_generation function via an instance of ScenarioInfo class, which can be put into a one-liner.

to each (state, resource) pair to populate the resulting dataframe. This
function is meant to contain the core logic of summarize_by_state and is
parameterized to simplify testing.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reason of breaking this up into parts is for simplicity of testing. However, maybe it is worth creating a Mock ScenarioInfo object or improve the Mock Scenario object with time series attributes such that it can be loaded into a ScenarioInfo object, since ScenarioInfo is widely used in the spot check integration script, which will make the tests in the future much easier once we have a mock for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pondering the same thing. Think you're right about simplifying future code/testing, so I'll look into setting up a mock implementation. Leaning towards mocking ScenarioInfo itself, so the tests for spot check logic are totally independent of ScenarioInfo implementation, which should be tested separately against mock Scenario or State instances. At least, that's my current frame of mind, we'll see how it goes :)

Use state as a dict key if index is a smaller region (e.g. Texas East),
otherwise use the given index.
"""
for state in ("Texas", "New Mexico", "Montana"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this ugly setup :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tuple 'states that are split across interconnections'? If so, do you think this would fit better as a named global variable instead of hardcoded within this variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but some examples from the raw data are "Montana Western", "Texas North Central", "Texas Far West" etc. Is that consistent with your description?

Copy link
Contributor

@danielolsen danielolsen Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The TAMU network mostly follows state lines when approximating the borders between interconnections (even though small bits of some plains states stretch over the lines), I believe Montana, Texas, and New Mexico are the only exceptions.



def summarize_hist_gen(hist_gen_raw, all_resources):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we start to finalize this PR, we should come up with better names of these intermediate functions together with other parameter documentations.

width = 0.3 # width of bars
fontsize = 15
size_inches = (20, 10)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organizing these plot parameters in a class is clean which is something in my mind for a while. However, setting up these parameters with hard coded constants is tricky given they might not fit with all the plots including the future ones we may have, specifically here, different bar plots may have adaptive bar width according the total number of bars within a plot. Let's think about a better way to handle general cases later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I figured we'd need to do more here but left it as kind of a reminder.

size_inches = (20, 10)


def plot_generation_sim_vs_hist_all(sim_gen, hist_gen, s_info):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we need functions like this in the codebase. I think having a simple loop to call a function with parameters is acceptable in the user's script.


def calc_max(gen_type):
with warnings.catch_warnings():
warnings.simplefilter("ignore")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of warnings we are trying to simplify here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, it's from get_capacity in ScenarioInfo object. Basically, when we are trying to get capacities of a specific type of generation which does not exist in certain area, it returns 0 with warnings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't actually know that. We have suppression in the notebook, so I just switched it to use the context manager since that seemed right.

ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=fontsize)
ax.legend(fontsize=fontsize)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was implementing these functions, I simply wrote down the code that appeared on the top of my head. Not sure whether there are cleaner ways to set up these parameters when prepare a plot. If it looks bad to you, free feel to refactor :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know :) I'm pretty amateur at matplotlib so played it safe for now, but I'll come back to it and see what we can do.

ha="center",
va="bottom",
fontsize=8,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I simply found an example from StackoverFlow of labeling the bars in a bar chart and used it here.

@BainanXia BainanXia self-requested a review July 24, 2020 17:15
interconnect = ["Western"]
s_info = MockScenarioInfo()
s_info.grid.interconnect = interconnect
monkeypatch.setattr(s_info, "get_available_resource", mock_resource)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard of it but never used it before. I presume monkeypatch is defined in pytest.fixture, which makes it easy to mock function returns in a mock object.

Copy link
Collaborator

@BainanXia BainanXia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through this again together with the previous PR on MockScenarioInfo, https://github.com/Breakthrough-Energy/PowerSimData/pull/227/files. I think the testing part is much cleaner with MockScenarioInfo. Thanks for the effort. I will approve it once our file system is ready and I could check it with some integration tests. In the meantime, it would be great if we can get a second eye on this given it is the first PR on moving stuff in spot check into code base, which sets an example of merging plot features into PostREISE. We will refactor AnalyzePG in the same fashion.

)


colname_map = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we correct the column names in the input CSV and avoid this dictionary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that'd be cleaner, not sure how the csv columns are generated though, or if it's part of some process. Probably doesn't hurt to have this kind of mapping but could move it somewhere more intentional if there are other usages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't remember where they come from.

Copy link
Collaborator

@BainanXia BainanXia Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought this up long time ago to @danlivengood who creates these input CSVs. At that time the argument was that "names on the left (keys in colname_map) make more sense for human reading, whereas names on the right (values in colname_map) make more sense for computer reading (what we defined in the grid DataFrame)."

@danielolsen danielolsen self-requested a review July 30, 2020 22:48
Copy link
Contributor

@danielolsen danielolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The spot check script quickly grew into a beast, I'm sorry that you got tasked with cleaning up our mess.

@jenhagg
Copy link
Collaborator Author

jenhagg commented Jul 30, 2020

I've seen much worse of a mess, glad to do it (it's also informative). One thing I noticed though is the plot fails on Texas since we have "Texas" in the index for both state and interconnect. I'll look at a fix for that before merging.

@BainanXia
Copy link
Collaborator

One thing I noticed though is the plot fails on Texas since we have "Texas" in the index for both state and interconnect. I'll look at a fix for that before merging.

Good catch, previously it's fine since we were summarizing simulated results via ScenarioInfo and function area_to_loadzone took care of the confusion: Texas is treated as a state by default, unless the area_type is specified as 'interconnect'.

Copy link
Collaborator

@BainanXia BainanXia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. Thanks.

@jenhagg jenhagg merged commit 4b38402 into develop Aug 5, 2020
@jenhagg jenhagg deleted the jon/spotcheck branch August 5, 2020 00:25
@ahurli ahurli mentioned this pull request Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move plots and features in the spot check integration script into codebase
3 participants