Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding PSI to diff report #688

Merged
merged 62 commits into from
Nov 2, 2022
Merged

Adding PSI to diff report #688

merged 62 commits into from
Nov 2, 2022

Conversation

taylorfturner
Copy link
Contributor

@taylorfturner taylorfturner commented Oct 13, 2022

Adding PSI to diff report as part of issue #633

  • adding PSI def _calculate_psi() and implement in NumericStatsMixin.diff()
  • updated unit testing to reflect psi output as part of diff()

@taylorfturner taylorfturner added Work In Progress Solution is being developed New Feature A feature addition not currently in the library labels Oct 13, 2022
@taylorfturner taylorfturner self-assigned this Oct 13, 2022
@taylorfturner taylorfturner changed the title WIP Adding PSI to diff report [WIP] Adding PSI to diff report Oct 13, 2022
@@ -349,6 +349,8 @@ def report(self, remove_disabled_flag: bool = False) -> Dict:
continue
elif profile_key == "times":
continue
elif profile_key == "psi":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add option for disabling flag psi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine this with times and create a in [...]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could... I kinda see times and psi as two distinct keys ... whereas stddev and variance together make sense becuase they are derivatives of each other

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are different, but the logic is the same. if the key is this, continue. whereas stddev / variance are dependent hence, if one is not there, we have to pop the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see what you're getting at ... fair critique of the code. I'm game for a elif profile_key in ['psi', 'times']

Copy link
Contributor Author

@taylorfturner taylorfturner Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -391,6 +393,7 @@ def diff(self, other_profile: NumericStatsMixin, options: Dict = None) -> Dict:
other_profile.variance,
other_profile.match_count,
),
"psi": self._calculate_psi(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call static method in def diff()

Comment on lines 522 to 526
@staticmethod
def _calculate_psi():
raise NotImplementedError()

def _update_variance(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where PSI actually gets calculated

@taylorfturner taylorfturner changed the title [WIP] Adding PSI to diff report Adding PSI to diff report Oct 19, 2022
Comment on lines 598 to 599
self_count = sum1 / mean1
other_count = sum2 / mean2
Copy link
Contributor

@JGSweets JGSweets Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent is to get the count, we could just sum the bin_counts instead. otherwise we might lose floating point accuracies, no? Also:

        def mean(self) -> float:
        """Return mean value."""
        if self.match_count == 0:
            return 0
        return float(self.sum) / self.match_count

so could just use match_count as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf! I'll use match_count -- less code and reuse what already exists

@taylorfturner taylorfturner added the Work In Progress Solution is being developed label Nov 1, 2022

if regenerate_histogram:
len_self_bin_counts = 0
if len(self._stored_histogram["histogram"]["bin_counts"]) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if statement needed?

new_self_histogram["bin_counts"] = histogram["bin_counts"]
new_self_histogram["bin_edges"] = histogram["bin_edges"]

if len_self_bin_counts == num_psi_bins:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this if, could set it above the first if, and it will get reassigned given the criteria

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally fair -- updating now

new_other_histogram["bin_edges"] = histogram["bin_edges"]
new_other_histogram["bin_counts"] = histogram["bin_counts"]

if not histogram_edges_not_equal:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally fair -- updating now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can reduce 3 ifs down to one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely

@@ -369,6 +369,93 @@ def diff(self, other_profile: NumericStatsMixin, options: Dict = None) -> Dict:
"and '{}'".format(cls.__name__, other_profile.__class__.__name__)
)

new_self_histogram = {"bin_counts": None, "bin_edges": None}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but one thing I would consider is the rest of the api functions in diff are purely padding the data with all the checks within.

To keep things similar, unless the histograms were used elsewhere, I'd consider putting all of this in calculate_psi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah that's good -- lemme see what I can do right quick and push up

)
invalid_stats = True

if invalid_stats:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an extra if could we put it inside the previous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol true 🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in next push

@taylorfturner taylorfturner removed the Work In Progress Solution is being developed label Nov 2, 2022
@JGSweets JGSweets enabled auto-merge (squash) November 2, 2022 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants