-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding PSI
to diff
report
#688
Conversation
PSI
to diff
report PSI
to diff
report
@@ -349,6 +349,8 @@ def report(self, remove_disabled_flag: bool = False) -> Dict: | |||
continue | |||
elif profile_key == "times": | |||
continue | |||
elif profile_key == "psi": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add option for disabling flag psi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we combine this with times
and create a in [...]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could... I kinda see times
and psi
as two distinct keys ... whereas stddev
and variance
together make sense becuase they are derivatives of each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are different, but the logic is the same. if the key is this, continue. whereas stddev / variance are dependent hence, if one is not there, we have to pop the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see what you're getting at ... fair critique of the code. I'm game for a elif profile_key in ['psi', 'times']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -391,6 +393,7 @@ def diff(self, other_profile: NumericStatsMixin, options: Dict = None) -> Dict: | |||
other_profile.variance, | |||
other_profile.match_count, | |||
), | |||
"psi": self._calculate_psi(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call static method in def diff()
@staticmethod | ||
def _calculate_psi(): | ||
raise NotImplementedError() | ||
|
||
def _update_variance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where PSI actually gets calculated
PSI
to diff
report PSI
to diff
report
self_count = sum1 / mean1 | ||
other_count = sum2 / mean2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intent is to get the count, we could just sum the bin_counts instead. otherwise we might lose floating point accuracies, no? Also:
def mean(self) -> float:
"""Return mean value."""
if self.match_count == 0:
return 0
return float(self.sum) / self.match_count
so could just use match_count
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perf! I'll use match_count
-- less code and reuse what already exists
|
||
if regenerate_histogram: | ||
len_self_bin_counts = 0 | ||
if len(self._stored_histogram["histogram"]["bin_counts"]) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if statement needed?
new_self_histogram["bin_counts"] = histogram["bin_counts"] | ||
new_self_histogram["bin_edges"] = histogram["bin_edges"] | ||
|
||
if len_self_bin_counts == num_psi_bins: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this if, could set it above the first if, and it will get reassigned given the criteria
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally fair -- updating now
new_other_histogram["bin_edges"] = histogram["bin_edges"] | ||
new_other_histogram["bin_counts"] = histogram["bin_counts"] | ||
|
||
if not histogram_edges_not_equal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally fair -- updating now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can reduce 3 ifs down to one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely
@@ -369,6 +369,93 @@ def diff(self, other_profile: NumericStatsMixin, options: Dict = None) -> Dict: | |||
"and '{}'".format(cls.__name__, other_profile.__class__.__name__) | |||
) | |||
|
|||
new_self_histogram = {"bin_counts": None, "bin_edges": None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but one thing I would consider is the rest of the api functions in diff are purely padding the data with all the checks within.
To keep things similar, unless the histograms were used elsewhere, I'd consider putting all of this in calculate_psi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah that's good -- lemme see what I can do right quick and push up
) | ||
invalid_stats = True | ||
|
||
if invalid_stats: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of an extra if could we put it inside the previous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol true 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in next push
Adding PSI to
diff
report as part of issue #633def _calculate_psi()
and implement inNumericStatsMixin.diff()
psi
output as part ofdiff()