-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add t-test to numerical stats differences #355
Conversation
warnings.warn("Could not import necessary statistical packages. " | ||
"T-test will be incomplete.", RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least should say pip install scipy
.
Best case scenario we abstract warn_on_profile
to be utilized for both profiling or any case where the user needs to install a set of requirements like dataprofiler[ml]
(or if we do dataprofiler[stats]
).
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments
'p-value': 0.749287157907667 | ||
}, | ||
'welch': { | ||
'df': 3.6288111187629117, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
degrees of freedom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we change to dof
? I've seen that used in numpy or scipy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Df stands for degrees of freedom, which determines the "curvature" of the t-distribution since the population variance is unknown: https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scipy for the t-statistic they use df
: https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_t.html
dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py
Outdated
Show resolved
Hide resolved
} | ||
diff = profiler.diff(profiler2) | ||
diff = profiler1.diff(profiler2) | ||
self.assertDictEqual(expected_diff, diff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add another unit test with two completely different profiles, so we will get t-test results with p_value small, and two similar profiles to get p_value large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to ensure specific cases so long as we ensure the p-value and df are calculated correctly and the edge cases are handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is, currently I don't know how we get the values of t-test in the unit tests. So, either we provide the calculations of those in the form of some comments, or we can test the correctness (ish) easily using the cases that I suggested.
Head branch was pushed to by a user without write access
dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py
Outdated
Show resolved
Hide resolved
if np.isnan([mean1, mean2, var1, var2]).any() or \ | ||
None in [mean1, mean2, var1, var2]: | ||
warnings.warn("Null value(s) found in mean and/or variance values. " | ||
"T-test cannot be performed.", RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Add t-test to numerical stats diff * Update tests * Refactor t-test * Update tests * Change to None when insufficient sample size * Change insufficient sample size return type to dict * Add check for invalid mean/var * Update tests * Add specific warnings for invalid t-test * Refactor tests * Add comments on test calculation methods
Performs a two-sample, two-tailed t-test for difference of means between numerical columns.