Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add t-test to numerical stats differences #355

Merged
merged 14 commits into from
Jul 26, 2021
Merged

Conversation

andrew-yin
Copy link
Contributor

Performs a two-sample, two-tailed t-test for difference of means between numerical columns.

Comment on lines 353 to 354
warnings.warn("Could not import necessary statistical packages. "
"T-test will be incomplete.", RuntimeWarning)
Copy link
Contributor

@JGSweets JGSweets Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least should say pip install scipy .
Best case scenario we abstract warn_on_profile to be utilized for both profiling or any case where the user needs to install a set of requirements like dataprofiler[ml] (or if we do dataprofiler[stats]).

@JGSweets JGSweets enabled auto-merge (squash) July 21, 2021 22:09
auto-merge was automatically disabled July 23, 2021 17:08

Head branch was pushed to by a user without write access

@JGSweets JGSweets enabled auto-merge (squash) July 23, 2021 17:47
auto-merge was automatically disabled July 23, 2021 17:47

Head branch was pushed to by a user without write access

@JGSweets JGSweets enabled auto-merge (squash) July 23, 2021 17:48
auto-merge was automatically disabled July 23, 2021 20:02

Head branch was pushed to by a user without write access

JGSweets
JGSweets previously approved these changes Jul 23, 2021
@JGSweets JGSweets enabled auto-merge (squash) July 23, 2021 20:06
@JGSweets JGSweets dismissed their stale review July 23, 2021 20:07

What happens if mean / variance is NaN?

auto-merge was automatically disabled July 23, 2021 20:14

Head branch was pushed to by a user without write access

@JGSweets JGSweets enabled auto-merge (squash) July 23, 2021 20:37
JGSweets
JGSweets previously approved these changes Jul 23, 2021
Copy link
Contributor

@AnhTruong AnhTruong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

'p-value': 0.749287157907667
},
'welch': {
'df': 3.6288111187629117,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's df?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

degrees of freedom

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change to dof? I've seen that used in numpy or scipy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Df stands for degrees of freedom, which determines the "curvature" of the t-distribution since the population variance is unknown: https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataprofiler/profilers/numerical_column_stats.py Outdated Show resolved Hide resolved
}
diff = profiler.diff(profiler2)
diff = profiler1.diff(profiler2)
self.assertDictEqual(expected_diff, diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add another unit test with two completely different profiles, so we will get t-test results with p_value small, and two similar profiles to get p_value large

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to ensure specific cases so long as we ensure the p-value and df are calculated correctly and the edge cases are handled correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, currently I don't know how we get the values of t-test in the unit tests. So, either we provide the calculations of those in the form of some comments, or we can test the correctness (ish) easily using the cases that I suggested.

auto-merge was automatically disabled July 26, 2021 17:00

Head branch was pushed to by a user without write access

@JGSweets JGSweets enabled auto-merge (squash) July 26, 2021 20:31
if np.isnan([mean1, mean2, var1, var2]).any() or \
None in [mean1, mean2, var1, var2]:
warnings.warn("Null value(s) found in mean and/or variance values. "
"T-test cannot be performed.", RuntimeWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@JGSweets JGSweets merged commit 43e5f6b into capitalone:main Jul 26, 2021
stevensecreti pushed a commit to stevensecreti/DataProfiler that referenced this pull request Jun 15, 2022
* Add t-test to numerical stats diff

* Update tests

* Refactor t-test

* Update tests

* Change to None when insufficient sample size

* Change insufficient sample size return type to dict

* Add check for invalid mean/var

* Update tests

* Add specific warnings for invalid t-test

* Refactor tests

* Add comments on test calculation methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants