Add t-test to numerical stats differences #355

andrew-yin · 2021-07-21T20:59:36Z

Performs a two-sample, two-tailed t-test for difference of means between numerical columns.

dataprofiler/tests/profilers/test_column_profile_compilers.py

dataprofiler/profilers/numerical_column_stats.py

JGSweets · 2021-07-21T21:28:29Z

dataprofiler/profilers/numerical_column_stats.py

+            warnings.warn("Could not import necessary statistical packages. "
+                          "T-test will be incomplete.", RuntimeWarning)


At the very least should say pip install scipy .
Best case scenario we abstract warn_on_profile to be utilized for both profiling or any case where the user needs to install a set of requirements like dataprofiler[ml] (or if we do dataprofiler[stats]).

dataprofiler/tests/profilers/test_column_profile_compilers.py

dataprofiler/tests/profilers/test_float_column_profile.py

dataprofiler/profilers/numerical_column_stats.py

What happens if mean / variance is NaN?

AnhTruong

some comments

AnhTruong · 2021-07-26T14:49:40Z

dataprofiler/tests/profilers/test_column_profile_compilers.py

+                             'p-value': 0.749287157907667
+                         },
+                         'welch': {
+                             'df': 3.6288111187629117,


degrees of freedom

should we change to dof? I've seen that used in numpy or scipy

Df stands for degrees of freedom, which determines the "curvature" of the t-distribution since the population variance is unknown: https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)

In scipy for the t-statistic they use df: https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_t.html

dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py

dataprofiler/profilers/numerical_column_stats.py

AnhTruong · 2021-07-26T14:59:50Z

dataprofiler/tests/profilers/test_text_column_profile.py

                         }
-        diff = profiler.diff(profiler2)
+        diff = profiler1.diff(profiler2)
        self.assertDictEqual(expected_diff, diff)


should we add another unit test with two completely different profiles, so we will get t-test results with p_value small, and two similar profiles to get p_value large

I'm not sure we need to ensure specific cases so long as we ensure the p-value and df are calculated correctly and the edge cases are handled correctly.

The thing is, currently I don't know how we get the values of t-test in the unit tests. So, either we provide the calculations of those in the form of some comments, or we can test the correctness (ish) easily using the cases that I suggested.

dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py

AnhTruong · 2021-07-26T20:36:10Z

dataprofiler/profilers/numerical_column_stats.py

+        if np.isnan([mean1, mean2, var1, var2]).any() or \
+                None in [mean1, mean2, var1, var2]:
+            warnings.warn("Null value(s) found in mean and/or variance values. "
+                          "T-test cannot be performed.", RuntimeWarning)


* Add t-test to numerical stats diff * Update tests * Refactor t-test * Update tests * Change to None when insufficient sample size * Change insufficient sample size return type to dict * Add check for invalid mean/var * Update tests * Add specific warnings for invalid t-test * Refactor tests * Add comments on test calculation methods

Andrew Yin added 2 commits July 21, 2021 15:56

Add t-test to numerical stats diff

9f23fa3

Update tests

d9cea81

andrew-yin requested review from AnhTruong, ChrisWallace2020, grant-eden, JGSweets and lettergram as code owners July 21, 2021 20:59

Merge branch 'main' into t-tests

62463c8

ChrisWallace2020 reviewed Jul 21, 2021

View reviewed changes

dataprofiler/tests/profilers/test_column_profile_compilers.py Outdated Show resolved Hide resolved