Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associations not returned for categorical variables #31

Closed
inesani opened this issue Feb 17, 2020 · 3 comments · Fixed by #45
Closed

Associations not returned for categorical variables #31

inesani opened this issue Feb 17, 2020 · 3 comments · Fixed by #45
Labels
bug Something isn't working
Milestone

Comments

@inesani
Copy link

inesani commented Feb 17, 2020

Hi, first of all thank you for this awesome package and the Medium article ;)

I am testing the associations function found in nominal.py with a mix of numerical and categorical variables. I provide below a sample (sample.csv) of the dataset that is returning an empty result.

2020-02-13 00:00:00.017,/131.161.10.118,GET,404,1830,569,1930
2020-02-13 00:00:00.183,/58.14.127.52,GET,406,1110,607,1210
2020-02-13 00:00:00.35,/93.40.70.85,GET,200,926,544,1026
2020-02-13 00:00:00.521,/93.40.70.85,GET,404,2229,502,2329
2020-02-13 00:00:01.02,/87.65.64.76,GET,404,2046,556,2146

My code is:

data = pd.read_csv(sample.csv', header = None)
associations(data)

The numerical columns are providing results that are fine but I am not getting anything for the categorical ones, my result is :

resultAsso

How is that nothing is returned ?

When testing this with other datasets that have a mix of variables I have had the case were everything was calculated just fine, cases where it doesn't, like the above, and cases where it does not and it throws this warning:
RuntimeWarning: divide by zero encountered in double_scalars return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

Any help will be greatly appreciated, I can't wait to use this package more!

@shakedzy shakedzy added the bug Something isn't working label Feb 17, 2020
@shakedzy
Copy link
Owner

So there are two issues here:

  • There's a bug in the code that this specific data triggers from some reason - has to do with the pd.crosstab part of cramers_v. I'll try fixing it soon - in the meantime you can use theil_u=True in the associations method.

  • The problem with column 2 is that it has only a single value in it (at least in this example). There's an underlying assumption which is that there are at least two distinct values in each column. I'll add an option to ignore single-value columns, and perhaps print a more clear warning

@shakedzy shakedzy added this to the Version 0.5.0 milestone Apr 16, 2020
@shakedzy
Copy link
Owner

So there's a rare edge case here, where the bias correction of Cramer's V ends up with a denominator of 0. I added an option to disable the bias correction in version 0.5.0. This should prevent errors like these.

In the new version, the plotted heat map will look like this:

Screen Shot 2020-04-18 at 4 14 17

Along with a clear warning:

RuntimeWarning: Unable to calculate Cramer's V using bias correction. Consider trying using bias_correction=False

@sdjoko
Copy link

sdjoko commented Nov 28, 2021

Can this package handle missing values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants