General approach to dealing with correlated columns #2462
Unanswered
gjthompson1
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@RobinL first of all thank you! This library is awesome. I have been thinking about this problem on and off for 7 years (I had a client that had a problem with 65M records that needed to be grouped down to about ~ 2M people) I implemented a simple linking model). I always felt intuitively that this was a solved problem. When the "dedupe" issue started to become a bigger problem for the client I did some more googling and found Splink! So thank you!
TLDR Whats the general recommendation for dealing with correlation?
I am struggling with
I understand that on the home page Splink says "Splink performs best with input data containing multiple columns that are not highly correlated.... Correlation is particularly problematic if all of your input columns are highly correlated"
Its not totally clear if and how much correlation is ok.
I feed in columns like
name
name_formatted
,first_name
,middle_name
,last_name
,generational_suffix
parsed from stuff likename = 'Ernest III & Susan Gibson Family Living Trust'
along with location information e.g.street
,city
,state
,zip_code
latitude
longitude
then I am finding that the model generally seems to work but the probabilities are all 1 or 0.99999.I am guessing this is user error based on the above and other statements such as "...A key assumption of the Fellegi Sunter model is that observations from different column/comparisons are independent of one another".
Reading other discussions online it seems like generally the best thing to do is to create "custom" "comparisons" that merge these correlated features into a single "comparison" with different "levels" e.g.
Location comparison
( I don't have lat long for every record)
Entity comparison
...
Is this the correct approach? Or is there a better way to do this? How much correlation is ok? I know you have mentioned in other areas first and middle name are often correlated. Phone numbers and addresses are correlated (here in the US). I want my model to be as good as possible so I want to add lots of features. But then there is a lot of correlation.
I know its maybe lazy but I am kinda used to just throwing in 100 columns into XGBoost and it figures it out. Its not super intuitive how to handle real world data like this / unclear how bad the correlation is / would be nice to have more guidance / materials / examples.
Very much appreciate this work and this library.
Thanks
[Update] I am looking at this https://github.com/RobinL/uk_address_matcher/blob/main/uk_address_matcher/data/splink_model.json and I would assume there is a decent amount of correlation between say
numeric_token_1
andoriginal_address_concat
e.g.123 Main Street
matching123 Main Street #120
would also imply123
matches123
and some of these other comparisons?Useful links
Related convos
Beta Was this translation helpful? Give feedback.
All reactions