General approach to dealing with correlated columns #2462

gjthompson1 · 2024-10-10T03:40:02Z

gjthompson1
Oct 10, 2024

@RobinL first of all thank you! This library is awesome. I have been thinking about this problem on and off for 7 years (I had a client that had a problem with 65M records that needed to be grouped down to about ~ 2M people) I implemented a simple linking model). I always felt intuitively that this was a solved problem. When the "dedupe" issue started to become a bigger problem for the client I did some more googling and found Splink! So thank you!

TLDR Whats the general recommendation for dealing with correlation?

I am struggling with

How is correlation effecting my model?
How to deal with it?

I understand that on the home page Splink says "Splink performs best with input data containing multiple columns that are not highly correlated.... Correlation is particularly problematic if all of your input columns are highly correlated"

Its not totally clear if and how much correlation is ok.

I feed in columns like name name_formatted, first_name, middle_name, last_name, generational_suffix parsed from stuff like name = 'Ernest III & Susan Gibson Family Living Trust' along with location information e.g. street, city, state, zip_code latitude longitude then I am finding that the model generally seems to work but the probabilities are all 1 or 0.99999.

I am guessing this is user error based on the above and other statements such as "...A key assumption of the Fellegi Sunter model is that observations from different column/comparisons are independent of one another".

Reading other discussions online it seems like generally the best thing to do is to create "custom" "comparisons" that merge these correlated features into a single "comparison" with different "levels" e.g.

Location comparison

( I don't have lat long for every record)

Distance < 0.1km OR exact match on street
Distance < 5km OR exact match on zip_codes
Distance < 10km OR exact match on city
Exact match on state

Entity comparison

Exact match on first and Exact match on middle, Exact match on last name
Exact match on first and Exact match on substr(middle, 1, 1), Exact match on last name
...
Jaro Winkler > 0.7 on name

Is this the correct approach? Or is there a better way to do this? How much correlation is ok? I know you have mentioned in other areas first and middle name are often correlated. Phone numbers and addresses are correlated (here in the US). I want my model to be as good as possible so I want to add lots of features. But then there is a lot of correlation.

I know its maybe lazy but I am kinda used to just throwing in 100 columns into XGBoost and it figures it out. Its not super intuitive how to handle real world data like this / unclear how bad the correlation is / would be nice to have more guidance / materials / examples.

Very much appreciate this work and this library.

Thanks

[Update] I am looking at this https://github.com/RobinL/uk_address_matcher/blob/main/uk_address_matcher/data/splink_model.json and I would assume there is a decent amount of correlation between say numeric_token_1 and original_address_concat e.g. 123 Main Street matching 123 Main Street #120 would also imply 123 matches 123 and some of these other comparisons?

Useful links

Related convos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General approach to dealing with correlated columns #2462

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

General approach to dealing with correlated columns #2462

gjthompson1 Oct 10, 2024

Replies: 0 comments

gjthompson1
Oct 10, 2024