Choosing a dataset for training a link_only model #1814

finalgrrrl · 2023-12-19T20:58:17Z

finalgrrrl
Dec 19, 2023

hi! i am working on a use case for Splink where the datasets looks something like this:

one large set of millions of individuals (people) who are already part of our data system ("right" dataset for linking purposes, i.e., the record being linked to)
a much smaller subset of new individuals who may or may not be part of the existing data system ("left" dataset for linking purposes, i.e., the record being linked)

the proposed matching workflow looks something like this:

workflow a

train model parameters
persist the model json

workflow b

load the trained model json from workflow a
grab a batch of new people data (this is our "left" dataset)
execute predict() to calculate match probabilities for records in the "left" dataset to individuals in the "right" dataset

my question is about training the model in workflow a: assuming there are at least some differences (e.g., in cardinality and missingness) between the left and right datasets, will it produce more accurate results to train the model using the "left" dataset (new people) or the "right" dataset (existing people)? my hope is to re-train the model (i.e., execute workflow a) infrequently, as workflow b needs to execute fairly regularly (albeit on small batches of records).

hopefully that questions makes sense! mostly looking for a gut-check here, if possible, as i realize the details are a bit fuzzy. happy to provide any clarification that i can.

Answered by RobinL

Dec 21, 2023

Good questions - you're right this is not straightforward.

In a perfect world you'd have a large dataset for the left and right datasets.

For the situation you describe:

the u values are driven by the variety of values in the combination of two datasets. This will be dominated by the larger (right) dataset, so they would probably be best trained on the right dataset.
the m values are more tricky because they primarily depend on the data quality. Especially if you're linking poorer quality (left) datsets to a higher quality (right) dataset, using the right dataset for training probably isn't a good idea. Training them on the right dataset will only actually work at all if you have duplica…

View full answer

RobinL · 2023-12-21T17:13:45Z

RobinL
Dec 21, 2023
Maintainer

Good questions - you're right this is not straightforward.

In a perfect world you'd have a large dataset for the left and right datasets.

For the situation you describe:

the u values are driven by the variety of values in the combination of two datasets. This will be dominated by the larger (right) dataset, so they would probably be best trained on the right dataset.
the m values are more tricky because they primarily depend on the data quality. Especially if you're linking poorer quality (left) datsets to a higher quality (right) dataset, using the right dataset for training probably isn't a good idea. Training them on the right dataset will only actually work at all if you have duplicates in that dataset which might not be the case.

This means there isn't really a perfect answer.

I would suggest the following approach:

if you can create a large 'left' dataset by (say) concatenating multiple smaller ones, that would probably be the best option. Then train in 'link only' mode.
if you can't, you could consider training the model in link and dedupe mode, and then manually overriding the link type to link only when you save the model out to json. The larger the 'left' dataset that goes into this model the better the results.

Another reasonable approach could be to train the model in dedupe only mode, and manually set the m values using expert judgement. The accutracy of linkage tends to be not that sensitive to the m values being a 'bit wrong' i.e. so long as they're ballpark the model still tends to do a pretty good job.

One thing to watch out for is that if you train using the 'wrong' link type (i.e. you use dedupe only, or link and dedupe), your probability_two_random_records_match will probably be wrong. However, in your case this value may be known in advance. For example, if your right hand dataset represents a 'census' of (say) 1 million people, and you know that each left hand record must be somewhere in the census, this value is simply 1/1million.

Hope that helps, appreciate it isn't the simplest answer in the world

0 replies

finalgrrrl · 2024-01-02T22:04:48Z

finalgrrrl
Jan 2, 2024
Author

hey @RobinL, thanks for the thorough response and explanation! this is super helpful.

i actually have several years worth of the left dataset to refer back to, so i think i can readily use that as training data!

i have a semi-related follow-up question about evaluation/QA. i don't have any ground truth data, but i would like to have some insight into how well the model is performing at the task and a somewhat objective way to compare different versions of the model.

my plan for this evaluation is to take our real world "right" dataset and then sample, duplicate, and corrupt some records to duplicate in order to create a (synthetic) labeled "left" dataset. we would then apply the model trained as you describe above to this synthetic dataset, using the built-in QA tools to evaluate various model performance metrics (ROC curve, accuracy, precision-recall). the hope is this will give us useful information that we can use to tweak and tune subsequent trainings of the model, blocking rules, comparison levels, etc.

do you think this approach to evaluation makes sense at a high level? if it's helpful, false positives (matches to the incorrect individual in the left dataset) are a particularly adverse outcome in my use case and could have serious organizational consequences.

so appreciate your help and guidance!

1 reply

RobinL Jan 16, 2024
Maintainer

Hey - sorry for the slow response. Yes - i like this idea - i think it's validity would rest on how accurately you can re-create real world errors, which often have complex correlations (e.g. if first name is spelt wrong, more likely surname is also spelt wrong).

But as a way to perform a bunch of spot checks I think it's a good idea - if you have a list of 'corrupted records that should match' and potentially also a list of 'records which are similar but should not match', then you can start to get a pretty good idea of how your model performs in various scenarios

SayaniB-Gates · 2024-06-14T00:12:42Z

SayaniB-Gates
Jun 14, 2024

Hi @RobinL I have created a link_only model . While trying to evaluate the model accuracy , I noticed that I must have good quality ground truth labelled data. My source and target dataset is < 50k records but labelled dataset is very small 200 records and doesnt cover all entities in the source and target. what is the best technique to create quality ground truth labels in this scenario?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choosing a dataset for training a link_only model #1814

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Choosing a dataset for training a link_only model #1814

finalgrrrl Dec 19, 2023

Replies: 3 comments · 1 reply

RobinL Dec 21, 2023 Maintainer

finalgrrrl Jan 2, 2024 Author

RobinL Jan 16, 2024 Maintainer

SayaniB-Gates Jun 14, 2024

finalgrrrl
Dec 19, 2023

Replies: 3 comments 1 reply

RobinL
Dec 21, 2023
Maintainer

finalgrrrl
Jan 2, 2024
Author

RobinL Jan 16, 2024
Maintainer

SayaniB-Gates
Jun 14, 2024