Choosing a dataset for training a link_only model #1814
-
hi! i am working on a use case for Splink where the datasets looks something like this:
the proposed matching workflow looks something like this:
my question is about training the model in hopefully that questions makes sense! mostly looking for a gut-check here, if possible, as i realize the details are a bit fuzzy. happy to provide any clarification that i can. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Good questions - you're right this is not straightforward. In a perfect world you'd have a large dataset for the left and right datasets. For the situation you describe:
This means there isn't really a perfect answer. I would suggest the following approach:
Another reasonable approach could be to train the model in dedupe only mode, and manually set the m values using expert judgement. The accutracy of linkage tends to be not that sensitive to the m values being a 'bit wrong' i.e. so long as they're ballpark the model still tends to do a pretty good job. One thing to watch out for is that if you train using the 'wrong' link type (i.e. you use dedupe only, or link and dedupe), your probability_two_random_records_match will probably be wrong. However, in your case this value may be known in advance. For example, if your right hand dataset represents a 'census' of (say) 1 million people, and you know that each left hand record must be somewhere in the census, this value is simply 1/1million. Hope that helps, appreciate it isn't the simplest answer in the world |
Beta Was this translation helpful? Give feedback.
-
hey @RobinL, thanks for the thorough response and explanation! this is super helpful. i actually have several years worth of the left dataset to refer back to, so i think i can readily use that as training data! i have a semi-related follow-up question about evaluation/QA. i don't have any ground truth data, but i would like to have some insight into how well the model is performing at the task and a somewhat objective way to compare different versions of the model. my plan for this evaluation is to take our real world "right" dataset and then sample, duplicate, and corrupt some records to duplicate in order to create a (synthetic) labeled "left" dataset. we would then apply the model trained as you describe above to this synthetic dataset, using the built-in QA tools to evaluate various model performance metrics (ROC curve, accuracy, precision-recall). the hope is this will give us useful information that we can use to tweak and tune subsequent trainings of the model, blocking rules, comparison levels, etc. do you think this approach to evaluation makes sense at a high level? if it's helpful, false positives (matches to the incorrect individual in the left dataset) are a particularly adverse outcome in my use case and could have serious organizational consequences. so appreciate your help and guidance! |
Beta Was this translation helpful? Give feedback.
-
Hi @RobinL I have created a link_only model . While trying to evaluate the model accuracy , I noticed that I must have good quality ground truth labelled data. My source and target dataset is < 50k records but labelled dataset is very small 200 records and doesnt cover all entities in the source and target. what is the best technique to create quality ground truth labels in this scenario? |
Beta Was this translation helpful? Give feedback.
Good questions - you're right this is not straightforward.
In a perfect world you'd have a large dataset for the left and right datasets.
For the situation you describe: