Detect potentially duplicated Individual records #82

frizbog · 2015-09-27T19:46:16Z

A good feature to add would be to have a way of evaluating how similar two Individual records are, and how much confidence one might have that they are the same person. Such a function could be used to:

Identify duplicate records in a GEDCOM
Identify potential merge points between GEDCOMs

Using equals() will not do here but for the most trivial of cases. There needs to be some sort of sameness or similarity function that would reflect how likely it is that two Individual records are for the same person.

First thoughts on such a similarity function:

Based on the number of matching items between two individuals, with high weight being given to names, birth/death events, and SSN.
Names would also need some sort of "sameness" function. For example, a Soundex function may help with names. Surnames should get more weight than given names, which should have more weight than middle names. Suffixes should probably have more weight than prefixes.
Dates would need some sort of sameness function. For example, the following values are increasingly similar to 12 Dec 1989:
- Btw 1985 and 1990
- 1988-1990
- 1989
- Nov 1989
- Dec 1989
- 1 Dec 1989
- 11 Dec 1989
- 12 Dec 1989
Function should recursively consider parents and children
- Two individuals with the same name and dob are more likely to be the same person if they have the same parents, or the same children
- The similarity of the parents/children should not be weighted as heavily as the similarities of the individuals - perhaps at something like 1/4 of the weight of the individuals themselves?
- Recursing more than one generation to grandparents or grandchildren will probably introduce more noise than desirable to such a function
Each matching fact should contribute some weighted number towards the similarity score
Data points with source citations should get more weight than those without
Similarity would not be a normalized number - not a 0.0 to 1.0 probability of being the same person, but the larger the number, the more likely to be the same person

Potential problems of this sort of idea:

Any weighting of some data points over others will be subjective
How do you test for this? That is, how can you test that two records for the same person came back with an appropriate score? For two similar looking records for two different people?

frizbog added the enhancement label Sep 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect potentially duplicated Individual records #82

Detect potentially duplicated Individual records #82

frizbog commented Sep 27, 2015

Detect potentially duplicated Individual records #82

Detect potentially duplicated Individual records #82

Comments

frizbog commented Sep 27, 2015