You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A good feature to add would be to have a way of evaluating how similar two Individual records are, and how much confidence one might have that they are the same person. Such a function could be used to:
Identify duplicate records in a GEDCOM
Identify potential merge points between GEDCOMs
Using equals() will not do here but for the most trivial of cases. There needs to be some sort of sameness or similarity function that would reflect how likely it is that two Individual records are for the same person.
First thoughts on such a similarity function:
Based on the number of matching items between two individuals, with high weight being given to names, birth/death events, and SSN.
Names would also need some sort of "sameness" function. For example, a Soundex function may help with names. Surnames should get more weight than given names, which should have more weight than middle names. Suffixes should probably have more weight than prefixes.
Dates would need some sort of sameness function. For example, the following values are increasingly similar to 12 Dec 1989:
Btw 1985 and 1990
1988-1990
1989
Nov 1989
Dec 1989
1 Dec 1989
11 Dec 1989
12 Dec 1989
Function should recursively consider parents and children
Two individuals with the same name and dob are more likely to be the same person if they have the same parents, or the same children
The similarity of the parents/children should not be weighted as heavily as the similarities of the individuals - perhaps at something like 1/4 of the weight of the individuals themselves?
Recursing more than one generation to grandparents or grandchildren will probably introduce more noise than desirable to such a function
Each matching fact should contribute some weighted number towards the similarity score
Data points with source citations should get more weight than those without
Similarity would not be a normalized number - not a 0.0 to 1.0 probability of being the same person, but the larger the number, the more likely to be the same person
Potential problems of this sort of idea:
Any weighting of some data points over others will be subjective
How do you test for this? That is, how can you test that two records for the same person came back with an appropriate score? For two similar looking records for two different people?
The text was updated successfully, but these errors were encountered:
A good feature to add would be to have a way of evaluating how similar two Individual records are, and how much confidence one might have that they are the same person. Such a function could be used to:
Using equals() will not do here but for the most trivial of cases. There needs to be some sort of sameness or similarity function that would reflect how likely it is that two Individual records are for the same person.
First thoughts on such a similarity function:
Potential problems of this sort of idea:
The text was updated successfully, but these errors were encountered: