Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect potentially duplicated Individual records #82

Open
frizbog opened this issue Sep 27, 2015 · 0 comments
Open

Detect potentially duplicated Individual records #82

frizbog opened this issue Sep 27, 2015 · 0 comments

Comments

@frizbog
Copy link
Owner

frizbog commented Sep 27, 2015

A good feature to add would be to have a way of evaluating how similar two Individual records are, and how much confidence one might have that they are the same person. Such a function could be used to:

  • Identify duplicate records in a GEDCOM
  • Identify potential merge points between GEDCOMs

Using equals() will not do here but for the most trivial of cases. There needs to be some sort of sameness or similarity function that would reflect how likely it is that two Individual records are for the same person.

First thoughts on such a similarity function:

  • Based on the number of matching items between two individuals, with high weight being given to names, birth/death events, and SSN.
  • Names would also need some sort of "sameness" function. For example, a Soundex function may help with names. Surnames should get more weight than given names, which should have more weight than middle names. Suffixes should probably have more weight than prefixes.
  • Dates would need some sort of sameness function. For example, the following values are increasingly similar to 12 Dec 1989:
    • Btw 1985 and 1990
    • 1988-1990
    • 1989
    • Nov 1989
    • Dec 1989
    • 1 Dec 1989
    • 11 Dec 1989
    • 12 Dec 1989
  • Function should recursively consider parents and children
    • Two individuals with the same name and dob are more likely to be the same person if they have the same parents, or the same children
    • The similarity of the parents/children should not be weighted as heavily as the similarities of the individuals - perhaps at something like 1/4 of the weight of the individuals themselves?
    • Recursing more than one generation to grandparents or grandchildren will probably introduce more noise than desirable to such a function
  • Each matching fact should contribute some weighted number towards the similarity score
  • Data points with source citations should get more weight than those without
  • Similarity would not be a normalized number - not a 0.0 to 1.0 probability of being the same person, but the larger the number, the more likely to be the same person

Potential problems of this sort of idea:

  • Any weighting of some data points over others will be subjective
  • How do you test for this? That is, how can you test that two records for the same person came back with an appropriate score? For two similar looking records for two different people?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
@frizbog and others