Link types when comparing new records to a pre-existing golden record #1502

HistoryBeginsAtSumer · 2023-08-06T16:41:35Z

HistoryBeginsAtSumer
Aug 6, 2023

TL;DR
I have been looking for a 'link_and_dedupe_right_but_not_left' way to use splink as it is a use-case I have personally encountered and think is very likely to have happened to others also. I have found a way to mimic this behaviour by altering my model's prediction blocking rules, and wondered it that was the original intent all along. The full story is given below.

Hi everyone,

I have started using splink a couple months ago at work and the experience has overall been exquisite.

Still, some situations or use-cases, if you will, make me wonder whether I fully understand how the link_type parameter of the Linker object's settings_dict argument is meant to be used.

When the project started, I was given an extract of all LOBs in-scope (as a single flat file, which I read first as spark.DataFrame object, and then proceeded to convert under pandas.DataFrame format). As I was looking to find/eliminate dupes within every LOB, as well as link clients across different LOBs, I set the link_type parameter to 'dedupe_only' -- something I think is counter-intuitive, as far as labels go, but is most likely the results of how my data was setup in the first place.

Once the clustering phase was over, I was able to build a "golden record" which "elected" a single client per cluster according to various business rules not worth mentioning here.

Sometimes later, a new batch of clients came in, which I was mean to reduce to the lowest number possible by: (a) removing dupes within the batch; and (b) linking to preexisting clients within the golden records.

Now being in possession of 2 lists, I reloaded the model JSON artifact, and altered its link_type parameter to 'link_and_dedupe' before instantiating a new Linker object with it.

This has been given me pause, and rightfully so, methinks.

I am not interested in deduping the golden record, only the new batch. While I understand that by virtue of reusing the splink model, no such dupes should thus be produced/identified, I do not think this line of reasoning truly addresses the issue:

Even if no dupes emerges from the within-golden-record comparisons, it is time and computational resources I should not have to spend;
Even if you don't alter the model proper (i.e., the $m$, $u$ and $\lambda$ parameters), should you update the blocking_rules_to_generate_predictions parameter, you might very well find new dupes.

I suppose point 2 can and should be criticized for being a case of "altering the model nonetheless", but my practical experience/reality with that is that's its almost impossible not to. The new batches of clients can (and do, in my case) be associated with new LOBs governed by their own set of business rules which makes relying on one or more client attributes an impossibility. For instance, I would only compare pairs of clients born no more than a year apart (thus accounting for the most frequent kinds of typos occurring in the month or day part of the dob -- most frequent in my organization, that is). That's not possible when the new batch has a rate of dob missingness of 99.2%. Like I said, the same holds for other attributes, but you get the picture.

I went ahead anyway, and then had to work backward -- every time I would find a cluster mapping to at least one record of the golden record, but not all, I would not consider its others records new client, and would assign them the golden record's key. On the contrary, when I would find no link back to the golden record, I would "elect" a new client among the cluster for the golden record, generate a new unique key, and append it to the latter.

Anyhoo, it "worked", right, so what am I doing here writing and telling you guys about?

The answer is that I recently found a decent workaround, so much so that I wondered if that was intended design all along -- and if not, if something should be done about it.

I changed the prediction blocking rules (i.e., blocking_rules_to_generate_predictions) so that it would ignore pairs of clients from the golden record. In a gist, it fixed everything:

No more wasted time or computational resources;
No more identification of unwanted dupes (within the golden record);
The ability to change the prediction blocking rules according to an ever-changing business context;
A simplification of the logic used to generate new unique keys.

So, there it is, the end to a so-long-I-must-immediately-apologize ramble.

Any constructive criticism is highly appreciated. Hopefully my story might help others experiencing similar challenges.

Cheers!

P.S. LOB means "Line-Of-Business"; dob means "date of birth"; model artifact refers to a serialized predictive application, often under pickle format, but in this case, JSON.

P.P.S. Just in case my post doesn't convey the right tone, I am extremely appreciative of splink and its dedicated team of developers. Period.

Hi everyone,

I have started using splink a couple months ago at work and the experience has overall been exquisite.

Still, some situations or use-cases, if you will, make me wonder whether I fully understand how the link_type parameter of the Linker object's settings_dict argument is meant to be used.

When the project started, I was given an extract of all LOBs in-scope (as a single flat file, which I read first as spark.DataFrame object, and then proceeded to convert under pandas.DataFrame format). As I was looking to find/eliminate dupes within every LOB, as well as link clients across different LOBs, I set the link_type parameter to 'dedupe_only' -- something I think is counter-intuitive, as far as labels go, but is most likely the results of how my data was setup in the first place.

Once the clustering phase was over, I was able to build a "golden record" which "elected" a single client per cluster according to various business rules not worth mentioning here.

Sometimes later, a new batch of clients came in, which I was mean to reduce to the lowest number possible by: (a) removing dupes within the batch; and (b) linking to preexisting clients within the golden records.

Now being in possession of 2 lists, I reloaded the model JSON artifact, and altered its link_type parameter to 'link_and_dedupe' before instantiating a new Linker object with it.

This has been given me pause, and rightfully so, methinks.

I am not interested in deduping the golden record, only the new batch. While I understand that by virtue of reusing the splink model, no such dupes should thus be produced/identified, I do not think this line of reasoning truly addresses the issue:

Even if no dupes emerges from the within-golden-record comparisons, it is time and computational resources I should not have to spend;
Even if you don't alter the model proper (i.e., the $m$, $u$ and $\lambda$ parameters$), should you update the blocking_rules_to_generate_predictions parameter, you might very well find new dupes.

I suppose point 2 can and should be criticized for being a case of "altering the model nonetheless", but my practical experience/reality with that is that's its almost impossible not to. The new batches of clients can (and do, in my case) be associated with new LOBs governed by their own set of business rules which makes relying on one or more client attributes an impossibility. For instance, I would only compare pairs of clients born no more than a year apart (thus accounting for the most frequent kinds of typos occurring in the month or day part of the dob -- most frequent in my organization, that is). That's not possible when the new batch has a rate of dob missingness of 99.2%. Like I said, the same holds for other attributes, but you get the picture.

I went ahead anyway, and then had to work backward -- every time I would find a cluster mapping to at least one record of the golden record, but not all, I would not consider its others records new client, and would assign them the golden record's key. On the contrary, when I would find no link back to the golden record, I would "elect" a new client among the cluster for the golden record, generate a new unique key, and append it to the latter.

Anyhoo, it "worked", right, so what am I doing here writing and telling you guys about?

The answer is that I recently found a decent workaround, so much so that I wondered if that was intended design all along -- and if not, if something should be done about it.

I changed the prediction blocking rules (i.e., blocking_rules_to_generate_predictions) so that it would ignore pairs of clients from the golden record. In a gist, it fixed everything:

No more wasted time or computational resources;
No more identification of unwanted dupes (within the golden record);
The ability to change the prediction blocking rules according to an ever-changing business context;
A simplification of the logic used to generate new unique keys.

So, there it is, the end to a so-long-I-must-immediately-apologize ramble.

Any help, constructive criticism, or simply insightful comments are highly appreciated. Hopefully my story might help others experiencing similar challenges.

Cheers!

P.S. "LOB" means "Line-Of-Business"; "dob" means "date of birth"; "model artifact" refers to a serialized predictive application, often under pickle format, but in this case, JSON.

P.P.S. Just in case my post doesn't convey the right tone, I am extremely appreciative of splink and its dedicated team of developers. Period.

ETA: fixing minor typos.
ETA2: more typos + removed bad copy/paste.

RobinL · 2023-08-08T21:47:25Z

RobinL
Aug 8, 2023
Maintainer

Good question. I've been playing with this a little bit this evening to see what's possible. Still want to do some thinking but here's a partial response. Posting now partly to ensure I've actually understood correctly what you're after.

You're right, this use case is not explicitly envisaged by the current API.

You're also right that it's possible to use blocking rules to achieve what I believe you're after.

In the below , I deliberately create example data where you need to compare intra-new records in order to get the right answer. i.e. a link_only is not sufficient because there's a transitive match 1 -> 3 -> 4 that's only found by comparing within the new records as well as comparing new to golden.

Click to expand an example of this

import duckdb
import pandas as pd

from splink.duckdb.comparison_library import exact_match, levenshtein_at_thresholds
from splink.duckdb.linker import DuckDBLinker

con = duckdb.connect(":memory:")

data_golden = [
    {"unique_id": 1, "first_name": "John", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 2, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]
df_golden = pd.DataFrame(data_golden)

data_new = [
    {"unique_id": 3, "first_name": "John", "surname": "Smith", "dob": "1980-01-02"},
    {"unique_id": 4, "first_name": "Jon", "surname": "Smith", "dob": "1980-01-02"},
    {"unique_id": 5, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]

df_new = pd.DataFrame(data_new)

br = "(not (l.source_dataset = 'golden' and r.source_dataset = 'golden'))"

settings = {
    "link_type": "link_and_dedupe",
    "blocking_rules_to_generate_predictions": [br],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        levenshtein_at_thresholds("dob", 1),
    ],
}

linker = DuckDBLinker(
    [df_golden, df_new], settings, input_table_aliases=["golden", "new"], connection=con
)

df_predict = linker.predict()
df_predict.as_pandas_dataframe()

However, this approach is a bit inefficient in that it has to evaluate comparisons before discarding ones where it sees its comparing one golden record against another.

There is a more efficient way that avoids these 'intra-golden' comparisons being created in the first place, but it's a bit fiddly/hacky:

The way I proceed is:

running a link_only model to create golden <-> new comparisons
running a dedupe_only to create new <-> new comparisons. Note that I use link_type = 'link_and_dedupe' because otherwise the source_dataset columns don't appear in the output. But since I provide only a single input dataframe (new) this is effectively a dedupe_only` job.
combining the prediction results, then registering these results with the linker
clustering these prediction results

Here's how to do it:

import duckdb
import pandas as pd

from splink.duckdb.comparison_library import exact_match, levenshtein_at_thresholds
from splink.duckdb.linker import DuckDBLinker

con = duckdb.connect(":memory:")

data_golden = [
    {"unique_id": 1, "first_name": "John", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 2, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]
df_golden = pd.DataFrame(data_golden)

data_new = [
    {"unique_id": 3, "first_name": "John", "surname": "Smith", "dob": "1980-01-02"},
    {"unique_id": 4, "first_name": "Jon", "surname": "Smith", "dob": "1980-01-02"},
    {"unique_id": 5, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]

df_new = pd.DataFrame(data_new)

settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        levenshtein_at_thresholds("dob", 1),
    ],
}

linker_link_only = DuckDBLinker(
    [df_golden, df_new], settings, input_table_aliases=["golden", "new"], connection=con
)

df_1 = linker_link_only.predict(threshold_match_probability=0.5)

settings = {
    "link_type": "link_and_dedupe",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        levenshtein_at_thresholds("dob", 1),
    ],
}

linker_dedupe = DuckDBLinker(df_new, settings, input_table_aliases=["new"], connection=con)
df_2 = linker_dedupe.predict(threshold_match_probability=0.5)

sql = f"""
select *
from {df_1.physical_name}

UNION ALL

select *
from {df_2.physical_name}
"""

df_union_all = linker_link_only.query_sql(sql)
df_union_all

df_predict = linker_link_only.register_table_predict(df_union_all, overwrite=True)

linker_link_only.debug_mode = True
linker_link_only.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.5).as_pandas_dataframe().sort_values("cluster_id")

2 replies

HistoryBeginsAtSumer Aug 9, 2023
Author

@RobinL

Thank you for your response, I'll be sure to test it over tomorrow using real data.

And yeah, I think your understanding is pretty spot on. I thought about working in 2 steps like you suggested, but feared in might amount to too much trouble -- your MWE is proving me it is not.

In any case, I remain very much interested in this use-case, and am happy to assist toward its proper resolution in any way at my disposal. Cheers.

RobinL Aug 9, 2023
Maintainer

Thinking about this a little more, I believe your problem is equivalent to the problem of what I call 'incremental linkage'. This is a scenario in which it's envisaged:

The user has begun by deduping or linking a large pre-existing dataset. Let's say this has 10 million records.
They have not necessarily consolidated these records into a single master/golden record (they might have been, it doesn't matter much)
Some new records arrive, let's say 50
The user wants to incorporate these into the linkage (clusters) without having to link everything again.

I think the concept of the 'golden record' isn't too important here. The solution is very similar whether or not you have identified golden records.

Incremental linkage is a feature Splink doesn't (easily) support yet, but there's a function called linker.find_matches_to_new_records() here that's close.

find_matches_to_new_records creates comparisons only between the new records and existing records.

This makes it fast, because typically the number of new records is small relative to the historical dataset.

Using the example numbers above, you have to compare 50 records to 10 million. You do not need to compare 10 million to 10 million.

It does not find matches within the new records. So, in the current implementation, you lose transitive links, and hence you lose the information that unique_id 3 = unique_id 4 in the example data in the above

So one solution here is possibly to extend the find_matches_to_new_records to find within-new-dataset links (within the 50 new records). That would make it more accurate without affecting performance too much. The implementation could be essentially the same (the UNION ALL) version as my toy example above

One challenge here is that:

If find_matches_to_new_records is effectively a link_only, adding the new links to existing clusters is easy because there's no problem of transitivity
Once you allow within-new-records links, it's potentially transitive, hence iterative clustering is needed
Maybe there's a solution here to run iterative clustering only on the new <-> new links, which should be fast, and use the results to modify existing clusters. not sure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link types when comparing new records to a pre-existing golden record #1502

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Link types when comparing new records to a pre-existing golden record #1502

HistoryBeginsAtSumer Aug 6, 2023

Replies: 1 comment · 2 replies

RobinL Aug 8, 2023 Maintainer

HistoryBeginsAtSumer Aug 9, 2023 Author

RobinL Aug 9, 2023 Maintainer

HistoryBeginsAtSumer
Aug 6, 2023

Replies: 1 comment 2 replies

RobinL
Aug 8, 2023
Maintainer

HistoryBeginsAtSumer Aug 9, 2023
Author

RobinL Aug 9, 2023
Maintainer