Link types when comparing new records to a pre-existing golden record #1502
Replies: 1 comment 2 replies
-
Good question. I've been playing with this a little bit this evening to see what's possible. Still want to do some thinking but here's a partial response. Posting now partly to ensure I've actually understood correctly what you're after. You're right, this use case is not explicitly envisaged by the current API. You're also right that it's possible to use blocking rules to achieve what I believe you're after. In the below , I deliberately create example data where you need to compare intra-new records in order to get the right answer. i.e. a Click to expand an example of thisimport duckdb
import pandas as pd
from splink.duckdb.comparison_library import exact_match, levenshtein_at_thresholds
from splink.duckdb.linker import DuckDBLinker
con = duckdb.connect(":memory:")
data_golden = [
{"unique_id": 1, "first_name": "John", "surname": "Smith", "dob": "1980-01-01"},
{"unique_id": 2, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]
df_golden = pd.DataFrame(data_golden)
data_new = [
{"unique_id": 3, "first_name": "John", "surname": "Smith", "dob": "1980-01-02"},
{"unique_id": 4, "first_name": "Jon", "surname": "Smith", "dob": "1980-01-02"},
{"unique_id": 5, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]
df_new = pd.DataFrame(data_new)
br = "(not (l.source_dataset = 'golden' and r.source_dataset = 'golden'))"
settings = {
"link_type": "link_and_dedupe",
"blocking_rules_to_generate_predictions": [br],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
levenshtein_at_thresholds("dob", 1),
],
}
linker = DuckDBLinker(
[df_golden, df_new], settings, input_table_aliases=["golden", "new"], connection=con
)
df_predict = linker.predict()
df_predict.as_pandas_dataframe() However, this approach is a bit inefficient in that it has to evaluate comparisons before discarding ones where it sees its comparing one golden record against another. There is a more efficient way that avoids these 'intra-golden' comparisons being created in the first place, but it's a bit fiddly/hacky: The way I proceed is:
Here's how to do it: import duckdb
import pandas as pd
from splink.duckdb.comparison_library import exact_match, levenshtein_at_thresholds
from splink.duckdb.linker import DuckDBLinker
con = duckdb.connect(":memory:")
data_golden = [
{"unique_id": 1, "first_name": "John", "surname": "Smith", "dob": "1980-01-01"},
{"unique_id": 2, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]
df_golden = pd.DataFrame(data_golden)
data_new = [
{"unique_id": 3, "first_name": "John", "surname": "Smith", "dob": "1980-01-02"},
{"unique_id": 4, "first_name": "Jon", "surname": "Smith", "dob": "1980-01-02"},
{"unique_id": 5, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
]
df_new = pd.DataFrame(data_new)
settings = {
"link_type": "link_only",
"blocking_rules_to_generate_predictions": [],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
levenshtein_at_thresholds("dob", 1),
],
}
linker_link_only = DuckDBLinker(
[df_golden, df_new], settings, input_table_aliases=["golden", "new"], connection=con
)
df_1 = linker_link_only.predict(threshold_match_probability=0.5)
settings = {
"link_type": "link_and_dedupe",
"blocking_rules_to_generate_predictions": [],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
levenshtein_at_thresholds("dob", 1),
],
}
linker_dedupe = DuckDBLinker(df_new, settings, input_table_aliases=["new"], connection=con)
df_2 = linker_dedupe.predict(threshold_match_probability=0.5)
sql = f"""
select *
from {df_1.physical_name}
UNION ALL
select *
from {df_2.physical_name}
"""
df_union_all = linker_link_only.query_sql(sql)
df_union_all
df_predict = linker_link_only.register_table_predict(df_union_all, overwrite=True)
linker_link_only.debug_mode = True
linker_link_only.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.5).as_pandas_dataframe().sort_values("cluster_id") |
Beta Was this translation helpful? Give feedback.
-
TL;DR
I have been looking for a
'link_and_dedupe_right_but_not_left'
way to usesplink
as it is a use-case I have personally encountered and think is very likely to have happened to others also. I have found a way to mimic this behaviour by altering my model's prediction blocking rules, and wondered it that was the original intent all along. The full story is given below.Hi everyone,
I have started using
splink
a couple months ago at work and the experience has overall been exquisite.Still, some situations or use-cases, if you will, make me wonder whether I fully understand how the
link_type
parameter of theLinker
object'ssettings_dict
argument is meant to be used.When the project started, I was given an extract of all LOBs in-scope (as a single flat file, which I read first as
spark.DataFrame
object, and then proceeded to convert underpandas.DataFrame
format). As I was looking to find/eliminate dupes within every LOB, as well as link clients across different LOBs, I set thelink_type
parameter to'dedupe_only'
-- something I think is counter-intuitive, as far as labels go, but is most likely the results of how my data was setup in the first place.Once the clustering phase was over, I was able to build a "golden record" which "elected" a single client per cluster according to various business rules not worth mentioning here.
Sometimes later, a new batch of clients came in, which I was mean to reduce to the lowest number possible by: (a) removing dupes within the batch; and (b) linking to preexisting clients within the golden records.
Now being in possession of 2 lists, I reloaded the model
JSON
artifact, and altered itslink_type
parameter to'link_and_dedupe'
before instantiating a newLinker
object with it.This has been given me pause, and rightfully so, methinks.
I am not interested in deduping the golden record, only the new batch. While I understand that by virtue of reusing the splink model, no such dupes should thus be produced/identified, I do not think this line of reasoning truly addresses the issue:
blocking_rules_to_generate_predictions
parameter, you might very well find new dupes.I suppose point 2 can and should be criticized for being a case of "altering the model nonetheless", but my practical experience/reality with that is that's its almost impossible not to. The new batches of clients can (and do, in my case) be associated with new LOBs governed by their own set of business rules which makes relying on one or more client attributes an impossibility. For instance, I would only compare pairs of clients born no more than a year apart (thus accounting for the most frequent kinds of typos occurring in the month or day part of the dob -- most frequent in my organization, that is). That's not possible when the new batch has a rate of dob missingness of 99.2%. Like I said, the same holds for other attributes, but you get the picture.
I went ahead anyway, and then had to work backward -- every time I would find a cluster mapping to at least one record of the golden record, but not all, I would not consider its others records new client, and would assign them the golden record's key. On the contrary, when I would find no link back to the golden record, I would "elect" a new client among the cluster for the golden record, generate a new unique key, and append it to the latter.
Anyhoo, it "worked", right, so what am I doing here writing and telling you guys about?
The answer is that I recently found a decent workaround, so much so that I wondered if that was intended design all along -- and if not, if something should be done about it.
I changed the prediction blocking rules (i.e.,
blocking_rules_to_generate_predictions
) so that it would ignore pairs of clients from the golden record. In a gist, it fixed everything:So, there it is, the end to a so-long-I-must-immediately-apologize ramble.
Any constructive criticism is highly appreciated. Hopefully my story might help others experiencing similar challenges.
Cheers!
P.S. LOB means "Line-Of-Business"; dob means "date of birth"; model artifact refers to a serialized predictive application, often under
pickle
format, but in this case,JSON
.P.P.S. Just in case my post doesn't convey the right tone, I am extremely appreciative of
splink
and its dedicated team of developers. Period.Hi everyone,
I have started using
splink
a couple months ago at work and the experience has overall been exquisite.Still, some situations or use-cases, if you will, make me wonder whether I fully understand how the
link_type
parameter of theLinker
object'ssettings_dict
argument is meant to be used.When the project started, I was given an extract of all LOBs in-scope (as a single flat file, which I read first as
spark.DataFrame
object, and then proceeded to convert underpandas.DataFrame
format). As I was looking to find/eliminate dupes within every LOB, as well as link clients across different LOBs, I set thelink_type
parameter to'dedupe_only'
-- something I think is counter-intuitive, as far as labels go, but is most likely the results of how my data was setup in the first place.Once the clustering phase was over, I was able to build a "golden record" which "elected" a single client per cluster according to various business rules not worth mentioning here.
Sometimes later, a new batch of clients came in, which I was mean to reduce to the lowest number possible by: (a) removing dupes within the batch; and (b) linking to preexisting clients within the golden records.
Now being in possession of 2 lists, I reloaded the model
JSON
artifact, and altered itslink_type
parameter to'link_and_dedupe'
before instantiating a newLinker
object with it.This has been given me pause, and rightfully so, methinks.
I am not interested in deduping the golden record, only the new batch. While I understand that by virtue of reusing the splink model, no such dupes should thus be produced/identified, I do not think this line of reasoning truly addresses the issue:
blocking_rules_to_generate_predictions
parameter, you might very well find new dupes.I suppose point 2 can and should be criticized for being a case of "altering the model nonetheless", but my practical experience/reality with that is that's its almost impossible not to. The new batches of clients can (and do, in my case) be associated with new LOBs governed by their own set of business rules which makes relying on one or more client attributes an impossibility. For instance, I would only compare pairs of clients born no more than a year apart (thus accounting for the most frequent kinds of typos occurring in the month or day part of the dob -- most frequent in my organization, that is). That's not possible when the new batch has a rate of dob missingness of 99.2%. Like I said, the same holds for other attributes, but you get the picture.
I went ahead anyway, and then had to work backward -- every time I would find a cluster mapping to at least one record of the golden record, but not all, I would not consider its others records new client, and would assign them the golden record's key. On the contrary, when I would find no link back to the golden record, I would "elect" a new client among the cluster for the golden record, generate a new unique key, and append it to the latter.
Anyhoo, it "worked", right, so what am I doing here writing and telling you guys about?
The answer is that I recently found a decent workaround, so much so that I wondered if that was intended design all along -- and if not, if something should be done about it.
I changed the prediction blocking rules (i.e.,
blocking_rules_to_generate_predictions
) so that it would ignore pairs of clients from the golden record. In a gist, it fixed everything:So, there it is, the end to a so-long-I-must-immediately-apologize ramble.
Any help, constructive criticism, or simply insightful comments are highly appreciated. Hopefully my story might help others experiencing similar challenges.
Cheers!
P.S. "LOB" means "Line-Of-Business"; "dob" means "date of birth"; "model artifact" refers to a serialized predictive application, often under
pickle
format, but in this case,JSON
.P.P.S. Just in case my post doesn't convey the right tone, I am extremely appreciative of
splink
and its dedicated team of developers. Period.ETA: fixing minor typos.
ETA2: more typos + removed bad copy/paste.
Beta Was this translation helpful? Give feedback.
All reactions