time complexity of estimate_u_using_random_sampling in max_pairs / and low frequency match levels #2535

econandrew · 2024-11-29T17:18:43Z

econandrew
Nov 29, 2024

How does the computation time of estimate_u_using_random_sampling() scale with max_pairs? Intuitively I would expect linearly but it seems much worse than that so maybe I'm misunderstanding things. (FWIW, I'm talking about e.g. 1e9 vs 2e9 on duckdb m4 mac pro so it could certainly be resource constraints kicking in...)

[motivated by:]

I have certain match levels that I have added to match rules based on ad hoc examination of linking errors, and which make complete logical sense (say, accidental DD/MM swapping in date entry), but which don't actually occur that often. Typically that means to get u estimates I need a very high max_pairs.

On one hand it's nice to capture these - they're real, and they shed light on the record generating process - but OTOH it seems like if I can't get u estimates with every pretty high max_pairs, they can't be important enough to bother with. Any practical advice on this?

(ps. fantastic package, kudos for whatever it took to get this out into the world)

RobinL · 2024-11-29T17:42:55Z

RobinL
Nov 29, 2024
Maintainer

It should usually be linear, but I guess there may be some point where it has to spill to disk or something that causes a nonlinearity.

For highly unusual values, you could probably just manually set the u value. See #2379

1 reply

RobinL Nov 29, 2024
Maintainer

Actually - thinking about it - a separate option would be to run u estimation with higher max_pairs on a smaller model with only the troublesome comparisons. In general, you only need high max_pairs for very high cardinality columns.

You'd then fix the u values for those troublesome comparisons in a subsequent training run of the full model.

A bit hacky but should work fine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time complexity of estimate_u_using_random_sampling in max_pairs / and low frequency match levels #2535

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

time complexity of estimate_u_using_random_sampling in max_pairs / and low frequency match levels #2535

econandrew Nov 29, 2024

Replies: 1 comment · 1 reply

RobinL Nov 29, 2024 Maintainer

RobinL Nov 29, 2024 Maintainer

econandrew
Nov 29, 2024

Replies: 1 comment 1 reply

RobinL
Nov 29, 2024
Maintainer

RobinL Nov 29, 2024
Maintainer