time complexity of estimate_u_using_random_sampling in max_pairs / and low frequency match levels #2535
Unanswered
econandrew
asked this question in
Q&A
Replies: 1 comment 1 reply
-
It should usually be linear, but I guess there may be some point where it has to spill to disk or something that causes a nonlinearity. For highly unusual values, you could probably just manually set the u value. See #2379 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
estimate_u_using_random_sampling()
scale withmax_pairs
? Intuitively I would expect linearly but it seems much worse than that so maybe I'm misunderstanding things. (FWIW, I'm talking about e.g. 1e9 vs 2e9 on duckdb m4 mac pro so it could certainly be resource constraints kicking in...)[motivated by:]
I have certain match levels that I have added to match rules based on ad hoc examination of linking errors, and which make complete logical sense (say, accidental DD/MM swapping in date entry), but which don't actually occur that often. Typically that means to get u estimates I need a very high max_pairs.
On one hand it's nice to capture these - they're real, and they shed light on the record generating process - but OTOH it seems like if I can't get u estimates with every pretty high
max_pairs
, they can't be important enough to bother with. Any practical advice on this?(ps. fantastic package, kudos for whatever it took to get this out into the world)
Beta Was this translation helpful? Give feedback.
All reactions