-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose top level stats in scans #227
Conversation
20355e6
to
f4445b6
Compare
kernel/src/scan/state.rs
Outdated
pub struct Stats { | ||
pub num_records: u64, | ||
#[serde(default = "default_true")] | ||
pub tight_bounds: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it help an engine to know about tight bounds? I guess it tells whether the logical rowcount could be less than the reported physical rowcount, but I wouldn't ever expect a deletion vector to suppress more than a small fraction of rows?
Also: Where do we draw the line on what stats to expose? I guess we stop here because the min/max/nullcount fields are (a) difficult to expose because schema-specific; and (b) arguably it's kernel's job to deal with them, not engine's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I had just started looking at the stats stuff when I wrote this PR, and thought that tight_bounds
had implications for the value of this stat, but it doesn't. Rather, the presence of a deletion vector matters.
So I've removed that field, and added a has_vector() -> bool
method on DvInfo
, and noted what the stat means in a doc comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call it has_dv
or has_deletion_vector
to be self documenting?
Co-authored-by: Ryan Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a "good enough" start -- we can always iterate later as use cases proliferate.
const DvInfo* dv_info, | ||
const CStringMap* partition_values) { | ||
printf("file: %.*s\n", (int)path.len, path.ptr); | ||
printf("file: %.*s (size: %" PRId64 ", num_records:", (int)path.len, path.ptr, size); | ||
if (stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: should this code also print dv info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering cffi-test.c "deprecated". As soon as #203 merges, I'll remove cffi-test.c and switch our CI over to run the more complete example there, which does look at the dv info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
EngineData
we produce during scansOnly supports
numRecords
for the moment, but I've kept this as a struct so we can add future stats without breaking apis