Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework segment metadata query "size" analysis #7124

Open
clintropolis opened this issue Feb 22, 2019 · 4 comments
Open

rework segment metadata query "size" analysis #7124

clintropolis opened this issue Feb 22, 2019 · 4 comments
Labels

Comments

@clintropolis
Copy link
Member

Motivation

Currently segment metadata query has a size analysis type that the documentation describes as "estimated byte size for the segment columns if they were stored in a flat format" and "estimated total segment byte size in if it was stored in a flat format", but this doesn't really have any practical value as far as I can tell. I think this value is confusing, and should be replaced with a size value that represents the actual segment and column sizes for mapped segments, or the estimated size in memory for incremental segments.

This allows size analysis to be useful for finding which columns are the heavy hitters in terms of overall segment size, observe fluctuations in segment size over time, and could also be of aid in capacity planning.

Alternatively, if this doesn't seem useful enough, or more trouble than it's worth, I would instead propose that the value just be removed completely because it's confusing and expensive to run.

Proposed changes

Making the value meaningful likely doesn't take much effort. For mapped columns, it would just be preserving the byte size of the columns when initially loading segments and making it accessible through the BaseColumn or related interface (if it isn't already and I just missed it). For incremental index, it would involve modifying the size functions to report the estimated size in memory of the values where necessary. No surface level changes would be necessary for this approach, but we would need to call out in the release notes that the meanings of values has changed and update documentation accordingly.

Rationale

We already need to know this size information to properly map columns from the 'smoosh' file, so preserving it and offering it up for segment metadata queries should be rather straightforward.

Operational impact

Size analysis as is looks rather expensive, if we switch to column "in segment" size, then this operation can become constant for mapped segments resulting in cheaper overall segment metadata queries using this analysis on mapped segments. I expect little change for the case of incremental indexes.

@drcrallen
Copy link
Contributor

Can such a feature allow comparisons to https://cloud.google.com/bigquery/pricing ?

@clintropolis
Copy link
Member Author

Can such a feature allow comparisons to https://cloud.google.com/bigquery/pricing ?

I'm not sure, I think it would depend on which of "size data requires to be loaded on historicals" or "size data requires to be at rest in deep storage" is the more appropriate metric to compare to the big query thing. My suggestion would provide a means to get the former value, which I think is maybe the better thing to use in Druid world to compare here since data needs to be loaded to be useful, so ... yes I guess? 😅That said, you might need both values if you also wanted to also meter segments that are in deep storage and not loaded, or are trying to bill for all costs.

@gianm
Copy link
Contributor

gianm commented Feb 25, 2019

I do think it makes sense to have segmentMetadata return the size in bytes instead of the fake-o weird size it reports today. Maybe it makes sense, though, to change the name? i.e. deprecate and retire "size" and introduce a new thing that does make sense. That's just to avoid any potential confusion with people querying "size" on older versions, thinking they're getting the newer behavior, and getting wrong numbers back.

@paul-rogers
Copy link
Contributor

+1 for this feature. As noted, without this, I cannot really tell how much space a column consumes and whether it is worth the cost. I suppose I could infer this by creating a new table without the column, and comparing the difference, but doing so is clearly a bit of a hassle.

The number returned should account for all the space dedicated to the column, including any dictionary overhead and run-length encoding or whatever. Would be wonderful to have separate numbers for in-memory and on-disk, if they are vastly different for some reason.

The key bit we want to know is the cost of column X relative to the overall table size. So, as long as the in-memory and on-disk sizes are proportional, having one size is good enough (if it is accurate.)

A good check would be that the sum of column sizes (per segment) should more-or-less equal the segment size, aside from any segment overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants