-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a nullSafe mode in ArrayOfStringsSerDe #375
Conversation
There are some sketches where we don't allow One potential issue is that it'll introduce an outright incompatibility with C++, where the provided serde uses unsigned values. It seems fairly unlikely that anyone has produced serialized images with string lengths >= 2^31, but it might be better to use a separate null-aware class. |
Thanks for the review, @jmalkin !
Yes, I tried testing
The nullSafe mode in the modified classes is optional and existing code will continue to use the non nullSafe mode unless explicitly changed, but I guess that won't be enough for C++ compatibility. |
@kfaraz I'd wait for either @leerho or @AlexanderSaydakov to weigh in with a second opinion before refactoring. The max non-negative integer is large enough it may well be a non-issue even if it's a technical incompatibility. IIRC, java can't allocate an array with >= 2^31 bytes anyway, so there's some compatibility weirdness either way. |
This is an interesting use case but I have a few questions:
The As Jon points out, the current implementations of the numeric and boolean SerDes assume that there are no nulls in the input boxed element arrays. For symmetry of the API across all these SerDes, it might make sense to create separate "null safe" SerDe classes. This way, it would be possible, in the future, as required, to create "null safe" capability for all the SerDes, including the numeric and boolean ones without any backward compatibility issues. But, as stated above, the simplest "null safe" implementation for strings would be an empty substitution. I would strongly encourage you to consider this as well as a separate "null safe" SerDe class: |
Thanks for the feedback, @leerho !
That's correct, the underlying sketch is a quantiles
This would not be desirable as in most of our datasets,
This could be done but if the user needs to specify this, it would perhaps be an additional cognitive burden. If hardcoded into the system as a specific String, there would still be a little chance of that string occurring in a dataset. If you believe there is a string we can use for this purpose, would it make sense to do the conversion from
None of the Java classes would have to be changed. But I am not sure about how the C++ compat would be affected. I have added both a nullSafe and a non nullSafe mode in the modified classes such that the existing sketch classes can continue to rely on the non nullSafe behaviour without any change at all. They would have to handle nulls only if they explicitly start constructing the SerDe as Code Snippet: private final boolean nullSafe;
/**
* Creates an instance of {@code ArrayOfStringsSerDe} which does not handle
* null values.
*/
public ArrayOfStringsSerDe() {
this(false);
}
/**
* Creates an instance of {@code ArrayOfStringsSerDe}.
*
* @param nullSafe true if null values should be serialized/deserialized safely.
*/
public ArrayOfStringsSerDe(boolean nullSafe) {
this.nullSafe = nullSafe;
} Please let me know if this would be insufficient and if it would be better to separate the nullSafe mode into a different class altogether. |
Null handling for all of our "container" type sketches which include the classic quantiles, KLL, and REQ, as well as the Frequent-Items sketches and Sampling sketches is a long-term goal. And given time and resources we hope to get around to it. Nonetheless, in all cases users have the option to do their own null-substitutions before submitting values to the sketch. And for those sketches that allow a user-defined comparator (such as the classic Generic Quantiles, the user can determine where the user-supplied null-substitute would appear in a sorted list. With that in mind, I would like to close this issue, which is a minor subset of the above goal. Thank you for prodding us to think about this important issue! |
Description
In cases of sparse data, occurrences of
null
are very common. Sketch estimationsshould treat
null
as a valid key, so that the estimated occurrences are more accurate.This is particularly useful in case of String columns, where null is more likely to occur.
As seen in apache/druid#11973, when the key is composed of multiple (nullable)
dimensions, the probability of a data row having a null dimension value becomes much
higher and thus the estimates are poorer, because any row containing a null dimension
value would simply have to be ignored.
This PR does the initial work required for this by allowing nulls in ArrayOfStringsSerDe.
Changes
ArrayOfStringsSerDe
andArrayOfUtf16StringsSerDe
So, a serialized null String would require the same number of bytes as a serialized empty String.
Testing
mvn clean test
Follow up work
ArrayOfNumbersSerDe
as it already uses a bytemarker for length. That same marker could hold a different flag value to denote nulls.
ArrayOfDoublesSerDe
orArrayOfLongsSerDe
would require changingthe way the array is serialized as it currently includes no marker byte. This would break
compatibility and might not add a lot of value.
ItemsSketch
would also have to be updated to have anullSafe mode.