Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a nullSafe mode in ArrayOfStringsSerDe #375

Closed
wants to merge 2 commits into from

Conversation

kfaraz
Copy link

@kfaraz kfaraz commented Nov 24, 2021

Description

In cases of sparse data, occurrences of null are very common. Sketch estimations
should treat null as a valid key, so that the estimated occurrences are more accurate.
This is particularly useful in case of String columns, where null is more likely to occur.

As seen in apache/druid#11973, when the key is composed of multiple (nullable)
dimensions, the probability of a data row having a null dimension value becomes much
higher and thus the estimates are poorer, because any row containing a null dimension
value would simply have to be ignored.

This PR does the initial work required for this by allowing nulls in ArrayOfStringsSerDe.

Changes

  • Add a null-safe mode in both ArrayOfStringsSerDe and ArrayOfUtf16StringsSerDe
  • In null-safe mode, null Strings are serialized simply as a negative length marker.
    So, a serialized null String would require the same number of bytes as a serialized empty String.
  • In non null-safe mode, serialization/deserialization continues to work as before.
  • Add tests for the above

Testing

  • Done mvn clean test

Follow up work

  • A similar change could also be made in ArrayOfNumbersSerDe as it already uses a byte
    marker for length. That same marker could hold a different flag value to denote nulls.
  • Similar change in ArrayOfDoublesSerDe or ArrayOfLongsSerDe would require changing
    the way the array is serialized as it currently includes no marker byte. This would break
    compatibility and might not add a lot of value.
  • The actual sketch classes such as ItemsSketch would also have to be updated to have a
    nullSafe mode.

@jmalkin
Copy link
Contributor

jmalkin commented Nov 24, 2021

There are some sketches where we don't allow null as input, but maybe that's not a concern since this would have nothing to do.

One potential issue is that it'll introduce an outright incompatibility with C++, where the provided serde uses unsigned values. It seems fairly unlikely that anyone has produced serialized images with string lengths >= 2^31, but it might be better to use a separate null-aware class.

@kfaraz
Copy link
Author

kfaraz commented Nov 25, 2021

Thanks for the review, @jmalkin !

There are some sketches where we don't allow null as input, but maybe that's not a concern since this would have nothing to do.

Yes, I tried testing ItemsSketch and I had to make some changes to allow nulls. But those changes are not in this PR. If needed, that can be done in a follow-up PR.

One potential issue is that it'll introduce an outright incompatibility with C++, where the provided serde uses unsigned values. It seems fairly unlikely that anyone has produced serialized images with string lengths >= 2^31, but it might be better to use a separate null-aware class.

The nullSafe mode in the modified classes is optional and existing code will continue to use the non nullSafe mode unless explicitly changed, but I guess that won't be enough for C++ compatibility.
So, I will go ahead and create a nullSafe version of these SerDes rather than modifying the existing ones.

@jmalkin
Copy link
Contributor

jmalkin commented Nov 25, 2021

@kfaraz I'd wait for either @leerho or @AlexanderSaydakov to weigh in with a second opinion before refactoring. The max non-negative integer is large enough it may well be a non-issue even if it's a technical incompatibility. IIRC, java can't allocate an array with >= 2^31 bytes anyway, so there's some compatibility weirdness either way.

@leerho
Copy link
Contributor

leerho commented Nov 29, 2021

This is an interesting use case but I have a few questions:

  • I assume from the discussion in the referenced Druid PR, that the underlying sketch is a quantiles sketch. Is this correct?
  • Given an array of strings, why is a null string element interpreted differently from an empty string element?
  • Could null be made equivalent to empty? For example, an empty string be substituted for every occurrence of a null element? This would eliminate the need for a magic null symbol, which does require changing the sketch code for all sketches that might use that class.
  • Could a null be converted to a magic string, specified by the user and based on the use case?

The ArrayOfStringsSerDe class can be used by the frequency sketches, sampling sketches as well as the quantile sketches. This PR would require changing sketch code of all these sketches to interpret nulls differently and either allow or disallow them, depending on the sketch and the use case, which increases the complexity of these sketches.

As Jon points out, the current implementations of the numeric and boolean SerDes assume that there are no nulls in the input boxed element arrays. For symmetry of the API across all these SerDes, it might make sense to create separate "null safe" SerDe classes. This way, it would be possible, in the future, as required, to create "null safe" capability for all the SerDes, including the numeric and boolean ones without any backward compatibility issues.

But, as stated above, the simplest "null safe" implementation for strings would be an empty substitution. I would strongly encourage you to consider this as well as a separate "null safe" SerDe class: ArrayOfNullSensitiveStringsSerDe.

@kfaraz
Copy link
Author

kfaraz commented Dec 3, 2021

Thanks for the feedback, @leerho !

I assume from the discussion in the referenced Druid PR, that the underlying sketch is a quantiles sketch. Is this correct?

That's correct, the underlying sketch is a quantiles ItemsSketch.

Given an array of strings, why is a null string element interpreted differently from an empty string element?
Could null be made equivalent to empty? For example, an empty string be substituted for every occurrence of a null element? This would eliminate the need for a magic null symbol, which does require changing the sketch code for all sketches that might use that class.

This would not be desirable as in most of our datasets, null typically signifies "No Data" whereas an empty String would be valid data explicitly set to empty by a user. There are often queries just to find out rows which have null values (i.e. "No Data") for a particular column.

Could a null be converted to a magic string, specified by the user and based on the use case?

This could be done but if the user needs to specify this, it would perhaps be an additional cognitive burden. If hardcoded into the system as a specific String, there would still be a little chance of that string occurring in a dataset.

If you believe there is a string we can use for this purpose, would it make sense to do the conversion from null to the magic string and vice-versa in ArrayOfStringsSerDe itself?

This PR would require changing sketch code of all these sketches to interpret nulls differently and either allow or disallow them, depending on the sketch and the use case, which increases the complexity of these sketches.

None of the Java classes would have to be changed. But I am not sure about how the C++ compat would be affected. I have added both a nullSafe and a non nullSafe mode in the modified classes such that the existing sketch classes can continue to rely on the non nullSafe behaviour without any change at all. They would have to handle nulls only if they explicitly start constructing the SerDe as new ArrayOfStringsSerDe(true) instead of the existing new ArrayOfStringsSerDe()

Code Snippet:

  private final boolean nullSafe;

  /**
   * Creates an instance of {@code ArrayOfStringsSerDe} which does not handle
   * null values.
   */
  public ArrayOfStringsSerDe() {
    this(false);
  }

  /**
   * Creates an instance of {@code ArrayOfStringsSerDe}.
   *
   * @param nullSafe true if null values should be serialized/deserialized safely.
   */
  public ArrayOfStringsSerDe(boolean nullSafe) {
    this.nullSafe = nullSafe;
  }

Please let me know if this would be insufficient and if it would be better to separate the nullSafe mode into a different class altogether.

@leerho
Copy link
Contributor

leerho commented Feb 2, 2022

Null handling for all of our "container" type sketches which include the classic quantiles, KLL, and REQ, as well as the Frequent-Items sketches and Sampling sketches is a long-term goal. And given time and resources we hope to get around to it. Nonetheless, in all cases users have the option to do their own null-substitutions before submitting values to the sketch. And for those sketches that allow a user-defined comparator (such as the classic Generic Quantiles, the user can determine where the user-supplied null-substitute would appear in a sorted list.

With that in mind, I would like to close this issue, which is a minor subset of the above goal.

Thank you for prodding us to think about this important issue!

@leerho leerho closed this Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants