Accessing private parts of datasketches #12261

AlexanderSaydakov · 2022-02-15T01:54:25Z

We have noticed this recent change e648b01

This change introduced a piece of code which bypasses public API of DataSketches and accesses a private part of implementation.

druid/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java

Line 60 in 033989e

SKETCH_FIELD = Class.forName("org.apache.datasketches.theta.UnionImpl")

We believe that it is problematic and can lead to wrong results.
We have this in our documentation:
/**

Although the gadget object is initially an UpdateSketch, in the context of a Union it is used
as a specialized buffer that happens to leverage much of the machinery of an UpdateSketch.
However, in this context some of the key invariants of the sketch algorithm are intentionally
violated as an optimization. As a result this object can not be considered as an UpdateSketch
and should never be exported as an UpdateSketch. It’s internal state is not necessarily
finalized and may contain garbage. Also its internal concept of “nominal entries” or “k” can
be meaningless. It is private for very good reasons.
*/

The text was updated successfully, but these errors were encountered:

kfaraz · 2022-02-15T17:57:07Z

Thanks a lot for pointing this out, @AlexanderSaydakov ! I had missed this while making the changes in the mentioned commit.

What I really intended to do was get a better estimate of the memory used by a org.apache.datasketches.theta.Union.
As the Union interface does not expose a getCurrentBytes() or similar method, I had to resort to the reflection.

Please advise what would be a better way to achieve the same.

leerho · 2022-02-15T20:06:35Z

@kfaraz @AlexanderSaydakov @gianm @cheddar

PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.
Whatever you do, never access internal private methods of these sketches.
We placed a detailed, and explicit warning Javadoc on the private gadget member precisely because we were worried that someone might try to do what you did.
Warning Comment
I don't know how we could have been more clear.

The memory currently being used by a Union or Sketch is rather complex:

Do you mean on-heap or off-heap memory, or both?
The largest chunk of memory is the array of hashes which is configured as a hashTable, where the fraction of that hashTable space consumed by valid hashes can vary enormously based on the input stream, and the exact state of the union state machine. --- So, are you seeking to understand the size of the total space consumed by the sketch, regardless of what fraction is actively being used at that moment? Unfortunately, this is dynamic, non-deterministic variable. And if your application makes use of reset(), it will reset the space used down to a minimum.
How do you intend to use this information? Are you trying to forecast memory requirements in the future, based on some aggregate statistic across many sketches?

We have done quite a bit of work on modeling aggregate sketch memory usage where thousands or millions of sketches are concurrent in memory. Perhaps this may be useful to you.

We like to use Druid's use of DataSketches as an exemplary model and we point folks to your website and your code base so that they can understand how DataSketches can be integrated properly. We certainly don't want other platforms mimicking your accessing sketch internal private fields or methods.

No one understands the internals of these sketches better than our DataSketches team, and we are very much interested in making sketches work optimally in the Druid environment. But we cannot help you if you don't engage with us.

In conclusion, it is my strong recommendation that you revert this commit (before it gets frozen in a release), and work with us to help you find a better, more public solution.

gianm · 2022-02-15T20:52:43Z

PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.

Please don't take it personally that nobody reached out to you about the original problem. The Druid project is a big place with a lot of dependencies, and not all contributors are aware of how willing those various dependencies' maintainers are to work with us. Your willingness to work with us over the years is very much appreciated, as is the fact that you took the time to write up your concerns about this particular case.

The memory currently being used by a Union or Sketch is rather complex.

The commit is from #12073, #12022 which are trying to improve memory estimation in the on-heap indexing implementation. The goal of the patch is to decide when to spill an on-heap index to disk. To do this properly, we need a way to estimate the memory usage of the index. The patch works by adding an aggregateWithSize() method to on-heap aggregators that (1) does an aggregation and (2) returns the amount of additional memory use caused by that particular aggregation. Each aggregator is associated with a particular row in the index, so we then sum up the memory used by all aggregators in all rows to get the memory footprint of the entire index.

The private field in this case is being used by SketchAggregator to implement the memory-estimation piece of aggregateWithSize().

We shouldn't be using private fields like that, so we should find another way to solve the original problem.

One obvious solution I can think of is to use a simple formula, like something that is based only on the size parameter and the number of items added so far. I'm not sure how accurate this would be since I'm not super familiar with how the sketches internally manage their on-heap memory. But maybe it'd be close enough? If we do go this way, an implementation note: at ingestion time, it's likely that a small number of items get added to each sketch. (Like in the 1 to 50 range.) I think some sketches have a special mode they use when the number of added items is low, and whenever that is true, we'd want to take that into account to avoid overcounting memory usage.

@AlexanderSaydakov or @leerho do you have any other, better suggestions? & @kfaraz could you please work with them to implement a better solution when we come up with one?

imply-cheddar · 2022-02-15T22:37:12Z

As gian mentioned, the code is reaching in and grabbing the _gadget as a short-term solution to grab the number of bytes used. It would absolutely be preferable for the UnionSketch to also expose the same .getCurrentBytes() method.

In terms of the implementation, the sketch that was grabbed is only being used to call .getCurrentBytes() and is used for no other purpose. The memory counting structure works with deltas, so if reset was called and the number got smaller, there would then be a negative delta in memory usage and the memory consumption values would be updated accordingly.

The intention was always to open a ticket to request that the public API be extended. We wanted to have a concrete example of why we needed the method rather than just asking for a random method, so we wanted the PR to actually exist before doing the request. In the end, the request back to the datasketches got lost in the shuffle of getting the interface changes worked out, so that's our bad.

leerho · 2022-02-16T04:52:42Z

I have added a PR, mentioned just above, to directly address this issue. It will be released with the next Java release which should be relatively soon since we also want to release a new KllDoublesSketch, plus a few other things.

I do respectfully request that you do not lock your "short-term" solution in a formal release.

Please note: the getCurrentBytes() method implemented here as well as in your "short-term" solution will only report a different value after the internal gadget goes through a resize when the current internal hash table is full. In between these resize events this method will return the value from the previous resize event. If you serialize the union via toByteArray() the length of the byte array will exactly be the value returned by this method. It also represents (approximately) how much RAM the union is using.

However, if you do union.toByteArray().getResult(), the size of the resulting CompactSketch will generally be much smaller as the union has been "pulled back to K" and compacted.

We don't recommend that you actually serialize the union to either store to disk or to transport to another machine because it is so much larger than the compact sketch you get when you getResult().

imply-cheddar · 2022-02-16T10:34:22Z

The value is currently being used to estimate the total size in memory of the onheap Aggregator object, so the semantics that you mentioned are exactly what we want. It's never used to determine or allocate the number of bytes actually required to store, persist or transport the sketch.

It's also perfectly fine that it doesn't always produce a new value. That's understood and accounted for.

Thanks for such a quick turnaround on an update to the base library. As soon as that's release, let us know and we will update the code to remove the reflection.

leerho · 2022-04-22T00:48:23Z

@gianm @imply-cheddar @kfaraz

This is to let you know that datasketches-java 3.2.0-RC1 is up for vote. This release contains two important features requested by Druid or Druid users:

This release extends the quantiles KLL sketch to include both doubles
and float types as well as full updatable operation off-heap in contiguous memory.
This release also adds two new methods to Theta Union based on the
discussion on Druid issue # 12261:
- getCurrentBytes() This behaves similarly to the Theta Sketch method
  by the same name.
- getMaxUnionBytes() This behaves similarly to the static Theta / Sketches
  method by the same name, but is non-static and returns the maximum stored
  bytes for this union given its nominal entries configuration.

If you would like to vote, please do :)

Thanks,
Lee.

leerho · 2022-04-22T00:50:56Z

Also, @AlexanderSaydakov is working to extend the DataSketches Druid adaptor to accommodate the new KllSketch.

leerho · 2022-04-22T18:14:20Z

@gianm @imply-cheddar @kfaraz

Question:
With the new datasketches-java 3.2.0-RC1 that is in vote right now, it is not possible to merge a KllFloatsSketch into a KllDoublesSketch as it will throw an error. Even if this were possible, you would naturally lose some accuracy (at the primitive level) going from a float to a double. Nevertheless, it would be relatively straightforward to create a "converter tool" that would create a KllDoublesSketch from a KllFloatsSketch that one could use before a merge. Is such a capability of interest in the Druid community? If so, could it wait until the next datasketches-java release?

gianm · 2022-05-09T15:57:33Z

With the new datasketches-java 3.2.0-RC1 that is in vote right now, it is not possible to merge a KllFloatsSketch into a KllDoublesSketch as it will throw an error. Even if this were possible, you would naturally lose some accuracy (at the primitive level) going from a float to a double. Nevertheless, it would be relatively straightforward to create a "converter tool" that would create a KllDoublesSketch from a KllFloatsSketch that one could use before a merge. Is such a capability of interest in the Druid community? If so, could it wait until the next datasketches-java release?

If we're going to support using both at ingestion time, then I think a converter would be useful, since that would allow people to switch from the float sketch to the double sketch without creating a new column. However, I don't think it's essential enough to block datasketches-java releases. We will just need to document that at the current time, if you want to switch KLL sketch implementations, you should create a new column, and you can't merge sketches of two different types.

gianm · 2022-05-09T15:58:31Z

@abhishekagarwal87 fyi, I tagged this with 0.23.0 due to the discussion above. I hope we can include this change.

abhishekagarwal87 · 2022-05-10T12:21:15Z

Sure thing, @gianm.

abhishekagarwal87 · 2022-05-17T08:26:49Z

This is fixed by #12522 #12509

leerho mentioned this issue Feb 16, 2022

Get union size apache/datasketches-java#388

Merged

gianm added this to the 0.23.0 milestone May 9, 2022

abhishekagarwal87 closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing private parts of datasketches #12261

Accessing private parts of datasketches #12261

AlexanderSaydakov commented Feb 15, 2022

kfaraz commented Feb 15, 2022

leerho commented Feb 15, 2022

gianm commented Feb 15, 2022

imply-cheddar commented Feb 15, 2022

leerho commented Feb 16, 2022 •

edited

Loading

imply-cheddar commented Feb 16, 2022

leerho commented Apr 22, 2022

leerho commented Apr 22, 2022

leerho commented Apr 22, 2022

gianm commented May 9, 2022

gianm commented May 9, 2022

abhishekagarwal87 commented May 10, 2022

abhishekagarwal87 commented May 17, 2022

Accessing private parts of datasketches #12261

Accessing private parts of datasketches #12261

Comments

AlexanderSaydakov commented Feb 15, 2022

kfaraz commented Feb 15, 2022

leerho commented Feb 15, 2022

gianm commented Feb 15, 2022

imply-cheddar commented Feb 15, 2022

leerho commented Feb 16, 2022 • edited Loading

imply-cheddar commented Feb 16, 2022

leerho commented Apr 22, 2022

leerho commented Apr 22, 2022

leerho commented Apr 22, 2022

gianm commented May 9, 2022

gianm commented May 9, 2022

abhishekagarwal87 commented May 10, 2022

abhishekagarwal87 commented May 17, 2022

leerho commented Feb 16, 2022 •

edited

Loading