Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing private parts of datasketches #12261

Closed
AlexanderSaydakov opened this issue Feb 15, 2022 · 13 comments
Closed

Accessing private parts of datasketches #12261

AlexanderSaydakov opened this issue Feb 15, 2022 · 13 comments
Milestone

Comments

@AlexanderSaydakov
Copy link
Contributor

We have noticed this recent change e648b01

This change introduced a piece of code which bypasses public API of DataSketches and accesses a private part of implementation.

We believe that it is problematic and can lead to wrong results.
We have this in our documentation:
/**

  • Although the gadget object is initially an UpdateSketch, in the context of a Union it is used
  • as a specialized buffer that happens to leverage much of the machinery of an UpdateSketch.
  • However, in this context some of the key invariants of the sketch algorithm are intentionally
  • violated as an optimization. As a result this object can not be considered as an UpdateSketch
  • and should never be exported as an UpdateSketch. It’s internal state is not necessarily
  • finalized and may contain garbage. Also its internal concept of “nominal entries” or “k” can
  • be meaningless. It is private for very good reasons.
    */
@kfaraz
Copy link
Contributor

kfaraz commented Feb 15, 2022

Thanks a lot for pointing this out, @AlexanderSaydakov ! I had missed this while making the changes in the mentioned commit.

What I really intended to do was get a better estimate of the memory used by a org.apache.datasketches.theta.Union.
As the Union interface does not expose a getCurrentBytes() or similar method, I had to resort to the reflection.

Please advise what would be a better way to achieve the same.

@leerho
Copy link
Contributor

leerho commented Feb 15, 2022

@kfaraz @AlexanderSaydakov @gianm @cheddar

  1. PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.
  2. Whatever you do, never access internal private methods of these sketches.
  3. We placed a detailed, and explicit warning Javadoc on the private gadget member precisely because we were worried that someone might try to do what you did.
    Warning Comment
    I don't know how we could have been more clear.

The memory currently being used by a Union or Sketch is rather complex:

  • Do you mean on-heap or off-heap memory, or both?
  • The largest chunk of memory is the array of hashes which is configured as a hashTable, where the fraction of that hashTable space consumed by valid hashes can vary enormously based on the input stream, and the exact state of the union state machine. --- So, are you seeking to understand the size of the total space consumed by the sketch, regardless of what fraction is actively being used at that moment? Unfortunately, this is dynamic, non-deterministic variable. And if your application makes use of reset(), it will reset the space used down to a minimum.
  • How do you intend to use this information? Are you trying to forecast memory requirements in the future, based on some aggregate statistic across many sketches?

We have done quite a bit of work on modeling aggregate sketch memory usage where thousands or millions of sketches are concurrent in memory. Perhaps this may be useful to you.

We like to use Druid's use of DataSketches as an exemplary model and we point folks to your website and your code base so that they can understand how DataSketches can be integrated properly. We certainly don't want other platforms mimicking your accessing sketch internal private fields or methods.

No one understands the internals of these sketches better than our DataSketches team, and we are very much interested in making sketches work optimally in the Druid environment. But we cannot help you if you don't engage with us.

In conclusion, it is my strong recommendation that you revert this commit (before it gets frozen in a release), and work with us to help you find a better, more public solution.

@gianm
Copy link
Contributor

gianm commented Feb 15, 2022

PLEASE, engage with us and help us understand what it is you are trying to do. If it means adding a public API to accomplish it, we are more than willing to do that.

Please don't take it personally that nobody reached out to you about the original problem. The Druid project is a big place with a lot of dependencies, and not all contributors are aware of how willing those various dependencies' maintainers are to work with us. Your willingness to work with us over the years is very much appreciated, as is the fact that you took the time to write up your concerns about this particular case.

The memory currently being used by a Union or Sketch is rather complex.

The commit is from #12073, #12022 which are trying to improve memory estimation in the on-heap indexing implementation. The goal of the patch is to decide when to spill an on-heap index to disk. To do this properly, we need a way to estimate the memory usage of the index. The patch works by adding an aggregateWithSize() method to on-heap aggregators that (1) does an aggregation and (2) returns the amount of additional memory use caused by that particular aggregation. Each aggregator is associated with a particular row in the index, so we then sum up the memory used by all aggregators in all rows to get the memory footprint of the entire index.

The private field in this case is being used by SketchAggregator to implement the memory-estimation piece of aggregateWithSize().

We shouldn't be using private fields like that, so we should find another way to solve the original problem.

One obvious solution I can think of is to use a simple formula, like something that is based only on the size parameter and the number of items added so far. I'm not sure how accurate this would be since I'm not super familiar with how the sketches internally manage their on-heap memory. But maybe it'd be close enough? If we do go this way, an implementation note: at ingestion time, it's likely that a small number of items get added to each sketch. (Like in the 1 to 50 range.) I think some sketches have a special mode they use when the number of added items is low, and whenever that is true, we'd want to take that into account to avoid overcounting memory usage.

@AlexanderSaydakov or @leerho do you have any other, better suggestions? & @kfaraz could you please work with them to implement a better solution when we come up with one?

@imply-cheddar
Copy link
Contributor

As gian mentioned, the code is reaching in and grabbing the _gadget as a short-term solution to grab the number of bytes used. It would absolutely be preferable for the UnionSketch to also expose the same .getCurrentBytes() method.

In terms of the implementation, the sketch that was grabbed is only being used to call .getCurrentBytes() and is used for no other purpose. The memory counting structure works with deltas, so if reset was called and the number got smaller, there would then be a negative delta in memory usage and the memory consumption values would be updated accordingly.

The intention was always to open a ticket to request that the public API be extended. We wanted to have a concrete example of why we needed the method rather than just asking for a random method, so we wanted the PR to actually exist before doing the request. In the end, the request back to the datasketches got lost in the shuffle of getting the interface changes worked out, so that's our bad.

@leerho
Copy link
Contributor

leerho commented Feb 16, 2022

I have added a PR, mentioned just above, to directly address this issue. It will be released with the next Java release which should be relatively soon since we also want to release a new KllDoublesSketch, plus a few other things.

I do respectfully request that you do not lock your "short-term" solution in a formal release.

Please note: the getCurrentBytes() method implemented here as well as in your "short-term" solution will only report a different value after the internal gadget goes through a resize when the current internal hash table is full. In between these resize events this method will return the value from the previous resize event. If you serialize the union via toByteArray() the length of the byte array will exactly be the value returned by this method. It also represents (approximately) how much RAM the union is using.

However, if you do union.toByteArray().getResult(), the size of the resulting CompactSketch will generally be much smaller as the union has been "pulled back to K" and compacted.

We don't recommend that you actually serialize the union to either store to disk or to transport to another machine because it is so much larger than the compact sketch you get when you getResult().

@imply-cheddar
Copy link
Contributor

The value is currently being used to estimate the total size in memory of the onheap Aggregator object, so the semantics that you mentioned are exactly what we want. It's never used to determine or allocate the number of bytes actually required to store, persist or transport the sketch.

It's also perfectly fine that it doesn't always produce a new value. That's understood and accounted for.

Thanks for such a quick turnaround on an update to the base library. As soon as that's release, let us know and we will update the code to remove the reflection.

@leerho
Copy link
Contributor

leerho commented Apr 22, 2022

@gianm @imply-cheddar @kfaraz

This is to let you know that datasketches-java 3.2.0-RC1 is up for vote. This release contains two important features requested by Druid or Druid users:

  • This release extends the quantiles KLL sketch to include both doubles
    and float types as well as full updatable operation off-heap in contiguous memory.
  • This release also adds two new methods to Theta Union based on the
    discussion on Druid issue # 12261:
    • getCurrentBytes() This behaves similarly to the Theta Sketch method
      by the same name.
    • getMaxUnionBytes() This behaves similarly to the static Theta / Sketches
      method by the same name, but is non-static and returns the maximum stored
      bytes for this union given its nominal entries configuration.

If you would like to vote, please do :)

Thanks,
Lee.

@leerho
Copy link
Contributor

leerho commented Apr 22, 2022

Also, @AlexanderSaydakov is working to extend the DataSketches Druid adaptor to accommodate the new KllSketch.

@leerho
Copy link
Contributor

leerho commented Apr 22, 2022

@gianm @imply-cheddar @kfaraz

Question:
With the new datasketches-java 3.2.0-RC1 that is in vote right now, it is not possible to merge a KllFloatsSketch into a KllDoublesSketch as it will throw an error. Even if this were possible, you would naturally lose some accuracy (at the primitive level) going from a float to a double. Nevertheless, it would be relatively straightforward to create a "converter tool" that would create a KllDoublesSketch from a KllFloatsSketch that one could use before a merge. Is such a capability of interest in the Druid community? If so, could it wait until the next datasketches-java release?

@gianm
Copy link
Contributor

gianm commented May 9, 2022

With the new datasketches-java 3.2.0-RC1 that is in vote right now, it is not possible to merge a KllFloatsSketch into a KllDoublesSketch as it will throw an error. Even if this were possible, you would naturally lose some accuracy (at the primitive level) going from a float to a double. Nevertheless, it would be relatively straightforward to create a "converter tool" that would create a KllDoublesSketch from a KllFloatsSketch that one could use before a merge. Is such a capability of interest in the Druid community? If so, could it wait until the next datasketches-java release?

If we're going to support using both at ingestion time, then I think a converter would be useful, since that would allow people to switch from the float sketch to the double sketch without creating a new column. However, I don't think it's essential enough to block datasketches-java releases. We will just need to document that at the current time, if you want to switch KLL sketch implementations, you should create a new column, and you can't merge sketches of two different types.

@gianm gianm added this to the 0.23.0 milestone May 9, 2022
@gianm
Copy link
Contributor

gianm commented May 9, 2022

@abhishekagarwal87 fyi, I tagged this with 0.23.0 due to the discussion above. I hope we can include this change.

@abhishekagarwal87
Copy link
Contributor

Sure thing, @gianm.

@abhishekagarwal87
Copy link
Contributor

This is fixed by #12522 #12509

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants