-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessing private parts of datasketches #12261
Comments
Thanks a lot for pointing this out, @AlexanderSaydakov ! I had missed this while making the changes in the mentioned commit. What I really intended to do was get a better estimate of the memory used by a Please advise what would be a better way to achieve the same. |
@kfaraz @AlexanderSaydakov @gianm @cheddar
The memory currently being used by a Union or Sketch is rather complex:
We have done quite a bit of work on modeling aggregate sketch memory usage where thousands or millions of sketches are concurrent in memory. Perhaps this may be useful to you. We like to use Druid's use of DataSketches as an exemplary model and we point folks to your website and your code base so that they can understand how DataSketches can be integrated properly. We certainly don't want other platforms mimicking your accessing sketch internal private fields or methods. No one understands the internals of these sketches better than our DataSketches team, and we are very much interested in making sketches work optimally in the Druid environment. But we cannot help you if you don't engage with us. In conclusion, it is my strong recommendation that you revert this commit (before it gets frozen in a release), and work with us to help you find a better, more public solution. |
Please don't take it personally that nobody reached out to you about the original problem. The Druid project is a big place with a lot of dependencies, and not all contributors are aware of how willing those various dependencies' maintainers are to work with us. Your willingness to work with us over the years is very much appreciated, as is the fact that you took the time to write up your concerns about this particular case.
The commit is from #12073, #12022 which are trying to improve memory estimation in the on-heap indexing implementation. The goal of the patch is to decide when to spill an on-heap index to disk. To do this properly, we need a way to estimate the memory usage of the index. The patch works by adding an The private field in this case is being used by SketchAggregator to implement the memory-estimation piece of We shouldn't be using private fields like that, so we should find another way to solve the original problem. One obvious solution I can think of is to use a simple formula, like something that is based only on the @AlexanderSaydakov or @leerho do you have any other, better suggestions? & @kfaraz could you please work with them to implement a better solution when we come up with one? |
As gian mentioned, the code is reaching in and grabbing the _gadget as a short-term solution to grab the number of bytes used. It would absolutely be preferable for the UnionSketch to also expose the same In terms of the implementation, the sketch that was grabbed is only being used to call The intention was always to open a ticket to request that the public API be extended. We wanted to have a concrete example of why we needed the method rather than just asking for a random method, so we wanted the PR to actually exist before doing the request. In the end, the request back to the datasketches got lost in the shuffle of getting the interface changes worked out, so that's our bad. |
I have added a PR, mentioned just above, to directly address this issue. It will be released with the next Java release which should be relatively soon since we also want to release a new KllDoublesSketch, plus a few other things. I do respectfully request that you do not lock your "short-term" solution in a formal release. Please note: the getCurrentBytes() method implemented here as well as in your "short-term" solution will only report a different value after the internal gadget goes through a resize when the current internal hash table is full. In between these resize events this method will return the value from the previous resize event. If you serialize the union via toByteArray() the length of the byte array will exactly be the value returned by this method. It also represents (approximately) how much RAM the union is using. However, if you do We don't recommend that you actually serialize the union to either store to disk or to transport to another machine because it is so much larger than the compact sketch you get when you getResult(). |
The value is currently being used to estimate the total size in memory of the onheap Aggregator object, so the semantics that you mentioned are exactly what we want. It's never used to determine or allocate the number of bytes actually required to store, persist or transport the sketch. It's also perfectly fine that it doesn't always produce a new value. That's understood and accounted for. Thanks for such a quick turnaround on an update to the base library. As soon as that's release, let us know and we will update the code to remove the reflection. |
This is to let you know that datasketches-java 3.2.0-RC1 is up for vote. This release contains two important features requested by Druid or Druid users:
If you would like to vote, please do :) Thanks, |
Also, @AlexanderSaydakov is working to extend the DataSketches Druid adaptor to accommodate the new KllSketch. |
Question: |
If we're going to support using both at ingestion time, then I think a converter would be useful, since that would allow people to switch from the float sketch to the double sketch without creating a new column. However, I don't think it's essential enough to block datasketches-java releases. We will just need to document that at the current time, if you want to switch KLL sketch implementations, you should create a new column, and you can't merge sketches of two different types. |
@abhishekagarwal87 fyi, I tagged this with 0.23.0 due to the discussion above. I hope we can include this change. |
Sure thing, @gianm. |
We have noticed this recent change e648b01
This change introduced a piece of code which bypasses public API of DataSketches and accesses a private part of implementation.
druid/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java
Line 60 in 033989e
We believe that it is problematic and can lead to wrong results.
We have this in our documentation:
/**
*/
The text was updated successfully, but these errors were encountered: