-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12934][SQL] Count-min sketch serialization #10893
Conversation
692f93e
to
e97d7f9
Compare
Test build #49967 has finished for PR 10893 at commit
|
public static CountMinSketch readFrom(InputStream in) { | ||
throw new UnsupportedOperationException("Not implemented yet"); | ||
public static CountMinSketch readFrom(InputStream in) throws IOException { | ||
DataInputStream dis = new DataInputStream(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put the implementation in CountMinSketchImpl, and simply delegate to a static method there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also somewhere in CountMinSketchImpl we should document the serialization format.
@@ -55,6 +57,25 @@ | |||
*/ | |||
abstract public class CountMinSketch { | |||
/** | |||
* Version number of the serialized binary format. | |||
*/ | |||
public enum Version { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this out so that we can also use it in bloom filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be count min sketch specific, because the two binary protocols can and should evolve separately.
5e0c225
to
4abc4e0
Compare
Test build #50023 has finished for PR 10893 at commit
|
Test build #50025 has finished for PR 10893 at commit
|
Test build #50020 has finished for PR 10893 at commit
|
* - Row 1 (width * 64 bit) | ||
* - ... | ||
* - Row depth - 1 (width * 64 bit) | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this comment near writeTo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I couldn't decide whether to put it before writeTo
or readFrom
, and then ended up here...
LGTM overall |
DataInputStream dis = new DataInputStream(in); | ||
|
||
// Ignores version number | ||
dis.readInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add some check here to throw an exception if the version number is not 1.
Going to merge this. Thanks. Please address my (one) feedback in your next pr. |
public CMSMergeException(String message) { | ||
super(message); | ||
public static CountMinSketchImpl readFrom(InputStream in) throws IOException { | ||
DataInputStream dis = new DataInputStream(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be closed before returning from the method, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But DataInputStream.close()
also closes the underlying input stream, which isn't our expected behavior, is it?
This PR adds serialization support for
CountMinSketch
.A version number is added to version the serialized binary format.