-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future-proof serialization #87
Comments
You can overload Java serialization process to save what you want, for example in my example, methods getBytes and fromBytes already implements in your Digest classes. In this example, fields of object are ignored and only declared "digestEncodeWithBytes" field is declared : public abstract class TDigest implements Serializable {
// For serialisation, overload ObjectInput/OutputStream reflection
private static final ObjectStreamField[] serialPersistentFields = {
new ObjectStreamField("digestEncodeWithBytes", byte[].class)
};
// Or use getBytes(ByteBuffer) instead
public abstract byte[] getBytes();
// Serialization create the object which implements this abstract class, so call a method to init it
protected abstract void initWithBytes(byte[] b);
/**
* Overload {@link ObjectOutputStream#defaultWriteObject()}
*
* @param oos object output stream use for write
* @throws IOException error during write process
*/
private void writeObject(ObjectOutputStream oos) throws IOException {
oos.putFields().put("digestEncodeWithBytes", getBytes());
oos.writeFields();
}
/**
* Overload {@link ObjectInputStream#defaultReadObject()}
*
* @param ois object input stream use for read
* @throws IOException error during read process
* @throws ClassNotFoundException class use in serialised stream unknown
*/
private void readObject(ObjectInputStream ois) throws IOException, ClassNotFoundException {
Object o = ois.readFields().get("digestEncodeWithBytes", null);
if (o != null) {
byte[] b = (byte[]) o;
initWithBytes(b);
}
}
} You can also overwrite deserialisation with : link class HackedObjectInputStream extends ObjectInputStream {
public HackedObjectInputStream(InputStream in) throws IOException {
super(in);
}
@Override
protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
ObjectStreamClass resultClassDescriptor = super.readClassDescriptor();
if (resultClassDescriptor.getName().equals("oldpackage.Clazz"))
resultClassDescriptor = ObjectStreamClass.lookup(newpackage.Clazz.class);
return resultClassDescriptor;
}
} |
Isn't it possible to add a Transcoder interface which has You can provide a default transcoder which does serialization similar to how it's done today (with bugfixes), while keeping the flexibility of accepting PRs from others that implement transcoders for other formats. If this line of thought makes sense to you and more importantly if we can come up with a common model to represent all digests, I will be happy to contribute a PR! A little background: we have a project that does jvm profiling for clusters and all our components talk Protobuf. We deal with latencies, scheduler delays and other metrics and streaming quantiles make a lot of sense. Hence the interest in adopting tdigest but it does not have protobuf support today nor the design is friendly enough to submit a PR for the same |
Are you suggesting that we have different serialization models for, say,
protibuf and any other format?
…On Jun 28, 2017 11:29 AM, "Ankur Jain" ***@***.***> wrote:
Isn't it possible to add a Transcoder interface which has byte[]
asBytes(SerializationModel) and SerializationModel fromBytes(byte[])
methods? SerializationModel here being a java object that is good enough to
represent any digest. TDigest interface can now have a create method that
takes SerializationModel as input which all digests implement.
You can provide a default transcoder which does serialization similar to
how it's done today (with bugfixes), while keeping the flexibility of
accepting PRs from others that implement transcoders for other formats. If
this line of thought makes sense to you and more importantly if we can come
up with a common model to represent all digests, I will be happy to
contribute a PR!
A little background: we have a project
<https://github.com/Flipkart/fk-prof> that does jvm profiling for
clusters and all our components talk Protobuf. We deal with latencies,
scheduler delays and other metrics and streaming quantiles make a lot of
sense. Hence the interest in adopting tdigest but it does not have protobuf
support today nor the design is friendly enough to submit a PR.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSetRnVRmB9xUdO1FkVyYNZFR3i6w0ks5sIpt8gaJpZM4NClUZ>
.
|
No. Basically, I wanted to enable bring your own serialization logic with some sane ones provided by the lib. By SerializationModel, I am implying one java class something on the lines of: class SerializationModel {
int encodingType; double min; double max; float compression; // ... and other members
}
class MergingDigest {
// ...
static MergingDigest fromSerializationModel(SerializationModel);
}
interface Transcoder {
byte[] asBytes(SerializationModel);
SerializationModel fromBytes(byte[]);
}
class ProtobufTranscoder implements Transcoder {
// ...
} This keeps the byte magic out of digest implementations and they are responsible for generating a uniform on-heap representation (which is SerializationModel, I know this name sucks) If you have any other ideas on how tdigest can support multiple serialization formats (and hopefully you consider that as a feature to be added in the first place), would love to hear about them. |
I think your design is fine. It bottlenecks the essential qualities of the
t-digest.
How about we find a better name for the SerializationModel (it might have
other uses, even). Here are some suggestions that are probably all
problematic, but might get your thoughts going:
DigestEssence
DigestInfo
DigestBasics
NopDigest
I am happy to see a pull request on this.
…On Wed, Jun 28, 2017 at 2:12 PM, Ankur Jain ***@***.***> wrote:
No. Basically, I wanted to enable *bring your own serialization logic*
with some sane ones provided by the lib. By SerializationModel, I am
implying *one* java class something on the lines of:
class SerializationModel {
int encodingType; double min; double max; float compression; // ... and other members
}
class MergingDigest {
// ...
static MergingDigest fromSerializationModel(SerializationModel);
}
interface Transcoder {
byte[] asBytes(SerializationModel);
SerializationModel fromBytes(byte[]);
}
class ProtobufTranscoder implements Transcoder {
// ...
}
This keeps the byte magic out of digest implementations and they are
responsible for generating a uniform on-heap representation (which is
SerializationModel, I know this name sucks)
Users can instantiate any out of the box transcoder or write their own. A
way to maintain backward compatibility would be for MergingDigest, etc to
optionally take a transcoder as constructor param and use that (or default
one if not provided) to keep existing asBytes, fromBytes method working.
If you have any other ideas on how tdigest can support multiple
serialization formats (and hopefully you consider that as a feature to be
added in the first place), would love to hear about them.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSesAY04FzicNDHZsPsrRBi9sfOaSHks5sIsHAgaJpZM4NClUZ>
.
|
Or DigestModel? I see this as a dto representing digests. |
You mention interchangeability of digest implementations. Presently, MergingDigest and AvlTreeDigest serialize different values so their serialized representation is not equivalent. Will need your inputs on structure of DigestModel and what fields of MergingDigest/AvlTreeDigest map to DigestModel fields. |
DigestModel sounds fine!
…On Thu, Jun 29, 2017 at 1:59 AM, Ankur Jain ***@***.***> wrote:
Or DigestModel? I see this as a dto representing digests.
Sure, I have some free time next week so expect a PR then.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSep3HR4T2pY3XN19AhSJZeFZkiNnUks5sI2drgaJpZM4NClUZ>
.
|
@tdunning Need your inputs on what DigestModel looks like. As mentioned in previous comment, presently different fields are serialized across digest implementations |
Yes, currently different fields are serialized, but we should synchronize
that.
The minimum set includes:
- compression factor
- min and max values observed
- centroid positions
- centroid weights
Implementations that use point by point insertion may also have special
fields to maintain sort ordering of centroids with the same location (due
to repeated data points, generally). This isn't done yet, but I think it
may get added soon. I am pretty sure that this additional information does
not need to be included in the serialized form if we make the assumption
that centroids are serialized in the internal ordering.
The MergingDigest also stores an extra sizing parameter which determines
the size of the input buffer. This can be dealt with by adding a parameter
to the DigestModel constructor or by just using a default value. Recent
speedups have made this parameter much less important.
Do you see other variations?
…On Wed, Jul 5, 2017 at 2:43 AM, Ankur Jain ***@***.***> wrote:
@tdunning <https://github.com/tdunning> Need your inputs on what
DigestModel looks like. As mentioned in previous comment, presently
different fields are serialized across digest implementations
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSetbdm61jI_Yp5jkcmyR-rXf2YYy3ks5sK1q_gaJpZM4NClUZ>
.
|
Apart from bufferSize, MergingDigest serializes Also, oddly MergingDigest#asBytes serializes tempMean.length but MergingDigest#fromBytes (verbose) does not expect it to be present in buffer. Is it a known issue or am I missing something there? |
In that context, `lastUsedCell` is just the number of active centroids. The
rest of the array is definitely zero filled (and this is even tested, I
believe).
…On Sat, Jul 8, 2017 at 8:05 AM, Ankur Jain ***@***.***> wrote:
Apart from bufferSize, MergingDigest serializes lastUsedCell. For verbose
encoding, this is deserialized and set as a property in digest. For small
encoding, lastUsedCell additionally represents the values to read and
assign in mean/weight array. Rest of the array seems to be zero-filled
since n(array size) is part of serde as well
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSeibIamsN_PyYXr4Dlj3l_7ik5J8Eks5sL5rFgaJpZM4NClUZ>
.
|
@anvinjain Do you have a PR coming? We need to cut the new major release. If you can't get this in soon, it will have to wait. |
No, PR will take some time. Have paused work on this for few weeks.
…________________________________
From: Ted Dunning <[email protected]>
Sent: Wednesday, August 2, 2017 5:16:10 AM
To: tdunning/t-digest
Cc: Ankur Jain; Mention
Subject: Re: [tdunning/t-digest] Future-proof serialization (#87)
@anvinjain<https://github.com/anvinjain> Do you have a PR coming?
We need to cut the new major release. If you can't get this in soon, it will have to wait.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#87 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AITW_tmwQrLZSQmX7pOKU20z0IJc3aEaks5sT7jCgaJpZM4NClUZ>.
|
OK.
I will leave the current serialization in place but will leave a note that
we may add constructors and methods dealing with this in the future as a
non-breaking API change.
…On Tue, Aug 1, 2017 at 7:43 PM, Ankur Jain ***@***.***> wrote:
No, PR will take some time. Have paused work on this for few weeks.
________________________________
From: Ted Dunning ***@***.***>
Sent: Wednesday, August 2, 2017 5:16:10 AM
To: tdunning/t-digest
Cc: Ankur Jain; Mention
Subject: Re: [tdunning/t-digest] Future-proof serialization (#87)
@anvinjain<https://github.com/anvinjain> Do you have a PR coming?
We need to cut the new major release. If you can't get this in soon, it
will have to wait.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/
tdunning/t-digest#87#issuecomment-319527755>, or mute the thread<
https://github.com/notifications/unsubscribe-auth/AITW_
tmwQrLZSQmX7pOKU20z0IJc3aEaks5sT7jCgaJpZM4NClUZ>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSeovor2NcFgBgj40EPC3BvuStkWw_ks5sT-I7gaJpZM4NClUZ>
.
|
This is being pushed beyond the 3.2 release. |
@tdunning I have raised a PR around this. |
Awesome. I will get to this in a few days.
…On Mon, Jan 1, 2018 at 12:57 AM, Ankur Jain ***@***.***> wrote:
@tdunning <https://github.com/tdunning> I have raised a PR around this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSeiXpmjc1tXRKGip1G9b6qwKwIHKPks5tGJ4VgaJpZM4NClUZ>
.
|
@tdunning Did you get a chance to look at the PR? |
No. I completely spaced looking at it.
Thanks for the ping. Very helpful.
…On Wed, Feb 14, 2018 at 3:17 AM, Ankur Jain ***@***.***> wrote:
@tdunning <https://github.com/tdunning> Did you get a chance to look at
the PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSes8xyIWB6juUO7VsD2d4uBMacSx_ks5tUsC3gaJpZM4NClUZ>
.
|
@tdunning looks like the PR was never merged. This is of interest to us too, so if there are any review comments we can take a look at addressing them too. Thanks in advance. |
So ... let's review this again with a lot of experience since last time.
Some thoughts:
- there will be new implementations of digests. It would be nice to be able
to deprecate old formats and use the binary data for new implementations.
What does that imply?
- some digests would like to add special extra data. how does that conflict
with the first point?
- how does this flexibility best play API-wise?
…On Thu, Jan 14, 2021 at 10:35 AM scheler ***@***.***> wrote:
@tdunning <https://github.com/tdunning> looks like the PR was never
merged. This is of interest to us too, so if there are any review comments
we can take a look at addressing them too. Thanks in advance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6X5HQNV4PCIZQPWIXDSZ42OLANCNFSM4DIKKUMQ>
.
|
Here's my proposal: #162
Example for tdigest.serializerClass=org.example.CustomSerializer
someProperty=someValue
someOtherProperty=someOtherValue This is the current implementation of TDigestDefaultSerializer public class TDigestDefaultSerializer extends AbstractTDigestSerializer<TDigest, byte[]> {
public TDigestDefaultSerializer(Properties properties) {
super(properties);
}
@Override
public byte[] serialize(TDigest tDigest) throws TDigestSerializerException {
ByteArrayOutputStream baos = new ByteArrayOutputStream(5120);
try (ObjectOutputStream out = new ObjectOutputStream(baos)) {
out.writeObject(tDigest);
return baos.toByteArray();
} catch (IOException e) {
throw new TDigestSerializerException(e);
}
}
@Override
public TDigest deserialize(byte[] object) throws TDigestSerializerException {
try (ObjectInputStream in = new ObjectInputStream(new ByteArrayInputStream(object))) {
return (TDigest) in.readObject();
} catch (ClassCastException | ClassNotFoundException | IOException e) {
throw new TDigestSerializerException(e);
}
}
} This is the code for AbstractTDigestSerializer public abstract class AbstractTDigestSerializer<T extends TDigest, O extends Object> implements TDigestSerializer<T, O> {
private Properties properties;
public AbstractTDigestSerializer(Properties properties) {
this.properties = properties;
}
public Properties getProperties() {
return properties;
}
} Hope this is being seen as future proof. There's still some work to do:
|
Currently, we depend on Java's serialization which is not a good strategy.
It would be much better to do our own and make all the different kinds of Digest use interchangeable formats.
The text was updated successfully, but these errors were encountered: