Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address key protobuf performance issues #1054

Open
louiscryan opened this issue Sep 18, 2015 · 43 comments
Open

Address key protobuf performance issues #1054

louiscryan opened this issue Sep 18, 2015 · 43 comments
Assignees
Labels
Milestone

Comments

@louiscryan
Copy link
Contributor

This is something of a coverall issue and should be factored out into separate issues for each individual piece of work. Listed here are a number of areas for improving the interaction between protobuf and GRPC to improve overall performance and reduce memory pressure on the system.

In rough order of priority....

  • Implement some basic protobuf benchmarks independent of GRPC (scrape the web or extract Googles) as a framework for experimenting with optimizations
  • Provide the ability to have a 0-copy approach for writing out large binary sequences (byte buf, file region etc) to the transport without copying
  • Provide the ability to read protobufs so that they can be written out again with 0-copy for large sequences. This includes writes to transport and to files
  • Document a pattern for transferring a large binary sequence as chunks and utility functions to chunk & dechunk those to application code. File transfer is a common use-case here
  • Consider supporting deserialized protobuf backed by pooled buffers (speculative). Would require the notion of a releasable protobuf

@nmittler @ejona86 - Feel free to pile on / cc folks

@louiscryan
Copy link
Contributor Author

@xfxyjwf FYI

@saintstack
Copy link

Yeah, can we have all the above please (smile).

Over in the project I work on (apache hbase):

  • The lack of zero-copy is a killer. Give us the rope so we can hang ourselves! We actually rely heavily on this, https://hbase.apache.org/xref/org/apache/hadoop/hbase/util/ByteStringer.html class, where we have hacked java pb so could do some zero-copy (it is wrapped in the ByteStringer class because our hackery has us up against classloader issues when pb jars get loaded by a classloader that is other than that of the applications).
  • RPCing, we implement the old pb generated stubs so pb is used to describe which method to invoke as well as the parameters passed back and forth. Because pb'ing all data was untenable -- especially as the size and count of elements goes up -- we hacked up an ugly follow-behind 'sidecar' mechanism so we can pass data without having to envelope it all in pb. If the data is small -- describing coordinates, say -- the pb will just carry the parameter but if large, then the pb notes that there is data following behind and includes meta attributes such as encoding and/or compression. Could the framework do this for us? Its like large binary transfer but not necessarily in chunks (we chunk at a higher level -- not that we'd not appreciate the rpc doing this for us). The zero-copy to the transport would help here too.

Thanks.

@louiscryan
Copy link
Contributor Author

If we give you 0-copy could the side-car move into the protobuf as a simple binary string? If so then I think we can pretty easily meet your needs

@saintstack
Copy link

It could.

We on 2.5 pb currently. It'd require an incompatible upgrade? Thanks Louis.

@louiscryan
Copy link
Contributor Author

What are your compatability gaps with the proto3 runtime ? (awaits answer in fear)

@saintstack
Copy link

What the gap is a TODO.

@tsuna
Copy link

tsuna commented Oct 12, 2015

The gap is annoying to deal with at best (I tried to make that switch at some point within Arista and we gave up after going deep enough in that rabbit hole and then deciding we had yaks in a more urgent need of a shave).

@louiscryan
Copy link
Contributor Author

Would a shaded proto3 runtime help ? If you recall where yak shaving was
necessary we can take a look.

On Sun, Oct 11, 2015 at 10:34 PM, Benoit Sigoure [email protected]
wrote:

The gap is annoying to deal with at best (I tried to make that switch at
some point within Arista and we gave up after going deep enough in that
rabbit hole and then deciding we had yaks in a more urgent need of a shave).


Reply to this email directly or view it on GitHub
#1054 (comment).

@ejona86 ejona86 added this to the 1.0 milestone Jan 27, 2016
@saintstack
Copy link

"Would a shaded proto3 runtime help ?"

We could shade, yeah. Would our 2.5 clients be able to talk to a 3.0 server?

Any movement here or something to try?

Our new stumbling block is CodedOutputStream. We have a DirectByteBuffer whose data is offheap and we want it to stay offheap. COS does not support ByteBuffer in 2.5. In 2.6 it does but insists on copying data from the offheap DBB on heap.

We are considering forking PB.

@tsuna
Copy link

tsuna commented Feb 2, 2016

Forking, ouch. What sorts of arguments does the open-source community need to put forward to convince the protobuf-java maintainers that this issue needs to be tackled upstream?

@nmittler
Copy link
Member

nmittler commented Feb 2, 2016

@saintstack work is underway to support zero-copy for ByteBuffer. The first part is just adding wrapping support, which is already available on master.

Adding zero-copy support for ByteBuffer to CodedOutputStream is coming soon. Stay tuned.

Agree with @tsuna ... yeah, let's not fork if we can avoid it :)

@saintstack
Copy link

That doesn't look bad @nmittler Would it be proto3 only?

@nmittler
Copy link
Member

nmittler commented Feb 2, 2016

@saintstack I don't believe so. UnsafeByteOperations should be in the next release. @pherl does that sound right?

@liujisi
Copy link
Member

liujisi commented Feb 2, 2016

Yes. UnsafeByteOperations will be in the next release, and is currently available in master already.

@enis
Copy link

enis commented Feb 2, 2016

UnsafeByteOperations is great.

At least one issue that Stack above describes is that we want our RPC buffers to be off-heap DBBs. The data that is in the sidecar can be decoded by us nicely, but the RPC call header, etc are in PB so we have to decode that as well using CodedInputStream. In this case, the CIS backed by NIOByteString still allocates a byte[] on-heap and does the whole-copy:
https://github.com/google/protobuf/blob/master/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L118

@nmittler are you saying that CIS will be changed to not do this extra-copy? If so, that will give us what we want.

@saintstack
Copy link

What @enis said.

It'd be a CIS that can read from the DBB w/o bringing it on heap (and does not do a byte-at-a-time reading out of the backing DBB but reads longs and words at a time when it can). Need same in COS.

I can try it out np.

@nmittler
Copy link
Member

nmittler commented Feb 2, 2016

@enis ... Nothing right now is planned for CodedInputStream (only CodedOutputStream). However, we do plan to look into zero-copy reads after we finish up zero-copy writes.

PRs are always welcome if you don't want to wait :)

@enis
Copy link

enis commented Feb 2, 2016

Great. I think we (in HBase community) can come up with the PR for CIS over BB / BS.

@anoopsjohn
Copy link

Thanks.. Ya we need COS also to work with DBB with zero copy. Happy to work on CIS part.

@nmittler
Copy link
Member

nmittler commented Feb 3, 2016

@enis @anoopsjohn sgtm. I'm hoping to have to COS work sorted out soon, which might help provide some direction for CIS ... we'll see :)

@ramkrish86
Copy link

@nmittler Thanks for the inputs/discussions. We can work on COS based on the direction of COS.

@hsaliak hsaliak added the P2 label Apr 26, 2016
@hsaliak hsaliak modified the milestones: Unscheduled, 1.0 Apr 26, 2016
@hsaliak
Copy link
Contributor

hsaliak commented Apr 26, 2016

Not targeting a release, but actively being worked on.

@nmittler
Copy link
Member

FYI, I've begun work for giving CIS full direct ByteBuffer support. Stay tuned.

@pgrosu
Copy link

pgrosu commented Aug 16, 2016

Thanks Nathan :)

@anoopsjohn
Copy link

Thanks Nathan..
In fact I had worked on PB trunk and refactor the CIS (Like the COS work) So we have byte[] based and IS based CIS.. In fact added a ByteInput based CIS also as we would need that. DBB based CIS I did not add.. Can u pls let me know how/whether I can contribute my change? Let me know.

@nmittler
Copy link
Member

Hey @anoopsjohn, would you mind just creating a PR for protobuf and then cc'ing me? I'd be happy to take a look and see if we can absorb those changes for a future protobuf release. Thanks!

@wesm
Copy link

wesm commented May 14, 2017

Do you know what status of providing a zero-copy read of the message payload in gRPC Java and C libraries?

I have a use case (sending Apache Arrow streams with gRPC) where I would use a proto message like:

message ArrowMessagePB {
  required bytes serialized_data;
}

I would like to avoid extra memory copies on the server side and on the client side. If this is already supported, or if there is a recommended workaround, I'm very interested.

I was also talking with @lukecwik about this recently. As an aside, in the gRPC docs it says (http://www.grpc.io/docs/guides/concepts.html)

By default, gRPC uses protocol buffers as the Interface Definition Language (IDL) for describing both the service interface and the structure of the payload messages. It is possible to use other alternatives if desired.

Along the lines of this question (zero-copy sends and receives -- e.g. zeromq has pretty good support for this http://zeromq.org/blog:zero-copy) I haven't been able to find more detail about what skipping the protobuf serde step would involve.

cc @leifwalsh

@wesm
Copy link

wesm commented May 14, 2017

Related: Apache Kudu built the notion of a zero copy "sidecar" into its RPC subsystem https://github.com/apache/kudu/blob/master/src/kudu/rpc/rpc_sidecar.h#L29

@carl-mastrangelo
Copy link
Contributor

@wesm If you don't want the Protobuf overhead, you can use InputStream Directly and skip the copy. It's more complicated, but there is an example of how to send the raw ByteBufs around here:

protected CountDownLatch startFlowControlledStreamingCalls(int callsPerChannel,

@jacques-n
Copy link

I've finally made some progress on this:
https://github.com/jacques-n/arrow/tree/flight/java/flight

Still need to test more but we should look for a way to incorporate some of the behavior of these two reflection classes [2] directly into GRPC (somehow... some initial thoughts in the classes).

https://github.com/jacques-n/arrow/tree/flight/java/flight/src/main/java/org/apache/arrow/flight/grpc

@cbornet
Copy link

cbornet commented May 13, 2020

Is there any progress on that ? I'm trying to replace a custom Netty protocol by gRPC but I fear the lack of zero-copy and object pooling will kill performance.

@ejona86
Copy link
Member

ejona86 commented May 13, 2020

@cbornet, I'd say right now, yes, absolutely, it is possible to make something faster using Netty directly. But that's not something everyone can do, and it requires a lot of energy to create and maintain, and there is the question of how much benefit you will gain. What percentage of performance are you willing to lose for ease of use and maintenance? .1%, 1%, 10%, 50%?

If you just want raw bytes it isn't so bad, but if you want any serialization format that can easily evolve over time it gets more complicated. A major issue with this work is there is no lifetime for protobuf messages, but that's almost universal with serialization implementations in Java.

If you have a message format you like in your custom approach, we can probably adapt gRPC to be efficient with it. So if you are happy working with Iterable<ByteBuf>, we could make that efficient without too much effort. But I expect most systems need a nicer message format.

FileRegion in general is harder, because we will need to cut it into frames. We'd also lose the benefits of it when using TLS, and we're not too interested to go out of our way to optimize only plaintext. So if we had Kernel TLS support, it might be a more interesting discussion.

For sending proto, I'll note that com.google.protobuf.UnsafeByteOperations appears like it can help. There will still be a copy to Netty buffers, although that is a tractable problem. But the lifetime issues make it hard for a real-life user, as it basically requires you not to use a buffer pool which kills benefit of direct memory.

For receiving proto, a lifetime must be established to make it useful, as gRPC has the data in Netty ByteBufs which have lifetimes. And nobody has good ideas as to how to accomplish that. The only idea I have is to have a mode that will require the message to be consumed synchronously within the onMessage() callback; if you retain the message past that point using it may produce corruption and could crash the JVM.

@cbornet
Copy link

cbornet commented May 14, 2020

@ejona86 thanks for the detailed answer.
To give more context, what I'm doing is developping a gRPC interface as an alternative to the Apache Pulsar binary protocol which is Protobuf over TCP.
For the message payload which can be quite big, they don't include it in the protobuf message but wrap it in their own framing structure and send it raw on the Netty channel just after sending the protobuf message. Server-side, the received ByteBuf payload is at some point duplicated with retainedDuplicate. This seems to handle Netty ByteBuf lifetime or is there an issue doing it this way ?

In my gRPC adapter, I had to put the payload in a protobuf ByteString and as I understand it even if I do:

// send being a protobuf message received in an onNext()
ByteString headersAndPayload = send.getHeadersAndPayload();
ByteBuffer buffer = headersAndPayload.asReadOnlyByteBuffer();
ByteBuf payload = Unpooled.wrappedBuffer(buffer);

there's still a copy from Netty to ByteString, is that correct ?

For serialization, I did

ByteBuf metadataAndPayload = entry.getDataBuffer();
ByteString payload = UnsafeByteOperations.unsafeWrap(metadataAndPayload.nioBuffer());

// ... code to call observer's onNext with message and payload

entry.release(); // will release the metadataAndPayload ByteBuf

Does it provide 0-copy ? Am I doing something wrong with the lifetimes ?

@ejona86
Copy link
Member

ejona86 commented May 14, 2020

This seems to handle Netty ByteBuf lifetime or is there an issue doing it this way ?

No, that approach is fine. The ManagedLedger has a lifetime and is ByteBuf-aware, so it can call release() when appropriate (at the very least for ledger.close()/delete()).

there's still a copy from Netty to ByteString, is that correct ?

Yes. The problem is ByteBuffer. It has no explicit lifetime. gRPC cannot use UnsafeByteOperations.unsafeWrap() because it would have to hold a reference on the ByteBuf until the ByteString is no longer used. And gRPC has no idea when it is no longer used by the application.

Does it provide 0-copy ? Am I doing something wrong with the lifetimes ?

If gRPC optimized reading ByteBuffers from the protobuf (which I believe is feasible), then that would be zero-copy.

But the lifetime is wrong! It isn't safe to call entry.release() until payload/ByteBuffer is longer being used, and gRPC will totally be using it still when onNext() returns, if doing zero-copy. So yeah, that would allow corruption and crashes.


So you should see a pattern: ByteBuf allows zero-copy with pooling. ByteBuffer can only do zero-copy without pooling. Explicit object lifetimes are essential for pooling. Object lifetimes are "viral" in that anything holding an object with a lifetime must have a lifetime. Protobuf messages have no lifetime, thus you either have to copy out of a pooled object or figure out a side-band method for communicating the lifetime.

@cbornet
Copy link

cbornet commented May 15, 2020

gRPC cannot use UnsafeByteOperations.unsafeWrap() because it would have to hold a reference on the ByteBuf until the ByteString is no longer used. And gRPC has no idea when it is no longer used by the application.

Maybe this could be proposed as an expert mode where the application would have to release the ByteBuf itself ? A bit like how it's done in Netty's ChannelInboundHandlerAdapter. Or you do the release in gRPC but the application has the possibility to retain/release the ByteBuf on its side.

But the lifetime is wrong! It isn't safe to call entry.release() until payload/ByteBuffer is longer being used, and gRPC will totally be using it still when onNext() returns, if doing zero-copy

Indeed ! Would it be possible then to register a callback or listener to get notified when gRPC has handled the message and it's safe to release the resources ?

@ejona86
Copy link
Member

ejona86 commented May 15, 2020

I spoke some with @cbornet on Gitter about using pooled buffers when sending. Specifically how the entry.release(); after onNext() is unsafe and this seemed important to dump here:

Given that you can continue sending even after the call is dead, I don't think there is a simple way [to know when release() is possible] with the current API. It would probably need to be a per-message callback, that could occur even after ClientCall.Listener.onClose()/ServerCall.Listener.onCancel().

Hmm... you'd probably need some custom state to attach the callback, or you pass a different callback for each message. Sorta seems like we'd need a new onNext(Message, Runnable onRelease) method

@cbornet
Copy link

cbornet commented May 15, 2020

Big +1 for onNext(Message, Runnable onRelease).

@ejona86
Copy link
Member

ejona86 commented May 19, 2020

With my API proposal, I think some people may think this is "solved" now. There's plenty of details though. I'm dumping some of my thoughts for details.

onNext(Message, Runnable) actually creates a bit of a mess for interceptors. It seems like it'd break any interceptor looking at the message. Now, it would only impact you if you called the new method, but it would be a pain for interceptor implementors. So concretely it's probably too invasive, but conceptually it seemed fair.

I realized we sort of already have such a callback: the InputStream.close() after the message is serialized. However, that isn't directly useful, because gRPC may serialize multiple times for retry. It also doesn't help us with Protobuf specifically because we need to go "around" the protobuf API (although for something like Pulsar that isn't a problem). It could maybe be used for receiving instead of sending, or maybe we could change how retries are implemented.

It may be fair to mirror what we did for compression: add a method to change the current "message releaser", and let you change that each message if you'd like. It would not be thread-safe, and the "current" "releaser" when onNext() is called will be the releaser called when the message is no longer used.

// Not a great name, but fine for demonstration purposes
abstract class MessageReleaser<T> {
  public abstract void release(T msg);
}

// For protobuf
so.setReleaser(new MyReleaser(bytebuf));
so.onNext(createMsg(bytebuf));
so.setReleaser(new MyReleaser(bytebuf2));
so.onNext(createMsg(bytebuf2));

// For something like Pulsar with built-in lifecycle
so.setReleaser(new PulsarReleaser());
so.onNext(msg1);
so.onNext(msg2);

While that works, I think it does expose a flaw in the approach: Pulsar-like-formats should "just work" without the need to set the releaser. Which would imply it maybe should be part of the Marshaller. I don't know if we make this part of the Marshaller as well or exclusively, along with a shim to allow it for Protobuf. For client-side I can see how we could have an interceptor shim that would wrap the Marshaller and you'd call interceptor.setReleaser() methods instead of on the StreamObserver (you'd use a new interceptor each RPC). Server-side is less clear, unless we use Context which would be functional but feels abusive.

In general, I'm also seeing trouble with interceptors. Some interceptors may "drop messages on the floor." Those would be incompatible with this approach, but there are probably few enough of those interceptors you could probably "fix them." We'll need to think of how "natural" it will be for them to implement proper semantics. The unfortunate part is it would be hard to discover interceptors that were leaking messages.

Internally, we will need NoopClientStream, RetriableStream, and okhttp, netty, and inprocess streams to support this. Note that RetriableStream will need to do coordination between the sub-streams it creates. It may make sense for this to be explicit (like onNext(Message, Runnable)) on our stream API.

Since this call can happen after the RPC is closed, we can't use the normal executor for this, except maybe if we had a fallback executor. We generally don't want to run callbacks from our network thread, since it is trivial for applications to accidentally block within them, but maybe in this case we make an exception. Unclear.

@emkornfield
Copy link

So you should see a pattern: ByteBuf allows zero-copy with pooling. ByteBuffer can only do zero-copy without pooling. Explicit object lifetimes are essential for pooling. Object lifetimes are "viral" in that anything holding an object with a lifetime must have a lifetime. Protobuf messages have no lifetime, thus you either have to copy out of a pooled object or figure out a side-band method for communicating the lifetime.

This is a naive question, but would the approach of adding a different mode to protobuf that generates class that support lifecycle be a feasible way of handling this? A quick sketch of what I mean:

message MyMessage {
  required bytes bytes_field = 1;
}

could generate something like:

class MyMessage extends Message implements AutoCloseable {
    private ByteBuf bytesField;

    // Callers are allowed to increment reference counts to this object but are
    // responsible for releasing it.
    ByteBuf getBytesFieldByteBuf() {
          return bytesField();
    }
    
    void close() {
        bytesField.release();
    }
}

@ejona86
Copy link
Member

ejona86 commented Jul 13, 2020

would the approach of adding a different mode to protobuf that generates class that support lifecycle be a feasible way of handling this

Any change along these lines is highly likely to be an API breaking change. The problem is that even if only one message had this support, all messages must transitively gain the support as well. If a message contains that "one special message," then it needs to have a lifetime as well. Since protobuf requires that adding a field to a message is backward-compatible, that means that all messages are potential containers for a lifetime-constrained message and would need be have their own lifecycles ("just in case").

There are some approaches to try to reduce the impact of such a change, like by requiring each message to explicitly enable the behavior via an option. So you could only have fields of lifecycle-based messages in a lifecycle-based message. But that would produce cross-language compatibility problems. It would also cause trouble with generic Message APIs, that don't care about the precise protobuf type.

And remember that ByteBuffer in Java does not have a lifetime. ByteBuf in Netty does, but Protobuf can't easily depend on Netty. So you'd need to provide a Factory for creating instances and a function to release instances.

@emkornfield
Copy link

@ejona86 thanks for the detailed response.

Any change along these lines is highly likely to be an API breaking change. The problem is that even if only one message had this support, all messages must transitively gain the support as well. If a message contains that "one special message,"

Agreed. I think this would have to be an all or nothing compilation mode, plus potentially additional options that were mentioned. i think the question is would it be possible to decouple this sufficiently and reduce code duplication from the API change? Or is the callback approach outlined above preferable? From gRPC core, it seems like it would need to understand an API which can transfer raw memory buffers to a deserializer?

But that would produce cross-language compatibility problems.

I don't understand this. I thought we were specifically discussing Java (or at least that was what I had cached in my head)? What other languages would this impact? Other JVM based languages?

It would also cause trouble with generic Message APIs, that don't care about the precise protobuf type

This likely depends on the flow of ownership but I agree it is a concern. Are there specific APIs related to gRPC that you are concerned with?

And remember that ByteBuffer in Java does not have a lifetime. ByteBuf in Netty does, but Protobuf can't easily depend on Netty. So you'd need to provide a Factory for creating instances and a function to release instances.

Since we don't seem close a solution, it seems like JEP-370 might be a way to make this generic across future JVMs?

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2020

I think this would have to be an all or nothing compilation mode, plus potentially additional options that were mentioned. i think the question is would it be possible to decouple this sufficiently and reduce code duplication from the API change?

I don't think all-or-nothing is on the table, as that is basically a new separate protobuf library. You'd need to allow both styles to exist in the same binary (to avoid diamond dependency problems) and so it'd have to be in a new package. You couldn't mix generated code.

But that would produce cross-language compatibility problems.

I don't understand this. I thought we were specifically discussing Java (or at least that was what I had cached in my head)? What other languages would this impact? Other JVM based languages?

If someone modified a proto and wasn't aware of the Java-isms, they could add a field in a non-lifecycle'd message to a lifecycle'd message. That would build fine in all non-Java languages. But now Java is broken and can't communicate with other languages.

It would also cause trouble with generic Message APIs, that don't care about the precise protobuf type

... Are there specific APIs related to gRPC that you are concerned with?

It's a general concern. But gRPC would definitely be impacted. If you use an older gRPC with a newer protobuf (which is fully allowed) then ProtoUtils.marshaller(Message) wouldn't behave correctly and gRPC would totally leak memory.

Since we don't seem close a solution, it seems like JEP-370 might be a way to make this generic across future JVMs?

It'll be helpful... in a decade? Maybe 5 years if we are lucky? Protobuf and gRPC still support Java 7 and it is really hard to tell when it is "acceptable" to drop support for older versions. There are some ways to allow it to be used on newer JVMs while remaining compatible with older JVMs, but you pay a complexity cost for them.

Even with JEP-370 you'll probably still want a factory so you can use a pool of buffers, so using ByteBuffer instead is probably not that big of a difference. Also, JEP-370 just has a single owner of memory; we frequently see the need for full reference counting. There's probably ways to hide that... blah, blah. It still won't be easy. Easier, but not easy.

@ejona86
Copy link
Member

ejona86 commented Aug 26, 2020

@njhill in #7330 (comment) said:

... I had a thought about a simple way to support ownership transfer - how about a "ParseTakesOwnership" marker interface on the Marshaller? If implemented then it becomes mandatory rather than optional for parse to close the InputStream.

I'm also curious about outbound ownership transfer. What if a message is written which has some backing resource that requires releasing? This could be done when the parsed InputStream is closed but IIUC messages might be serialized multiple times in some cases like retries (or am I wrong about that)?

ParseTakesOwnership marker would be easy to implement. Although if it is unsupported (say, an interceptor has mucked with the marshaller) there would be no indication. Something along that direction could be nice.

Yes, outbound ownership transfer has to be handled separately. And yes, today a message may be marshalled multiple times which means there is no way to know when memory can be released other than via the GC. I'll note there might be difficulty as well if the message is never marshalled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests