Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: not acked quickly #684

Closed
junghoahnsc opened this issue Jun 27, 2017 · 23 comments
Closed

pubsub: not acked quickly #684

junghoahnsc opened this issue Jun 27, 2017 · 23 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@junghoahnsc
Copy link

Hello,

After I upgraded google-cloud-go library (from 686a2d1 to f5c3fe2), pubsub message seems to be acked not quickly as before. All of messages are acked within a second, but the "Oldest Unacknowledged Message" went up from less than 2min to 30+ min. Also "Undelivered Messages" went up from 1~2k to 10k.

And does 'ReceiveSettings.MaxOutstandingMessages' affect this?
When I tried to decrease to like 256, those metrics went down a bit.

The pulled (received) messages are handled concurrently and the messages can be acked in a different order.

Is there any change that might affect this?

Thanks,

@jba jba self-assigned this Jun 27, 2017
@jba jba added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. api: pubsub Issues related to the Pub/Sub API. labels Jun 27, 2017
@jba
Copy link
Contributor

jba commented Jun 27, 2017

There have been some major changes.

I assume you switched from Pull to Receive? Can we see your Receive call?

@junghoahnsc
Copy link
Author

Yes, I switched from Pull to Receive.
Here is the snippet.

if maxOutstandingMessages > 0 {
    sub.ReceiveSettings.MaxOutstandingMessages = maxOutstandingMessages
}
handler := func(_ xcontext.Context, m *pubsub.Message) { f.queue <- m }
if err := sub.Receive(ctx, handler); err != nil {
  ...
}

And there are some workers that read the messages from queue and call Ack() or Nack() after handling the message.

@pongad
Copy link
Contributor

pongad commented Jun 28, 2017

@junghoahnsc

All of messages are acked within a second

Do you mean you call m.Ack() within a second?

I have a couple conjectures.

  1. The Receive method already process messages concurrently, but no more than MaxOutstandingMessages at a time. It acquires a semaphore before calling handler and release after handler returns. So, when the semaphore is released, signalling that more messages can be pulled, the old message is just being processed. Depending on how the channel is buffered and consumers, this could increase the ack time.

  2. We pull messages in batches. Previously the default batch size is 100. Now it's controlled by MaxOutstandingMessages and defaults to 1000.

Together, this might explain the "Oldest Unacknowledged Message". Is it possible for you to try processing the messages within handler itself? You should get about the same concurrency since each each call to handler happens in its own goroutine. If your machine cannot handle processing so many messages at once, reducing MaxOutstandingMessages should help.

I'm not exactly sure what "Undelivered Messages" reports. The increase (not sure about the magnitude of increase though) makes sense if it also reports messages being redelivered. Pubsub automatically extend the message deadline, eventually giving up after ReceiveSettings.MaxExtension. The default value is 10 minutes. So, if many messages are taking 30 minutes to ack, I can see this being the result. If my conjecture is right, processing messages in handler should probably help.

Please let us know if this helps!

@junghoahnsc
Copy link
Author

Do you mean you call m.Ack() within a second?

Yes, right.

Is it possible for you to try processing the messages within handler itself?

Yes, I can try at lease for testing. Is each message passed to handler in its own goroutine?
Are all of messages up to MaxOutstandingMessages handled concurrently?
Do you expect a different behavior than handling messages in our own fixed number of goroutines?
(We want to control the number of active handlers.)

I'm not exactly sure what "Undelivered Messages" reports

I can expect the increase of this if message is not acked properly. But all of messages are acked by calling Ack() within a second. That's why I guess ack might not be delievered properly.

Thanks!

@pongad
Copy link
Contributor

pongad commented Jun 28, 2017

Is each message passed to handler in its own goroutine?
Are all of messages up to MaxOutstandingMessages handled concurrently?

The answer to both is yes. Can you control the number of active handlers by MaxOutstandingMessages? (I'll call this MOM for short.) Receive can guarantee that there are at most MOM goroutines running handlers at once.

I expect some changes in behavior. Receive limits the number of concurrent calls to handler. Your handler puts message on the channel and exit, so Receive mistakenly assume that you've already finished with the message and pull more. Now, new messages have to sit and wait since your worker goroutines are probably still busy with the old messages. I think processing the message in handler should alleviate this.

That said, maybe Receive to wait until you call m.Ack or Nack before releasing the semaphore. @jba was this considered? I can pick up the task if you're busy.

@junghoahnsc
Copy link
Author

Thanks! I'll try and update the result.

BTW, one more quick question. If I want to delay the re-delivery of a nacked message as much as possible, what is the best way? Is a message re-delivered as soon as Nack is delivered? Should I skip calling Nack to delay re-delivery? Or doesn't it matter?

@junghoahnsc
Copy link
Author

I just tried to process the messages within handler itself.
After that, Acknowledge Requests rate became relatively stable instead of having spikes.
But it didn't help to reduce Undelivered Messages or Oldest Unacknowledged Message.
As I increased MaxOutstandingMessages like to 4K, they seemed to be even worse.

@pongad pongad self-assigned this Jun 29, 2017
@pongad
Copy link
Contributor

pongad commented Jun 29, 2017

Is a message re-delivered as soon as Nack is delivered?

The client library doesn't do anything immediately after you call Nack. For performance, we try to batch a few calls together to amortize RPC overhead. In the current implementation, we wait for at most 1/10th of the subscription's AckDeadline before we publish. The behavior could change in the future though. Once the library actually contact the server to nack, the message should become immediately available for redelivery.

If I want to delay the re-delivery of a nacked message as much as possible, what is the best way?

I think there are a couple of options. If you use pubsub without the client library, your message will expire and redelivered after AckDeadline. However, if you use the library, the library assumes that you're still busy processing the message, and it automatically extend the deadline for you until MaxExtension. After this, your message will expire and redelivered like normal. So, if you want it redelivered at some time (not immediately) in the future, this might work for you.

If you need to nack at a specific time, I think you could call Nack with time.AfterFunc?

A potential issue with both is that the message could be kept in memory for while. If there are too many messages like this, your mem consumption might be large.

Could you let us know why you need to delay delivery? We would very much like to know if it's a common problem.

@pongad
Copy link
Contributor

pongad commented Jun 29, 2017

@junghoahnsc I unfortunately couldn't reproduce this problem. I tested the library against the load test framework and it showed better throughput and smaller Undelivered Messages and Oldest Unacknowledged Message than the old library.

Do you know of a way to reproduce this? The handler function in load test is rather trivial, so that might be causing the difference.

@junghoahnsc
Copy link
Author

@pongad Thanks for the detailed explanation and looking into this.

Could you let us know why you need to delay delivery? We would very much like to know if it's a common problem.

One of our service depends on an external service. When it fails to handle the message due to the external service, we'd like to delay the reprocessing and give higher priority to new messages.
But we observed the old failed messages are re-delivered quickly some times.

Do you know of a way to reproduce this? The handler function in load test is rather trivial, so that might be causing the difference.

Yes, I guess it depends on the handler. Other jobs works fine after upgrading library. I can see this from only one job now. I'm investigating why this job shows worse ack behavior.

BTW, All of jobs showed a weird behavior after upgrading. They stopped pulling from one of subscription after a few hours, while pulling from other subscriptions works fine. This pubsub topic is different than others like there are a bulk of messages periodically. At that time, I give grpc.WithKeepaliveParams (with 30seconds) to pubsub.NewClient(). After I removed this option, it works fine.

@junghoahnsc
Copy link
Author

I added a new subscription to the one we had an issue and ran a job with a dummy handler.
Surprisingly, it showed the same behavior - same 'Acknowledge Requests' rate and 'Undelivered Messages'...

I'm looking why only this pubsub shows this behavior.

@junghoahnsc
Copy link
Author

Update: I tried to use a dummy handler, which just call Ack, but it showed the exactly same behavior as the original one. So I guessed it's not a handler issue. And then, I tried to change Acknowledgment Deadline of the subscription. It was 120s before. After I changed to 60s, 30s, and 10s, it showed much higher pulling and ack rate than before. I didn't touch the job when I changed.

Is this expected? I'm wondering why Acknowledgment Deadline affect pulling (or ack) rate.

@pongad
Copy link
Contributor

pongad commented Jul 3, 2017

@junghoahnsc This is very interesting information. I actually thought about something like this. Our implementation batch ack messages together and send them every (D-5)/2 seconds where D is the message deadline in seconds. If D is set longer, it acks less often.

However, this logic has been around much before commit 686a2d1. If this was causing problems, it shouldn't get worse when you update. I assume you did not change the ack deadline when you update the library? @jba does this new info ring any bell for you?

FWIW, Java implementation send ack messages every 100ms. Should we do the same? @kamalaboulhosn (Sorry I keep spamming the two of you.)

@junghoahnsc
Copy link
Author

@pongad Yes, I didn't change the ack deadline when I updated the library.

@kamalaboulhosn
Copy link
Contributor

@pongad I think following the Java implementation makes sense. Quicker acks are generally better because, in the event the client shuts down, the buffered acks could be lost and messages could be redelivered.

@junghoahnsc
Copy link
Author

The testing job that is just ack in a handler stopped pulling after showing this error:

A  I0704 11:46:33.796467       1 glogger.go:43] transport: http2Client.notifyError got notified that the client transport was broken EOF.

I had this issue two times for the last 3~4 days.
I think the receive() call should return an error or reconnect automatically, it doesn't seem to do.

@pongad
Copy link
Contributor

pongad commented Jul 5, 2017

I believe this is not a problem with the client library, but with gRPC. It feels suspiciously similar to grpc/grpc-go#954. You said this topic only receives batches of messages periodically? This makes me wonder if it's a gRPC problem.

In some other languages, we had a bug where gRPC request would fail if the connection stays idle for too long. I'm not sure if this affects Go or not. Is your gRPC up to date?

If updating doesn't help, is it possible for you to share gRPC trace with us? In your test application

import _ "golang.org/x/net/trace"

grpc.EnableTracing = true // it already defaults to true, if you turn it off you need to turn it back on before any calls though
http.ListenAndServe("localhost:8080", nil) // or whichever port you want

then you can find traces at http://localhost:8080/debug/events and http://localhost:8080/debug/requests. I hope this will tell us what's going on.

Thank you for your patience.

gopherbot pushed a commit that referenced this issue Jul 5, 2017
See discussions in #684 (comment) .
Load testing shows no performance regression.

Change-Id: I45761540d2b248292b52118894d37712c0bfa654
Reviewed-on: https://code-review.googlesource.com/14472
Reviewed-by: kokoro <[email protected]>
Reviewed-by: Jonathan Amsterdam <[email protected]>
@pongad
Copy link
Contributor

pongad commented Jul 5, 2017

@junghoahnsc This commit makes us ack more often. In theory, I think this should improve your metric the way your reducing the the ack deadline did. Could you let us know if this helps you?

@junghoahnsc
Copy link
Author

@pongad After I restarted the job, it received batches of messages and then stopped again.
The gRPC lib is up-to-date (we're 3c33c26290b747350f8650c7d38bcc51b42dc785).

I'll try to trace to the test job and see when this happens again.
Also, I'll try the commit that ack more often.

Thanks!

@junghoahnsc
Copy link
Author

@pongad For the commit that acks often shows noticeable difference in our case. We can't see spikes in Undelivered Messages and Oldest Unacknowledged Message dropped to less than 1sec.

@pongad
Copy link
Contributor

pongad commented Jul 6, 2017

Fantastic news! Are Undelivered Messages and Oldest Unacknowledged Message metrics good enough that we can close that portion of the issue? I won't close this issue yet, so we can keep tracking the notifyError problem here.

We are also in the process of reworking some of the library's internals. Load testing so far shows that it beats currently implementation in both metrics. So, if your metrics look acceptable but not completely satisfactory, I think we could wait a little for the rework to land so we don't duplicate effort.

@junghoahnsc
Copy link
Author

Yes, those metrics are good enough to us.

@pongad
Copy link
Contributor

pongad commented Aug 8, 2017

Since the main concern here is addressed, I'll close this for now. Please reopen if you're still running into problems.

@pongad pongad closed this as completed Aug 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants