-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: not acked quickly #684
Comments
There have been some major changes. I assume you switched from Pull to Receive? Can we see your Receive call? |
Yes, I switched from Pull to Receive.
And there are some workers that read the messages from queue and call Ack() or Nack() after handling the message. |
Do you mean you call I have a couple conjectures.
Together, this might explain the "Oldest Unacknowledged Message". Is it possible for you to try processing the messages within handler itself? You should get about the same concurrency since each each call to handler happens in its own goroutine. If your machine cannot handle processing so many messages at once, reducing I'm not exactly sure what "Undelivered Messages" reports. The increase (not sure about the magnitude of increase though) makes sense if it also reports messages being redelivered. Pubsub automatically extend the message deadline, eventually giving up after Please let us know if this helps! |
Yes, right.
Yes, I can try at lease for testing. Is each message passed to handler in its own goroutine?
I can expect the increase of this if message is not acked properly. But all of messages are acked by calling Ack() within a second. That's why I guess ack might not be delievered properly. Thanks! |
The answer to both is yes. Can you control the number of active handlers by I expect some changes in behavior. Receive limits the number of concurrent calls to handler. Your handler puts message on the channel and exit, so Receive mistakenly assume that you've already finished with the message and pull more. Now, new messages have to sit and wait since your worker goroutines are probably still busy with the old messages. I think processing the message in handler should alleviate this. That said, maybe Receive to wait until you call |
Thanks! I'll try and update the result. BTW, one more quick question. If I want to delay the re-delivery of a nacked message as much as possible, what is the best way? Is a message re-delivered as soon as |
I just tried to process the messages within handler itself. |
The client library doesn't do anything immediately after you call Nack. For performance, we try to batch a few calls together to amortize RPC overhead. In the current implementation, we wait for at most 1/10th of the subscription's AckDeadline before we publish. The behavior could change in the future though. Once the library actually contact the server to nack, the message should become immediately available for redelivery.
I think there are a couple of options. If you use pubsub without the client library, your message will expire and redelivered after AckDeadline. However, if you use the library, the library assumes that you're still busy processing the message, and it automatically extend the deadline for you until MaxExtension. After this, your message will expire and redelivered like normal. So, if you want it redelivered at some time (not immediately) in the future, this might work for you. If you need to nack at a specific time, I think you could call Nack with time.AfterFunc? A potential issue with both is that the message could be kept in memory for while. If there are too many messages like this, your mem consumption might be large. Could you let us know why you need to delay delivery? We would very much like to know if it's a common problem. |
@junghoahnsc I unfortunately couldn't reproduce this problem. I tested the library against the load test framework and it showed better throughput and smaller Do you know of a way to reproduce this? The handler function in load test is rather trivial, so that might be causing the difference. |
@pongad Thanks for the detailed explanation and looking into this.
One of our service depends on an external service. When it fails to handle the message due to the external service, we'd like to delay the reprocessing and give higher priority to new messages.
Yes, I guess it depends on the handler. Other jobs works fine after upgrading library. I can see this from only one job now. I'm investigating why this job shows worse ack behavior. BTW, All of jobs showed a weird behavior after upgrading. They stopped pulling from one of subscription after a few hours, while pulling from other subscriptions works fine. This pubsub topic is different than others like there are a bulk of messages periodically. At that time, I give |
I added a new subscription to the one we had an issue and ran a job with a dummy handler. I'm looking why only this pubsub shows this behavior. |
Update: I tried to use a dummy handler, which just call Is this expected? I'm wondering why |
@junghoahnsc This is very interesting information. I actually thought about something like this. Our implementation batch ack messages together and send them every However, this logic has been around much before commit 686a2d1. If this was causing problems, it shouldn't get worse when you update. I assume you did not change the ack deadline when you update the library? @jba does this new info ring any bell for you? FWIW, Java implementation send ack messages every 100ms. Should we do the same? @kamalaboulhosn (Sorry I keep spamming the two of you.) |
@pongad Yes, I didn't change the ack deadline when I updated the library. |
@pongad I think following the Java implementation makes sense. Quicker acks are generally better because, in the event the client shuts down, the buffered acks could be lost and messages could be redelivered. |
The testing job that is just
I had this issue two times for the last 3~4 days. |
I believe this is not a problem with the client library, but with gRPC. It feels suspiciously similar to grpc/grpc-go#954. You said this topic only receives batches of messages periodically? This makes me wonder if it's a gRPC problem. In some other languages, we had a bug where gRPC request would fail if the connection stays idle for too long. I'm not sure if this affects Go or not. Is your gRPC up to date? If updating doesn't help, is it possible for you to share gRPC trace with us? In your test application import _ "golang.org/x/net/trace"
grpc.EnableTracing = true // it already defaults to true, if you turn it off you need to turn it back on before any calls though
http.ListenAndServe("localhost:8080", nil) // or whichever port you want then you can find traces at Thank you for your patience. |
See discussions in #684 (comment) . Load testing shows no performance regression. Change-Id: I45761540d2b248292b52118894d37712c0bfa654 Reviewed-on: https://code-review.googlesource.com/14472 Reviewed-by: kokoro <[email protected]> Reviewed-by: Jonathan Amsterdam <[email protected]>
@junghoahnsc This commit makes us ack more often. In theory, I think this should improve your metric the way your reducing the the ack deadline did. Could you let us know if this helps you? |
@pongad After I restarted the job, it received batches of messages and then stopped again. I'll try to trace to the test job and see when this happens again. Thanks! |
@pongad For the commit that acks often shows noticeable difference in our case. We can't see spikes in |
Fantastic news! Are We are also in the process of reworking some of the library's internals. Load testing so far shows that it beats currently implementation in both metrics. So, if your metrics look acceptable but not completely satisfactory, I think we could wait a little for the rework to land so we don't duplicate effort. |
Yes, those metrics are good enough to us. |
Since the main concern here is addressed, I'll close this for now. Please reopen if you're still running into problems. |
Hello,
After I upgraded google-cloud-go library (from 686a2d1 to f5c3fe2), pubsub message seems to be acked not quickly as before. All of messages are acked within a second, but the "Oldest Unacknowledged Message" went up from less than 2min to 30+ min. Also "Undelivered Messages" went up from 1~2k to 10k.
And does 'ReceiveSettings.MaxOutstandingMessages' affect this?
When I tried to decrease to like 256, those metrics went down a bit.
The pulled (received) messages are handled concurrently and the messages can be acked in a different order.
Is there any change that might affect this?
Thanks,
The text was updated successfully, but these errors were encountered: