pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

dwalker-va · 2019-06-14T04:59:13Z

Client

Pubsub

Describe Your Environment

golang 1.12 on GKE

Expected Behavior

Errors returned by the SDK if the pubsub API is returning errors or the ability to inject a logger or automatic re-establishment of broken connections.

Actual Behavior

Errors seem to be ignored and subscription receivers get into a bad state, we can usually resolve the issue by restarting our pods/instances, which will set up the subscription receiver again as part of the instance bootup.

Ultimately the behaviour we see is basically #1444 - a deadlock.

This graph and table is from https://console.cloud.google.com/apis/api/pubsub.googleapis.com/metrics

So the pub/sub api says that close to 100% of our StreamingPull calls are failing. We use this client exclusively for StreamingPull, and in none of our services can I find any hints about what exactly the problem is. I should also mention that we're way, way under quota for StreamingPull open request, open connections, subscriber throughput, etc according to https://console.cloud.google.com/apis/api/pubsub.googleapis.com/quota

Is there a way I can help further diagnose the issue?

jeanbza · 2019-06-14T14:45:40Z

Could you please provide:

Your SubscriptionConfig.
Your Receive callback function.
The version of this library you're using (e.g. latest is v0.40.0).
A goroutine dump would be useful, too, and any other diagnostic information you could provide.

dwalker-va · 2019-06-14T17:51:16Z

SubscriptionConfig: We use the default values
Receive Callback function: We notice the issue across dozens of services using different receive callbacks. I will try to reproduce the issue with a simple receive function, but because it's not something I can reproduce on-demand, it may take a while.
Version: v0.40.0 (although we've experienced this issue for a long time with previous versions too)
Diagnostics (open them to make them readable):
I've captured diagnostics for the same service running at 2 different points in time (during normal operation and during an outage via this issue). There's a big contrast in CPU. This was the only contrast I noticed across all of the stackdriver metrics (CPU, Heap, Contention, Allocated Heap, Threads/Goroutines)

CPU time during normal (good) behaviour

CPU time during bad behaviour (not pulling messages):

dwalker-va · 2019-06-17T20:40:32Z

I have some experiments going - one is just a forked version of this repo using a logger to investigate which error codes are being returned by the pubsub API when this state happens. I'm mostly interested in whether it is ResourceExhausted: #1166 (comment)

The other experiment I'm running is using unary pull ("synchronous" in the pulloptions) instead of streaming pull.

I'll post any findings.

hongalex · 2019-06-17T22:28:44Z

Thanks! From the other example, I have been able to occasionally replicate the issue with the Receive call hanging indefinitely. Not exactly sure what conditions trigger it, but will also report back once we know more.

Izopi4a · 2019-06-18T15:03:23Z

my logging says exactly the same, 99% error logging, my app is super simple, i am not using the latest version, was about to update and check, I assume its the same.

since i am noob if you need something from me as well, i will need a slight guide what do you need me to show you. I can post you my entire code btw, its like 50-60 lines i think.

dwalker-va · 2019-06-18T15:20:22Z

The entire day of pubsub errors (some of these would be triggered by context canceled/pod shutdown on our side, although not many as we only restarted once during the day):

During a single hour during which we had an outage/receive methods hanging:

I think the most I can conclude from this is that it doesn't seem to be related to #1166 (at least in the sense that ResourceExhausted is special)

hongalex · 2019-06-18T17:49:22Z

@dwalker-va How often are you making the requests? It would be helpful to know what your error rate is.

dwalker-va · 2019-06-18T19:57:00Z

The metrics page on cloud console allows me to filter by credentials (very nice) so I'm able to get stats for just the one service whose error codes I was tracking.

This is for the past day

Does that help or were you asking for something else?

hongalex · 2019-06-18T20:07:50Z

Yeah that helps! So actually, I think the high error rate you're seeing is intended. StreamingPull ends in an error (usually from the stream being broken). From the streaming pull docs:

StreamingPull has a 100% error rate (this is to be expected)

StreamingPull streams are always terminated with an error code. Note that unlike in regular RPCs, the error here is simply an indication that a stream has been broken, not that requests are failing. Therefore, while the StreamingPull API may have a seemingly surprising 100% error rate, this is by design. To diagnose StreamingPull errors, we recommend focusing on StreamingPull message operation metrics, rather than StreamingPull request metrics.

This actually reflects your metrics well, since your Ack calls are going through properly with no errors.

dwalker-va · 2019-06-18T22:20:37Z

@hongalex I'm confused at why this has been closed. Ultimately this isn't a question about the error rate of the StreamingPull API, it is a question about why the Receive method hangs indefinitely - I only mentioned the StreamingPull API error rate because it seemed anomalous, but it's not relevant to the actual problem with the SDK.

hongalex · 2019-06-18T22:56:42Z

Sorry about that, I misunderstood your original issue between references to the errors. This does seem fairly similar to #1444 as you said, but I'll leave both issues open for the time being.

Just to be sure, are you always ack'ing or nack'ing all messages are they come in? From your last image, it seems like there are about 10% of messages that have been published but not ack'ed. That potentially leads to the issue of hanging StreamingPull calls over time.

dwalker-va · 2019-06-19T17:46:44Z

Thanks @hongalex, I'm taking a look at changing the behaviour there. Also about unary pull - still runs into the same issue, so I think what you are suggesting is likely the root cause. I'll confirm asap.

dwalker-va · 2019-06-21T15:39:33Z

This is for the last day, the number are within 0.5% of each other, which should be reasonable to chalk up to leading lag (system is currently processing messages it hasn't acked/nacked yet).

I also haven't had an outage since making this change. Do you have any intuition about what is happening in the client when people are letting 10% of messages time out? People were letting messages time out that they didnt want to try reprocessing immediately - I guess trying to emulate a task queue backoff. This was still a problem when using the synchronous option.

hongalex · 2019-06-21T19:32:25Z

Essentially, the resources held by those messages are never released, specifically from MaxOustandingMessages and MaxOutstandingBytes, part of subscription.ReceiveSettings. They semaphores held by these messages are released only when the message is ack or nackd. If you'd like to implement your own form of flow control, you can set these values to -1 as described in #870. There are other ways to use task queue, but the client library itself was not written to support this kind of behavior and is sort of hacky.

dwalker-va · 2019-06-21T19:57:59Z

Thanks @hongalex that explains everything I think and this issue can be closed.
I see that issue produced a PR - interested to see what happens with it. We used this mechanism to delay processing in our old python applications, it appears to be possible via the API but the client library actually punishes using the API in that way - which might not be the wrong thing to do, but it is pretty subtle. I know we love to use Pub/Sub because of the amazing throughput - task queues (Cloud Tasks) for example don't offer nearly the same QPS.

hongalex · 2019-06-21T20:11:39Z

Great, glad your issue was mostly resolved. At the moment, we aren't planning to add the functionality to the client library, see here. With that said, we are looking to add delayed retries as a native feature of Pub/Sub, since this feature request has been brought up multiple times.

jeanbza assigned hongalex Jun 14, 2019

jeanbza added api: pubsub Issues related to the Pub/Sub API. status: investigating The issue is under investigation, which is determined to be non-trivial. needs more info This issue needs more information from the customer to proceed. labels Jun 14, 2019

hongalex removed the needs more info This issue needs more information from the customer to proceed. label Jun 17, 2019

hongalex closed this as completed Jun 18, 2019

hongalex reopened this Jun 18, 2019

hongalex closed this as completed Jun 21, 2019

hongalex mentioned this issue Jun 21, 2019

pubSub: infinite loop (or deadlock) when pulling subscription #1444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

dwalker-va commented Jun 14, 2019 •

edited

Loading

jeanbza commented Jun 14, 2019

dwalker-va commented Jun 14, 2019 •

edited

Loading

dwalker-va commented Jun 17, 2019

hongalex commented Jun 17, 2019

Izopi4a commented Jun 18, 2019

dwalker-va commented Jun 18, 2019 •

edited

Loading

hongalex commented Jun 18, 2019

dwalker-va commented Jun 18, 2019

hongalex commented Jun 18, 2019

dwalker-va commented Jun 18, 2019

hongalex commented Jun 18, 2019

dwalker-va commented Jun 19, 2019

dwalker-va commented Jun 21, 2019 •

edited

Loading

hongalex commented Jun 21, 2019

dwalker-va commented Jun 21, 2019 •

edited

Loading

hongalex commented Jun 21, 2019

pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

Comments

dwalker-va commented Jun 14, 2019 • edited Loading

Client

Describe Your Environment

Expected Behavior

Actual Behavior

jeanbza commented Jun 14, 2019

dwalker-va commented Jun 14, 2019 • edited Loading

dwalker-va commented Jun 17, 2019

hongalex commented Jun 17, 2019

Izopi4a commented Jun 18, 2019

dwalker-va commented Jun 18, 2019 • edited Loading

hongalex commented Jun 18, 2019

dwalker-va commented Jun 18, 2019

hongalex commented Jun 18, 2019

dwalker-va commented Jun 18, 2019

hongalex commented Jun 18, 2019

dwalker-va commented Jun 19, 2019

dwalker-va commented Jun 21, 2019 • edited Loading

hongalex commented Jun 21, 2019

dwalker-va commented Jun 21, 2019 • edited Loading

hongalex commented Jun 21, 2019

dwalker-va commented Jun 14, 2019 •

edited

Loading

dwalker-va commented Jun 14, 2019 •

edited

Loading

dwalker-va commented Jun 18, 2019 •

edited

Loading

dwalker-va commented Jun 21, 2019 •

edited

Loading

dwalker-va commented Jun 21, 2019 •

edited

Loading