Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

Closed
dwalker-va opened this issue Jun 14, 2019 · 16 comments
Closed

pubsub: StreamingPull api errors hidden or ignored by the sdk #1460

dwalker-va opened this issue Jun 14, 2019 · 16 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. status: investigating The issue is under investigation, which is determined to be non-trivial.

Comments

@dwalker-va
Copy link

dwalker-va commented Jun 14, 2019

Client

Pubsub

Describe Your Environment

golang 1.12 on GKE

Expected Behavior

Errors returned by the SDK if the pubsub API is returning errors or the ability to inject a logger or automatic re-establishment of broken connections.

Actual Behavior

Errors seem to be ignored and subscription receivers get into a bad state, we can usually resolve the issue by restarting our pods/instances, which will set up the subscription receiver again as part of the instance bootup.

Ultimately the behaviour we see is basically #1444 - a deadlock.

This graph and table is from https://console.cloud.google.com/apis/api/pubsub.googleapis.com/metrics

streamingpull

So the pub/sub api says that close to 100% of our StreamingPull calls are failing. We use this client exclusively for StreamingPull, and in none of our services can I find any hints about what exactly the problem is. I should also mention that we're way, way under quota for StreamingPull open request, open connections, subscriber throughput, etc according to https://console.cloud.google.com/apis/api/pubsub.googleapis.com/quota

Is there a way I can help further diagnose the issue?

@jeanbza
Copy link
Contributor

jeanbza commented Jun 14, 2019

Could you please provide:

  • Your SubscriptionConfig.
  • Your Receive callback function.
  • The version of this library you're using (e.g. latest is v0.40.0).
  • A goroutine dump would be useful, too, and any other diagnostic information you could provide.

@jeanbza jeanbza added api: pubsub Issues related to the Pub/Sub API. status: investigating The issue is under investigation, which is determined to be non-trivial. needs more info This issue needs more information from the customer to proceed. labels Jun 14, 2019
@dwalker-va
Copy link
Author

dwalker-va commented Jun 14, 2019

SubscriptionConfig: We use the default values
Receive Callback function: We notice the issue across dozens of services using different receive callbacks. I will try to reproduce the issue with a simple receive function, but because it's not something I can reproduce on-demand, it may take a while.
Version: v0.40.0 (although we've experienced this issue for a long time with previous versions too)
Diagnostics (open them to make them readable):
I've captured diagnostics for the same service running at 2 different points in time (during normal operation and during an outage via this issue). There's a big contrast in CPU. This was the only contrast I noticed across all of the stackdriver metrics (CPU, Heap, Contention, Allocated Heap, Threads/Goroutines)

CPU time during normal (good) behaviour
goodcpu

CPU time during bad behaviour (not pulling messages):
badcpu

@hongalex hongalex removed the needs more info This issue needs more information from the customer to proceed. label Jun 17, 2019
@dwalker-va
Copy link
Author

I have some experiments going - one is just a forked version of this repo using a logger to investigate which error codes are being returned by the pubsub API when this state happens. I'm mostly interested in whether it is ResourceExhausted: #1166 (comment)

The other experiment I'm running is using unary pull ("synchronous" in the pulloptions) instead of streaming pull.

I'll post any findings.

@hongalex
Copy link
Member

Thanks! From the other example, I have been able to occasionally replicate the issue with the Receive call hanging indefinitely. Not exactly sure what conditions trigger it, but will also report back once we know more.

@Izopi4a
Copy link

Izopi4a commented Jun 18, 2019

my logging says exactly the same, 99% error logging, my app is super simple, i am not using the latest version, was about to update and check, I assume its the same.

since i am noob if you need something from me as well, i will need a slight guide what do you need me to show you. I can post you my entire code btw, its like 50-60 lines i think.

@dwalker-va
Copy link
Author

dwalker-va commented Jun 18, 2019

The entire day of pubsub errors (some of these would be triggered by context canceled/pod shutdown on our side, although not many as we only restarted once during the day):
Screen Shot 2019-06-18 at 9 18 58 AM

During a single hour during which we had an outage/receive methods hanging:
Screen Shot 2019-06-18 at 9 18 20 AM

I think the most I can conclude from this is that it doesn't seem to be related to #1166 (at least in the sense that ResourceExhausted is special)

@hongalex
Copy link
Member

@dwalker-va How often are you making the requests? It would be helpful to know what your error rate is.

@dwalker-va
Copy link
Author

The metrics page on cloud console allows me to filter by credentials (very nice) so I'm able to get stats for just the one service whose error codes I was tracking.

This is for the past day
Screen Shot 2019-06-18 at 1 56 02 PM

Does that help or were you asking for something else?

@hongalex
Copy link
Member

Yeah that helps! So actually, I think the high error rate you're seeing is intended. StreamingPull ends in an error (usually from the stream being broken). From the streaming pull docs:

StreamingPull has a 100% error rate (this is to be expected)

StreamingPull streams are always terminated with an error code. Note that unlike in regular RPCs, the error here is simply an indication that a stream has been broken, not that requests are failing. Therefore, while the StreamingPull API may have a seemingly surprising 100% error rate, this is by design. To diagnose StreamingPull errors, we recommend focusing on StreamingPull message operation metrics, rather than StreamingPull request metrics.

This actually reflects your metrics well, since your Ack calls are going through properly with no errors.

@dwalker-va
Copy link
Author

@hongalex I'm confused at why this has been closed. Ultimately this isn't a question about the error rate of the StreamingPull API, it is a question about why the Receive method hangs indefinitely - I only mentioned the StreamingPull API error rate because it seemed anomalous, but it's not relevant to the actual problem with the SDK.

@hongalex
Copy link
Member

Sorry about that, I misunderstood your original issue between references to the errors. This does seem fairly similar to #1444 as you said, but I'll leave both issues open for the time being.

Just to be sure, are you always ack'ing or nack'ing all messages are they come in? From your last image, it seems like there are about 10% of messages that have been published but not ack'ed. That potentially leads to the issue of hanging StreamingPull calls over time.

@hongalex hongalex reopened this Jun 18, 2019
@dwalker-va
Copy link
Author

Thanks @hongalex, I'm taking a look at changing the behaviour there. Also about unary pull - still runs into the same issue, so I think what you are suggesting is likely the root cause. I'll confirm asap.

@dwalker-va
Copy link
Author

dwalker-va commented Jun 21, 2019

This is for the last day, the number are within 0.5% of each other, which should be reasonable to chalk up to leading lag (system is currently processing messages it hasn't acked/nacked yet).
Screen Shot 2019-06-21 at 9 35 39 AM
I also haven't had an outage since making this change. Do you have any intuition about what is happening in the client when people are letting 10% of messages time out? People were letting messages time out that they didnt want to try reprocessing immediately - I guess trying to emulate a task queue backoff. This was still a problem when using the synchronous option.

@hongalex
Copy link
Member

Essentially, the resources held by those messages are never released, specifically from MaxOustandingMessages and MaxOutstandingBytes, part of subscription.ReceiveSettings. They semaphores held by these messages are released only when the message is ack or nackd. If you'd like to implement your own form of flow control, you can set these values to -1 as described in #870. There are other ways to use task queue, but the client library itself was not written to support this kind of behavior and is sort of hacky.

@dwalker-va
Copy link
Author

dwalker-va commented Jun 21, 2019

Thanks @hongalex that explains everything I think and this issue can be closed.
I see that issue produced a PR - interested to see what happens with it. We used this mechanism to delay processing in our old python applications, it appears to be possible via the API but the client library actually punishes using the API in that way - which might not be the wrong thing to do, but it is pretty subtle. I know we love to use Pub/Sub because of the amazing throughput - task queues (Cloud Tasks) for example don't offer nearly the same QPS.

@hongalex
Copy link
Member

Great, glad your issue was mostly resolved. At the moment, we aren't planning to add the functionality to the client library, see here. With that said, we are looking to add delayed retries as a native feature of Pub/Sub, since this feature request has been brought up multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. status: investigating The issue is under investigation, which is determined to be non-trivial.
Projects
None yet
Development

No branches or pull requests

4 participants