-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: StreamingPull api errors hidden or ignored by the sdk #1460
Comments
Could you please provide:
|
SubscriptionConfig: We use the default values |
I have some experiments going - one is just a forked version of this repo using a logger to investigate which error codes are being returned by the pubsub API when this state happens. I'm mostly interested in whether it is ResourceExhausted: #1166 (comment) The other experiment I'm running is using unary pull ("synchronous" in the pulloptions) instead of streaming pull. I'll post any findings. |
Thanks! From the other example, I have been able to occasionally replicate the issue with the |
my logging says exactly the same, 99% error logging, my app is super simple, i am not using the latest version, was about to update and check, I assume its the same. since i am noob if you need something from me as well, i will need a slight guide what do you need me to show you. I can post you my entire code btw, its like 50-60 lines i think. |
The entire day of pubsub errors (some of these would be triggered by context canceled/pod shutdown on our side, although not many as we only restarted once during the day): During a single hour during which we had an outage/receive methods hanging: I think the most I can conclude from this is that it doesn't seem to be related to #1166 (at least in the sense that ResourceExhausted is special) |
@dwalker-va How often are you making the requests? It would be helpful to know what your error rate is. |
The metrics page on cloud console allows me to filter by credentials (very nice) so I'm able to get stats for just the one service whose error codes I was tracking. Does that help or were you asking for something else? |
Yeah that helps! So actually, I think the high error rate you're seeing is intended. StreamingPull ends in an error (usually from the stream being broken). From the streaming pull docs:
This actually reflects your metrics well, since your Ack calls are going through properly with no errors. |
@hongalex I'm confused at why this has been closed. Ultimately this isn't a question about the error rate of the StreamingPull API, it is a question about why the Receive method hangs indefinitely - I only mentioned the StreamingPull API error rate because it seemed anomalous, but it's not relevant to the actual problem with the SDK. |
Sorry about that, I misunderstood your original issue between references to the errors. This does seem fairly similar to #1444 as you said, but I'll leave both issues open for the time being. Just to be sure, are you always ack'ing or nack'ing all messages are they come in? From your last image, it seems like there are about 10% of messages that have been published but not ack'ed. That potentially leads to the issue of hanging StreamingPull calls over time. |
Thanks @hongalex, I'm taking a look at changing the behaviour there. Also about unary pull - still runs into the same issue, so I think what you are suggesting is likely the root cause. I'll confirm asap. |
Essentially, the resources held by those messages are never released, specifically from |
Thanks @hongalex that explains everything I think and this issue can be closed. |
Great, glad your issue was mostly resolved. At the moment, we aren't planning to add the functionality to the client library, see here. With that said, we are looking to add delayed retries as a native feature of Pub/Sub, since this feature request has been brought up multiple times. |
Client
Pubsub
Describe Your Environment
golang 1.12 on GKE
Expected Behavior
Errors returned by the SDK if the pubsub API is returning errors or the ability to inject a logger or automatic re-establishment of broken connections.
Actual Behavior
Errors seem to be ignored and subscription receivers get into a bad state, we can usually resolve the issue by restarting our pods/instances, which will set up the subscription receiver again as part of the instance bootup.
Ultimately the behaviour we see is basically #1444 - a deadlock.
This graph and table is from https://console.cloud.google.com/apis/api/pubsub.googleapis.com/metrics
So the pub/sub api says that close to 100% of our StreamingPull calls are failing. We use this client exclusively for StreamingPull, and in none of our services can I find any hints about what exactly the problem is. I should also mention that we're way, way under quota for StreamingPull open request, open connections, subscriber throughput, etc according to https://console.cloud.google.com/apis/api/pubsub.googleapis.com/quota
Is there a way I can help further diagnose the issue?
The text was updated successfully, but these errors were encountered: