Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: NACK causes resource to be considered non-existing #8657

Closed
ejona86 opened this issue Nov 3, 2021 · 6 comments
Closed

xds: NACK causes resource to be considered non-existing #8657

ejona86 opened this issue Nov 3, 2021 · 6 comments
Assignees
Labels
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Nov 3, 2021

I'm dealing with a server returning broken configuration. There's three breakages.

Log:

2021-11-03 08:13:44.018 GMT [port:8080] UNAVAILABLE: Credentials failed to obtain metadata
2021-11-03 08:13:44.019 GMT [xds-client<6>] Retry ADS stream in 21,552,486,369 ns
2021-11-03 08:14:05.573 GMT [xds-client<6>] Sending LDS request for resources: [grpc/server?xds.resource.listening_address=0.0.0.0:8080]
2021-11-03 08:14:05.573 GMT [xds-client<6>] ADS stream started
2021-11-03 08:14:05.575 GMT Sent DiscoveryRequest <snip>
2021-11-03 08:14:05.773 GMT [xds-client<6>] Received LDS response: <snip>
2021-11-03 08:14:06.120 GMT [xds-client<6>] Received LDS Response version 1635927232752380394 nonce 1. Parsed resources: [grpc/server?xds.resource.listening_address=0.0.0.0:8080]
2021-11-03 08:14:06.120 GMT [xds-client<6>] Failed processing LDS Response version 1635927232752380394 nonce 1. Errors:
LDS response Listener 'grpc/server?xds.resource.listening_address=0.0.0.0:8080' validation error: HttpConnectionManager contains invalid HttpFilter: Invalid filter config for HttpFilter [envoy.filters.http.rbac]: Encountered error parsing policy: com.google.re2j.PatternSyntaxException: error parsing regexp: missing argument to repetition operator: `+`
2021-11-03 08:14:06.120 GMT [xds-client<6>] Sending NACK for LDS update, nonce: 1, current version: 
2021-11-03 08:14:06.137 GMT [xds-client<6>] Sent DiscoveryRequest <snip; it was the NACK>
2021-11-03 08:14:20.575 GMT [xds-client<6>] LDS resource grpc/server?xds.resource.listening_address=0.0.0.0:8080 initial fetch timeout
2021-11-03 08:14:20.576 GMT [xds-client<6>] Conclude LDS resource grpc/server?xds.resource.listening_address=0.0.0.0:8080 not exist
2021-11-03 08:14:20.579 GMT Exception in thread "main" java.io.IOException: io.grpc.StatusException: UNAVAILABLE: Listener grpc/server?xds.resource.listening_address=0.0.0.0:8080 unavailable
	at io.grpc.xds.XdsServerWrapper.start(XdsServerWrapper.java:168)
	at io.grpc.testing.integration.XdsTestServer.start(XdsTestServer.java:184)
	at io.grpc.testing.integration.XdsTestServer.main(XdsTestServer.java:97)
Caused by: io.grpc.StatusException: UNAVAILABLE: Listener grpc/server?xds.resource.listening_address=0.0.0.0:8080 unavailable
	at io.grpc.Status.asException(Status.java:543)
	at io.grpc.xds.XdsServerWrapper$DiscoveryState$3.run(XdsServerWrapper.java:428)
	at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
	at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
	at io.grpc.xds.XdsServerWrapper$DiscoveryState.onResourceDoesNotExist(XdsServerWrapper.java:421)
	at io.grpc.xds.ClientXdsClient$ResourceSubscriber.onAbsent(ClientXdsClient.java:2226)
	at io.grpc.xds.ClientXdsClient$ResourceSubscriber$1ResourceNotFound.run(ClientXdsClient.java:2172)
	at io.grpc.SynchronizationContext$ManagedRunnable.run(SynchronizationContext.java:182)
	at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
	at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
	at io.grpc.SynchronizationContext$1.run(SynchronizationContext.java:155)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
2021-11-03 08:14:20.582 GMT Shutting down
2021-11-03 08:14:20.583 GMT java.lang.NullPointerException
	at io.grpc.testing.integration.XdsTestServer.stop(XdsTestServer.java:211)
	at io.grpc.testing.integration.XdsTestServer.access$000(XdsTestServer.java:49)
	at io.grpc.testing.integration.XdsTestServer$1.run(XdsTestServer.java:91)

The first is that the resource is considered not to exist. That is not right. The watcher should have been delivered an error and the resource wait timer cancelled. I'll note that in this case we had previously gotten a lot of UNAVAILABLE: Credentials failed to obtain metadata failures, and this is the first response to arrive (not included because that log was painful to copy).

Even assuming that the resource is properly determined to not exist, it shouldn't cause start() to fail. From A36 xDS for Servers:

XdsServer's start must not fail due to transient xDS issues, like missing xDS configuration from the xDS server.

And then there's a bug in XdsTestServer if start() throws, since server was never assigned.

CC @sanjaypujare

@ejona86 ejona86 added this to the Next milestone Nov 3, 2021
@ejona86 ejona86 added the bug label Nov 3, 2021
@sanjaypujare
Copy link
Contributor

...
The first is that the resource is considered not to exist. That is not right. The watcher should have been delivered an error and the resource wait timer cancelled.

If the XdsClient never received a valid response (within the timeout period) after the initial DiscoveryRequest for a resource then it treats it as "resource does not exist". So that part is probably working as designed. What you are saying is that in this case the control plane sent an LDS response which was rejected/NACKed because of validation issues then we cannot treat it as "resource does not exist". When the timer fires we should then differentiate between the cases of "no response ever arrived" (which means "resource does not exist") and "invalid response arrived" (which means "resource does exist" but there's an error) and accordingly deliver "resource not found" or "timeout" to the watcher.

Even assuming that the resource is properly determined to not exist, it shouldn't cause start() to fail. From A36 xDS for Servers:

XdsServer's start must not fail due to transient xDS issues, like missing xDS configuration from the xDS server.

Correct. This should be fixed in XdsServerWrapper to not set the exception in the initialStartFuture.

And then there's a bug in XdsTestServer if start() throws, since server was never assigned.

CC @sanjaypujare

@YifeiZhuang
Copy link
Contributor

Ah i remember this particular issue, we somehow treated initial start specially, so if it is identified as resource not exist, initial start fails. The previous UNAVAILABLE was correct because it was treated as an error.
Your suggested behavior interprets RFC better:

If XdsServer's start blocks waiting for xDS configuration an application can use the serving state callback to be notified of issues preventing startup progress.

Note that we have the same issue in CDS.

@ejona86
Copy link
Member Author

ejona86 commented Nov 4, 2021

If the XdsClient never received a valid response (within the timeout period) after the initial DiscoveryRequest for a resource then it treats it as "resource does not exist".

I'm familiar with that timeout, but it should be for when we don't receive a response for the resource. If it is an invalid resource, then we should just notify the watcher of the error, not the resource is missing. After we see the bad resource, why would a timer even make sense? There's no expectation that the control plane would fix the resource before the timer expires.

@sanjaypujare
Copy link
Contributor

...
After we see the bad resource, why would a timer even make sense? There's no expectation that the control plane would fix the resource before the timer expires.

Yes, makes sense. Initially I thought not failing start() but canceling the timer sounds contradictory but getting a response (even with an invalid resource) means the timer has served its purpose. But you still think start() should not fail because the control plane will somehow send a valid resource in a reasonable amount of time?

@YifeiZhuang
Copy link
Contributor

But you still think start() should not fail because the control plane will somehow send a valid resource in a reasonable amount of time?

A36 didn't say start() has a timeout. It sounds start will be blocked forever until valid resource available:

XdsServer's start must not fail due to transient xDS issues, like missing xDS configuration from the xDS server. If XdsServer's start blocks waiting for xDS configuration an application can use the serving state callback to be notified of issues preventing startup progress

@ejona86
Copy link
Member Author

ejona86 commented Nov 4, 2021

But you still think start() should not fail because the control plane will somehow send a valid resource in a reasonable amount of time?

I say nothing about a reasonable amount of time. The design of start() called for it not to fail based on the control plane. It could fail if the local port couldn't be opened, but it shouldn't fail due to remote I/O, as remote I/O can vary in whether it is erroring or whether the resource temporarily didn't exist, and we don't want a server to randomly not-start.

@dapengzhang0 dapengzhang0 modified the milestones: Next, 1.43 Nov 30, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants