-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSL23_GET_SERVER_HELLO:unknown protocol for openssl command from a different service in the same mesh #200
Comments
Same problem here. Connecting to a virtual node with TLS termination enabled from outside the mesh correctly presents the cert to the client. Connecting from one virtual node to another in the same mesh (backends, TLS termination, and client policy configured correctly) gives me the error. |
Hey @aboutbelka and @gdowmont, is it possible that your client applications are also trying to negotiate TLS in addition to the proxy? TLS in App Mesh (today) supports the proxies negotiating TLS between themselves, while the applications speak plain-text to the proxies. We have a roadmap item for allowing the downstream (client) applications to negotiate TLS instead of the proxy (#162). But keep in mind: if the application is negotiating TLS with it's upstream service, you'll lose Layer 7 metrics and routing (since it will be encrypted to the client proxy). I'm moving this to our roadmap repo for better tracking. |
Hi @bcelenza, I tried changing the server application to http and port 80. Without TLS - for wget command I get 404 (which is expected) from the client node |
@aboutbelka So I can reproduce this behavior, and I can walk you through why this is the case. ReproductionI setup a 2-node mesh like you had with a client hosting SSH and an upstream server hosting a simple HTTP service that echos back a 200 response code. My resources look like this (in Kubernetes format, but spec should look familiar): Client:
Server:
When I shell into my client node, and issue an HTTP request to the mockingbird service, it succeeds:
But an HTTPS request, on the other hand:
And when I verify with openssl:
So why is this happening?I'm going to borrow a diagram from our TLS blog post to help explain this. The current App Mesh TLS model configures the client and server proxies to negotiate TLS. So when I send the HTTP (not HTTPS) request from curl above, the communication path looks like:
This works, because the Client Envoy is configured to intercept HTTP requests on port 80, negotiate TLS upstream with the server envoy and send the traffic that way. The HTTPS, by comparison, looks like this:
The reason the request fails is this: the Client Envoy is not expecting to negotiate TLS with the curl application. And because the curl application is attempting to negotiate TLS with the Client Envoy, it fails, because the TLS session cannot be established. The same failure can be observed by Where to go from here?In effect, this is the intended behavior of the current TLS implementation. If the application (curl or wget for example) wants to negotiate TLS, then the Client Envoy would either need to terminate TLS and re-negotiate with the Server Envoy, or proxy the traffic as pure TCP. The former option adds complexity in setup, and the latter means you'd lose all of the benefits of Layer 7 request routing, observability, etc. That said, would your expectation be to have the client application originate the TLS session instead? If so, we have a roadmap item for this and I'd love to have your +1 and feedback on it: #162. I hope this explanation of the behavior helped, and let me know what your ideal communication path between the applications and proxies looks like. |
Thank you for clarifying the expected behaviour @bcelenza. We have got very similar configuration to your example but in ECS Fargate. Server is configured on port 80 with HTTP. Our first approach was to use the typical HTTPS flow: However after your previous message we found that it wasn't correct and we have configured it so: With AppMesh TLS termination disabled we get correct response when connecting from the client. However as soon as we enable TLS termination (regardless of client policy) we get 503 back from the server. There is nothing logged in the application so that error looks like is coming from Envoy This error only occurs when communicating to the service from a client inside the mesh. If we use client outside the mesh we get expected response from the server. I think that is correct as we don't have virtual gateway so client outside is not going to follow mesh routing. Recycling the containers does not make any difference and even on fresh start with envoy cleanly pulling the config still the same issue occurs. If relevant we currently use |
@gdowmont Interesting, I wonder if the 503 you're seeing is actually from the client Envoy because it's failing to validate the certificate on the server. To confirm this, could you do the following:
If this isn't the case, if you could provide the logs from the client and server envoys, that might help root cause this issue. |
Hi @bcelenza , We tried the steps you've mentioned. I couldn't get any consistent behaviour. I couldn't see any explanations in envoy logs, I was still getting a mixture of 404s (expected) and 503s with no clear pattern + something was forcing the service to crash periodically (and again, no reasonable explanation in the logs) |
Hi @bcelenza, I had a chance to do more testing: |
Hey @aboutbelka, it sounds like there may be a few things going on. One question up front: are you calling UpdateVirtualNode to toggle TLS on and off for your services? We generally recommend traffic shifting from non-TLS to TLS Virtual Nodes (same for TLS to non-TLS) via a Virtual Router and Route. The reason for this is that each proxy may receive the TLS configuration at a slightly different time (within seconds), but can result in endpoints being considered unhealthy for a period of time due to the eventual consistency. Additionally, we recommend a series of best practices to avoid 503s in communication paths, which are often the result of scaling and/or other eventual consistency concerns. On the name resolution issues, that is a bit more interesting to me. Name resolution via DNS should be the same outcome regardless of TLS (since it happens well before a TLS session is negotiated). That might be the issue to prioritize looking at first, as it could be a sign of a larger issue in your mesh. |
@bcelenza I don't think eventual consistency or DNS is a problem here. We have tried it with a single Fargate container that was started fresh after TLS has been enabled on the virtual node. We have been using AppMesh (without TLS) in production for about 6-7 months now and we only see this problem when enabling TLS termination. |
@bcelenza I've retested the same services yesterday. No changes have been made since Friday and both the services were started today in the morning automatically as a part of auto scaling process and running without any problems. I performed the same wget command and got consistent 200 responses. |
A brief status update on this issue since it's been a while. We've been working with the customer via internal support to help analyze and root cause the source of the issues described in @aboutbelka's latest comment. One issue we've noticed is that terminated endpoints related to Cloud Map-enabled Virtual Nodes remain routable when active health checking is configured, until the active health checks meet the unhealthy threshold. We've created a separate issue to track this (#213). We've also identified that whether or not TLS is enabled does not appear to be a factor in the stabilization time -- we've been able to reproduce with and without TLS. Research is on-going and I'll report back when we have additional information. |
Final status update -- the second impacting issue we've discovered is related to the negative caching TTL of Route 53 when used with Cloud Map. I've cut a separate issue on our roadmap for that: #221. I believe this covers the issues reported and discovered as part of this issue. I'm going to close this issue in favor of the other two which we've cut for better tracking. Please feel free to open this issue (or a new one) for any additional concerns or problems. |
I've configured TLS on the server node using a Private certificate from PCA and set a client policy to enforce TLS on the client.
Both the server and the client nodes have the following permissions added to the policy:
"acm:ExportCertificate",
"acm-pca:GetCertificateAuthorityCertificate",
"appmesh:StreamAggregatedResources",
"acm-pca:DescribeCertificateAuthority"
When I run openssl s_client -connect my-service-name.co.uk:443/ I get the correct response
When I run the same from a different service inside the same mesh I get:
`CONNECTED(00000003)
140065377040032:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:s23_clnt.c:795:
no peer certificate available
No client certificate CA names sent
SSL handshake has read 7 bytes and written 295 bytes
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
`
Attaching my json configs for the server and the client (applied through aws-cli on existing ECS fargate services)
github.zip
The text was updated successfully, but these errors were encountered: