Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL23_GET_SERVER_HELLO:unknown protocol for openssl command from a different service in the same mesh #200

Closed
aboutbelka opened this issue May 13, 2020 · 13 comments
Assignees
Labels
Bug/needs-information Bug Something isn't working

Comments

@aboutbelka
Copy link

I've configured TLS on the server node using a Private certificate from PCA and set a client policy to enforce TLS on the client.

Both the server and the client nodes have the following permissions added to the policy:
"acm:ExportCertificate",
"acm-pca:GetCertificateAuthorityCertificate",
"appmesh:StreamAggregatedResources",
"acm-pca:DescribeCertificateAuthority"

When I run openssl s_client -connect my-service-name.co.uk:443/ I get the correct response
When I run the same from a different service inside the same mesh I get:

`CONNECTED(00000003)
140065377040032:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:s23_clnt.c:795:

no peer certificate available

No client certificate CA names sent

SSL handshake has read 7 bytes and written 295 bytes

New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
`
Attaching my json configs for the server and the client (applied through aws-cli on existing ECS fargate services)
github.zip

@gdowmont
Copy link

Same problem here.

Connecting to a virtual node with TLS termination enabled from outside the mesh correctly presents the cert to the client.

Connecting from one virtual node to another in the same mesh (backends, TLS termination, and client policy configured correctly) gives me the error.

@bcelenza
Copy link
Contributor

Hey @aboutbelka and @gdowmont, is it possible that your client applications are also trying to negotiate TLS in addition to the proxy?

TLS in App Mesh (today) supports the proxies negotiating TLS between themselves, while the applications speak plain-text to the proxies.

We have a roadmap item for allowing the downstream (client) applications to negotiate TLS instead of the proxy (#162). But keep in mind: if the application is negotiating TLS with it's upstream service, you'll lose Layer 7 metrics and routing (since it will be encrypted to the client proxy).

I'm moving this to our roadmap repo for better tracking.

@bcelenza bcelenza transferred this issue from aws/aws-app-mesh-examples May 18, 2020
@bcelenza bcelenza self-assigned this May 18, 2020
@aboutbelka
Copy link
Author

aboutbelka commented May 20, 2020

Hi @bcelenza,
Thanks for coming back to me.

I tried changing the server application to http and port 80.

Without TLS - for wget command I get 404 (which is expected) from the client node
With TLS - I get 503
Tried both with and without the client policy on the client node.
Still works fine from the bastion with http and port 80 instead of https:443
Looks like smth is wrong with the mesh?

@jamsajones jamsajones added Bug Something isn't working Priority: Medium labels May 20, 2020
@bcelenza
Copy link
Contributor

bcelenza commented May 20, 2020

@aboutbelka So I can reproduce this behavior, and I can walk you through why this is the case.

Reproduction

I setup a 2-node mesh like you had with a client hosting SSH and an upstream server hosting a simple HTTP service that echos back a 200 response code. My resources look like this (in Kubernetes format, but spec should look familiar):

Client:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualNode
metadata:
  name: client
  namespace: aviary
spec:
  meshName: aviary
  backends:
    - virtualService:
        virtualServiceName: mockingbird.aviary.svc.cluster.local
  backendDefaults:
    clientPolicy:
      tls:
        validation:
          trust:
            acm:
              certificateAuthorityArns:
                - arn:aws:acm-pca:us-east-1:XXXXXXX:certificate-authority/XXXXXXX
  logging:
    accessLog:
      file:
        path: /dev/stdout

Server:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualNode
metadata:
  name: mockingbird-v1
  namespace: aviary
spec:
  meshName: aviary
  listeners:
    - portMapping:
        port: 80
        protocol: http
      tls:
        mode: STRICT
        certificate:
          acm:
            certificateArn: arn:aws:acm:us-east-1:XXXXXXXX:certificate/XXXXXXXX
  serviceDiscovery:
    dns:
      hostName: mockingbird.aviary.svc.cluster.local
---
apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualService
metadata:
  name: mockingbird.aviary.svc.cluster.local
  namespace: aviary
spec:
  meshName: aviary
  virtualRouter:
    listeners:
      - portMapping:
          port: 80
          protocol: http
  routes:
    - name: default-route
      http:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeName: mockingbird-v1
              weight: 1

When I shell into my client node, and issue an HTTP request to the mockingbird service, it succeeds:

$ curl -v http://mockingbird.aviary.svc.cluster.local/echo
*   Trying 10.100.125.95...
* Connected to mockingbird.aviary.svc.cluster.local (10.100.125.95) port 80 (#0)
> GET /echo HTTP/1.1
> Host: mockingbird.aviary.svc.cluster.local
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< date: Wed, 20 May 2020 21:01:01 GMT
< content-length: 0
< x-envoy-upstream-service-time: 4
< server: envoy
<
* Connection #0 to host mockingbird.aviary.svc.cluster.local left intact

But an HTTPS request, on the other hand:

$ curl -vk https://mockingbird.aviary.svc.cluster.local:80/echo
*   Trying 10.100.125.95...
* Connected to mockingbird.aviary.svc.cluster.local (10.100.125.95) port 80 (#0)
* error reading ca cert file /etc/ssl/certs/ca-certificates.crt (Error while reading file.)
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* gnutls_handshake() failed: An unexpected TLS packet was received.
* Closing connection 0
curl: (35) gnutls_handshake() failed: An unexpected TLS packet was received.

And when I verify with openssl:

$ openssl s_client -connect mockingbird.aviary.svc.cluster.local:80
CONNECTED(00000003)
139747333252760:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:s23_clnt.c:794:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 305 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1590006493
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

So why is this happening?

I'm going to borrow a diagram from our TLS blog post to help explain this.

TLS diagram

The current App Mesh TLS model configures the client and server proxies to negotiate TLS. So when I send the HTTP (not HTTPS) request from curl above, the communication path looks like:

Client Application (curl) ---Plain Text---> Client Envoy ---TLS---> Server Envoy ---Plain Text---> Server Application

This works, because the Client Envoy is configured to intercept HTTP requests on port 80, negotiate TLS upstream with the server envoy and send the traffic that way.

The HTTPS, by comparison, looks like this:

Client Application (curl) ---XXX connection fails---> Client Envoy

The reason the request fails is this: the Client Envoy is not expecting to negotiate TLS with the curl application. And because the curl application is attempting to negotiate TLS with the Client Envoy, it fails, because the TLS session cannot be established.

The same failure can be observed by openssl, because the Client Envoy is intercepting the traffic for both, and failing to negotiate TLS.

Where to go from here?

In effect, this is the intended behavior of the current TLS implementation. If the application (curl or wget for example) wants to negotiate TLS, then the Client Envoy would either need to terminate TLS and re-negotiate with the Server Envoy, or proxy the traffic as pure TCP. The former option adds complexity in setup, and the latter means you'd lose all of the benefits of Layer 7 request routing, observability, etc.

That said, would your expectation be to have the client application originate the TLS session instead? If so, we have a roadmap item for this and I'd love to have your +1 and feedback on it: #162.

I hope this explanation of the behavior helped, and let me know what your ideal communication path between the applications and proxies looks like.

@gdowmont
Copy link

Thank you for clarifying the expected behaviour @bcelenza.

We have got very similar configuration to your example but in ECS Fargate. Server is configured on port 80 with HTTP.

Our first approach was to use the typical HTTPS flow:
Client Application (curl) ---XXX connection fails---> Client Envoy

However after your previous message we found that it wasn't correct and we have configured it so:
Client Application (curl) ---Plain Text---> Client Envoy ---TLS---> Server Envoy ---Plain Text---> Server Application

With AppMesh TLS termination disabled we get correct response when connecting from the client. However as soon as we enable TLS termination (regardless of client policy) we get 503 back from the server. There is nothing logged in the application so that error looks like is coming from Envoy

This error only occurs when communicating to the service from a client inside the mesh.

If we use client outside the mesh we get expected response from the server. I think that is correct as we don't have virtual gateway so client outside is not going to follow mesh routing.

Recycling the containers does not make any difference and even on fresh start with envoy cleanly pulling the config still the same issue occurs.

If relevant we currently use 840364872350.dkr.ecr.eu-west-1.amazonaws.com/aws-appmesh-envoy:v1.12.3.0-prod as our envoy image.

@bcelenza
Copy link
Contributor

@gdowmont Interesting, I wonder if the 503 you're seeing is actually from the client Envoy because it's failing to validate the certificate on the server.

To confirm this, could you do the following:

  1. Enable debug logging for both the client and server envoy: https://docs.aws.amazon.com/app-mesh/latest/userguide/troubleshooting-best-practices.html#ts-bp-enable-envoy-debug-logging
  2. Check and see if you get the error mentioned in this troubleshooting topic: https://docs.aws.amazon.com/app-mesh/latest/userguide/troubleshooting-security.html#ts-security-tls-client-policy

If this isn't the case, if you could provide the logs from the client and server envoys, that might help root cause this issue.

@aboutbelka
Copy link
Author

Hi @bcelenza ,

We tried the steps you've mentioned. I couldn't get any consistent behaviour. I couldn't see any explanations in envoy logs, I was still getting a mixture of 404s (expected) and 503s with no clear pattern + something was forcing the service to crash periodically (and again, no reasonable explanation in the logs)
For 503s I was getting the following:
[2020-05-22 13:34:33.196][20][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:198] [C295] handshake error: 5
Unfortunately, my 30-day cert trial is about to expire so I have to freeze any future testing and won't be able to provide responses

@aboutbelka
Copy link
Author

Hi @bcelenza,

I had a chance to do more testing:
So both the services work fine without TLS, once TLS is enabled I have a mix of 200, 503 and Name or service not known with no clear pattern. Envoy logs don't show any errors and the services remain healthy is ECS however Cloudmap shows the service becoming unhealthy for a short period of time but the reason is not clear (and it's still healthy in ECS).

@bcelenza
Copy link
Contributor

bcelenza commented Jun 1, 2020

Hey @aboutbelka, it sounds like there may be a few things going on.

One question up front: are you calling UpdateVirtualNode to toggle TLS on and off for your services? We generally recommend traffic shifting from non-TLS to TLS Virtual Nodes (same for TLS to non-TLS) via a Virtual Router and Route. The reason for this is that each proxy may receive the TLS configuration at a slightly different time (within seconds), but can result in endpoints being considered unhealthy for a period of time due to the eventual consistency.

Additionally, we recommend a series of best practices to avoid 503s in communication paths, which are often the result of scaling and/or other eventual consistency concerns.

On the name resolution issues, that is a bit more interesting to me. Name resolution via DNS should be the same outcome regardless of TLS (since it happens well before a TLS session is negotiated). That might be the issue to prioritize looking at first, as it could be a sign of a larger issue in your mesh.

@gdowmont
Copy link

gdowmont commented Jun 2, 2020

@bcelenza I don't think eventual consistency or DNS is a problem here. We have tried it with a single Fargate container that was started fresh after TLS has been enabled on the virtual node.
Once launched it goes into this loop of 503, correct responses, DNS error changing between each response every 30 seconds or so. At the same time the container stays up all the time and healthcheck does not terminate it.

We have been using AppMesh (without TLS) in production for about 6-7 months now and we only see this problem when enabling TLS termination.

@aboutbelka
Copy link
Author

aboutbelka commented Jun 3, 2020

@bcelenza I've retested the same services yesterday. No changes have been made since Friday and both the services were started today in the morning automatically as a part of auto scaling process and running without any problems. I performed the same wget command and got consistent 200 responses.
However when I manually restarted the container for the server I started experiencing the same problems as before - a mixture of 200, 503, unable to resolve host. I tested it a few times and looks like it takes around 15 minutes for the service to become stable and return consistent 200s after restart:
12.54 - stopped the container
12.56.39 503
12.57.21 unable to resolve host
12.59.20 a mix of 200 503 200 503 200 503 for some 20 seconds
12.59.42 unable to resolve host
13.00.44 503
13.00.47 200
13.01.20 503
13.02.31 200
13.02.33 503
13.02.40 200
13.02.43 503
13.02.53 200
13.03.06 unable to resolve host
13.04.01 200
13.05.05 unable to resolve host
13.06.13 200
13.07.14 unable to resolve host
13.09.14 200 and going
I don't think that any 503/ unable to resolve host messages should appear after the 1st 200 response and 15 minutes is too long for the service to become stable, any other non-TLS services are restored within a minute or two after stopping the container.
From ECS service perspective the service was up within 3 mins and reached healthy state

@bcelenza
Copy link
Contributor

A brief status update on this issue since it's been a while.

We've been working with the customer via internal support to help analyze and root cause the source of the issues described in @aboutbelka's latest comment.

One issue we've noticed is that terminated endpoints related to Cloud Map-enabled Virtual Nodes remain routable when active health checking is configured, until the active health checks meet the unhealthy threshold. We've created a separate issue to track this (#213).

We've also identified that whether or not TLS is enabled does not appear to be a factor in the stabilization time -- we've been able to reproduce with and without TLS.

Research is on-going and I'll report back when we have additional information.

@bcelenza
Copy link
Contributor

Final status update -- the second impacting issue we've discovered is related to the negative caching TTL of Route 53 when used with Cloud Map. I've cut a separate issue on our roadmap for that: #221.

I believe this covers the issues reported and discovered as part of this issue. I'm going to close this issue in favor of the other two which we've cut for better tracking. Please feel free to open this issue (or a new one) for any additional concerns or problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug/needs-information Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants