Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streaming][bugfix] handle TLS signalisation when TLS is disabled on client side #9494

Closed

Conversation

pierresouchay
Copy link
Contributor

This will fix #9474

wrapper(server.Datacenter, conn) could return the same connection. In such case, Consul client were incorrectly marking connection as RPCTLS while the connection was not encrypted.

This cause messages like:

[ERROR] agent.server.rpc: failed to read byte: conn=from=10.236.195.48:51572 error="tls: first record does not look like a TLS handshake"

The bug can be highlighted very easily by running Consul in dev mode with this config:

consul agent -dev -hcl 'rpc{enable_streaming=true}' -hcl use_streaming_backend=true -hcl 'ca_file="/etc/consul/ca.pem", cert_file="/etc/consul/cert.pem", key_file="/etc/consul/key.pem"'

Then, the bug is triggered with:

curl http://localhost:8500/v1/health/service/consul?index=2&stale

After this fix, such configuration works well.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I was hoping to have a look at the bug report soon. This definitely seems like the right place for the fix.

I'm going to look into this fix some more tomorrow and see why the test is failing.

It seems a bit strange me to that the wrapper could return the same connection unmodified. I'm wondering if we don't need to send the RPCTLS byte at all from here, because that wrapping happens elsewhere, or if we should be checking something other than server.UseTLS or if there is something else I don't understand here.

agent/grpc/client.go Show resolved Hide resolved
Previous patch was using conn to write magic number, but
since TLS Connection had been started, if corrupted the stream when
TLS was active.

To make it work, changed `TLSWrapper` to be another indirection using
the datacenter as paramter, thus, the TLS knowledge about TLS being
enable is known in advance.

This involves some refactoring, but we are sure the condtions to
decide whether TLS is enabled are identical.
@pierresouchay pierresouchay force-pushed the streaming_fix_grpc_tls branch from 8de406a to deeb577 Compare January 5, 2021 17:01
@pierresouchay pierresouchay requested a review from dnephin January 5, 2021 17:14
@dnephin
Copy link
Contributor

dnephin commented Jan 5, 2021

Thank you for tracking down the bug! @kyhavlov and I were looking at this, and I think it should be possible to fix without changing the signature of OutgoingRPCWrapper.

As you found, the source of the problem is verify_outgoing=false. I think that's because this line:

if c.UseTLS(dc) {

Which gets here:

consul/tlsutil/config.go

Lines 558 to 561 in 19baf4b

// if CAs are provided or VerifyOutgoing is set, use TLS
if c.base.VerifyOutgoing {
return false
}

verify_outgoing seems like the wrong thing to be checking in this case. What we want to confirm is that:

  1. the server will accept TLS (already handled by newDialer with if server.UseTLS)
  2. the client is capable of using TLS (has the CA config)

verify_outgoing is neither of those, it is a way of saying "must use TLS". Even if verify_outgoing is false, I think we should opt to use TLS when the server supports it.

I think we can accomplish that by changing OutgoingRPCWrapper to call MutualTLSCapable instead of UseTLS on the line I referenced above.

We took a quick look at the two other callers of OutgoingRPCWrapper and both already explicitly call UseTLS themselves, so there should be no impact on the existing callers. MutualTLSCapable seems like it is the right check here.

UseTLS should probably be renamed to something more appropriate for what it does, but we don't need to change that to fix the bug.

If we make that change I guess we also need to handle the case where MutualTLSCapable() returns false, but verify_outgoing=true. I'm not sure if that gets handled during config loading, or if that is something we'll need to check here. It seems like it should be possible to handle that during config loading, but I haven't looked to see if we do.

@pierresouchay
Copy link
Contributor Author

@dnephin Thank you for looking. Well, to me the big problem is that getAreaForPeerDatacenterUseTLS(dc) called by c.UseTLS(dc) offers no guarantee that it will return the same value as if server.UseTLS(I mean, myServ.dc1 could return UseTLS=true while getAreaForPeerDatacenterUseTLS("dc1") could return false. That is why I had to introduce dc to avoid changing too much the behaviour.

I am also quite afraid of activating silently TLS for all Consul users in such case, because for now, having verify_outgoing did actually disabled TLS by default over RPC (which might have a significant performance impact).

What do you think?

@dnephin
Copy link
Contributor

dnephin commented Jan 5, 2021

I agree we shouldn't change the existing behaviour for normal RPC, but I think we can make the default behaviour for gRPC requests do the right thing. If someone doesn't want TLS at all, they omit the ca_file. I think since the existing callers already call UseTLS before calling OutgoingRPCWrapper it should be safe to change OutgoingRPCWrapper.

@pierresouchay
Copy link
Contributor Author

@dnephin what would you recommend then? Because it would be 2 different code paths, so, I would have to duplicate OutgoingRPCWrapper, one for GRPC over RPC, the other for RPC.

Still I don't really get the purpose of getAreaForPeerDatacenterUseTLS() in the codepath...

I am a bit skeptical: having different behaviour for GRPC OverRPC and other RPC is weird for the same parameters. TLS is hard enough/scary for admins, I would not complexity furthermore the behaviour if possible.

I see and applause the point of using TLS as much as possible, but I would rather change this behaviour for all RPC, not only GRPC over RPC in the next major release in such case.

@dnephin
Copy link
Contributor

dnephin commented Jan 6, 2021

Ya, I think you are probably right. We should consider doing that for 1.10 but we don't need to do it in this PR.

Another option for fixing this would then be to change the grpc/client.NewClientConnPool to accept the full tlsutil.Configurator instead of only the wrapper. That way in newDialer we could use if server.UseTLS && configurator.UseTLS(dc), which is basically what the other code paths do.

@pierresouchay
Copy link
Contributor Author

pierresouchay commented Jan 6, 2021

You decide, tell me what to do, I'll do the change quickly because it blocks our release of 1.9.x in production (we try to push changes in prod only when merged upstream).

I took the approach because :

  • TLS rapper was lying (aka not creating TLS), having a bool is both explicit and allows to get the result cleanly
  • even in case of refactoring we are sure the test to choose to use or not TLS is in a single place.

@pierresouchay
Copy link
Contributor Author

@dnephin Implemented as you suggested in #9512

dnephin added a commit that referenced this pull request Jan 6, 2021
[Streaming][bugfix] handle TLS signalisation when TLS is disabled on client side (alternative to #9494)
@dnephin
Copy link
Contributor

dnephin commented Jan 6, 2021

replaced by #9512

@dnephin dnephin closed this Jan 6, 2021
hashicorp-ci pushed a commit that referenced this pull request Jan 6, 2021
[Streaming][bugfix] handle TLS signalisation when TLS is disabled on client side (alternative to #9494)
pierresouchay added a commit to criteo-forks/consul that referenced this pull request Jan 7, 2021
criteoconsul pushed a commit to criteo-forks/consul that referenced this pull request Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[streaming] weird errors when watching services with streaming on Consul 1.9.1
2 participants