Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to bootstrap Fleet Server with self-signed certificates in 7.17 #3435

Closed
jsoriano opened this issue Apr 5, 2024 · 17 comments
Closed

Fail to bootstrap Fleet Server with self-signed certificates in 7.17 #3435

jsoriano opened this issue Apr 5, 2024 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@jsoriano
Copy link
Member

jsoriano commented Apr 5, 2024

Fleet Server has started to fail to bootstrap with self-signed certificates in 7.17 branch. It doesn't affect any released version at the moment. It affects 7.17.20.

It fails with:

Error: fail to enroll: fail to execute request to fleet-server: remote error: tls: bad certificate

The very same certificates work with 7.17.19 and any other version tested, including 8.14.0-SNAPSHOT.

The only significant change since 7.17.19 is #3391, which updates beats dependency from 7.11.2 to 7.17.18.

One change included in Beats between these versions completely rewrites certificate validation: https://github.com/elastic/beats/pull/22495/files#diff-6225b0191feaa98542380b95e652c6f1f6805ee9a141330788011681cce0e487R153

There is a change in elastic-agent that solves an issue with certificate validation when bootstrapping, that was not backported to 7.x. Not sure if could be related: https://github.com/elastic/elastic-agent/pull/1867/files#diff-7efa04d4079650519d21a6e8ba217a7874130cecf8adc92dc2a78d4cfd2aee09R556

For confirmed bugs, please report:

  • Version: 7.17 branch (doesn't affect 7.17.19)
  • Steps to Reproduce:
    • Can be reproduced with elastic-package stack up -v -d --version 7.17-SNAPSHOT. That under the hood:
      • Generates self-signed certificates.
      • Installs the CA system-wide in the containers (in /etc/ssl/certs).
      • Bootstraps Fleet Server using these certificates.
@jlind23
Copy link
Contributor

jlind23 commented Apr 8, 2024

@jsoriano Looks like it happened due to b4de6f5 as stated by @cmacknz here - https://github.com/elastic/dev/issues/2547#issuecomment-2040631630

I'll add this to one of our upcoming sprint.

@jillguyonnet
Copy link
Contributor

jillguyonnet commented Apr 15, 2024

Hi @michel-laterman 👋 A customer is experiencing this issue after upgrading from 7.17.18 to 7.17.20 (https://github.com/elastic/sdh-beats/issues/4608) - could you confirm whether they should roll back until elastic/beats#38785 is merged?

Edit: thank you @jlind23 for https://github.com/elastic/sdh-beats/issues/4608#issuecomment-2056255255

@cmacknz
Copy link
Member

cmacknz commented Apr 15, 2024

There needs to be a known issue about this. 7.17 doesn't have release notes in ingest-docs yet, so probably we need to put this in both possible places to ensure people catch it:

https://github.com/elastic/ingest-docs/tree/main/docs/en/ingest-management/release-notes

@cmacknz
Copy link
Member

cmacknz commented Apr 15, 2024

Nevermind, release note already exists elastic/ingest-docs#1006 was looking at the main branch for docs on 7.17.

@michel-laterman
Copy link
Contributor

It does not look like elastic/beats#38785 by itself will resolve the issue.

I've tried to see if it's an issue with tlscommon by updating the module in beats for the 7.17 branch with whatever is in the elastic-agent-libs repo and building an elastic-agent, and altering my fleet-server to build with it as well (branch for beats here).

I've altered my fleet-server to also report its config when the bootstrap message is updated so we can see it on the command line.
Note that the certs/keys/cas below are self-signed for the elastic-agent instance in a vm (safe to expose)

# fleet-server debugging
2024-04-15T23:30:07.196Z	INFO	cmd/enroll_cmd.go:788	Fleet Server - Running on policy with Fleet Server integration: 365721b0-fb41-11ee-b608-5bc48b3ca2f3; missing config fleet.agent.id (expected during bootstrap process), server.ssl: &{Enabled:<nil> VerificationMode:full Versions:[] CipherSuites:[] CAs:[] Certificate:{Certificate:-----BEGIN CERTIFICATE-----
MIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh
c3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTI0MDQxNTIzMjk1OVoX
DTM0MDQxNTIzMjk1OVowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE
AxMQZmxlZXQtc2VydmVyLWRldjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC
ggEBANoNELKiFda/y0PYtMrQj4ec0e7XFCT9b15I4YzMiu+6ZCJjQ1ivxGyoqDek
Y1/5PI82P8wDOy/wBRm0X9OOJ1EBuek3UM+Z9oY4WdKGfDCu2SyJgtq54Bo7JgfO
r4wB3UTeB4LFK66K3lz3MgfrTkiRN00cwmpC4wy5MhZd8ytJ9nKNCMvw4XmHOmgs
ZDv/WwvNrB18qNmh8WPL6kmp4LQ7fACoY1onmAjwctgyvpzQBsHd8vqmbpdYj0NE
uKVH5DJ0yOIVEi2WEETNQVvadVSPFcGWkUoXYJ4wQ7OFGNajsXiMjlPOpF2r/CCd
Drifszm4G312TOmTW39cg6WPGEMCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G
A1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBQPjDnP1DPX
S/qqOGb+FNGciVNwNDAbBgNVHREEFDASghBmbGVldC1zZXJ2ZXItZGV2MA0GCSqG
SIb3DQEBCwUAA4IBAQBVG3exeOydiomKUwOCPqtxrU/d+iF/naNhCzRU+CrMyDmI
fNz82ybxs6Vrbq1bW0sq76aUeBbvEcKdJRdaflJhln41DYrRrXyZEyCPrIhNJzgW
0SWJ8Fv8ObQYMMu33U/qAJDvM2y9gBTW6NCDvX4Pn1kk4bnK5pwXlgxAFV/U1t5N
uGpw6L05WdutF6Qg7cKrVp45tlF7sqVYqWTPZj0vjHxS0Sa6UW7yadZSFhPg404x
re/rkiwfetf1jT1eazWiPdA8burrKQUg0p28z/e8Eon7ZQKdGNn2PfEtoDUFupXg
1ejvPjVjP5wA+Pj5cu70Oqh5yATV/bQY7l4fzbuH
-----END CERTIFICATE-----
 Key:-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEA2g0QsqIV1r/LQ9i0ytCPh5zR7tcUJP1vXkjhjMyK77pkImND
WK/EbKioN6RjX/k8jzY/zAM7L/AFGbRf044nUQG56TdQz5n2hjhZ0oZ8MK7ZLImC
2rngGjsmB86vjAHdRN4HgsUrroreXPcyB+tOSJE3TRzCakLjDLkyFl3zK0n2co0I
y/DheYc6aCxkO/9bC82sHXyo2aHxY8vqSangtDt8AKhjWieYCPBy2DK+nNAGwd3y
+qZul1iPQ0S4pUfkMnTI4hUSLZYQRM1BW9p1VI8VwZaRShdgnjBDs4UY1qOxeIyO
U86kXav8IJ0OuJ+zObgbfXZM6ZNbf1yDpY8YQwIDAQABAoIBADSnxwp8Ha34Lsu5
fx8i8iYbdo6onZK5KLWp/92SX1K4vgmX0uGNwG9E4ypcpiq88yTaQtmvRhGzcmfc
qO9bep8TPaPV2cvoMCIFZZtzInZXItagdlr/W+C5u9tSzA2RPa+ttj6cAoppunzU
rN5AsmzPtH0InuIuOMoPxsqj8V7YUWxZaxR+jQzWND2q5k5B7SAKCkBLztB/OugV
Ws+CGYMj3d0callJATUKQPmOcLmTK8MWY/nLOgc86tiwRf189uC4zz81q6uY0eA2
Ych61zhgINDlY17/ov7QjAcuSZpZLq8pgIL4uZGpLXA5xmLaegZQgDeN6TVC8J2d
jTUWb+ECgYEA5/5NsEPv+kpQOsscCgVEsaTFxP6FNq95hchw9dSlkAHdrvTBYe0I
d9mg1kowdIYOhUv5lvyhpdM7Pr6gHop4EEHCXGR8yKtYMMDzGUu5f/Rdz0IAWsqq
/8ylaWVMNZZyvxnV2s8aWib8jws2DlbevKrqnVVC1sDoV/b3ueVaETsCgYEA8J1q
7tbZ1FVOFIwX03toh8obDOcHgcDaGzD6b1kEjro86k8f/GJb395gHFj2Ht3Zx8HB
LPtpPVJ6xw/FIw3Ke+I1WiWVQ6HwRQS0kFGp2AuaXgBoj2TxeUTqmbxXvRkAHHRB
yXpEWAWH39hd/274wXzvnYmT9Gi45Qxi7kQDpJkCgYB49+yQlcxDaY6OKayUOQ0J
yE5rmv/hdPxb2xmzxc8S2TY77VoM8ukwfWVVd9fuWpyluukJZu/vJMbGv+WEJ1XV
vERZovhCNr5EpcfdD9RJOSXVVagTr4wc2BwEahKj+rAYn6MYdldaXOvitsjYD0oT
fNfbdELm8i30+E1SPJqLUwKBgE5sAB44CGccJoar4lgbMMaRKJ/b7KZtpKiYHgeM
i9+484GqqFIp/KfKYqjald0ZkZF5pOx0RKin6TxX93ilVglqgNkQxsV0UkssbW1c
MG8p2PYqS+nwjINp4syYhkArlc2wVoDESOIna1GZw4ktMgZeIfrGjGJsf1an4tal
dEqBAoGBALVwqXDOuxeQcOoOfNmhcsI3lUgrhFewAqNZ24X78mLSCG1vULnjuunV
ZaT4qrekvhpCdIg+AOv7TbolAUzWfQmy/NVNa3sG2PwySdCt9SiOE+fm84fkW7Eh
sxmtjoh7lKhy1ZpWmCDFXO9Bqz0DI+KIhnAWcpnUk0MfpLU1uCxX
-----END RSA PRIVATE KEY-----
 Passphrase: PassphrasePath:} CurveTypes:[] ClientAuth:<nil> CASha256:[]}, host: 0.0.0.0

# elastic-agent debugging
2024-04-15T23:30:07.810Z	INFO	cmd/enroll_cmd.go:458	Starting enrollment to URL: https://fleet-server-dev:8220/
2024-04-15T23:30:07.919Z	INFO	cmd/enroll_cmd.go:485	Attempting to diagnose tls issues	{"host": "fleet-server-dev:8220"}
2024-04-15T23:30:07.924Z	INFO	cmd/enroll_cmd.go:498	Remote cert found	{"issuer_name": "CN=localhost,O=elastic-fleet", "expiry": "2034-April-15", "common_name": "localhost", "dns_names": ["fleet-server-dev"], "ips": null, "issuer": "MCwxFjAUBgNVBAoTDWVsYXN0aWMtZmxlZXQxEjAQBgNVBAMTCWxvY2FsaG9zdA==", "sig": "VRt3sXjsnYqJilMDgj6rca1P3fohf52jYQs0VPgqzMg5iHzc/Nsm8bOla26tW1tLKu+mlHgW7xHCnSUXWn5SYZZ+NQ2K0a18mRMgj6yITSc4FtElifBb/Dm0GDDLt91P6gCQ7zNsvYAU1ujQg71+D59ZJOG5yuacF5YMQBVf1NbeTbhqcOi9OVnbrRekIO3Cq1aeObZRe7KlWKlkz2Y9L4x8UtEmulFu8mnWUhYT4ONOMa3v65IsH3rX9Y09Xms1oj3QPG7q6ykFINKdvM/3vBKJ+2UCnRjZ9j3xLaA1BbqV4NXo7z41Yz+cAPj4+XLu9DqoecgE1f20GO5eH827hw=="}
2024-04-15T23:30:07.924Z	INFO	cmd/enroll_cmd.go:504	CA info	{"cas": ["-----BEGIN CERTIFICATE-----\nMIIDSzCCAjOgAwIBAgICBnUwDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTI0MDQxNTIzMjk1OVoX\nDTM0MDQxNTIzMjk1OVowLDEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDESMBAGA1UE\nAxMJbG9jYWxob3N0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA6NRw\n0N7k5InOEJh6z33aVubEZfrsRdISifVX9jHAE76MDMxZtLBCGwBTa7Cgg+xfMFFK\nTDN952n8wC0PZcTkOLFmmBtMUNIAiEEAvvQ7/ubhj5wJjuBsOCRspcNb1itOSLfw\n6TpMdVm6J2D7R/tuIMdFqbGYCoMRGUffrtGulXEVliie8ChfTDwZp8d+R7ifqsB4\n+jDQ+GZFFMOpupmVVqvR5YXkALNOGHF9p93TvD4r4VOEaJzPd4cHF2QEHLFEVMvq\nzuV4cLdUp4aI+zlHGhydY7S3R82DQzmgsVqxWIgNM5LkuPhsUORZG/mjeTFAklJ+\nQpNkfznI4fod1GBROwIDAQABo3cwdTAOBgNVHQ8BAf8EBAMCAoQwHQYDVR0lBBYw\nFAYIKwYBBQUHAwIGCCsGAQUFBwMBMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYE\nFA+MOc/UM9dL+qo4Zv4U0ZyJU3A0MBQGA1UdEQQNMAuCCWxvY2FsaG9zdDANBgkq\nhkiG9w0BAQsFAAOCAQEAxDG5opOJt4Ux2PGOifa/unt2R44YAEyelIraWeHotouh\norsFnobC7VX0l0BKUQXYblmD0i1x/zG1npZw5WfsPDkQJIBhvzCep7pMroGLrrJO\nR8krOUb+tRGh8kSKuxHehTP4Pr9NplzBSTs7nCi557KPDSrDwnU7zxP4ZwisLqfq\nBUu+7bV/WrZPUntTG6a4O1IjvyQSxymemlX0OibqoFp8OJntHOheWN4D9/GU9dRG\n0YT8uya17ZBFWw/2Gtb0E1sOfWjl0s1cfRZORPjaOZ14dBTEBQWacATjlgpT+QrM\nL+4EEDFL4McSGzhPpO4T0UgybL7RIwnsOoCA8khZlw==\n-----END CERTIFICATE-----\n"]}

I'll investigate more tomorrow

@michel-laterman
Copy link
Contributor

I can recreate the issue by updating the fleet-server beats import to v7.12.0 which was the next tagged release after v7.11.2.
I'm sure that the cert, key + ca that are passed are all valid.
I think it has something to do with the upgrade to go v1.15 (the beats pr that updated this is elastic/beats#22495).
More details on my current testing are in a gist

I think the makeVerifyServerConnection method in tlscommon might be responsible, i need to compare how our v8 releases are creating tls server config to how it would in v7.12.0

@maggieghamry
Copy link

Confirmed this has happened in ESS in both Fleet and APM, with downgrade of Fleet or APM (only) to 7.17.19 fixing the issue temporarily until 7.17.21 is released

@simitt
Copy link

simitt commented Apr 23, 2024

@cmacknz can this be handled with the highest priority as it is breaking all setups with self signed certificates and also negatively impacts apm-servers managed by EA/Fleet Server?

@cmacknz
Copy link
Member

cmacknz commented Apr 23, 2024

There is a PR to fix this now: #3473

Following up

@cmacknz
Copy link
Member

cmacknz commented Apr 24, 2024

This appears to now be fixed in 7.17.21-SNAPSHOT

Screenshot 2024-04-24 at 11 21 21 AM

@michel-laterman
Copy link
Contributor

Reopening so we can confirm release will work

@jsoriano
Copy link
Member Author

Are there 7.17 snapshot containers to test this change with elastic-package?

@cmacknz
Copy link
Member

cmacknz commented Apr 25, 2024

There should be 7.17.21-SNAPSHOT containers, I was able to create a cloud deployment for this version #3435 (comment)

@cmacknz
Copy link
Member

cmacknz commented Apr 25, 2024

Closing this as fixed now. There will be an accelerated 7.17.21 release to fix this soon.

@cmacknz cmacknz closed this as completed Apr 25, 2024
@jsoriano
Copy link
Member Author

There should be 7.17.21-SNAPSHOT containers, I was able to create a cloud deployment for this version #3435 (comment)

Ah ok, I tried but docker.elastic.co/elastic-agent/elastic-agent-complete:7.17.21-SNAPSHOT is not available, and 7.17-SNAPSHOT hasn't been updated since weeks ago.

I could confirm though that elastic-package stack works with docker.elastic.co/beats/elastic-agent:7.17.21-SNAPSHOT.

ELASTIC_AGENT_IMAGE_REF_OVERRIDE=docker.elastic.co/beats/elastic-agent:7.17.21-SNAPSHOT elastic-package stack up -v -d --version 7.17.21-SNAPSHOT

There will be an accelerated 7.17.21 release to fix this soon.

Great, thanks!

@cmacknz
Copy link
Member

cmacknz commented Apr 25, 2024

Ah ok, I tried but docker.elastic.co/elastic-agent/elastic-agent-complete:7.17.21-SNAPSHOT is not available, and 7.17-SNAPSHOT hasn't been updated since weeks ago.

This seems wrong, possibly something has been lost in the Buildkite migration. Let me follow up with eng prod.

@cmacknz
Copy link
Member

cmacknz commented Apr 25, 2024

docker pull docker.elastic.co/elastic-agent/elastic-agent:7.17.21-SNAPSHOT succeeds but docker.elastic.co/elastic-agent/elastic-agent-complete:7.17.21-SNAPSHOT doesn't exist as reported.

docker pull docker.elastic.co/elastic-agent/elastic-agent-complete:8.13.3-SNAPSHOT does exist however so something is missing in 7.17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants