-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builder fails constantly with panic in crypto/ssh #462
Comments
We are also seeing this in our long-running cluster, which happens to be: k8s version 1.4.7 on GKE |
Yes, the problem appears in 2.9. |
Thanks @Bregor that is helpful... So there are 2 builder releases in the meantime: https://github.com/deis/builder/releases/tag/v2.6.0 and https://github.com/deis/builder/releases/tag/v2.6.1 There aren't any changes that look obvious towards this behavior so wondering if the issue is something a bit more subtle such as the bump of the golang toolchain when we updated the docker-go-dev image in this project (3fb031a) |
No leads yet; going to build/deploy image locally with added debug logging to try to pinpoint... |
From what we're finding, the error is here. Perhaps this Go crypto/ssh issue is relevant, which would sound fair because of the basic liveness check performed by the server, which is just a basic TCP probe. EDIT: Then again, that fix was merged in June 2015 so that fix would've always been there because Go 1.5 was released in Aug 2015. Still, there could be something here that leads to a bigger fish. |
I think I'm onto something here. In
I get the following panic:
So wrong incantation, but at least we're in the right location to nailing down a reproducible test case against the OP's logs. From the panic stacktrace, it looks to be when initializing a basic handshake without having any valid config from the client. I think the DSAPrivateKey callout is a wild goose chase/default path. |
For what it's worth, I've been running a local vagrant cluster on k8s v1.5.1 and Workflow v2.10.0 for over 19 hours without any restarts:
Still trying to nail it down through a unit test though. |
@bacongobbler try to build something. |
Building works fine, failing or not. Tried that, but thanks for the suggestion! From @vdice's observations it seems to be something passive, because neither of us are actively killing the pod when it reboots. |
Correct, thank you @bacongobbler ; I should have mentioned that before. My GKE cluster (k8s 1.4.7) has the latest Workflow chart installed and is completely stock. Without even interacting with Deis/Workflow, (have not even registered) the builder pod restarts occur, usually multiple times an hour. Here's the number currently:
|
We are having the same issue on cluster running on AWS (kube-aws): Deis Workflow 2.9.1 |
Maybe this could be useful: Events:
Pods:
Controller log:
May this 404s or deployment overall timeout be a purpose for failing builder's liveness-check? |
Some addition info:
Controller fails from time to time. |
Just after builder restart:
|
Controller logs are empty at the moment, so this particular behaviour is just builder's |
Controller issue is here: deis/controller#1204 |
Another interesting detail: Over the weekend, I checked out the v2.5.5 tag of builder, which is the version released with Workflow v2.8.0. These versions are believed to be the last not hitting this issue. With the builder deployment updated to use this image, there indeed are instances of this issue, though seemingly more rare:
Error:
|
+1 on this issue. Also witnessing the same behavior on k8s 1.4.7 builder image -- quay.io/deis/builder:v2.6.1 |
@gabrtv pointed me to this commit, made only 15 hours ago: golang/crypto@b822463
EDIT: golang.org/x/crypto is maintained separately from Go stdlib. myb |
If anyone's willing to try a fix, #464 looks promising. |
re-opening as #464 didn't resolve this. |
I believe we encounter the same issue: Environment:
Fail ratio: Log:
|
1) Always force a key exchange if we exchange 2^31 packets. In the past this might not happen if RekeyThreshold was set to a very large interval. 2) Follow recommendations from RFC 4344 for block ciphers. For AES, we can encrypt 2^(blocksize/4) blocks under the same keys. On modern hardware, the previous default of 1Gb could force a key exchange within ~10 seconds. Since the key exchange takes 3 roundtrips (send kex init, send DH init, send NEW_KEYS), this is relatively expensive on high-latency links. Change-Id: I1297124a307c541b7bf22d814d136ec0c6d8ed97 Reviewed-on: https://go-review.googlesource.com/35410 Run-TryBot: Han-Wen Nienhuys <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Adam Langley <[email protected]>
I also got same on Workflow-2.11.0 / K8s v1.5.2+coreos.1 / Azure.
|
Alas, it doesn't appear that any golang.org/x/crypto bumps have completely ameliorated this issue; bumping to next milestone. |
Can reproduce as well; a huge problem for us at the moment 😞 |
Closed by #493 Included in Builder v2.9.0 release. You can update an existing install/helm release via
Note you may want to add Or just patch the builder deploy via
|
I think you also need to upgrade to dockerbuilder v2.7.1 or things will break. |
The |
Btw. I just did some debugging in golang.org/x/crypto/ssh and it appears the above mentioned bug occurs with 2048 Bit DSA keys, while the common key size for DSA is 1024 Bit. The error seems to be that the code hardcodes a subgroup size of 160 Bit (20 Bytes), but DSA 2048 Bit uses either 224 Bit or 256 Bit subgroups (my test key generated with openssl had a subgroup size of 256 Bit / 32 Byte). I'll open an upstream issue and reference it back here. Update: Looks like @vdice already reported the problem in golang/go#19424.
Update: It appears the root of the problem stems from the |
We completely removed DSA keys from the builder some time ago as they're being discontinued, so this effectively has been fixed. |
@bacongobbler sure, this should be closed. Thanks for reminding. |
Environment:
Fail ratio:
Log:
The text was updated successfully, but these errors were encountered: