Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builder fails constantly with panic in crypto/ssh #462

Closed
Bregor opened this issue Jan 11, 2017 · 31 comments · Fixed by #464
Closed

Builder fails constantly with panic in crypto/ssh #462

Bregor opened this issue Jan 11, 2017 · 31 comments · Fixed by #464
Assignees
Milestone

Comments

@Bregor
Copy link
Contributor

Bregor commented Jan 11, 2017

Environment:

  • k8s-1.5.1
  • deis-workflow-2.10.0
  • helm-2.1.3
  • bare metal servers

Fail ratio:

$ kubectl get po -n deis
NAME                                     READY     STATUS    RESTARTS   AGE
deis-builder-3815239197-sgwpb            1/1       Running   19         11h

Log:

$ kubectl logs -n deis -p deis-builder-3815239197-sgwpb
2017/01/11 18:05:49 Starting health check server on port 8092
2017/01/11 18:05:49 Starting deleted app cleaner
2017/01/11 18:05:49 Starting SSH server on 0.0.0.0:2223
Listening on 0.0.0.0:2223
Accepting new connections.
Accepted connection.
---> ---> ---> panic: runtime error: slice bounds out of range

goroutine 884 [running]:
panic(0x1170b00, 0xc4200180c0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420182148, 0x1bbf2e0, 0xc42007c5a0, 0xc4204da740, 0x14, 0x20, 0x0, 0x780, 0xc4204da740)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc8860, 0xc420182148, 0x1bbf2e0, 0xc42007c5a0, 0xc4204da740, 0x14, 0x20, 0x11030a0, 0x10caae3, 0x11a4180, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc420118b50, 0x7f6c89ca8440, 0xc42041ab40, 0x1bbf2e0, 0xc42007c5a0, 0xc42058a000, 0x1bc8860, 0xc420182148, 0x40def4, 0x12f8db4, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc4205be000, 0x1bc8820, 0xc420118b50, 0xc42031a000, 0xc42058a000, 0x1, 0x0, 0xe3)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc4205be000, 0xc4201080e0, 0xd9, 0xd9, 0x0, 0x0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc4205be000, 0x0, 0x0, 0x4341b8, 0x137e0b8, 0x0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc4205be000)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216
@vdice vdice added this to the v2.11 milestone Jan 11, 2017
@vdice
Copy link
Member

vdice commented Jan 11, 2017

We are also seeing this in our long-running cluster, which happens to be:

k8s version 1.4.7 on GKE
Workflow v2.9.1
Helm 2.1.3

@Bregor
Copy link
Contributor Author

Bregor commented Jan 11, 2017

Yes, the problem appears in 2.9.
2.8 was not affected.

@vdice
Copy link
Member

vdice commented Jan 11, 2017

Thanks @Bregor that is helpful... So there are 2 builder releases in the meantime: https://github.com/deis/builder/releases/tag/v2.6.0 and https://github.com/deis/builder/releases/tag/v2.6.1

There aren't any changes that look obvious towards this behavior so wondering if the issue is something a bit more subtle such as the bump of the golang toolchain when we updated the docker-go-dev image in this project (3fb031a)

@vdice
Copy link
Member

vdice commented Jan 12, 2017

No leads yet; going to build/deploy image locally with added debug logging to try to pinpoint...

@bacongobbler
Copy link
Member

bacongobbler commented Jan 12, 2017

From what we're finding, the error is here.

Perhaps this Go crypto/ssh issue is relevant, which would sound fair because of the basic liveness check performed by the server, which is just a basic TCP probe.

EDIT: Then again, that fix was merged in June 2015 so that fix would've always been there because Go 1.5 was released in Aug 2015. Still, there could be something here that leads to a bigger fish.

@bacongobbler
Copy link
Member

bacongobbler commented Jan 12, 2017

I think I'm onto something here. In server_test.go, if I add this unit test that runs the server with no SSH config to intentionally cause it to panic and run make test:

func TestReceiveWithNoConfig(t *testing.T) {
	const testingServerAddr = "127.0.0.1:2246"

	c := NewCircuit()
	pushLock := NewInMemoryRepositoryLock(0)
	// NOTE(bacongobbler): notice that we're not initializing the server with any SSH config.
	runServer(nil, c, pushLock, testingServerAddr, time.Duration(0), t)

	// Give server time to initialize.
	time.Sleep(200 * time.Millisecond)
	assert.Equal(t, c.State(), ClosedState, "circuit state")

	// Connect to the server and issue env var set. This should return true.
	_, err := ssh.Dial("tcp", testingServerAddr, clientConfig())
	if err != nil {
		t.Fatalf("Failed to connect client to local server: %s", err)
	}
}

I get the following panic:

---> Listening on 127.0.0.1:2246
---> Accepting new connections.
---> Accepted connection.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5e3747]

goroutine 1091 [running]:
panic(0x8a3860, 0xc42000c0a0)
        /usr/local/go/src/runtime/panic.go:500 +0x1ae
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.NewServerConn(0xaf3020, 0xc420484018, 0x0, 0x0, 0x0, 0x1, 0xc420280778, 0x4057da)

So wrong incantation, but at least we're in the right location to nailing down a reproducible test case against the OP's logs. From the panic stacktrace, it looks to be when initializing a basic handshake without having any valid config from the client. I think the DSAPrivateKey callout is a wild goose chase/default path.

@bacongobbler
Copy link
Member

For what it's worth, I've been running a local vagrant cluster on k8s v1.5.1 and Workflow v2.10.0 for over 19 hours without any restarts:

><> kd get po | grep builder
deis-builder-3170955059-k084s            1/1       Running   0          19h

Still trying to nail it down through a unit test though.

@Bregor
Copy link
Contributor Author

Bregor commented Jan 13, 2017

@bacongobbler try to build something.
Especially something failing to build/start.

@bacongobbler
Copy link
Member

Building works fine, failing or not. Tried that, but thanks for the suggestion!

From @vdice's observations it seems to be something passive, because neither of us are actively killing the pod when it reboots.

@vdice
Copy link
Member

vdice commented Jan 13, 2017

From @vdice's observations it seems to be something passive, because neither of us are actively killing the pod when it reboots.

Correct, thank you @bacongobbler ; I should have mentioned that before. My GKE cluster (k8s 1.4.7) has the latest Workflow chart installed and is completely stock. Without even interacting with Deis/Workflow, (have not even registered) the builder pod restarts occur, usually multiple times an hour. Here's the number currently:

NAME                                     READY     STATUS    RESTARTS   AGE
deis-builder-2492834593-bh17g            1/1       Running   6          4h

@joelso
Copy link

joelso commented Jan 14, 2017

We are having the same issue on cluster running on AWS (kube-aws):

Deis Workflow 2.9.1
Kube 1.4.6
Helm 2.1.0

@Bregor
Copy link
Contributor Author

Bregor commented Jan 16, 2017

Maybe this could be useful:

Events:

    deis      2017-01-16 12:42:50 +0300 MSK   2017-01-12 09:25:20 +0300 MSK   30        deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Warning   Unhealthy   {kubelet 10.83.8.200}   Liveness probe failed: Get http://192.168.0.12:8092/healthz: EOF
    deis      2017-01-16 12:42:51 +0300 MSK   2017-01-12 08:01:16 +0300 MSK   142       deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Normal    Pulled    {kubelet 10.83.8.200}   Container image "quay.io/deis/builder:v2.6.1" already present on machine
    deis      2017-01-16 12:42:52 +0300 MSK   2017-01-16 12:42:52 +0300 MSK   1         deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Normal    Created   {kubelet 10.83.8.200}   Created container with docker id 2a3c05004c81; Security:[seccomp=unconfined]
    deis      2017-01-16 12:42:53 +0300 MSK   2017-01-16 12:42:53 +0300 MSK   1         deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Normal    Started   {kubelet 10.83.8.200}   Started container with docker id 2a3c05004c81

Pods:

    deis      deis-builder-762421820-k5ht5   0/1       Error     140      4d        192.168.0.12   10.83.8.200
    deis      deis-builder-762421820-k5ht5   0/1       Running   141       4d        192.168.0.12   10.83.8.200
    deis      deis-builder-762421820-k5ht5   1/1       Running   141       4d        192.168.0.12   10.83.8.200

Controller log:

    INFO [preview]: build preview-ebaf46d created
    INFO [preview]: gazay deployed d94cc6b
    INFO [preview]: This deployments overall timeout is 120s - batch timout is 120s and there are 1 batches to deploy with a total of 2 pods
    192.168.0.12 "POST /v2/hooks/build/ HTTP/1.1" 200 25 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/04:9b:f5:de:de:27:10:5a:b8:8d:ab:79:cd:17:e8:57 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/27:c6:ad:f9:a6:4d:22:3f:18:b0:3b:df:81:1c:57:45 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/71:3a:b0:18:e2:6c:41:18:4e:56:1e:fd:d2:49:97:66 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/b0:90:e5:dd:bf:1c:91:c0:83:dc:09:3a:53:7c:5d:07 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/22:c8:a9:c3:01:a0:17:31:a5:43:f2:70:4a:1c:55:f6 HTTP/1.1" 404 33 "deis-builder"

May this 404s or deployment overall timeout be a purpose for failing builder's liveness-check?

@Bregor
Copy link
Contributor Author

Bregor commented Jan 17, 2017

Some addition info:

2017-01-17T11:11:58.616956833Z ERROR [ebay-mag-2]: (app::deploy): There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.617114079Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.617142643Z ERROR [ebay-mag-2]: (app::deploy): (app::deploy): There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.617157129Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.697508351Z ERROR:root:(app::deploy): (app::deploy): There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.697593251Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.697608908Z Traceback (most recent call last):
2017-01-17T11:11:58.697621238Z   File "/app/scheduler/__init__.py", line 270, in deploy
2017-01-17T11:11:58.697633205Z     namespace, name, image, entrypoint, command, **kwargs
2017-01-17T11:11:58.697644665Z   File "/app/scheduler/resources/deployment.py", line 138, in update
2017-01-17T11:11:58.697656298Z     self.wait_until_ready(namespace, name, **kwargs)
2017-01-17T11:11:58.697667225Z   File "/app/scheduler/resources/deployment.py", line 353, in wait_until_ready
2017-01-17T11:11:58.697684935Z     self.pod._handle_not_ready_pods(namespace, labels)
2017-01-17T11:11:58.697765951Z   File "/app/scheduler/resources/pod.py", line 720, in _handle_not_ready_pods
2017-01-17T11:11:58.697781435Z     raise KubeException(message)
2017-01-17T11:11:58.697795505Z scheduler.exceptions.KubeException: Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.697808171Z
2017-01-17T11:11:58.697818315Z The above exception was the direct cause of the following exception:
2017-01-17T11:11:58.697828868Z
2017-01-17T11:11:58.697841211Z Traceback (most recent call last):
2017-01-17T11:11:58.697860075Z   File "/app/api/models/app.py", line 493, in deploy
2017-01-17T11:11:58.697880775Z     async_run(tasks)
2017-01-17T11:11:58.697901071Z   File "/app/api/utils.py", line 169, in async_run
2017-01-17T11:11:58.697927378Z     raise error
2017-01-17T11:11:58.697939798Z   File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
2017-01-17T11:11:58.697951311Z     result = coro.throw(exc)
2017-01-17T11:11:58.697961688Z   File "/app/api/utils.py", line 182, in async_task
2017-01-17T11:11:58.697972558Z     yield from loop.run_in_executor(None, params)
2017-01-17T11:11:58.697983155Z   File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
2017-01-17T11:11:58.697993875Z     yield self  # This tells Task to wait for completion.
2017-01-17T11:11:58.698005745Z   File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
2017-01-17T11:11:58.698026335Z     future.result()
2017-01-17T11:11:58.698046131Z   File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
2017-01-17T11:11:58.698064605Z     raise self._exception
2017-01-17T11:11:58.698075908Z   File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
2017-01-17T11:11:58.698086948Z     result = self.fn(*self.args, **self.kwargs)
2017-01-17T11:11:58.698097535Z   File "/app/scheduler/__init__.py", line 276, in deploy
2017-01-17T11:11:58.698108305Z     ) from e
2017-01-17T11:11:58.698118831Z scheduler.exceptions.KubeException: There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.698129888Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.698141051Z
2017-01-17T11:11:58.698150915Z The above exception was the direct cause of the following exception:
2017-01-17T11:11:58.698161851Z
2017-01-17T11:11:58.698172055Z Traceback (most recent call last):
2017-01-17T11:11:58.698182398Z   File "/app/api/models/app.py", line 502, in deploy
2017-01-17T11:11:58.698195955Z     self.deploy(release.previous(), force_deploy=True, rollback_on_failure=False)
2017-01-17T11:11:58.698235185Z   File "/app/api/models/app.py", line 512, in deploy
2017-01-17T11:11:58.698248875Z     raise ServiceUnavailable(err) from e

Controller fails from time to time.
Builder's liveness probe is controller-related.

@Bregor
Copy link
Contributor Author

Bregor commented Jan 17, 2017

Just after builder restart:

$ git push deis master


2017-01-17T14:20:03.432674195Z Accepted connection.
2017-01-17T14:20:03.944641435Z Accepted connection.
2017-01-17T14:20:04.363952863Z ---> ---> ---> ---> panic: runtime error: slice bounds out of range
2017-01-17T14:20:04.364017975Z
2017-01-17T14:20:04.364028347Z goroutine 259 [running]:
2017-01-17T14:20:04.364035784Z panic(0x1170b00, 0xc4200180c0)
2017-01-17T14:20:04.364044873Z 	/usr/local/go/src/runtime/panic.go:500 +0x1a1
2017-01-17T14:20:04.364053685Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420358108, 0x1bbf2e0, 0xc4200155c0, 0xc42012da40, 0x14, 0x20, 0x0, 0x780, 0xc42012da40)
2017-01-17T14:20:04.364062346Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
2017-01-17T14:20:04.364070577Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc8860, 0xc420358108, 0x1bbf2e0, 0xc4200155c0, 0xc42012da40, 0x14, 0x20, 0x11030a0, 0x10caae3, 0x11a4180, ...)
2017-01-17T14:20:04.364077775Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
2017-01-17T14:20:04.364086132Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc420106c20, 0x7f0b30777cd8, 0xc420366ea0, 0x1bbf2e0, 0xc4200155c0, 0xc4205b8720, 0x1bc8860, 0xc420358108, 0x40def4, 0x12f8db4, ...)
2017-01-17T14:20:04.364093352Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
2017-01-17T14:20:04.364100217Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc420612c30, 0x1bc8820, 0xc420106c20, 0xc42016a600, 0xc4205b8720, 0x1, 0x0, 0xe3)
2017-01-17T14:20:04.364107165Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
2017-01-17T14:20:04.364115152Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc420612c30, 0xc4202d0000, 0xd9, 0xd9, 0x0, 0x0)
2017-01-17T14:20:04.364122009Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
2017-01-17T14:20:04.364128869Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc420612c30, 0x0, 0x0, 0x0, 0x0, 0x45c4c1)
2017-01-17T14:20:04.364135778Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
2017-01-17T14:20:04.364142772Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc420612c30)
2017-01-17T14:20:04.364149315Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
2017-01-17T14:20:04.364187652Z created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
2017-01-17T14:20:04.364196565Z 	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216

@Bregor
Copy link
Contributor Author

Bregor commented Jan 17, 2017

Controller logs are empty at the moment, so this particular behaviour is just builder's

@Bregor
Copy link
Contributor Author

Bregor commented Jan 17, 2017

Controller issue is here: deis/controller#1204

@vdice
Copy link
Member

vdice commented Jan 17, 2017

Another interesting detail: Over the weekend, I checked out the v2.5.5 tag of builder, which is the version released with Workflow v2.8.0. These versions are believed to be the last not hitting this issue. With the builder deployment updated to use this image, there indeed are instances of this issue, though seemingly more rare:

deis-builder-1590797075-a5w9f            1/1       Running   33         3d

Error:

$ kd logs -p deis-builder-1590797075-a5w9f
2017/01/17 12:20:21 Running in debug mode
2017/01/17 12:20:21 Starting health check server on port 8092
2017/01/17 12:20:21 Starting deleted app cleaner
2017/01/17 12:20:21 Starting SSH server on 0.0.0.0:2223
Parsed host key /var/run/secrets/deis/builder/ssh/ssh-host-rsa-key.
Parsed host key /var/run/secrets/deis/builder/ssh/ssh-host-dsa-key.
Parsed host key /var/run/secrets/deis/builder/ssh/ssh-host-ecdsa-key.
Listening on 0.0.0.0:2223
Accepting new connections.
Accepted connection.
[DEBUG] [DEBUG] [DEBUG] ---> ---> ---> [ERROR] Failed handshake: ssh: invalid packet length, packet too large
Accepted connection.
---> panic: runtime error: slice bounds out of range

goroutine 12604 [running]:
panic(0x1170100, 0xc4200160c0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420026a98, 0x1bbe4e0, 0xc4200137a0, 0xc420130b60, 0x14, 0x20, 0x0, 0x780, 0xc420130b60)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc7aa0, 0xc420026a98, 0x1bbe4e0, 0xc4200137a0, 0xc420130b60, 0x14, 0x20, 0x1102640, 0x10ca083, 0x11a3800, ...)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc4200ecc40, 0x7f4d821ee108, 0xc4204659e0, 0x1bbe4e0, 0xc4200137a0, 0xc420122420, 0x1bc7aa0, 0xc420026a98, 0x40def4, 0x12f84a8, ...)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc4201174a0, 0x1bc7a60, 0xc4200ecc40, 0xc420118700, 0xc420122420, 0x1, 0x0, 0xe3)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc4201174a0, 0xc4200dc1c0, 0xd9, 0xd9, 0x0, 0x0)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc4201174a0, 0x0, 0x0, 0xc42048c5a0, 0xc42001f7a8, 0x95b892)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc4201174a0)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216

@lachie83
Copy link
Member

lachie83 commented Jan 18, 2017

+1 on this issue. Also witnessing the same behavior on k8s 1.4.7 builder image -- quay.io/deis/builder:v2.6.1

@bacongobbler
Copy link
Member

bacongobbler commented Jan 18, 2017

@gabrtv pointed me to this commit, made only 15 hours ago: golang/crypto@b822463

I'll see if compiling builder with go-tip fixes this. If so, then we'll have to find a way to apply this patch to Go 1.7.4 in deis/docker-go-dev.

EDIT: golang.org/x/crypto is maintained separately from Go stdlib. myb

@bacongobbler
Copy link
Member

bacongobbler commented Jan 18, 2017

If anyone's willing to try a fix, #464 looks promising.

@bacongobbler
Copy link
Member

re-opening as #464 didn't resolve this.

@stuszynski
Copy link

stuszynski commented Feb 2, 2017

I believe we encounter the same issue:

Environment:

  • Kubernetes - v1.5.1
  • deis-workflow-2.10.0
  • helm-2.1.3
  • AWS EC2

Fail ratio:
8 times in last 24 hour, ~20 times a day avg from last 7days.

Log:

Accepted connection.
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> panic: runtime error: slice bounds out of range

goroutine 2651 [running]:
panic(0x1170b00, 0xc4200180c0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc42015c350, 0x1bbf2e0, 0xc420015740, 0xc420393300, 0x14, 0x20, 0x0, 0x780, 0xc420393300)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc8860, 0xc42015c350, 0x1bbf2e0, 0xc420015740, 0xc420393300, 0x14, 0x20, 0xc42044d8eb, 0x12e44b7, 0x8, ...)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc4200f8c20, 0x7f18f3edd968, 0xc4202146c0, 0x1bbf2e0, 0xc420015740, 0xc4203d9080, 0x1bc8860, 0xc42015c350, 0x40def4, 0x12f8db4, ...)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc42068a000, 0x1bc8820, 0xc4200f8c20, 0xc4201edb80, 0xc4203d9080, 0x1, 0x0, 0xe3)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc42068a000, 0xc4204a01c0, 0xd9, 0xd9, 0x0, 0x0)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc42068a000, 0x0, 0x0, 0xc4203d8240, 0x7096dd, 0x180001)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc42068a000)
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
	/go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216

mboersma referenced this issue in golang/crypto Feb 2, 2017
1) Always force a key exchange if we exchange 2^31 packets. In the past
this might not happen if RekeyThreshold was set to a very large
interval.

2) Follow recommendations from RFC 4344 for block ciphers. For AES, we
can encrypt 2^(blocksize/4) blocks under the same keys.

On modern hardware, the previous default of 1Gb could force a key
exchange within ~10 seconds. Since the key exchange takes 3 roundtrips
(send kex init, send DH init, send NEW_KEYS), this is relatively
expensive on high-latency links.

Change-Id: I1297124a307c541b7bf22d814d136ec0c6d8ed97
Reviewed-on: https://go-review.googlesource.com/35410
Run-TryBot: Han-Wen Nienhuys <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Adam Langley <[email protected]>
@monaka
Copy link
Contributor

monaka commented Feb 8, 2017

I also got same on Workflow-2.11.0 / K8s v1.5.2+coreos.1 / Azure.
Reproduced randomly.

Accepted connection.
Starting ssh authentication
Channel type: session

Received request of type [email protected]

Key='LC_ALL', Value=''

receiving git repo name: eclipse-che.git, operation: git-receive-pack, fingerprint: 54:0d:c3:3d:43:46:d8:1d:db:0e:ff:f5:02:1c:36:10, user: monaka
creating repo directory /home/git/eclipse-che.git
writing pre-receive hook under /home/git/eclipse-che.git
git-shell -c git-receive-pack 'eclipse-che.git'
Waiting for git-receive to run.
Waiting for deploy.
Accepted connection.
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> panic: runtime error: slice bounds out of range

goroutine 5062 [running]:
panic(0x11ac4a0, 0xc4200180c0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420316470, 0x1c22840, 0xc4200156e0, 0xc420708e80, 0x14, 0x20, 0x0, 0x7c0, 0xc420708e80)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:457 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1c2c000, 0xc420316470, 0x1c22840, 0xc4200156e0, 0xc420708e80, 0x14, 0x20, 0xc4205622eb, 0x132695d, 0x8, ...)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:159 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc4200d1720, 0x7f1064176b80, 0xc42037c200, 0x1c22840, 0xc4200156e0, 0xc4201704e0, 0x1c2c000, 0xc420316470, 0x40df34, 0x133b4ad, ...)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:188 +0x511
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc420540a00, 0x1c2bfc0, 0xc4200d1720, 0xc42009c380, 0xc4201704e0, 0x1, 0x0, 0x2)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:537 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc420540a00, 0xc4204c41c0, 0xd9, 0xd9, 0xc4203e3e01, 0xc4206e77a0)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:502 +0x37d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc420540a00)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:271 +0x1e9
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:121 +0x227

@vdice
Copy link
Member

vdice commented Mar 1, 2017

Alas, it doesn't appear that any golang.org/x/crypto bumps have completely ameliorated this issue; bumping to next milestone.

@yaroslav
Copy link

yaroslav commented Mar 3, 2017

Can reproduce as well; a huge problem for us at the moment 😞

@vdice
Copy link
Member

vdice commented Mar 13, 2017

Closed by #493

Included in Builder v2.9.0 release. You can update an existing install/helm release via

helm upgrade <release> workflow/workflow --set builder.docker_tag=v2.9.0 (-f values.yaml)

Note you may want to add --version <particular Workflow version> if necessary.

Or just patch the builder deploy via

kubectl -n deis set image deployment deis-builder deis-builder=quay.io/deis/builder:v2.9.0

@felixbuenemann
Copy link
Contributor

I think you also need to upgrade to dockerbuilder v2.7.1 or things will break.

@bacongobbler
Copy link
Member

The helm upgrade assumes v2.12.0 which has the latest version of dockerbuilder so yes.

@felixbuenemann
Copy link
Contributor

felixbuenemann commented Jun 17, 2017

Btw. I just did some debugging in golang.org/x/crypto/ssh and it appears the above mentioned bug occurs with 2048 Bit DSA keys, while the common key size for DSA is 1024 Bit. The error seems to be that the code hardcodes a subgroup size of 160 Bit (20 Bytes), but DSA 2048 Bit uses either 224 Bit or 256 Bit subgroups (my test key generated with openssl had a subgroup size of 256 Bit / 32 Byte).

I'll open an upstream issue and reference it back here.

Update: Looks like @vdice already reported the problem in golang/go#19424.

Based on the above it's safe to assume that the crashes happened whenever someone with a DSA key of 2048 Bit or more connected to the builder (which could also be bots/scanners and not actual users).

Update: It appears the root of the problem stems from the genPrivateKey template function of sprig which is used in the deis builder helm templates to generate the host keys and uses dsa.L2048N256 to generate a 2048 Bit DSA key with a 256 Bit subgroup instead of dsa.L1024N160 for a 1024 Bit key with a 160 Bit subgroup, which ist the only key size allowed by ssh-keygen -t dsa and the openssh server.

@bacongobbler
Copy link
Member

We completely removed DSA keys from the builder some time ago as they're being discontinued, so this effectively has been fixed.

@Bregor
Copy link
Contributor Author

Bregor commented Jun 18, 2017

@bacongobbler sure, this should be closed. Thanks for reminding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants