Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6.0 [master] - Nil job on allocation #2605

Closed
dadgar opened this issue May 2, 2017 · 5 comments
Closed

0.6.0 [master] - Nil job on allocation #2605

dadgar opened this issue May 2, 2017 · 5 comments

Comments

@dadgar
Copy link
Contributor

dadgar commented May 2, 2017

Somehow a nil job is being attached to a job which can cause various panics.

One such report:
I'm running from master, because I need the feature in #2535 which has not yet made it into a release.

Here's some logs that provide context of the allocation that got into this state. It was for the task group hdfs-namenode2. This container wouldn't start due to an issue with a docker volume driver. It appears that after it failed to start, nomad did not remove the container and tried to create it again, the subsequent create operations failing because the old container already existed.

May 02 11:30:52 dc0-cls02 nomad[15432]:     2017/05/02 11:30:52.104417 [ERR] driver.docker: Failed to create container: Post http://unix.sock/containers/create?name=hdfs-namenode2-399ea264-2d63-7b83-9484-be35b6dda4ad: EOF
May 02 11:30:52 dc0-cls02 nomad[15432]:     2017/05/02 11:30:52.107207 [WARN] client: failed to start task "hdfs-namenode2" for alloc "399ea264-2d63-7b83-9484-be35b6dda4ad": Failed to create container: Post http://unix.sock/containers/create?name=hdfs-namenode2-399ea264-2d63-7b83-9484-be35b6dda4ad: EOF
May 02 11:30:52 dc0-cls02 nomad[15432]:     2017/05/02 11:30:52.108120 [INFO] client: Restarting task "hdfs-namenode2" for alloc "399ea264-2d63-7b83-9484-be35b6dda4ad" in 15.146291303s
May 02 11:31:12 dc0-cls02 nomad[15432]:     2017/05/02 11:31:12.340198 [ERR] driver.docker: Failed to create container: container already exists
May 02 11:31:12 dc0-cls02 nomad[15432]:     2017/05/02 11:31:12.343062 [WARN] client: failed to start task "hdfs-namenode2" for alloc "399ea264-2d63-7b83-9484-be35b6dda4ad": Failed to create container: container already exists
May 02 11:31:12 dc0-cls02 nomad[15432]:     2017/05/02 11:31:12.343665 [INFO] client: Not restarting task: hdfs-namenode2 for alloc: 399ea264-2d63-7b83-9484-be35b6dda4ad
May 02 11:31:12 dc0-cls02 nomad[15432]:     2017/05/02 11:31:12.345574 [INFO] client: marking allocation 399ea264-2d63-7b83-9484-be35b6dda4ad for GC
May 02 11:31:12 dc0-cls02 nomad[15432]:     2017/05/02 11:31:12.348991 [INFO] client: marking allocation 399ea264-2d63-7b83-9484-be35b6dda4ad for GC
May 02 11:33:38 dc0-cls02 nomad[15432]:     2017/05/02 11:33:38 http: panic serving 10.1.6.184:34408: runtime error: invalid memory address or nil pointer dereference
May 02 11:33:38 dc0-cls02 nomad[15432]: goroutine 1037521 [running]:
May 02 11:33:38 dc0-cls02 nomad[15432]: net/http.(*conn).serve.func1(0xc4201d9900)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:1721 +0xd0
May 02 11:33:38 dc0-cls02 nomad[15432]: panic(0xfbb1e0, 0x18ac2b0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/runtime/panic.go:489 +0x2cf
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/nomad/state.(*StateStore).AllocsByJob(0xc4222a8c80, 0x0, 0xc4207cd41c, 0x4, 0xc421909100, 0xc4222a8c60, 0xc4219091c0, 0xc4219091c0, 0x411cd8, 0x20)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/nomad/state/state_store.go:1164 +0x3ed
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/nomad.(*Job).Allocations.func1(0x0, 0xc4222a8c80, 0x0, 0x0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/nomad/job_endpoint.go:546 +0x7c
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/nomad.(*Server).blockingRPC(0xc4200811e0, 0xc421909318, 0x0, 0x0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/nomad/rpc.go:389 +0x161
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/nomad.(*Job).Allocations(0xc420460230, 0xc4239a42d0, 0xc4207cde30, 0x0, 0x0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/nomad/job_endpoint.go:571 +0x279
May 02 11:33:38 dc0-cls02 nomad[15432]: reflect.Value.call(0xc42016c6c0, 0xc420460380, 0x13, 0x113ffbc, 0x4, 0xc421909798, 0x3, 0x3, 0x199, 0xc4207cde30, ...)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/reflect/value.go:434 +0x91f
May 02 11:33:38 dc0-cls02 nomad[15432]: reflect.Value.Call(0xc42016c6c0, 0xc420460380, 0x13, 0xc421f1d798, 0x3, 0x3, 0xc421883920, 0xc4204bd401, 0x0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/reflect/value.go:302 +0xa4
May 02 11:33:38 dc0-cls02 nomad[15432]: net/rpc.(*service).call(0xc42045c9c0, 0xc42045c8c0, 0xc4228d0fe0, 0xc42046d180, 0xc421883920, 0xfd9ac0, 0xc4239a42d0, 0x16, 0xf08b60, 0xc4207cde30, ...)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/rpc/server.go:387 +0x144
May 02 11:33:38 dc0-cls02 nomad[15432]: net/rpc.(*Server).ServeRequest(0xc42045c8c0, 0x18cc0c0, 0xc42233b740, 0xc4207cdcb0, 0xc4204bd400)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/rpc/server.go:502 +0x267
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc4200811e0, 0x114e810, 0xf, 0xfd9ac0, 0xc4239a4280, 0xf08b60, 0xc4207cdd70, 0xc4239a4201, 0xc4207cdd70)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/nomad/server.go:1021 +0xc4
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/command/agent.(*Agent).RPC(0xc42046ec30, 0x114e810, 0xf, 0xfd9ac0, 0xc4239a4280, 0xf08b60, 0xc4207cdd70, 0xc42051b8c0, 0x10000c4205051d0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/command/agent/agent.go:628 +0x7a
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).jobAllocations(0xc420849140, 0x18ca800, 0xc4220e17a0, 0xc421299000, 0xc4207cd41c, 0x4, 0xc42233f1e0, 0xc420ac96e0, 0xc420026600, 0x18)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/command/agent/job_endpoint.go:192 +0x2ed
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).JobSpecificRequest(0xc420849140, 0x18ca800, 0xc4220e17a0, 0xc421299000, 0xe3c583, 0xc421f1db20, 0xc421f1db18, 0xc420ac96e0)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/command/agent/job_endpoint.go:50 +0x2a2
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).JobSpecificRequest-fm(0x18ca800, 0xc4220e17a0, 0xc421299000, 0xc4222a8bc0, 0x18, 0xed09a9dd2, 0x10f0461)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/command/agent/http.go:150 +0x48
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1(0x18ca800, 0xc4220e17a0, 0xc421299000)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/command/agent/http.go:226 +0x12e
May 02 11:33:38 dc0-cls02 nomad[15432]: net/http.HandlerFunc.ServeHTTP(0xc420232360, 0x18ca800, 0xc4220e17a0, 0xc421299000)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:1942 +0x44
May 02 11:33:38 dc0-cls02 nomad[15432]: net/http.(*ServeMux).ServeHTTP(0xc42051b890, 0x18ca800, 0xc4220e17a0, 0xc421299000)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:2238 +0x130
May 02 11:33:38 dc0-cls02 nomad[15432]: github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler.GzipHandler.func1(0x18ca800, 0xc4220e17a0, 0xc421299000)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /home/clint/go/src/github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler/gzip.go:73 +0x2c8
May 02 11:33:38 dc0-cls02 nomad[15432]: net/http.HandlerFunc.ServeHTTP(0xc4202329a0, 0x18ca800, 0xc4220e17a0, 0xc421299000)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:1942 +0x44
May 02 11:33:38 dc0-cls02 nomad[15432]: net/http.serverHandler.ServeHTTP(0xc420498000, 0x18ca800, 0xc4220e17a0, 0xc421299000)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:2568 +0x92
May 02 11:33:38 dc0-cls02 nomad[15432]: net/http.(*conn).serve(0xc4201d9900, 0x18cbd00, 0xc42233b540)
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:1825 +0x612
May 02 11:33:38 dc0-cls02 nomad[15432]: created by net/http.(*Server).Serve
May 02 11:33:38 dc0-cls02 nomad[15432]:         /usr/lib/go/src/net/http/server.go:2668 +0x2ce
@clinta
Copy link
Contributor

clinta commented May 2, 2017

I am not running in dev mode, but I do have every node configured as a client and server in a 5 node cluster.

@dadgar
Copy link
Contributor Author

dadgar commented May 2, 2017

@clinta can you share the commit you are running

@clinta
Copy link
Contributor

clinta commented May 2, 2017

When we first encountered this issue I was running a build of 976f58b.

Since encountering this I tried running d1bb92e, which is v0.5.6 with #2535 cherry-picked. I continued to get this nil-pointer exception. But the exception was still due to the allocation that was created under 0.6. So at this point I don't know if v0.5.6 is capable of creating the broken allocation or not.

Right now I'm running 2000f61 which is a workaround for the nil pointer issue so that we can keep this cluster running.

Also I've looked at the state.json for the broken allocation and it appears to have the correct job data. I can share that privately if it would help.

@dadgar
Copy link
Contributor Author

dadgar commented May 3, 2017

Okay thanks for sharing! I believe it is all server side so the state.json won't be of too much help but thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants