Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker tries to execute the workflow even if pulling image from registry fails #382

Closed
invidian opened this issue Dec 10, 2020 · 2 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@invidian
Copy link
Contributor

Expected Behaviour

Workflow execution should stop if image can be pulled.

Current Behaviour

Even if pulling image from registry fails, worker still tries to execute the workflow and then complains that the image cannot be found.

Possible Solution

Make sure pulling errors are handled properly.

Steps to Reproduce (for bugs)

  1. Create a worker machine with 4GB of RAM to make tmpfs in OSIE of size 2GB.
  2. Create a workflow with Docker image bigger than 1GB (e.g. ubuntu with qemu and libguestfs-tools packages installed).
  3. Try running a workflow.
@thebsdbox thebsdbox added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Dec 10, 2020
@Cbkhare Cbkhare self-assigned this Feb 11, 2021
@Cbkhare
Copy link
Contributor

Cbkhare commented Mar 3, 2021

hey @invidian , nice catch!

I tried reproducing the issue and below are my findings.

  • The error which is visible to us is actually not the root cause. Workflow execution has failed way before while pulling the image. It can be seen in the tink-worker logs with message no space left on the device.
  • What we are observing is that the exit should have been before creating the container, which didn't really happened.
  • Reason for the problem here is that io.copy never really received the syscall of failure and eventually didn't returned the same.
  • And lately, our workflow execution went ahead for creating the container and failed.

I tried to pinpoint the error by debugging it with delve. What seems to cause the failure is during the Copy. We can notice in the below logs after docker interrupt, Read returned EOF which for copy is graceful exit instead of an error. Usually, io.Copy returns the error when failed for no space left on the device but with docker pull I am expecting the interrupt is not received by the go compiler.

{"errorDetail":{"message":"failed to register layer: Error processing tar file(exit status 1): write /root/test123/cdb5015ff36390a47e559aafbbc8f0713264b8815717f21c5db8026a0656f818/layer.tar: no space left on device"},"error":"failed to register layer: Error processing tar file(exit status 1): write /root/test123/cdb5015ff36390a47e559aafbbc8f0713264b8815717f21c5db8026a0656f818/layer.tar: no space left on device"}
> os.(*File).Write() /usr/local/go/src/os/file.go:175 (PC: 0x51bb0b)
   170:	func (f *File) Write(b []byte) (n int, err error) {
   171:		if err := f.checkValid("write"); err != nil {
   172:			return 0, err
   173:		}
   174:		n, e := f.write(b)
=> 175:		if n < 0 {
   176:			n = 0
   177:		}
   178:		if n != len(b) {
   179:			err = io.ErrShortWrite
   180:		}
(dlv) print n
990
(dlv) print e
error nil
(dlv) print len(b) 
990
> os.(*File).Write() /usr/local/go/src/os/file.go:188 (PC: 0x51bbf4)
   183:	
   184:		if e != nil {
   185:			err = f.wrapErr("write", e)
   186:		}
   187:	
=> 188:		return n, err
   189:	}
   190:	
   191:	var errWriteAtInAppendMode = errors.New("os: invalid use of WriteAt on file opened with O_APPEND")
   192:	
   193:	// WriteAt writes len(b) bytes to the File starting at byte offset off.
(dlv) n
> io.copyBuffer() /usr/local/go/src/io/io.go:425 (PC: 0x504ec2)
Values returned:

   420:			buf = make([]byte, size)
   421:		}
   422:		for {
   423:			nr, er := src.Read(buf)
   424:			if nr > 0 {
=> 425:				nw, ew := dst.Write(buf[0:nr])
   426:				if nw < 0 || nr < nw {
   427:					nw = 0
   428:					if ew == nil {
   429:						ew = errInvalidWrite
   430:					}
(dlv) n
(dlv) print nw
Command failed: could not find symbol value for nw
(dlv) print nr
990
(dlv) print er
error(*errors.errorString) *{s: "EOF"}
(dlv) n
> io.copyBuffer() /usr/local/go/src/io/io.go:447 (PC: 0x505032)
   442:			}
   443:			if er != nil {
   444:				if er != EOF {
   445:					err = er
   446:				}
=> 447:				break
   448:			}
   449:		}
   450:		return written, err
   451:	}
   452:	
(dlv) print er
error(*errors.errorString) *{s: "EOF"}
(dlv) print err
error nil

Error logs of tink-worker, note the first few lines of below logs.

{"errorDetail":{"message":"write /var/lib/docker/tmp/GetImageBlob525488247: no space left on device"},"error":"write /var/lib/docker/tmp/GetImageBlob525488247: no space left on device"}^M
{"level":"info","ts":1613665320.148993,"caller":"internal/action.go:41","msg":"creating container","service":"github.com/tinkerbell/tink","command":[]}
{"level":"error","ts":1613665320.1849961,"caller":"internal/worker.go:313","msg":"DOCKER CREATE: DOCKER CREATE: Error response from daemon: No such image: 192.168.1.1/test:latest","service":"github.com/tinkerbell/tink","workerID":"0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94","workflowID":"25ceb32c-7205-11eb-bc69-0242ac140005","actionName":"hello_world","taskName":"hello world 1","error":"DOCKER CREATE: DOCKER CREATE: Error response from daemon: No such image: 192.168.1.1/test:latest","errorVerbose":"Error response from daemon: No such image: 192.168.1.1/test:latest\nDOCKER CREATE\ngithub.com/tinkerbell/tink/cmd/tink-worker/internal.(*Worker).createContainer\n\t/usr/myapp/cmd/tink-worker/internal/action.go:44\ngithub.com/tinkerbell/tink/cmd/tink-worker/internal.(*Worker).execute\n\t/usr/myapp/cmd/tink-worker/internal/worker.go:104\ngithub.com/tinkerbell/tink/cmd/tink-worker/internal.(*Worker).ProcessWorkflowActions\n\t/usr/myapp/cmd/tink-worker/internal/worker.go:295\ngithub.com/tinkerbell/tink/cmd/tink-worker/cmd.NewRootCommand.func2\n\t/usr/myapp/cmd/tink-worker/cmd/root.go:73\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:887\nmain.main\n\t/usr/myapp/cmd/tink-worker/main.go:29\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nDOCKER CREATE\ngithub.com/tinkerbell/tink/cmd/tink-worker/internal.(*Worker).execute\n\t/usr/myapp/cmd/tink-worker/internal/worker.go:106\ngithub.com/tinkerbell/tink/cmd/tink-worker/internal.(*Worker).ProcessWorkflowActions\n\t/usr/myapp/cmd/tink-worker/internal/worker.go:295\ngithub.com/tinkerbell/tink/cmd/tink-worker/cmd.NewRootCommand.func2\n\t/usr/myapp/cmd/tink-worker/cmd/root.go:73\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:887\nmain.main\n\t/usr/myapp/cmd/tink-worker/main.go:29\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}
Error: worker Finished with error: DOCKER CREATE: DOCKER CREATE: Error response from daemon: No such image: 192.168.1.1/test:latest
Usage:

Cbkhare added a commit to Cbkhare/tink that referenced this issue Mar 8, 2021
Cbkhare added a commit to Cbkhare/tink that referenced this issue Mar 11, 2021
Cbkhare added a commit to Cbkhare/tink that referenced this issue Mar 17, 2021
Cbkhare added a commit to Cbkhare/tink that referenced this issue Mar 17, 2021
added check for Decode statement

added unit tests

added unit tests

added unit tests
Cbkhare added a commit to Cbkhare/tink that referenced this issue Mar 18, 2021
added check for Decode statement

added unit tests

added unit tests

added unit tests

added unit tests
Cbkhare added a commit that referenced this issue Mar 30, 2021
* docker graceful exits when image pull fails #382

added check for Decode statement

added unit tests

added unit tests

added unit tests

added unit tests

* added unit tests

Co-authored-by: cbkhare <[email protected]>
@Cbkhare
Copy link
Contributor

Cbkhare commented Mar 30, 2021

resolved after #450 merge.

@Cbkhare Cbkhare closed this as completed Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

3 participants