Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error starting one service causes others in task set to not start #231

Closed
troyanov opened this issue Jun 2, 2023 · 2 comments · Fixed by #437
Closed

Error starting one service causes others in task set to not start #231

troyanov opened this issue Jun 2, 2023 · 2 comments · Fixed by #437
Labels
Bug An undesired feature ;-)

Comments

@troyanov
Copy link
Member

troyanov commented Jun 2, 2023

Pebble gets locked after starting the service which then exits immediately.

Layer specification used:

services:
  a:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled
  b:
    override: replace
    command: sh -c "exit 123"
    startup: enabled
  c:
    override: replace
    command: sh -c "sleep 999"
    startup: enabled

Pebble will not try to start service c as can be seen in debug output:

❯ PEBBLE_DEBUG=1 PEBBLE="/tmp/pebble-root" go run ./cmd/pebble run
2023-06-02T20:13:31.257Z [pebble] DEBUG socket "/tmp/pebble-root/.pebble.socket" was not activated; listening
2023-06-02T20:13:31.257Z [pebble] DEBUG Reaper started, waiting for SIGCHLD.
2023-06-02T20:13:31.257Z [pebble] DEBUG socket "/tmp/pebble-root/.pebble.socket.untrusted" was not activated; listening
2023-06-02T20:13:31.257Z [pebble] Started daemon.
2023-06-02T20:13:31.257Z [pebble] DEBUG activation done in 10ms
2023-06-02T20:13:31.266Z [pebble] DEBUG Configuring check manager (stopping 0, starting 0)
2023-06-02T20:13:31.280Z [pebble] DEBUG Running task 251 on Do: Start service "a"
2023-06-02T20:13:31.280Z [pebble] DEBUG pid=1871156;uid=1000;socket=/tmp/pebble-root/.pebble.socket; POST /v1/services 22.762402ms 202
2023-06-02T20:13:31.281Z [pebble] POST /v1/services 22.762402ms 202
2023-06-02T20:13:31.281Z [pebble] Started default services with change 79.
2023-06-02T20:13:31.295Z [pebble] Service "a" starting: sh -c "sleep 999"
2023-06-02T20:13:31.295Z [pebble] DEBUG Service "a" started with PID 1871169
2023-06-02T20:13:31.295Z [pebble] DEBUG Service "a" transitioning to state "starting"
2023-06-02T20:13:32.296Z [pebble] DEBUG Service "a" transitioning to state "running"
2023-06-02T20:13:32.310Z [pebble] DEBUG Running task 252 on Do: Start service "b"
2023-06-02T20:13:32.316Z [pebble] Service "b" starting: sh -c "exit 123"
2023-06-02T20:13:32.317Z [pebble] DEBUG Service "b" started with PID 1871177
2023-06-02T20:13:32.317Z [pebble] DEBUG Service "b" transitioning to state "starting"
2023-06-02T20:13:32.317Z [pebble] DEBUG Reaper received SIGCHLD.
2023-06-02T20:13:32.317Z [pebble] DEBUG Reaped PID 1871177 which exited with code 123.
2023-06-02T20:13:32.317Z [pebble] DEBUG Service "b" exited with code 123.
2023-06-02T20:13:32.317Z [pebble] DEBUG Service "b" transitioning to state "exited"
2023-06-02T20:13:32.317Z [pebble] DEBUG 2023-06-02T23:13:32+03:00 INFO Most recent service output:

2023-06-02T20:13:32.324Z [pebble] DEBUG 2023-06-02T23:13:32+03:00 ERROR cannot start service: exited quickly with code 123

Services status looks wrong (because both b and c are marked as inactive):

❯ PEBBLE_DEBUG=1 PEBBLE="/tmp/pebble-root" ./cmd/pebble/pebble services
Service  Startup  Current   Since
a        enabled  active    today at 23:04 MSK
b        enabled  inactive  -
c        enabled  inactive  -

doStart function is executed only twice (for service a and b and never called for c)

func (m *ServiceManager) doStart(task *state.Task, tomb *tomb.Tomb) error {

Here is the place where it will lock because for service b we will never get anything in mentioned channels.

// Wait for a small amount of time, and if the service hasn't exited,
// consider it a success.
select {
case err := <-service.started:
if err != nil {
addLastLogs(task, service.logs)
m.removeService(config.Name)
return fmt.Errorf("cannot start service: %w", err)
}
// Started successfully (ran for small amount of time without exiting).
return nil
case <-tomb.Dying():
// User tried to abort the start, sending SIGKILL to process is about
// the best we can do.
m.removeService(config.Name)
m.servicesLock.Lock()
defer m.servicesLock.Unlock()
err := syscall.Kill(-service.cmd.Process.Pid, syscall.SIGKILL)
if err != nil {
return fmt.Errorf("start aborted, but cannot send SIGKILL to process: %v", err)
}
return fmt.Errorf("start aborted, sent SIGKILL to process")
}

Adding default case resolves the issue, but that doesn't seem to be the right fix.

default:
  return nil

I suspect that okayWaitElapsed is missing some logic to handle such situation?

// okayWaitElapsed is called when the okay-wait timer has elapsed (and the
// service is considered running successfully).
func (s *serviceData) okayWaitElapsed() error {
s.manager.servicesLock.Lock()
defer s.manager.servicesLock.Unlock()
switch s.state {
case stateStarting:
s.started <- nil // still running fine after short duration, no error
s.transition(stateRunning)
default:
// Ignore if timer elapsed in any other state.
return nil
}
return nil
}

@benhoyt
Copy link
Contributor

benhoyt commented Jun 5, 2023

Thanks for the report. Yeah, this is definitely non-ideal: you'd expect it to start a and c even if b fails (when there's no dependencies between them). What's happening is that we put all the "start service X" tasks in a task set and then run them all as one change. But Pebble's task runner runs the tasks sequentially, and puts the subsequent asks in Hold state if a previous one has an error.

Here's the output of pebble changes and then pebble tasks <change-id> after starting (note the Hold status of the Start service "c" task):

$ pebble changes
ID   Status  Spawn                       Ready                       Summary
...
143  Error   today at 11:36 NZST         today at 11:36 NZST         Autostart service "a" and 2 more
$ pebble tasks 143
Status  Spawn                Ready                Summary
Done    today at 11:36 NZST  today at 11:36 NZST  Start service "a"
Error   today at 11:36 NZST  today at 11:36 NZST  Start service "b"
Hold    today at 11:36 NZST  today at 11:36 NZST  Start service "c"

......................................................................
Start service "b"

2023-06-06T11:36:46+12:00 INFO Most recent service output:
    
2023-06-06T11:36:46+12:00 ERROR cannot start service: exited quickly with code 123

So I don't think this has to do with locking or even the 1s startup wait time (it happens for any kind of service-start error), but rather the fact that we're starting them sequentially and stopping the whole thing if there's one error.

Relatedly, there's a TODO comment in servstate.Start saying "Allow non-dependent services to start in parallel". So this would probably be fixed if we do that.

I'll update the issue title to reflect this better.

@benhoyt benhoyt changed the title Possible lock if service exit immediately after start Error starting one service causes others in task set to not start Jun 5, 2023
@benhoyt benhoyt added 24.10 Bug An undesired feature ;-) labels Mar 13, 2024
@benhoyt
Copy link
Contributor

benhoyt commented Apr 21, 2024

This can probably be done via the state engine's "lanes" concept. Currently they all likely get put into the default lane.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug An undesired feature ;-)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants