Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meet error on supervisorctl reread&&reload while sending start command via rpc #975

Closed
NetWilliam opened this issue Aug 9, 2017 · 10 comments
Labels

Comments

@NetWilliam
Copy link

NetWilliam commented Aug 9, 2017

os: ubuntu14.04
python: 2.7.6
version: 3.0b2
golang rpc client.

what I've seen:
supervisord crash and restart after a exception of error killing

what I am expecting:
No crash down or something like that.

log:

2017-08-09 10:33:15,042 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-6bf1a138b2c647ca063f49e691397b5e' with pid 2633
2017-08-09 10:33:15,125 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-6bf1a138b2c647ca063f49e691397b5e (exit status 255; not expected)
2017-08-09 10:33:16,128 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-20e47f8d08d0902fb9a5833f9b76db2a' with pid 2702
2017-08-09 10:33:16,130 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-83124eebca39f4463e7cb67f45be435d' with pid 2703
2017-08-09 10:33:16,131 INFO success: weedfs_filer entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-08-09 10:33:16,132 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-849ad9ee070a477fe8d66c868b8adcc7' with pid 2704
2017-08-09 10:33:16,137 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-da1403a502e7571f311cad0fe912d9e9' with pid 2705
2017-08-09 10:33:16,138 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-fe40d8f75d5be7f1f10c093b54cbff13' with pid 2706
2017-08-09 10:33:16,139 INFO stopped: matrix-c1d726c85cacbf57b03cd021b8781123-83124eebca39f4463e7cb67f45be435d (terminated by SIGTERM)
2017-08-09 10:33:16,352 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-3052046766254e8bcea59f35ae221fbd' with pid 2913
2017-08-09 10:33:16,394 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-20e47f8d08d0902fb9a5833f9b76db2a (exit status 255; not expected)
2017-08-09 10:33:16,421 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-da1403a502e7571f311cad0fe912d9e9 (exit status 255; not expected)
2017-08-09 10:33:16,431 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-849ad9ee070a477fe8d66c868b8adcc7 (exit status 255; not expected)
2017-08-09 10:33:16,461 INFO spawned: 'kafka_0' with pid 3003
2017-08-09 10:33:16,472 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-fe40d8f75d5be7f1f10c093b54cbff13 (exit status 255; not expected)
2017-08-09 10:33:16,520 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-3052046766254e8bcea59f35ae221fbd (exit status 255; not expected)
2017-08-09 10:33:16,991 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-508b67793beee64031f6e8731afc1551' with pid 3343
2017-08-09 10:33:16,993 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-3f7b9d2205bb95a7400f81feb0842be9' with pid 3344
2017-08-09 10:33:16,995 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-31e8ac1e97a12925ff14e35536cbdd71' with pid 3345
2017-08-09 10:33:17,170 INFO spawned: 'matrix-c1d726c85cacbf57b03cd021b8781123-6bf1a138b2c647ca063f49e691397b5e' with pid 3501
2017-08-09 10:33:17,173 CRIT unknown problem killing matrix-c1d726c85cacbf57b03cd021b8781123-6bf1a138b2c647ca063f49e691397b5e (3501):Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/supervisor/process.py", line 408, in kill
    options.kill(pid, sig)
  File "/usr/lib/python2.7/dist-packages/supervisor/options.py", line 1159, in kill
    os.kill(pid, signal)
OSError: [Errno 3] No such process

2017-08-09 10:33:17,205 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-508b67793beee64031f6e8731afc1551 (exit status 255; not expected)
2017-08-09 10:33:17,217 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-3f7b9d2205bb95a7400f81feb0842be9 (exit status 255; not expected)
2017-08-09 10:33:17,236 INFO exited: matrix-c1d726c85cacbf57b03cd021b8781123-31e8ac1e97a12925ff14e35536cbdd71 (exit status 255; not expected)
2017-08-09 10:34:16,199 CRIT Supervisor running as root (no user in config file)
2017-08-09 10:34:16,199 WARN Included extra file "/etc/supervisor/conf.d/etcd.conf" during parsing

Note that the CRIT log above means the restarting of the program.

@NetWilliam
Copy link
Author

I do some supervisorctl reread/reload work while I send startProcess/stopProcess request to supervisord service via golang RPC package. And I get error above. I have been trying to figure it out, but not done yet

@NetWilliam
Copy link
Author

AssertionError: Assertion failed for matrix-f80b24e0053c81fe1db99f3ff3b5ea7e-8c0bbfbdc646ab35f7136abe6ebac432: UNKNOWN not in STARTING                                                                                                  [6/908]

> /usr/lib/python2.7/dist-packages/supervisor/process.py(179)_assertInState()
    178             raise AssertionError('Assertion failed for %s: %s not in %s' %  (
--> 179                 self.config.name, current_state, allowable_states))
    180 

ipdb> bt
  /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(360)main()
    358             profile('go(options)', globals(), locals(), sort_order, callers)
    359         else:
--> 360             go(options)
    361         if test or (options.mood < SupervisorStates.RESTARTING):
    362             break

  /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(372)go()
    370         d.main()
    371     except asyncore.ExitNow:
--> 372         pass
    373 
    374 if __name__ == "__main__":

  /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(83)main()
     81             self.options.clear_autochildlogdir()
     82 
---> 83         self.run()
     84 
     85     def run(self):

  /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(102)run()
    100             self.runforever()
    101         finally:
--> 102             self.options.cleanup()
    103 
    104     def diff_to_active(self, new=None):

  /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(254)runforever()
    252             [ group.transition() for group  in pgroups ]
    253 
--> 254             self.reap()
    255             self.handle_signal()
    256             self.tick()

  /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(288)reap()
    286                 self.options.logger.critical('reaped unknown pid %s)' % pid)
    287             else:
--> 288                 process.finish(pid, sts)
    289                 del self.options.pidhistory[pid]
    290             if not once:

  /usr/lib/python2.7/dist-packages/supervisor/process.py(455)finish()
    453             self.spawnerr = 'Exited too quickly (process log may have details)'
    454             msg = "exited: %s (%s)" % (processname, msg + "; not expected")
--> 455             self._assertInState(ProcessStates.STARTING)
    456             self.change_state(ProcessStates.BACKOFF)
    457 

> /usr/lib/python2.7/dist-packages/supervisor/process.py(179)_assertInState()
    177             allowable_states = ' '.join(map(getProcessStateDescription, states))
    178             raise AssertionError('Assertion failed for %s: %s not in %s' %  (
--> 179                 self.config.name, current_state, allowable_states))
    180 
    181     def record_spawnerr(self, msg):

ipdb> 

and here is stack traces of the exception

@NetWilliam
Copy link
Author

@mnaberez

@NetWilliam
Copy link
Author

NetWilliam commented Sep 7, 2017

and the corresponding code snippet

        elif too_quickly:
            # the program did not stay up long enough to make it to RUNNING
            # implies STARTING -> BACKOFF
            self.exitstatus = None
            self.spawnerr = 'Exited too quickly (process log may have details)'
            msg = "exited: %s (%s)" % (processname, msg + "; not expected")
            self._assertInState(ProcessStates.STARTING)
            self.change_state(ProcessStates.BACKOFF)

why there is a process state assertion? what is it designed for? Can I just bypass the code?
why the runforever doesn't run forever? when would it meet its end?

@mnaberez
Copy link
Member

mnaberez commented Sep 7, 2017

why there is a process state assertion? what is it designed for?

The assertion is to find bugs in supervisord. It checks that the state of the process is good based on the intended state machine transitions. If a process somehow subverts the intended transitions, the assertion is triggered. In the code you quoted, a process is only ever intended to be in the STARTING state at the point but it somehow got into the UNKNOWN state.

Can I just bypass the code?

No, supervisord probably cannot continue. If a process is not in its correct place in the state transition diagram, the results will be unpredictable. The underlying bug that caused this condition needs to be fixed.

why the runforever doesn't run forever? when would it meet its end?

supervisord tries to run forever but there are sometimes it can't, e.g. if there is a bug or if a major problem occurs on the system.

From your log above, it looks like supervisord tried to send a signal to a process but the process exited right before. The error is caught and the process is put into the UNKNOWN state. The UNKNOWN state then causes the AssertionError since a process isn't supposed to be in the UNKNOWN state. The fix for this is probably to change the exception handler so it recognizes OSError: [Errno 3] No such process and doesn't set the process to the UNKNOWN state.

@NetWilliam
Copy link
Author

NetWilliam commented Sep 8, 2017

Thank you for your answer, which does a lot of help. @mnaberez
I think I have found the key to sidestep this problem. However I still want to know why there is a problem, cause I would not truly solve this problem util I find why this happen and what I can do for this kind of problem.

I've checked the stack trace and I found that in the upper frame /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(360)main() the variable first is true.
Does that mean the the supervisord restart for the first time?
Would the runforever end while it meet the supervisorctl reread, supervisorctl update or supervisorctl reload?

My golang code does startProcess/stopProcess RPC call and sudo supervisorctl reread and sudo supervisorctl update alternately. However there are some programs managed by supervisord meant to be failed for their configuration error. Would this be the key which cause the UNKNOWN state and crash of supervisord?

By the way I am still keeping the pdb debug available all the time if that would be help

@NetWilliam
Copy link
Author

NetWilliam commented Sep 14, 2017

I am trying to local the stack of UNKNOWN assertion failure. I set the break point at

process.py:412

through hours of running, it has come to a hanging up. There is no more log in pdb output, and I try to connect the supersord by supervisorctl but connection has blocked. here is the net status

deepglint@DG0101038:~/olympus/olympus/test$ sudo netstat -ntlpa | grep 9002
tcp        0      0 127.0.0.1:9002          0.0.0.0:*               LISTEN      21297/python2.7 
tcp      307      0 127.0.0.1:9002          127.0.0.1:54744         CLOSE_WAIT  -               
tcp      307      0 127.0.0.1:9002          127.0.0.1:38862         CLOSE_WAIT  -               
tcp      307      0 127.0.0.1:9002          127.0.0.1:60934         CLOSE_WAIT  -               
tcp      307      0 127.0.0.1:9002          127.0.0.1:36492         CLOSE_WAIT  -               
tcp      307      0 127.0.0.1:9002          127.0.0.1:33230         CLOSE_WAIT  -               
tcp        1      0 127.0.0.1:9002          127.0.0.1:37742         CLOSE_WAIT  21297/python2.7 
tcp      307      0 127.0.0.1:9002          127.0.0.1:51570         CLOSE_WAIT  -               

but supervisord seems to be blocked somewhere, and followed are strace info:

deepglint@DG0101038:~/olympus/olympus/test$ pstack
deepglint@DG0101038:~/olympus/olympus/test$ ps axu | grep pdb
root     11953  0.0  0.0  81804  4488 pts/36   S+    9月13   0:00 sudo pdb /usr/bin/supervisord -c /home/deepglint/supervisor-debug/supervisord.conf -n
root     11954  0.4  0.0  79292 25084 pts/36   S+    9月13   6:53 /usr/bin/python2.7 /usr/bin/pdb /usr/bin/supervisord -c /home/deepglint/supervisor-debug/supervisord.conf -n
deepgli+ 15169  0.0  0.0  15984  2168 pts/57   S+   14:46   0:00 grep pdb
root     21296  0.0  0.0  81800  4372 pts/51   S+    9月12   0:00 sudo pdb /usr/bin/supervisord -n -c /etc/supervisor/supervisord.conf
root     21297  1.1  0.0  72796 24876 pts/51   S+    9月12  33:07 /usr/bin/python2.7 /usr/bin/pdb /usr/bin/supervisord -n -c /etc/supervisor/supervisord.conf
deepglint@DG0101038:~/olympus/olympus/test$ pstack 11953
Could not attach to target 11953: Operation not permitted.
detach: No such process
deepglint@DG0101038:~/olympus/olympus/test$ sudo !!
sudo pstack 11953

11953: sudo pdb /usr/bin/supervisord -c /home/deepglint/supervisor-debug/supervisord.conf -n
(No symbols found)
0x7fcd21c51830: ???? (0, 5647f3cd1360, 0, 5647f3cd1370, 5647f3cd11c0, 20) + fffffffffffffd90
0x5647f3cd1120: ???? (0, 0, 0, 0, 5647f3cd1350, a00000006) + ffffa9b80c32eea0
deepglint@DG0101038:~/olympus/olympus/test$ sudo strace -p 11953                                                       
Process 11953 attached
restart_syscall(<... resuming interrupted call ...>

^CProcess 11953 detached
 <detached ...>
deepglint@DG0101038:~/olympus/olympus/test$ 
deepglint@DG0101038:~/olympus/olympus/test$ sudo strace -p 11953
Process 11953 attached
restart_syscall(<... resuming interrupted call ...>^CProcess 11953 detached
 <detached ...>
deepglint@DG0101038:~/olympus/olympus/test$ sudo strace -e trace=all -p 11953                                          
Process 11953 attached
restart_syscall(<... resuming interrupted call ...>
^CProcess 11953 detached
 <detached ...>
deepglint@DG0101038:~/olympus/olympus/test$ sudo strace -p 21296
Process 21296 attached
restart_syscall(<... resuming interrupted call ...>^CProcess 21296 detached
 <detached ...>
deepglint@DG0101038:~/olympus/olympus/test$ sudo strace -p 21297
Process 21297 attached
select(1, [0], NULL, NULL, NULL

it's weird..

@NetWilliam
Copy link
Author

NetWilliam commented Sep 19, 2017

I have traced the stack info into rpcinterface.py,
the code near the rpcinterface.py line 382 is followed:

        def killit():
            if not called:
                if process.get_state() not in RUNNING_STATES:
                    raise RPCError(Faults.NOT_RUNNING)
                # use a mutable for lexical scoping; see startProcess
                called.append(1)

            if not stopped:
                msg = process.stop()
                if msg is not None:
                    raise RPCError(Faults.FAILED, msg)
                stopped.append(1)

I guess what we are doing here is: when we want to stop a process we check it status first, if the process state is in the RUNNING_STATES and we haven't stopped the process in the request, we try to stop it. But the question is:

Whether it's a racing condition if there is some threads scheduling in between the if process.get_state() not in RUNNING_STATES: and msg = process.stop() and there is some operation changing the process state?
It seems that there is same problem in the latest release too?

@mnaberez
Follow are my tracing information, FYI
Stack info: please notice that the process.get_state() return 40, which defined in states.py means STOPPING

379                         self.ac_out_buffer = self.ac_out_buffer + p
380                         return
381  
382  ->                 data = p.more()
383  
384                     if data is NOT_DONE_YET:
385                         self.delay = p.delay
386                         return
387  
(Pdb) up
> /usr/lib/python2.7/dist-packages/supervisor/medusa/asynchat_25.py(213)initiate_send()
-> self.refill_buffer()
(Pdb) up
> /usr/lib/python2.7/dist-packages/supervisor/medusa/asynchat_25.py(152)handle_write()
-> self.initiate_send ()
(Pdb) up
> /usr/lib/python2.7/dist-packages/supervisor/medusa/asyncore_25.py(397)handle_write_event()
-> self.handle_write()
(Pdb) up
> /usr/lib/python2.7/dist-packages/supervisor/supervisord.py(246)runforever()
-> dispatcher.handle_write_event()
(Pdb) list
241                         try:
242                             dispatcher = combined_map[fd]
243                             self.options.logger.blather(
244                                 'write event caused by %(dispatcher)s',
245                                 dispatcher=dispatcher)
246  ->                         dispatcher.handle_write_event()
247                         except asyncore.ExitNow:
248                             raise
249                         except:
250                             combined_map[fd].handle_error()
251  
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/medusa/asyncore_25.py(397)handle_write_event()
-> self.handle_write()
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/medusa/asynchat_25.py(152)handle_write()
-> self.initiate_send ()
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/medusa/asynchat_25.py(213)initiate_send()
-> self.refill_buffer()
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/http.py(382)refill_buffer()
-> data = p.more()
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/xmlrpc.py(65)more()
-> value = self.callback()
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/rpcinterface.py(803)allfunc()
-> value = callback()
(Pdb) down
> /usr/lib/python2.7/dist-packages/supervisor/rpcinterface.py(382)killit()
-> msg = process.stop()
(Pdb) print process.stopped
*** AttributeError: Subprocess instance has no attribute 'stopped'
(Pdb) print process.get_state()
40
(Pdb) 

@NetWilliam
Copy link
Author

NetWilliam commented Sep 20, 2017

there is a possible racing condition, or I misunderstood? @mnaberez @alex @devrieda @anandology @joshk please notice that it seems like that there is same problem in the latest version

@NetWilliam
Copy link
Author

For those who come here later, here are some issues may help:
#536
#445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants