Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver restart fails to kill older sibling and never retries #1423

Closed
jimklimov opened this issue May 13, 2022 · 6 comments
Closed

Driver restart fails to kill older sibling and never retries #1423

jimklimov opened this issue May 13, 2022 · 6 comments
Labels
bug nut-driver-enumerator (NDE) nut-driver-enumerator (NDE) automates service management integration for NUT driver instances etc. service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug
Milestone

Comments

@jimklimov
Copy link
Member

jimklimov commented May 13, 2022

While troubleshooting a loss of connection on a system where crontab has scripts trying to restart NUT driver (and OS USB stack) upon hiccups, I found a sort of race:

  • driver instance 1 starts and takes time to walk the device and initialize:
[ May  9 14:01:12 Executing start method ("/sbin/sh -c 'NUTDEV="`/usr/libexec/nut-driver-enumerator.sh --get-device-for
-service innotech`" && [ -n "$NUTDEV" ] || { echo "FATAL: Could not find a NUT device section for service unit innotech
" >&2 ; exit 1 ; } ; /usr/sbin/upsdrvctl start "$NUTDEV"'"). ]
   0.000000     [D1] debug level is '3'
kill: No such process
   0.000718     [D1] Saving PID 28434 into /var/run/nut/nutdrv_qx-innotech.pid
   0.000827     [D1] upsdrv_initups...
   0.012366     [D2] Checking device 1 of 2 (0665/5161)
   0.013921     [D1] nut_libusb_open get iManufacturer failed, retrying...
   0.015943     [D1] nut_libusb_open get iManufacturer failed, retrying...
   0.030914     [D2] - VendorID: 0665
   0.031042     [D2] - ProductID: 5161
   0.031122     [D2] - Manufacturer: INNO TECH
   0.031201     [D2] - Product: USB to Serial
   0.031279     [D2] - Serial Number: 20100826
   0.031358     [D2] - Bus: 152
   0.031436     [D2] - Device: unknown
   0.031514     [D2] - Device release number: 0002
   0.031593     [D2] Trying to match device
   0.031673     [D3] match_function_regex: matching a device...
   0.031779     [D2] Device matches
   0.031858     [D2] Reading first configuration descriptor
   0.032507     [D3] libusb_kernel_driver_active() returned 0
   0.032594     [D2] Claimed interface 0 successfully
   0.032677     [D3] nut_usb_set_altinterface: skipped libusb_set_interface_alt_setting(udev, 0, 0)
   0.034904     [D3] send: QGS
   0.122172     [D3] read: QGS
...
  11.965163     [D2] qx_process_answer: short reply (ups.firmware)
  11.967496     [D3] send: Q1
  12.220586     [D3] read: (225.4 225.4 225.4 010 50.3 13.6 25.0 00001001
  21.052334     [D3] send: I
  • driver instance 2 starts (because no upsc response is seen, script does it):
Startup timer elapsed, continuing...
Network UPS Tools - UPS driver controller 2.8.0-20-g535395363
[ May  9 14:01:57 Method "start" exited with status 1. ]
[ May  9 14:01:57 Executing start method ("/sbin/sh -c 'NUTDEV="`/usr/libexec/nut-driver-enumerator.sh --get-device-for-service innotech`" && [ -n "$NUTDEV" ] || { echo "FATAL: Could not find a NUT device section for service unit innotech" >&2 ; exit 1 ; } ; /usr/sbin/upsdrvctl start "$NUTDEV"'"). ]
   0.000000     [D1] debug level is '3'
  • driver instance 2 sees the 1's PID file and replaces it, claiming to kill the other driver:
   0.000654     Duplicate driver instance detected (PID file /var/run/nut/nutdrv_qx-innotech.pid exists)! Terminating other driver!
   5.008521     Duplicate driver instance detected (PID file /var/run/nut/nutdrv_qx-innotech.pid exists)! Terminating other driver!
  10.016690     Duplicate driver instance detected (PID file /var/run/nut/nutdrv_qx-innotech.pid exists)! Terminating other driver!
  15.024774     [D1] Saving PID 28635 into /var/run/nut/nutdrv_qx-innotech.pid
  15.025207     [D1] upsdrv_initups...
  • ...but either the kill is not processed yet as the new sibling moves forward, or does not even happen (I see 28434 still running several days later), in any case the PID file now contains the new driver instance 2, and this instance fails to connect because device is busy with instance 1:
  15.055364     [D2] Checking device 1 of 2 (0665/5161)
  15.055921     [D1] Failed to open device (0665/5161), skipping: Other error
  15.056151     [D2] Checking device 2 of 2 (099A/610A)
  15.058812     [D1] nut_libusb_open get iManufacturer failed, retrying...
  15.059737     [D1] nut_libusb_open get iManufacturer failed, retrying...
  15.060741     [D1] nut_libusb_open get iManufacturer failed, retrying...
  15.061740     [D1] nut_libusb_open get iProduct failed, retrying...
  15.062799     [D1] nut_libusb_open get iProduct failed, retrying...
  15.063708     [D1] nut_libusb_open get iProduct failed, retrying...
  15.063840     [D2] - VendorID: 099a
  15.063921     [D2] - ProductID: 610a
  15.064001     [D2] - Manufacturer: unknown
  15.064079     [D2] - Product: unknown
  15.064158     [D2] - Serial Number: unknown
  15.064236     [D2] - Bus: 144
  15.064316     [D2] - Device: unknown
  15.064394     [D2] - Device release number: 0001
  15.064473     [D2] Trying to match device
  15.064553     [D3] match_function_regex: matching a device...
  15.064651     [D2] match_function_regex: failed match of VendorID:  99a
  15.064731     [D2] Device does not match - skipping
  15.064865     [D2] libusb1: No appropriate HID device found
Network UPS Tools - Generic Q* USB/Serial driver 0.32 (2.8.0-20-g535395363)
USB communication driver (libusb 1.0) 0.43
  15.064963     libusb1: Could not open any HID devices: insufficient permissions on everything
  15.065044     No supported devices found. Please check your device availability with 'lsusb' ...
  • later driver instance 3..N tries to start, does not even know about an older sibling (PID from instance 2 was no longer valid and removed/ignored), and device is still busy with instance 1, so not accessible. For days:
[ May  9 14:10:00 Executing start method ("/sbin/sh -c 'NUTDEV="`/usr/libexec/nut-driver-enumerator.sh --get-device-for-service innotech`" && [ -n "$NUTDEV" ] || { echo "FATAL: Could not find a NUT device section for service unit innotech" >&2 ; exit 1 ; } ; /usr/sbin/upsdrvctl start "$NUTDEV"'"). ]
   0.000000     [D1] debug level is '3'
   0.000617     [D1] Saving PID 29633 into /var/run/nut/nutdrv_qx-innotech.pid
   0.000734     [D1] upsdrv_initups...
   0.012102     [D2] Checking device 1 of 2 (0665/5161)
   0.012332     [D1] Failed to open device (0665/5161), skipping: Other error
...

TODO:

  • Check if kill is sent, and received, and handled? Does it send a TERM or also SIGKILL?
  • Check if the driver option to run as root has a role in this.
  • Check if driver instance 1 apparently being stuck to wait for reply has a role in not processing the kill? Although the process is up for days, it serves no data via upsc (and/or socket file got replaced along with PID file?) and does not log regular device walks.
@jimklimov jimklimov added bug service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug labels May 13, 2022
@jimklimov jimklimov added this to the 2.8.1 milestone May 13, 2022
@jimklimov
Copy link
Member Author

Some findings:

  • user does not seem to matter, at least as long as it does not change between runs (e.g. driver 1 started as root, driver 2 started as nut so can't signal a root process)
  • driver 1 was stuck so ignored SIGTERM; according to tracer, its process did pollsys() every minute regularly.
  • the ritual to "Terminate other driver" did not try to SIGKILL it, only SIGTERM.
  • adding a SIGKILL into the loop (PR SIGKILL an earlier driver instance if SIGTERM does not cut it [part of #1423 fix] #1424 pending) did resolve part of the issue, in that the new driver instance could grab the device now:
   0.000000     [D1] debug level is '6'
   0.000883     Duplicate driver instance detected (PID file /var/run/nut/nutdrv_qx-innotech.pid exists)! Terminating other driver!
   5.009110     Duplicate driver instance detected (PID file /var/run/nut/nutdrv_qx-innotech.pid exists)! Terminating other driver!
  10.017450     Duplicate driver instance detected (PID file /var/run/nut/nutdrv_qx-innotech.pid exists)! Terminating other driver!
  15.026749     Duplicate driver instance is still alive (PID file /var/run/nut/nutdrv_qx-innotech.pid exists) after several termination attempts! Killing other driver!
kill: No such process
  20.038688     Could not signal the other driver after kill, either its process is finally dead or owned by another user!
  20.039096     [D1] Saving PID 4909 into /var/run/nut/nutdrv_qx-innotech.pid
  20.039497     [D5] send_to_all: SETINFO device.type "ups"
  20.039800     [D1] upsdrv_initups...
  20.053941     [D2] Checking device 1 of 2 (0665/5161)
  20.056730     [D1] nut_libusb_open get iManufacturer failed, retrying...
  20.072566     [D2] - VendorID: 0665
  20.072908     [D2] - ProductID: 5161
  20.073345     [D2] - Manufacturer: INNO TECH
  20.073639     [D2] - Product: USB to Serial
  20.073904     [D2] - Serial Number: 20100826
  20.074180     [D2] - Bus: 152
  20.074444     [D2] - Device: unknown
  20.074731     [D2] - Device release number: 0002
  20.075000     [D2] Trying to match device
  20.075287     [D3] match_function_regex: matching a device...
  20.075590     [D2] Device matches
  20.076018     [D2] Reading first configuration descriptor
  20.077613     [D3] libusb_kernel_driver_active() returned 0
  20.077746     [D2] Claimed interface 0 successfully
  20.077870     [D3] nut_usb_set_altinterface: skipped libusb_set_interface_alt_setting(udev, 0, 0)
  20.077973     [D5] send_to_all: SETINFO ups.vendorid "0665"
  20.078083     [D5] send_to_all: SETINFO ups.productid "5161"
  20.080580     [D3] send: QGS
...

@jimklimov
Copy link
Member Author

jimklimov commented May 13, 2022

Remaining issues to investigate (another time):

@jimklimov
Copy link
Member Author

Now that the fix is deployed on a test system, I see the driver reporting loss of connection fairly often, so the point on locking up in reads (probably) feels urgent. However now the scripted recovery from this works, being able to kill the other driver.

@jimklimov
Copy link
Member Author

Now that #1422 is merged, I hope to see if that fixes the root problem of driver stalling in this case; built and left running...

@jimklimov
Copy link
Member Author

Empirically, it did not help, after ~30 min give or take, the driver stalls and my cronjob restarts it.

@jimklimov
Copy link
Member Author

Remaining ideas offloaded into other tickets, so this one can be closed.

@jimklimov jimklimov added the nut-driver-enumerator (NDE) nut-driver-enumerator (NDE) automates service management integration for NUT driver instances etc. label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug nut-driver-enumerator (NDE) nut-driver-enumerator (NDE) automates service management integration for NUT driver instances etc. service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug
Projects
None yet
Development

No branches or pull requests

1 participant