Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upsmon fails to start if the old pid happens to exist for an unrelated process #2463

Closed
yoyoma2 opened this issue Jun 5, 2024 · 14 comments · Fixed by #2464
Closed

upsmon fails to start if the old pid happens to exist for an unrelated process #2463

yoyoma2 opened this issue Jun 5, 2024 · 14 comments · Fixed by #2464
Labels
impacts-release-2.8.1 Issues reported against NUT release 2.8.1 (maybe vanilla or with minor packaging tweaks) portability We want NUT to build and run everywhere possible question service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug upsmon

Comments

@yoyoma2
Copy link

yoyoma2 commented Jun 5, 2024

If a previous upsmon.pid file is present at startup and an unrelated process by chance happens to exist with that pid, upsmon will fail to start with the following message:

Fatal error: A previous upsmon instance is already running!
Either stop the previous instance first, or use the 'reload' command.

Workaround:
My cron job that that checks running services using pidof will now delete upsmon.pid before attempting to restart upsmon.

Expected behavior:
If the name of the current process with the pid contained in upsmon.pid is unrelated to upsmon, then upsmon should start up normally instead of exiting with a failure.

Platform: router running linux

@jimklimov
Copy link
Member

Thanks, that's an interesting case, although the hard part can be looking up the process name for a PID across many platforms that all do it differently.

Just in case: is the problem happening at router start-up? Does your build of NUT there store the PID files in a persistent storage location, or in a tmpfs that should be empty after a reboot?

Also, which version of NUT is involved? I think with recent ones, a graceful exit of the daemon should have it delete the PID file.

@yoyoma2
Copy link
Author

yoyoma2 commented Jun 6, 2024

These are the entware packages installed on the router so a pretty recent version.

nut - 2.8.1-1
nut-common - 2.8.1-1
nut-upsmon - 2.8.1-1
nut-upssched - 2.8.1-1

The pid file is in /opt/var/run which is persistent. The /opt/etc/init.d/S15upsmon startup script isn't specifying a variable with the directory for the pid file so not sure how to override this to a non-persistent directory.

This problem only happened once after a non-graceful reboot of the router. I kept getting emails about upsmon not running so all subsequent attempts to start upsmon were also failing.

Since the entware startup scripts already perform a check of an already running process (below), perhaps a upsmon command-line argument to skip the redundant test at upsmon startup could be a workaround.

if [ -n "`pidof $PROC`" ]; then
    echo -e "            $ansi_yellow already running. $ansi_std"
    return 0
fi

This is admittedly a fringe case that requires bad luck to occur.

@jimklimov
Copy link
Member

jimklimov commented Jun 6, 2024

For temporary files (PID, Unix socket, etc.), NUT uses generally several locations which are built into the binaries by configre script settings, but can be customized by environment variables such as NUT_STATEPATH, NUT_PIDPATH and NUT_ALTPIDPATH. And NUT_CONFPATH for config files, to mention them all. Try exporting these first 3 from the init scripts (consistently for all NUT components you run).

@yoyoma2
Copy link
Author

yoyoma2 commented Jun 6, 2024

Thanks, those variables are very useful. Tweaking the entware scripts makes installing future entware package updates a pain. If there was a portable fix I would log an issue with entware so all entware NUT users would avoid the risk. Even a non-persistent upsmon.pid isn't a perfect solution.

For now I'll just keep my checkservices cron job hack which deletes upsmon.pid the next time it runs after the problem happens and notifies me. I might get lucky and never hit this issue again.

This is a apparently a bigger topic than this NUT issue.

@jimklimov
Copy link
Member

jimklimov commented Jun 7, 2024

Well, in case of NUT the PID files are also important for inter-process communications of sorts, such as sending signals to already-running daemons (e.g. to upsmon -c reload live). If a random process lives at that PID, for a majority of NUT programs that run as an unprivileged user like ups or nut it should end up as just inability to send the signal due to permissions. In this, upsmon may be special as it typically starts as root (so there's a piece with permission to shut down the host) and privileges are dropped for the majority of the program.

@jimklimov jimklimov added service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug upsmon portability We want NUT to build and run everywhere possible impacts-release-2.8.1 Issues reported against NUT release 2.8.1 (maybe vanilla or with minor packaging tweaks) question labels Jun 7, 2024
@yoyoma2
Copy link
Author

yoyoma2 commented Jun 7, 2024

In the entware distribution of upsmon the -p argument is the only one used so everything is "all root all the time" so sending signals to a a random daemon will succeed. My router only has root so that's a rare case where -p makes sense.

Could an argument tell upsmon to run pidof from the OS to double check the PID rather than only consulting the PID file? The entware startup scripts use pidof before launching anything so they already assume all the platforms they support have pidof so entware might consider using such an option especially since using -p.

Just brainstorming crazy ideas on how to never confuse a random process for a NUT program. I'm no expert...

@jimklimov
Copy link
Member

Adding a dependency on a random program is not a likely way forward. I'm tinkering to check with how ps programs on differrnt known OSes do it - so if we can get a name, check it...

jimklimov added a commit to jimklimov/nut that referenced this issue Jun 8, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 8, 2024
…ignal*() via old PID only to same progname [networkupstools#2463]

Internal API change for common.c/h

Signed-off-by: Jim Klimov <[email protected]>
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 8, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 8, 2024
@jimklimov
Copy link
Member

I think I ended up making an universal ps built-in, but whatever. It even works :)

@yoyoma2
Copy link
Author

yoyoma2 commented Jun 9, 2024

Looked at the pull request, very nice... Makes investigating the cause and logging an issue worthwhile.

@jimklimov
Copy link
Member

jimklimov commented Jun 9, 2024

Not sure what you meant about using -p up there though, in context of running as root or not.

In the https://github.com/Entware/entware-packages/blob/master/net/nut/files/nut-monitor.init#L14-L16 script I see them setting up RUN_AS_USER nutmon (if no other runas name is specified in "UCI config").

And in https://github.com/Entware/entware-packages/blob/master/net/nut/files/nut-monitor.init#L207 they start it foregrounded (with debug verbosity 1), no -p here.

Do you have a similar version deployed? The one I see in their Git is 2 years old, related to OpenWRT 2022.04...

But then there's also https://github.com/Entware/entware-packages/blob/master/net/nut/files/S15upsmon with -p (5 years, OpenWRT 2019.02), so it is a bit confusing. Which one actually runs?

@yoyoma2
Copy link
Author

yoyoma2 commented Jun 9, 2024

Yes I have entware deployed on a few routers. The main one where the pid incident occurred has a S15upsmon identical to the third link you posted. That's where I saw the -p as well as the ps command:

# ps | grep up[s]
 2612 root      6544 S    upsmon -p
#

I don't have an entware build environment and know little about their git. They tend to be slow with updating their packages to the latest versions. Is there something you want tested on my backup router?

jimklimov added a commit to jimklimov/nut that referenced this issue Jun 9, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 9, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 9, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 9, 2024
jimklimov added a commit that referenced this issue Jun 10, 2024
@jimklimov
Copy link
Member

Is there something you want tested on my backup router?

Well, if you were able to build a newer package to install and run on your router - that would be great.

Other than that, I think I can only suggest asking in the Entware community about how they have these different init scripts, and the newer one looks more advanced but an older one is actually used (at least in your platform's builds), and how to rectify that...

@yoyoma2
Copy link
Author

yoyoma2 commented Jun 11, 2024

I successfully build and run released upsmon but what would I put in this file so the Entware build system downloads pre-release NUT tar.gz and tar.gz.sha256 files?

@jimklimov
Copy link
Member

jimklimov commented Jun 11, 2024

Good question, as we do not really publish pre-release tarballs for interim branches (or even states of master) - that would need too much storage. The good news is that on a sufficiently prepared (tools and third-party deps) you can make your own such tarball with make dist, and somehow publish it locally for your Entware build to see (local web server, maybe plain filesystem).

Also a tarball can be left over from make distcheck - and this one we customize for quicker runs in some tests (see the main Makefile.am and ci_build.sh for different distcheck-light etc. definitions): some of this can reduce your third-party tool footprint regarding documentation builds in particular (configure --with-docs=skip IIRC) because docs tools can pull in half of X11 to render PDF etc.

jimklimov added a commit to jimklimov/nut that referenced this issue Jun 13, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 14, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jun 14, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jul 22, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jul 22, 2024
…ed() from checkprocname() and compareprocname() [networkupstools#2463]

Signed-off-by: Jim Klimov <[email protected]>
jimklimov added a commit to jimklimov/nut that referenced this issue Jul 22, 2024
…red() and getprocname(pid) once for several tests against its value, and report it in the end [networkupstools#2463]

Signed-off-by: Jim Klimov <[email protected]>
jimklimov added a commit to jimklimov/nut that referenced this issue Jul 22, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacts-release-2.8.1 Issues reported against NUT release 2.8.1 (maybe vanilla or with minor packaging tweaks) portability We want NUT to build and run everywhere possible question service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug upsmon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants