[Nut-upsuser] upsmon+snmp-ups does not shut down system

Arnaud Quette aquette.dev at gmail.com
Thu Jan 12 10:11:21 UTC 2012


Hi

2012/1/11 William Seligman <seligman at nevis.columbia.edu>

> The problem is solved, but first things first:
>
> On 1/11/12 6:43 AM, Arnaud Quette wrote:
> > 2012/1/9 William Seligman <seligman at nevis.columbia.edu>
> >
> >> On 1/9/12 9:53 AM, Arnaud Quette wrote:
> >>
> >>> 2012/1/6 William Seligman <seligman at nevis.columbia.edu>
> >>>
> >>>> I've googled and RTFM'ed, but still can't solve this one. I hope you
> >>>> folks can.
> >>>>
> >>>> This affects my entire computer cluster, but let's start simple: I've
> >>>> got a computer running NUT; OS is Scientific Linux 5.5; kernel
> >>>> 2.6.18-274.12.1.el5xen. It connects to an APC SMART-UPS via an APC
> >>>> SmartCard using the snmp-ups driver. It generally works: upsmon will
> >>>> detect if the battery is low (I get an e-mail message); I can control
> >>>> the UPS, inspect it variables, set variables, issue commands, and so
> >>>> on.
> >>>
> >>> If "On battery" and "Low battery" are both detected, there should be no
> >>> issue.
> >>>
> >>>> There's just one thing that does not happen: when the UPS goes
> critical,
> >>>> the computer does not shut down. The upsmon daemon does not display
> any
> >>>> messages, does not write to the syslog, does not send e-mail, etc.;
> even
> >>>> though I've configured it to do so in upsmon.conf.>>
> >>>> I've tried nut-2.2.2, nut-2.4.3, and nut-2.6.2, and the symptom is the
> >>>> same.
> >>>
> >>> Using the latest version, when possible, is always a good idea.
> >>
> >> Installing nut-2.6.2 on a Scientific Linux 5.5 system was a bit
> difficult,
> >> and played havoc with my regular yum updates. After I've finished
> >> debugging this problem, I'm going to completely reinstall the OS to make
> >> sure I've got a consistent set of RPMs.>>
> >
> > you may have prefered to rebuild an SRPM like that:
> >
> http://zid-luxinst.uibk.ac.at/linux/rpm2html/fedora/14/i386/updates/nut-2.6.2-1.fc14.i686.html
>
> That what I did, at first. The rebuild process for that RPM involves
> "-devel"
> libraries that are not part of an RHEL5-style distribution. So I tried to
> download and compile the SRPMs for those libraries (neon-devel,
> portman-devel,
> net-snmp-devel, etc.). This led to a chain of installs and the usual RPM
> hell; I
> had not appreciated how different RHEL6+ was from RHEL5.
>
> Even with all the dependent libraries installed, the nut-2.6.2 SRPM would
> still
> not rebuild; even though the neon and neon-devel libraries were present,
> the
> configure script couldn't find them and so the rebuild failed.
>
> Finally, I did what I should have done from the start: I just used the
> nut-2.6.2.tar.gz file and built it manually. The configure script still
> couldn't
> find the neon libraries, but I didn't need that functionality for my
> tests, and
> this did not block the compilation. The only problem was getting the
> various
> directory options set so files/binaries would be installed in the same
> directories as in a Redhat distribution. Even then, I had to move binaries
> around post-install.
>
> And after all that work, it still didn't solve the problem. Read on...
>
> >>>> I tried issuing a "graceful reboot" command via the APC SmartCard's
> web
> >>>> and telnet interface. It made no difference; the system still did not
> >>>> shut down.
> >>>>
> >>>> Now let's extend the problem to my cluster: I have a variety of
> >>>> different computers, all running Scientific Linux 5.5, connecting
> >>>> through different switches, connecting to different flavors of APC
> >>>> SMART-UPSes, via SmartCards, each ranging in age from six months to
> >>>> five years. They all exhibit this same symptom, as I painfully
> >>>> discovered during a recent power outage: they all sent me e-mail when
> >>>> the UPSes went to low battery, but none turned off when the UPS went
> >>>> critical. Given the range of hardware involved, this must be a common
> >>>> software problem.
> >>>>
> >>>> The systems will shut down properly if I do "upsmon -c fsd", so it
> >>>> doesn't appear to be a permissions problem.
> >>>>
> >>>> I don't think this is the upsdrv_shutdown() issue described in the
> >>>> snmp-ups man page; I do not care if the UPS shuts down when the
> >>>> computer does, nor do I want it to. I just want upsmon to shut down
> the
> >>>> system when the UPS goes critical.
> >>>>
> >>>> Here are my config files; the system is tanya, its UPS is tanya-ups.
> >>>> Any advice?
> >>>>
> >>>> ups.conf:
> >>>>
> >>>> [tanya-ups]
> >>>>        driver = snmp-ups
> >>>>        port = tanya-ups
> >>>>        community = private
> >>>>        mibs = apcc
> >>>>
> >>>> upsd.conf:
> >>>>
> >>>> # LISTEN 0.0.0.0 3493
> >>>>
> >>>> upsd.users:
> >>>>
> >>>> [admin]
> >>>>        password = nowayjose
> >>>>        actions = SET
> >>>>        instcmds = all
> >>>>        upsmon master
> >>>>
> >>>
> >>> it's also a good idea to separate monitoring and administrative users.
> >>> Ie:
> >>> [admin]
> >>>        password = XXX
> >>>        actions = SET
> >>>        instcmds = all
> >>>
> >>> [monuser]
> >>>        password = XXX
> >>>        upsmon master
> >>>
> >>>> upsmon.conf:
> >>>>
> >>>> MONITOR tanya-ups at localhost 1 admin nowayjose master
> >>>> MINSUPPLIES 1
> >>>> SHUTDOWNCMD "/sbin/shutdown -h +0"
> >>>> NOTIFYCMD /home/bin/notify.sh # sends me e-mail
> >>>> POLLFREQ 5
> >>>> POLLFREQALERT 5
> >>>> HOSTSYNC 15
> >>>> DEADTIME 15
> >>>> POWERDOWNFLAG /etc/killpower
> >>>> NOTIFYFLAG ONLINE       SYSLOG
> >>>> NOTIFYFLAG ONBATT       SYSLOG+WALL
> >>>> NOTIFYFLAG LOWBATT      SYSLOG+WALL
> >>>> NOTIFYFLAG FSD          SYSLOG+WALL+EXEC
> >>>> NOTIFYFLAG COMMOK       SYSLOG
> >>>> NOTIFYFLAG COMMBAD      SYSLOG
> >>>> NOTIFYFLAG SHUTDOWN     SYSLOG+WALL+EXEC
> >>>> NOTIFYFLAG REPLBATT     SYSLOG+WALL+EXEC
> >>>> NOTIFYFLAG NOCOMM       SYSLOG
> >>>> NOTIFYFLAG NOPARENT     SYSLOG+WALL
> >>>> RBWARNTIME 43200
> >>>> NOCOMMWARNTIME 300
> >>>> FINALDELAY 5
> >>>
> >>> Your config seems fine.
> >>> An interesting test to do would be to stop upsmon, but keep snmp-ups
> and
> >>> upsd, then discharge your UPS and to ensure that you indeed get an
> >>> ups.status == "OB LB", which triggers the call to
> >>> upsmon.conf->SHUTDOWNCMD. Note that you need both "OB" and "LB", since
> >>> you may have "low battery" and be "online" at the same time!
> >>
> >> This is a good idea, and I ran the test. I disconnected the UPS, and
> >> periodically checked the output of:
> >>
> >> upsc tanya-ups at localhost ups.status
> >>
> >> Eventually this command returned "OB LB" as you said. But upsmon did
> >> nothing. I waited and eventually the UPS shut power to the system in a
> hard
> >> crash.
> >
> > ooch, mea culpa!
> > I was too brief in my answer, and forgot to tell you the obvious: remove
> > your computer from the UPS, in order to avoid such crash.
> >
> >> So the UPS is sending the correct signals, and snmp-ups is reporting the
> >> correct status. Is there anything else I can check to trace the cause of
> >> the problem?
> >
> > indeed, though there is an issue, as you've reported initially.
> >
> > Could you do this test again, but this time:
> > - remove your server from the UPS,
> > - start upsmon in debug mode. If it's already started, just call "upsmon
> -c
> > stop ; upsmon -DDDDD"
> > and send us back the output, at least when it should see the "OB LB"
> > condition, to see what's going on.
>
> I solved the problem by looking at the code in upsmon.c. I did two stupid
> things:
>
> - I didn't RTFM as much as I thought I had.
>
> - In my rush to trim down the config files for my first message to
> nut-upsuser,
> I left out the crucial bits that would have enabled anyone else to help me.
>
> Here's the key: In my upsmon.conf, I actually have two MONITOR lines:
>
> MONITOR tanya-ups at localhost 1 monuser acdc master
> MONITOR network-ups at localhost 1 monuser acdc master
>
> (Note the change to "monuser", indicating that I followed Arnaud's advice.)
>
> I'm using snmp-ups to communicate with my UPS. If the UPS that supplies
> power to
> the network switch goes critical, I want tanya to power down as well;
> after all,
> if tanya can't talk to its UPS anymore, it won't know when tanya-ups goes
> critical.
>
> So the intent of the two MONITOR lines is: If either tanya-ups OR
> network-ups
> goes critical, shut down the system.
>
> But I also had this line in upsmon.conf:
>
> MINSUPPLIES 1
>
> That means the effect of the two MONITOR lines is: If tanya-ups AND
> network-ups
> go critical, shut down the system.
>
> Since all my tests involved just cutting the power via tanya-ups, upsmon
> wasn't
> shutting down tanya. It was doing what the configuration file told it to
> do.
>
> The solution is change the MINSUPPLIES line:
>
> MINSUPPLIES 2
>
> Then upsmon does what I want it to do. I've already confirmed this with
> direct
> tests. (I also discovered that I had to increase the "low-battery duration"
> parameter on tanya-ups, but that's another story.)
>
> In general, at least for my cluster configuration, the argument to
> MINSUPPLIES
> should be equal to the number of MONITOR lines I have in upsmon.conf.
>
> My confusion was due to my mis-interpretation of the language of the
> documentation. The upsmon.conf man page and big-servers.txt all speak about
> power supplies directly connected to the system; I skipped over those parts
> because I thought of only one UPS supplying power to my system. In my
> configuration I have to think of the network switch as part of "the
> system." I
> should have paid more attention.
>
> Thanks for trying to help me out, Arnaud. It wasn't your fault that I
> didn't
> give you enough information.
>

glad to hear that your issue is fixed.
I'll try to check if these wordings can be improved to avoid confusion.

cheers,
Arnaud
-- 
Linux / Unix Expert R&D - Eaton - http://powerquality.eaton.com
Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/
Debian Developer - http://www.debian.org
Free Software Developer - http://arnaud.quette.free.fr/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/nut-upsuser/attachments/20120112/3f67f90a/attachment-0001.html>


More information about the Nut-upsuser mailing list