[Nut-upsuser] NUT behaviour when master system fails

Jon Clark jon.clark at sheffield.ac.uk
Thu Mar 20 17:06:24 UTC 2008


Hi all,

We have recently bought an APC UPS and are in the process of setting up 
the NUT software to make use of it. We are experiencing a problem with 
the behaviour of the slave systems when the master system goes off line. 
Although the failure of our master system will (hopefully) be a rare 
event, and we hope not to experience too many power outages, it is 
possible (if unlikely) that both circumstances will occur at the same 
time. I have searched the list, but not found anyone else with this 
problem. We would appreciate some help and advice if possible.

I will first give a very brief overview of our set up, then detail the 
problem, and finally provide detailed information on our set up and its 
configuration.


++ Brief overview of set up.

Our APC UPS is attached to a PC by a serial cable. This PC acts as the 
NUT master system (with NUT server and client software installed) and is 
connected to the network. Two other systems act as NUT slave systems 
(have NUT client software installed), these are also attached to the 
network and monitor the master system using this network connection.

This is a test rig. It has shown the NUT software and UPS to operate 
very successfully in many different circumstances. As stated above, the 
circumstances that lead to our problem should be rare.


++ Details of the problem.

Problem
_______

We have conducted some tests in which the master PC is unexpectedly shut 
down when the UPS is On Line (OL) and On Battery (OB). Both tests showed 
that the slave systems did not register the loss of the master system 
for 15 minutes. This period of time is too great because the fully 
charged battery of the UPS will probably not last for 15 minutes, and 
there is no guarantee that such a failure will occur with a fully 
charged battery.


Our Understanding of the Expected NUT Behaviour
_______________________________________________

It is our understanding that the NUT software process "upsmon" is 
responsible for monitoring the "upsd" process on the master system that 
provides information about the state of the UPS. Each slave system can 
set parameters for the upsmon process (using the NUT configuration file 
"upsmon.conf"). One of these parameters is called "DEADTIME".

The man page for upsmon (upsmon.8) states:

DEAD UPSES
In the event that upsmon can’t reach upsd(8), it declares that UPS dead 
after some interval controlled by DEADTIME in the upsmon.conf(5). If 
this happens while that UPS was last known to be on battery, it is 
assumed to have gone critical and no longer contributes to the overall 
power value.

The parameter DEADTIME has units of seconds. This parameter is set to 
"15" by default, indicating that after 15 seconds of being unable to 
contact the master's upsd process, the slave upsmon process should make 
a decision on whether to shut the system down. (The decision is based on 
the last know state of the UPS [OL or OB] and whether the system has an 
alternative power source.) Modifications have been made to this 
parameter on the slave systems; these changes have not affected the 15 
minute delay between the shut down of the master and the registering of 
the absence of the master upsd process by the slaves.

We expect that if the UPS is OB and the master system is shut down, the 
slaves will begin to shut down after a DEADTIME second delay. It is 
clear that something other than the upsmon DEADTIME parameter is 
affecting the behaviour of the slaves, but we don't know how to alter this.


A Guess at the Root of this Problem
___________________________________

We have done a little bit of further investigation to try to understand 
what is going on and what we are doing wrong.

By running a slave upsmon process with a debugging flag set it can be 
seen that the 15 minute delay occurs as a result of the upsmon's poll of 
the master's upsd process. Once the master has gone off line, the slave 
upsmon reports:

polling ups: apcups at nutMaster.domain.uk
get_var: apcups at nutMaster.domain.uk / status

and then 'hangs'. A 15 minute delay follows before the polling process 
returns that the master's upsd process is not reachable.

A brief examination of the NUT source code indicates that a system 
"write" statement is being used to communicate across the network with 
the upsd process of the master. We think that this system function 
blocks by default. Maybe the default blocking settings are in use. We 
don't know, this is probably very wide of the mark, but it is the best 
we have come up with!



We are expecting this problem to be caused by our set up and 
configuration of the NUT software. Has anyone seen similar behaviour? 
Does anyone have any suggestions on how to fix this problem?

Any sharing of knowledge or suggestions will be appreciated.

Best wishes,
Jon Clark



++ Details about the set up

In almost all cases, the default configuration settings are in use where 
possible.


Master Configuration Files
__________________________

ups.conf
--------
$ grep -v "#" ups.conf

[apcups]
driver = apcsmart
port = /dev/ttyS0

upsd.conf
---------
$ grep -v "#" upsd.conf

ACL all 0.0.0.0/0
ACL localhost 127.0.0.1/32
ACL nutMaster xx.xx.xx.xx1/32
ACL nutSlave1 xx.xx.xx.xx7/32
ACL nutSlave2 xx.xx.xx.xx3/32

ACCEPT localhost nutMaster nutSlave1 nutSlave2
REJECT all

upsd.users
----------
$ grep -v "#" upsd.users

[upsadmin]
password = ****
allowfrom = nutMaster
actions = SET
instcmds = ALL

[monmaster]
password = ****
allowfrom = nutMaster
upsmon master

[monslave-nutSlave1]
password = ****
allowfrom = nutSlave1
upsmon slave

[monslave-nutSlave2]
password = ****
allowfrom = nutSlave2
upsmon slave

upsmon.conf
-----------
$ grep -v "#" upsmon.conf

MONITOR apcups at nutMaster.domain.uk 1 monmaster **** master
MINSUPPLIES 1
SHUTDOWNCMD "/sbin/shutdown -h +0"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
RBWARNTIME 43200
NOCOMMWARNTIME 300
FINALDELAY 5


Slave Configuration Files
_________________________

(Both slaves have similar settings and exhibit similar behaviour.)

upsmon.conf
-----------
$ grep -v "#" upsmon.conf

MONITOR apcups at nutMaster.domain.uk 1 monslave-nutSlave1 **** slave
MINSUPPLIES 1
SHUTDOWNCMD "/sbin/shutdown -h +0"
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOCOMMWARNTIME 300
FINALDELAY 0


Computer Operating Systems
__________________________

nutMaster: Scientific Linux 4.4
nutSlave1: Scientific Linux 4.1

(Scientific Linux is a Redhat Enterprise recompile.)


NUT Software Versions
_____________________

nutMaster:
- nut-2.2.0-3.3.el4.i386.rpm
- nut-client-2.2.0-3.3.el4.i386.rpm

nutSlave1:
- nut-client-2.2.0-3.3.el4.i386.rpm


UPS Details
___________

Brand: APC
Model: Smart-UPS RT 8000VA RM 230V (XLI)


-- 
----------------------------
Jon Clark
Scientific Officer
Dept. of Applied Mathematics
University of Sheffield
Sheffield, S3 7RH, UK
----------------------------




More information about the Nut-upsuser mailing list