[Nut-upsdev] some fixes, improvements, and new features (EPO and DYING) for NUT

Fri Mar 9 04:01:39 UTC 2012

On Mar 8, 2012, at 6:21 PM, Greg A. Woods wrote:

> Here are a series of my recent changes to NUT.
> 
> The first few in the set are primarily little fixes and improvements.
> 
> In among those are a few for .gitignore files which of course you can
> ignore for SVN, and there's one for a commit to a generated file which
> of course should not be tracked in any VCS.

We are actually in the process of trying to move the NUT source code over to Git, but both conversions by git-svn and Eric S. Raymond's reposurgeon are not quite there yet. (We are leaning towards reposurgeon, which involves a little more tweaking of commits, but produces better results for a one-way SVN-to-Git conversion, including .gitignore files generated from svn:ignore properties.)

That said, while we could easily apply these first few patches, I would like to preserve what is left of my sanity (we are still working through a horrible Git/SVN hybrid merge of the NSS SSL code), and defer applying them until we have a native Git tree. This will also prevent some from falling through the cracks.

> Then there are a couple or three to do with generating the header files
> used by nut-scanner.  These probably could have been collapsed into one,
> but I left them separate to show more clearly what some of the problems
> are with the crazy attempts to use scripts to parse C code instead of
> using the compiler.  The final one in that group is a half-assed attempt
> to generate one of the headers using a helper function directly from the
> compiled data structures it is derived from, and thus totally
> eliminating the need for the broken python script in the first place.
> Even this though is wrong -- the code needing the data structures from
> the driver should be linking directly with shared .o files to access it
> instead of re-inventing new data structures and trying to populate them
> from the existing data structures.  The same thing should be done to
> eliminate the horrid perl script in there too.

I have vented about other issues related to nut-scanner in the past, but with the CI and source control stuff, I haven't had time to fix it personally. My vote would be for applying these, but I'll give the Eaton folks a chance to look at it first.

> I then made some improvements to the SNMP driver to make it actually
> work properly with my AP9605 SNMP card, and which should make it work
> properly now with any SNMP agent implementing APC's POWERNET MIB.

SNMP isn't my area, but sounds good.

> I also discovered the blazer driver does work pretty well with my GE
> Digital Energy GT Series UPS, at least with the 1000-3000 VA models.

Trivial to apply.

> I added some more info about APC cables that I'd been keeping track of
> independently.

Very useful, thanks.

> I had independently made a similar change to the apcsmart driver to keep
> it from failing when tcgetattr() reported some irrelevant differences in
> the port settings.  What's actually in the patch now is my merge of the
> change from upstream which is basically just an "improved" log message.

Agreed.

> I've also added some suggested coding improvements which I think will
> make things easier to maintain down the line, notably using clear syntax
> that's easy to modify safely for defining bit flag value macros, as well
> as a strong suggestion to NEVER EVER use comment syntax to comment out
> code blocks -- always use the pre-processor -- it's much safer!

Agreed in principle, although I haven't looked to see if collapsing any of the unused bits will lead to binary incompatibility. Given how distributions tend to lag behind the latest code, we often suggest that people just drop in a replacement driver to test certain changes without disrupting the rest of the install. This could be completely unwarranted fears on my part, though.

> Finally I introduce the first of my new features:  The "EPO" command.
> This is very similar to "FSD", but fundamentally different in that it
> goes a bit deeper into the infrastructure and it has a different purpose
> and ultimate affect on the systems being managed.  The basic idea is to
> provide the moral equivalent, though not in quite such draconian and
> dangerous hard-core way, of an Emergency Power Off (big red) switch.
> The critical difference with FSD is that EPO is intended to require
> manual human intervention to recover from, and that it is also intended
> to completely and entirely remove power from everything if at all
> possible, even if mains power is still fully and smoothly functioning.
> 
> I'm really not sure if "FSD" has a true purpose other than as a test
> command to see if everything will restart after mains power returns,
> since of course FSD tries to simulate the effect of mains power
> returning after a full shutdown has been committed to and is in
> progress.

Recently, we discussed adding the option for drivers to set FSD if an external shutdown signal has been applied (e.g. if NUT is not the master):

http://article.gmane.org/gmane.comp.monitoring.nut.devel/5925

> EPO on the other hand is a key requirement of my next feature:  The
> ability of a UPS driver to declare that the UPS is dying of some
> critical condition and that it must be shut down in such a way that
> manual human intervention is required to restart it.  EPO is also
> intended to be triggered automatically, whereas FSD (I think) is always
> intended to be manually introduced by a human systems manager.
> 
> I.e. in an ideal configuration everything should restart and reboot and
> return to operational status after "upsmon -c fsd" once mains power
> returns or if power was never actually off; whereas with "upsmon -c epo"
> then everything should power down and stay off even if mains power
> remains on and steady.

This is an interesting distinction (one that a few drivers make in their different shutdown commands, but that is not currently tied to FSD).

The reason why I advocated usurping the "FSD" status was because it is the only other status besides "OB LB" that is currently guaranteed to trigger a shutdown. I wonder if we could just use FSD with some other status option to indicate whether the driver should request a restart when the power returns.

I've CC'd Bill Elliot to get his thoughts on the use cases that led to suggesting the external shutdown trigger - it might dovetail with this.

> For example I would have used "FSD" to shut down in power blackouts
> where I knew the power could not return before the batteries ran low,
> and thus I would have conserved battery charge for the inevitable short
> hiccups that occur after a long blackout, but still been able to enjoy
> automatic restart after the blackout in case power returns while I'm
> sleeping, etc.
> 
> Finally I add some features to the three drivers I was able to test
> which make use of this new "DYING" state to power things down safely but
> quickly when they detect operating temperatures above a configurable
> maximum value.  The one driver that already supported use of the ALARM
> state also sets an alarm when the temperature rises above a configurable
> warning value.  The idea here is that if the HVAC fails in your computer
> room then you can have everything automatically shut down _AND_ stop
> pouring BTUs out into the room, and of course hopefully first raise an
> alarm so that a human can try to intervene before an emergency power off
> is actually necessary to prevent equipment damage.
> 
> Indeed the motivation behind these new features is because HVAC fails
> far more frequently in my client's server room, and with far more dire
> consequences, than the power fails.  Indeed they have only one tiny UPS
> that can run only the most critical core equipment, but everything has
> come near to suffering serious physical damage when ambient temperatures
> have shot up above 45C in an extremely short time after HVAC failure,
> which of course is usually on a Saturday night.
> 
> These changes are a work in progress to some extent -- I still have not
> fully tested the EPO of a running network, but I hope to do that very
> soon.  The drivers do report alarms (where implemented) and they report
> the "DYING" status when their temperature sensors report above-maximum
> values.

It's definitely a feature I would like to see merged at some point. Now that you mention this, I think there are several UPS protocols which support a bitmask for alarm conditions which will trigger a shutdown (including overtemp). We will want to make sure that the procedure for setting that event mask is not terribly different depending on whether the shutdown is triggered by the UPS hardware, or by NUT monitoring other UPS status (as I believe you are proposing with the DYING status).

I admit I haven't had time to read all of the patches that implement this, though, so please correct me if I am making any incorrect assumptions.

-- 
Charles Lepple
clepple at gmail