Home · About · Download · Documentation · Getting Help · Google+

Changes between Version 57 and Version 58 of Node Health Check


Ignore:
Timestamp:
11/12/2015 09:30:22 AM (2 years ago)
Author:
mej
Comment:

New site for NHC

Legend:

Unmodified
Added
Removed
Modified
  • Node Health Check

    v57 v58  
    11= Warewulf Node Health Check (NHC) = #top 
    22 
    3 [[PageOutline(2-4, Table of Contents, pullout)]] 
     3= PROJECT MOVED = 
    44 
    5    ||= '''NOTE:'''=||'''Although NHC is a subproject of Warewulf, the Warewulf core packages are ''NOT'' required for any part of NHC to function.''' || 
    6  
    7 TORQUE, SLURM, and other schedulers/resource managers provide for a periodic "node health check" to be performed on each compute node to verify that the node is working properly.  Nodes which are determined to be "unhealthy" can be marked as down or offline so as to prevent jobs from being scheduled or run on them.  This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failure, etc. 
    8  
    9 Though many sites have created their own scripts to serve this function, the vast majority are one-off efforts with little attention paid to extensibility, flexibility, reliability, speed, or reuse.  The Warewulf developers hope to change that with their Node Health Check project.  Warewulf NHC has several design features that set it apart from most home-grown solutions: 
    10  * Reliable - To prevent single-threaded script execution from causing hangs, execution of subcommands is kept to an absolute minimum, and a watchdog timer is used to terminate the check if it runs for too long. 
    11  * Fast - Implemented almost entirely in native `bash` (2.x or greater).  Reducing pipes and subcommands also cuts down on execution delays and related overhead. 
    12  * Flexible - Anything which can be described in a shell function can be a check.  Modules can also populate cache data and reuse it for multiple checks. 
    13  * Extensible - Its modular functional interface makes writing new checks easy.  Just drop modules into the scripts directory, then add your checks to the config file! 
    14  * Reusable - Written to be ultra-portable and can be used directly from a resource manager or scheduler, run via cron, or even spawned centrally (e.g., via `pdsh`).  The configuration file syntax allows for all compute nodes to share a single configuration. 
    15  
    16 In a typical scenario, the Node Health Check script is run periodically on each compute node by the resource manager client daemon (e.g., `pbs_mom`).  NHC loads its configuration file to determine which checks are to be run on the current node (based on its hostname).  Each matching check is run, and if a failure is encountered, NHC will exit with an error message describing the problem.  It can also be configured to mark nodes offline so that the scheduler will not assign jobs to bad nodes, reducing the risk of system-induced job failures.  NHC can also log errors to the syslog (which is often forwarded to the master node).  Some resource managers are even able to use NHC as a pre-job test, keeping scheduled jobs from running on a newly-failed node, and/or a post-job test to remove nodes from the scheduler which may have been adversely affected by a just-completed job. 
    17  
    18 == Getting Started == 
    19  
    20 The following instructions will walk you through downloading and installing Warewulf NHC, configuring it for your system, testing the configuration, and implementing it for use with the TORQUE resource manager. 
    21  
    22 === Installation === 
    23  
    24 Pre-built RPM packages for !RedHat Enterprise Linux versions [http://warewulf.lbl.gov/downloads/repo/rhel5/ 5] and [http://warewulf.lbl.gov/downloads/repo/rhel6/ 6] are available at the [http://warewulf.lbl.gov/downloads/ Download Site].  Simply download the appropriate RPM for your compute nodes (e.g., [http://warewulf.lbl.gov/downloads/repo/rhel6/warewulf-nhc-1.4.1-1.el6.noarch.rpm warewulf-nhc-1.4.1-1.el6.noarch.rpm]) and install it into your compute node VNFS.  (The other Warewulf RPMs are '''NOT''' required for NHC to function.) 
    25  
    26 You can also add our RPM repository to your compute nodes' Yum configuration by following the instructions [wiki:Recipes/Installation#InstallingWarewulfwithYUMRPM in our Wiki]. 
    27  
    28 The [http://warewulf.lbl.gov/downloads/releases/warewulf-nhc/warewulf-nhc-1.4.1.tar.gz source tarball for the latest release] is also available at the [http://warewulf.lbl.gov/downloads/releases/warewulf-nhc/ Download Site].  If you prefer to install from source, or aren't using one of the distributions shown above, use the commands shown here: 
    29  
    30 {{{ 
    31 # ./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec 
    32 # make test 
    33 # make install 
    34 }}} 
    35  
    36 NOTE:  The `make test` step is optional but recommended.  This will run NHC's built-in unit test suite to make sure everything is functioning properly! 
    37  
    38 Whether you use RPMs or install from source, the script will be installed as `/usr/sbin/nhc`, the configuration file and check scripts in `/etc/nhc`, and the helper scripts in `/usr/libexec/nhc`.  Once you've completed one of the 3 installation methods above on your compute nodes' root filesystem image, you can proceed with the configuration. 
    39  
    40 === Sample Configuration === 
    41  
    42 The default configuration supplied with Warewulf NHC is intended to be more of an overview of available checks than a working configuration.  It's essentially impossible to create a default configuration that will work out-of-the-box for any host and still do something useful.  But there are some basic checks which are likely to apply, with some modifications of boundary values, to most systems.  Here's an example `nhc.conf` which shouldn't require too many tweaks to be a solid starting point: 
    43  
    44 {{{ 
    45 # Check that / is mounted read-write. 
    46 * || check_fs_mount_rw / 
    47  
    48 # Check that sshd is running and is owned by root. 
    49 * || check_ps_daemon sshd root 
    50  
    51 # Check that there are 2 physical CPUs, 8 actual cores, and 8 virtual cores (i.e., threads) 
    52 * || check_hw_cpuinfo 2 8 8 
    53  
    54 # Check that we have between 1kB and 1TB of physical RAM 
    55 * || check_hw_physmem 1k 1TB 
    56  
    57 # Check that we have between 1B and 1TB of swap 
    58 * || check_hw_swap 1b 1TB 
    59  
    60 # Check that we have at least some swap free 
    61 * || check_hw_swap_free 1 
    62  
    63 # Check that eth0 is available 
    64 * || check_hw_eth eth0 
    65 }}} 
    66  
    67 Obviously you'll need to adjust the CPU and memory numbers, but this should get you started. 
    68  
    69 ==== Config File Auto-Generation ==== 
    70  
    71 Instead of starting with a basic sample configuration and building on it, as of version 1.4.1, the `nhc-genconf` utility is supplied with NHC which uses the same shell code as NHC itself to query various attributes of your system (CPU socket/core/thread counts, RAM size, swap size, etc.) and automatically generate an initial configuration file based on its scan.  Simply invoke `nhc-genconf` on each system where NHC will be running.  By default, this will create the file `/etc/nhc/nhc.conf.auto` which can then be renamed (or used directly via NHC's `-c` option), tweaked, and deployed on your system! 
    72  
    73 Normally the config file which `nhc-genconf` creates will use the hostname of the node on which it was run at the beginning of each line.  This is to allow multiple files to be merged and sorted into a single config that will work across your system.  However, you may wish to provide a custom match expression to prefix each line; this may be done via the `-H` option (e.g., `-H host1` or `-H '*'`). 
    74  
    75 The scan also includes BIOS information obtained via the `dmidecode` command.  The default behavior only includes lines from the output which match the regular expression `/([Ss]peed|[Vv]ersion)/`, but this behavior may be altered by supplying an alternative match string via the `-b` option (e.g., `-b '*release*'`). 
    76  
    77 It can be incredibly tedious, especially for large, well-established heterogeneous or multi-generational clusters to gather up all the different types of hardware that exist in your system and write the appropriate NHC config file rules, match expressions, etc.  The following commands might come in handy for aggregating the results of `nhc-genconf` across a large group of nodes: 
    78  
    79 {{{ 
    80 # wwsh ssh 'n*' /usr/sbin/nhc-genconf -H '*' -c - | dshbak -c 
    81  OR 
    82 # pdsh -a /usr/sbin/nhc-genconf -H '*' -c - | dshbak -c 
    83 }}} 
    84  
    85 === Testing === 
    86  
    87 As of version 1.2 (and higher), NHC comes with a built-in set of fairly extensive unit tests.  Each of the check functions is tested for proper functionality; even the driver script (`/usr/sbin/nhc` itself) is tested!  To run the unit tests, use the `make test` command at the top of the source tree.  You should see something like this: 
    88  
    89 {{{ 
    90 # make test 
    91 make -C test test 
    92 make[1]: Entering directory `/home/mej/svn/warewulf/nhc/test' 
    93 Running unit tests for NHC: 
    94 nhcmain_init_env...ok 6/6 
    95 nhcmain_finalize_env...ok 14/14 
    96 nhcmain_check_conffile...ok 1/1 
    97 nhcmain_load_scripts...ok 6/6 
    98 nhcmain_set_watchdog...ok 1/1 
    99 nhcmain_run_checks...ok 2/2 
    100 common.nhc...ok 18/18 
    101 ww_fs.nhc...ok 61/61 
    102 ww_hw.nhc...ok 65/65 
    103 ww_job.nhc...ok 2/2 
    104 ww_nv.nhc...ok 4/4 
    105 ww_ps.nhc...ok 32/32 
    106 All 212 tests passed. 
    107 make[1]: Leaving directory `/home/mej/svn/warewulf/nhc/test' 
    108 # 
    109 }}} 
    110  
    111 If everything works properly, all the unit tests should pass.  Any failures represent a problem that should be reported to the [mailto:warewulf@lbl.gov developers]! 
    112  
    113 Before adding the node health check to your resource manager (RM) configuration, it's usually prudent to do a test run to make sure it's installed/configured/running properly first.  To do this, simply run `/usr/sbin/nhc` with no parameters.  Successful execution will result in no output and an exit code of 0.  If this is what you get, you're done testing!  Skip to the next section. 
    114  
    115 If you receive an error, it will look similar to the following: 
    116  
    117 {{{ 
    118 ERROR Health check failed:  Actual CPU core count (2) does not match expected (8). 
    119 }}} 
    120  
    121 Depending on which check failed, the message will vary.  Hopefully it will be clear what the discrepancy is based on the content of the message.  Adjust your configuration file to match your system and try again.  If you need help, feel free to post to the [mailto:warewulf@lbl.gov Warewulf Mailing List]. 
    122  
    123 Additional information may be found in `/var/log/nhc.log`, the runtime logfile for NHC.  A successful run based on the configuration above will look something like this: 
    124  
    125 {{{ 
    126 Node Health Check starting. 
    127 Running check:  "check_fs_mount_rw /" 
    128 Running check:  "check_ps_daemon sshd root" 
    129 Running check:  "check_hw_cpuinfo 2 8 8" 
    130 Running check:  "check_hw_physmem 1024 1073741824" 
    131 Running check:  "check_hw_swap 1 1073741824" 
    132 Running check:  "check_hw_swap_free 1" 
    133 Running check:  "check_hw_eth eth0" 
    134 Node Health Check completed successfully (1s). 
    135 }}} 
    136  
    137 A failure will look like this: 
    138  
    139 {{{ 
    140 Node Health Check starting. 
    141 Running check:  "check_fs_mount_rw /" 
    142 Running check:  "check_ps_daemon sshd root" 
    143 Running check:  "check_hw_cpuinfo 2 8 8" 
    144 Health check failed:  Actual CPU core count (2) does not match expected (8). 
    145 }}} 
    146  
    147 We can see from the excerpt here that the `check_hw_cpuinfo` check failed and that the machine we ran on appears to be a dual-socket single-core system (2 cores total).  Since our configuration expected a dual-socket quad-core system (8 cores total), this was flagged as a failure.  Since we're testing our configuration, this is most likely a mismatch between what we told NHC to expect and what the system actually has, so we need to fix the configuration file.  Once we have a working configuration and have gone into production, a failure like this would likely represent a hardware issue. 
    148  
    149 Once the configuration has been modified, try running `/usr/sbin/nhc` again.  Continue fixing the discrepancies and re-running the script until it succeeds; then, proceed with the next section. 
    150  
    151 === Implementation === 
    152  
    153 Instructions for putting NHC into production depend entirely on your use case.  We can't possibly hope to delineate them all, but we'll cover some of the most common. 
    154  
    155 ==== TORQUE Integration ==== 
    156  
    157 NHC can be executed by the `pbs_mom` process at regular intervals, job start, and/or job end.  More detailed information on how to configure the `pbs_mom` health check can be found in the [http://docs.adaptivecomputing.com/torque/help.htm#topics/11-troubleshooting/computeNodeHealthCheck.htm TORQUE Documentation].  The configuration used here at LBNL is as follows: 
    158  
    159 {{{ 
    160 $node_check_script /usr/sbin/nhc 
    161 $node_check_interval 5,jobstart,jobend 
    162 $down_on_error 1 
    163 }}} 
    164  
    165 This causes `pbs_mom` to launch `/usr/sbin/nhc` every 5 "MOM intervals" (45 seconds by default), when starting a job, and when a job completes (or is terminated).  Failures will cause the node to be marked as "down." 
    166  
    167 ||= '''NOTE:''' =|| Some concern has been expressed over the possibility for "OS jitter" caused by the NHC.  No significant jitter has been experienced so far (and similar checks at similar intervals are used on ''extremely'' jitter-sensitive systems); however, increase the interval to `80` instead of `5` for once-hourly checks if you suspect NHC-generated jitter to be an issue for your system. || 
    168  
    169 In addition, NHC will by default mark the node "offline" (i.e., `pbsnodes -o`) and add a note (viewable with `pbsnodes -ln`) specifying the failure.  Once the failure has been corrected and NHC completes successfully, it will remove the note it set and clear the "offline" status from the node.  In order for this to work, however, each node must have "operator" access to the TORQUE daemon.  Unfortunately, the support for wildcards in `pbs_server` attributes is limited to replacing the host, subdomain, and/or domain portions with asterisks, so for most setups this will likely require omitting the entire hostname section.  The following has been tested and is known to work: 
    170  
    171 {{{ 
    172 qmgr -c "set server operators += root@*" 
    173 }}} 
    174  
    175 This functionality is not strictly required, but it makes determining the reason nodes are marked down significantly easier! 
    176  
    177 Another possible caveat to this functionality is that it only works if the canonical hostname (as returned by the `hostname` command or the file `/proc/sys/kernel/hostname`) of each node matches its identity within TORQUE.  If your site uses FQDNs on compute nodes but has them listed in TORQUE using the short versions, you will need to add something like this to the top of your NHC configuration file: 
    178  
    179 {{{ 
    180 * || HOSTNAME="$HOSTNAME_S" 
    181 }}} 
    182  
    183 This will cause the offline/online helpers to use the shorter hostname when invoking `pbsnodes`.  This will NOT, however, change how the hostnames are matched in the NHC configuration, so you'll still need to use FQDN matching there. 
    184  
    185 It's also important to note here that NHC will only set a note on nodes that don't already have one (and aren't yet offline) or have one set by NHC itself; also, it will only online nodes and clear notes if it sees a note that was set by NHC.  It looks for the string "NHC:" in the note to distinguish between notes set by NHC and notes set by operators.  If you use this feature, and you need to mark nodes offline manually (e.g., for testing), setting a note when doing so is strongly encouraged.  (You can do this via the `-N` option, like this:  `pbsnodes -o -N 'Testing stuff' n0000 n0001 n0002`)  There was a bug in versions prior to 1.2.1 which would cause it to treat nodes with no notes the same way it treats nodes with NHC-assigned notes.  This ''should'' be fixed in 1.2.1 and higher, but you never know.... 
    186  
    187 ==== SLURM Integration ==== 
    188  
    189 Add the following to `/etc/slurm.conf` (or `/etc/slurm/slurm.conf` depending on version) on your master node '''AND''' your compute nodes: 
    190  
    191 {{{ 
    192 HealthCheckProgram=/usr/sbin/nhc 
    193 HealthCheckInterval=300 
    194 }}} 
    195  
    196 This will execute NHC every 5 minutes. 
    197  
    198 For optimal support of SLURM, NHC version 1.3 or higher is recommended.  Prior versions will require manual intervention. 
    199  
    200 ==== Periodic Execution ==== 
    201  
    202 The legacy method for doing this was to either employ a simple `crontab` entry, like this one: 
    203  
    204 {{{ 
    205 MAILTO=operators@your.com 
    206 */5 * * * * /usr/sbin/nhc 
    207 }}} 
    208  
    209 which will result in an e-mail being sent if the health check fails, or to use the contributed `nhc.cron` script.  However, the former technique resulted in a flood of e-mail when a problem arose, and the latter had no clean way of dealing with multiple contexts and could not be set up to do periodic reminders of issues.  It also would fail to notify if a new problem was detected before or at the same time as the old problem was resolved. 
    210  
    211 Version 1.4.1 introduces a vastly superior option:  `nhc-wrapper`.  This tool will execute `nhc`^([#note1 1])^ and record the results.  It then compares the results to the output of the previous run, if present, and will ignore results that are identical to those previously obtained.  Old results can be set to expire after a given length of time (and thus re-reported).  Results may be echoed to stdout or sent via e-mail.  Once an unrecognized command line option or non-option argument is encountered, it and the rest of the command line arguments are passed to the wrapped program intact. 
    212  
    213 This tool will typically be run via `cron(8)`.  It can be used to wrap distinct contexts of NHC in a manner identical to NHC itself (i.e., specified via executable name or command line arg); also, unlike the existing script, this one does a comparison of the results rather than only distinguishing between the presence/absence of output, and those results can have a finite lifespan. 
    214  
    215 `nhc-wrapper` also offers another option for periodic execution:  looping (`-L`).  When launched from a terminal or `inittab`/`init.d` entry in looping mode, `nhc-wrapper` will execute a loop which runs the wrapped program (e.g., `nhc`) at a time interval you supply.  It attempts to be smart about interpreting your intent as well, calculating sleep times after subprogram execution (i.e., the interval is from start time to start time, not end time to start time) and using nice, round execution times when applicable (i.e., based on 00:00 local time instead of whatever random time the wrapper loop happened to be entered).  For example, if you ask it to run every 5 minutes, it'll run at :00, :05, :10, :15, etc.  If you ask for every 4 hours, it'll run at 00:00, 04:00, 08:00, 12:00, 16:00, and 20:00 exactly. 
    216  
    217 This allows the user to run `nhc-wrapper` in a terminal to keep tabs on it while still running checks at predictable times (just like `crond` would).  It also has some flags to provide timestamps (`-L t`) and/or ASCII horizontal rulers (`-L r`) between executions. clearing the screen (`-L c`) before each execution (`watch`-style) is also available. 
    218  
    219  
    220 '''''Examples:''''' 
    221  
    222 To run `nhc` and notify `root` when errors appear, are cleared, or every 12 hours while they persist: 
    223 {{{ 
    224 # /usr/sbin/nhc-wrapper -M root -X 12h 
    225 }}} 
    226  
    227 Same as above, but run the "nhc-cron" context instead (`nhc -n nhc-cron`): 
    228 {{{ 
    229 # /usr/sbin/nhc-wrapper -M root -X 12h -n nhc-cron 
    230   OR 
    231 # /usr/sbin/nhc-wrapper -M root -X 12h -A '-n nhc-cron' 
    232 }}} 
    233  
    234 Same as above, but run `nhc-cron` (symlink to `nhc`) instead: 
    235 {{{ 
    236 # /usr/sbin/nhc-wrapper -M root -X 12h -P nhc-cron 
    237   OR 
    238 # ln -s nhc-wrapper /usr/sbin/nhc-cron-wrapper 
    239 # /usr/sbin/nhc-cron-wrapper -M root -X 12h 
    240 }}} 
    241  
    242 Expire results after 1 week, 1 day, 1 hour, 1 minute, and 1 second: 
    243 {{{ 
    244 # /usr/sbin/nhc-wrapper -M root -X 1w1d1h1m1s 
    245 }}} 
    246  
    247 Run verbosely, looping every minute with ruler and timestamp: 
    248 {{{ 
    249 # /usr/sbin/nhc-wrapper -L tr1m -V 
    250 }}} 
    251  
    252 Or for something quieter and more `cron`-like: 
    253 {{{ 
    254 # /usr/sbin/nhc-wrapper -L 1h -M root -X 12h 
    255 }}} 
    256  
    257  
    258 == Configuration == 
    259  
    260 Now that you have a basic working configuration, we'll go more in-depth into how NHC is configured, including command-line invocation, configuration file syntax, modes of operation, how individual checks are matched against a node's hostname, and what checks are already available in the NHC distribution for your immediate use. 
    261  
    262 Configuration of NHC is generally done in one of 3 ways:  passing option flags and/or configuration (i.e., environment) variables on the command line, setting variables and specifying checks in the configuration file (`/etc/nhc/nhc.conf` by default), and/or setting variables in the sysconfig initialization file (`/etc/sysconfig/nhc` by default).  The latter works essentially the same as any other sysconfig file (it is directly sourced into NHC's `bash` session using the `.` operator), so this document does not go into great detail about using it.  The following sections discuss the other two mechanisms. 
    263  
    264 === Command-Line Invocation === 
    265  
    266 From version 1.3 onward, NHC supports a subset of command-line options and arguments in addition to the configuration and sysconfig files.  A few specific settings have CLI options associated with them as shown in the table below; additionally, any configuration variable which is valid in the configuration or sysconfig file may also be passed on the command line instead. 
    267  
    268 ==== Options ==== 
    269  
    270 ||= '''Command-Line Option''' =||= '''Equivalent Configuration Variable''' =||= '''Purpose''' =|| 
    271 || `-D` ''`confdir`'' || `CONFDIR=`''`confdir`'' || Use config directory ''`confdir`'' (default: `/etc/`''`name`'') || 
    272 || `-a` || `NHC_CHECK_ALL=1` || Run ALL checks; don't exit on first failure (useful for `cron`-based monitoring) || 
    273 || `-c` ''`conffile`'' || `CONFFILE=`''`conffile`'' || Load config from ''`conffile`'' (default: ''`confdir`''`/`''`name`''`.conf`) || 
    274 || `-d` || `DEBUG=1` || Activate debugging output || 
    275 || `-f` || `NHC_CHECK_FORKED=1` || Run each check in a separate background process (''EXPERIMENTAL'') || 
    276 || `-h` || N/A || Show command line help || 
    277 || `-l` ''`logspec`'' || `LOGFILE=`''`logspec`'' || File name/path or BASH-syntax directive for logging output (`-` for `STDOUT`) || 
    278 || `-n` ''`name`'' || `NAME=`''`name`'' || Set program name to ''`name`'' (default: `nhc`); see -D & -c || 
    279 || `-q` || `SILENT=1` || Run quietly || 
    280 || `-t` ''`timeout`'' || `TIMEOUT=`''`timeout`'' || Use timeout of ''`timeout`'' seconds (default: 30) || 
    281 || `-v` || `VERBOSE=1` || Run verbosely (i.e., show check progress) || 
    282  
    283 '''NOTE:''' Due to the use of the `getopts bash` built-in, and the limitations thereof, POSIX-style bundling of options (e.g., `-da`) is NOT supported, and all command-line options MUST PRECEDE any additional variable/value-type arguments! 
    284  
    285 ==== Variable/Value Arguments ==== 
    286  
    287 Instead of, or possibly in addition to, the use of command-line options, NHC accepts configuration via variables specified on the command line.  Simply pass any number of ''`VARIABLE`''`=`''`value`'' arguments on the command line, and each variable will be set to its respective value immediately upon NHC startup.  This happens before the sysconfig file is loaded, so it can be used to alter such values as `$SYSCONFIGDIR` (`/etc/sysconfig` by default) which would normally be unmodifiable. 
    288  
    289 It's important to note that while command-line configuration directives will override NHC's built-in defaults for various variables, variables set in the configuration file (see below) will NOT be overridden.  The config file takes precedence over the command line, in contrast to most other CLI tools out there (and possibly contrary to user expectation) due to the way `bash` deals with variables and initialization.  If you want the command line to take precedence, you'll need to test the value of the variable in the config file and only alter it if the current value matches NHC's built-in default. 
    290  
    291 ==== Example Invocations ==== 
    292  
    293 Most sites just run `nhc` by itself with no options when launching from a resource manager daemon.  However, when running from cron or manually at the command line, numerous other possible scenarios exist for invoking NHC in various ways.  Here are some real-world examples. 
    294  
    295 To run in debug mode, either of the following two command lines may be used: 
    296 {{{ 
    297 # nhc -d 
    298 # nhc DEBUG=1 
    299 }}} 
    300  
    301 To run for testing purposes in debug mode with no timeout and with node online/offline disabled: 
    302 {{{ 
    303 # nhc -d -t 0 MARK_OFFLINE=0 
    304 }}} 
    305  
    306 To force use of SLURM as the resource manager and use a sysconfig path in `/opt`: 
    307 {{{ 
    308 # nhc NHC_RM=slurm SYSCONFIGDIR=/opt/etc/sysconfig 
    309 }}} 
    310  
    311 To run NHC out-of-band (e.g., from cron) with the name `nhc-oob` (which will load its config from `/etc/sysconfig/nhc-oob` and `/etc/nhc/nhc-oob.conf`): 
    312 {{{ 
    313 # nhc -n nhc-oob 
    314 }}} 
    315 '''NOTE''':  As an alternative, you may symlink `/usr/sbin/nhc-oob` to `nhc` and run `nhc-oob` instead.  This will accomplish the same thing. 
    316  
    317 === Configuration File Syntax === 
    318  
    319 The configuration file is fairly straight-forward.  Stored by default in `/etc/nhc/nhc.conf`, the file is plain text and recognizes the traditional `#` introducer for comments.  Any line that starts with a `#` (with or without leading whitespace) is ignored.  Blank lines are also ignored. 
    320  
    321 Examples: 
    322 {{{ 
    323 # This is a comment. 
    324        # This is also a comment. 
    325 # This line and the next one will both be ignored. 
    326  
    327 }}} 
    328  
    329 Configuration lines contain a '''target''' specifier, the separator string `||`, and the '''check''' command.  The target specifies which hosts should execute the check; only nodes whose hostname matches the given target will execute the check on that line.  All other nodes will ignore it and proceed to the next check. 
    330  
    331 A check is simply a shell command.  All NHC checks are bash functions defined in the various included files in `/etc/nhc/scripts/*.nhc`, but in actuality any valid shell command that properly returns success or failure will work.  This documentation and all examples will only reference bash function checks.  Each check can take zero or more arguments and is executed exactly as seen in the configuration. 
    332  
    333 As of version 1.2, configuration variables may also be set in the config file with the same syntax.  This makes it easy to alter specific settings, commands, etc. globally or for individual hosts/hostgroups! 
    334  
    335 Example: 
    336 {{{ 
    337     * || SOMEVAR="value" 
    338     * || check_something 
    339 *.foo || another_check 1 2 3 
    340 }}} 
    341  
    342 === Matching Hosts === 
    343  
    344 Three separate methods for specifying host matches are supported as of version 1.2.2.  The default style is a '''glob''', also known as a wildcard.  bash will determine if the hostname of the node (specifically, the contents of `/proc/sys/kernel/hostname`) matches the supplied glob expression (e.g., `n*.viz`) and execute only those checks which have matching target expressions.  If the hostname does not match the glob, the corresponding check is ignored. 
    345  
    346 The second method for specifying host matches is via regular expression.  Regexp targets must be surrounded by slashes to identify them as regular expressions.  The internal regexp matching engine of bash is used to compare the hostname to the given regular expression.  For example, given a target of `/^n00[0-5][0-9]\.cc2$/`, the corresponding check would execute on `n0017.cc2` but not on `n0017.cc1` or `n0083.cc2`. 
    347  
    348 The third method for matching hosts is via node range expressions similar to those used by `pdsh`, Warewulf, and other open source HPC tools.  (''Please note that not all expressions supported by `pdsh` and Warewulf will work in NHC.'')  The match expression is placed in curly braces and specifies one or more comma-separated node name ranges, and the corresponding check will only execute on nodes which fall into at least one of the specified ranges.  Note that only one range expression is supported per range, and commas within ranges are not supported.  So, for example, the target `{n00[00-99].phys,n000[0-4].bio}` would cause its check to execute on `n0030.phys`, `n0099.phys`, and `n0001.bio`, but not on `n0100.phys` nor `n0005.bio`.  Expressions such as `{n[0-3]0[00-49].r[00-29]}` and `{n00[00-29,54,87].sci}` are not supported (though the latter may be written instead as `{n00[00-29].sci,n0054.sci,n0087.sci}`). 
    349  
    350 Examples: 
    351 {{{ 
    352                 *  || valid_check1 
    353        /n000[0-9]/ || valid_check2 
    354       {n00[20-39]} || valid_check3 
    355  {n03,n05,n0[7-9]} || valid_check4 
    356    {n00[10-21,23]} || this_target_is_invalid 
    357 }}} 
    358  
    359 Throughout the rest of the documentation, we will refer to this concept as a '''match string'''.  Anywhere a match string is expected, either a glob, a regular expression surrounded by slashes, or node range expression in braces may be specified. 
    360  
    361 === Supported Variables === 
    362  
    363 As mentioned above, version 1.2 and higher support setting/changing shell variables within the configuration file.  Many aspects of NHC's behavior can be modified through the use of shell variables, including a number of the commands in the various checks and helper scripts NHC employs. 
    364  
    365 There are, however, some variables which can only be specified in `/etc/sysconfig/nhc`, the global initial settings file for NHC.  This is typically for obvious reasons (e.g., you can't change the path to the config file from within the config file!). 
    366  
    367 The table below provides a list of the configuration variables which may be used to modify NHC's behavior; those which won't work in a config file (only sysconfig or command line) are marked with an asterisk ("*"): 
    368  
    369 ||= '''Variable Name''' =||= '''Default Value''' =||= '''Purpose''' =|| 
    370 || *CONFDIR || `/etc/nhc` || Directory for NHC configuration data || 
    371 || *CONFFILE || `$CONFDIR/$NAME.conf` || Path to NHC config file  || 
    372 || DEBUG || `0` || Set to `1` to activate debugging output || 
    373 || *DETACHED_MODE || `0` || Set to `1` to activate [#DetachedMode Detached Mode] || 
    374 || *DETACHED_MODE_FAIL_NODATA || `0` || Set to `1` to cause [#DetachedMode Detached Mode] to fail if no prior check result exists || 
    375 || DF_CMD || `df` || Command used by `check_fs_free`, `check_fs_size`, and `check_fs_used` || 
    376 || DF_FLAGS || `-Tka` || Flags to pass to `$DF_CMD` for space checks. '''''NOTE:''  Adding the `-l` flag is ''strongly'' recommended if only checking local filesystems.''' || 
    377 || DFI_CMD || `df` || Command used by `check_fs_inodes`, `check_fs_ifree`, and `check_fs_iused` || 
    378 || DFI_FLAGS || `-Tia` || Flags to pass to `$DFI_CMD`. '''''NOTE:''  Adding the `-l` flag is ''strongly'' recommended if only checking local filesystems.''' || 
    379 || *FORCE_SETSID || `1` ||  Re-execute NHC as a session leader if it isn't already one at startup || 
    380 || *HELPERDIR || `/usr/libexec/nhc` || Directory for NHC helper scripts || 
    381 || *HOSTNAME || Set from `/proc/sys/kernel/hostname` || Canonical name of current node || 
    382 || *HOSTNAME_S || `$HOSTNAME` truncated at first `.` || Short name (no domain or subdomain) of current node || 
    383 || IGNORE_EMPTY_NOTE || `0` || Set to `1` to treat empty notes like NHC-assigned notes (<1.2.1 behavior) || 
    384 || *INCDIR || `$CONFDIR/scripts` || Directory for NHC check scripts || 
    385 || JOBFILE_PATH || TORQUE/PBS:  `$PBS_SERVER_HOME/mom_priv/jobs`[[br]]SLURM:  `$SLURM_SERVER_HOME` || Directory on compute nodes where job records are kept || 
    386 || *LOGFILE || `>>/var/log/nhc.log` || File name/path or BASH-syntax directive for logging output (`-` for `STDOUT`) || 
    387 || LSF_BADMIN || `badmin` || Command to use for LSF's `badmin` (may include path) || 
    388 || LSF_BHOSTS || `bhosts` || Command to use for LSF's `bhosts` (may include path) || 
    389 || LSF_OFFLINE_ARGS || `hclose -C` || Arguments to LSF's `badmin` to offline node || 
    390 || LSF_ONLINE_ARGS || `hopen` || Arguments to LSF's `badmin` to online node || 
    391 || MARK_OFFLINE || `1` || Set to `0` to disable marking nodes offline on check failure || 
    392 || MAX_SYS_UID || `99` || UIDs <= this number are exempt from rogue process checks || 
    393 || MCELOG || `mcelog` || Command to use to check for MCE log errors || 
    394 || MCELOG_ARGS || `--client` || Parameters passed to `$MCELOG` command || 
    395 || MCELOG_MAX_CORRECTED_RATE || `9` || Maximum number of '''corrected''' MCEs allowed before `check_hw_mcelog()` returns failure || 
    396 || MCELOG_MAX_UNCORRECTED_RATE || `0` || Maximum number of '''uncorrected''' MCEs allowed before `check_hw_mcelog()` returns failure || 
    397 || MDIAG_CMD || `mdiag` || Command to use to invoke Moab's `mdiag` command (may include path) || 
    398 || *NAME || `nhc` || Used to populate default paths/filenames for configuration || 
    399 || NHC_AUTH_USERS || `root nobody` || Users authorized to have jobs running on compute nodes || 
    400 || NHC_CHECK_ALL || `0` || Forces all checks to be non-fatal.  Displays each failure message, reports total number of failed checks, and returns that number. || 
    401 || NHC_CHECK_FORKED || `0` || Forces each check to be executed in a separate forked subprocess.  NHC attempts to detect directives which set environment variables to avoid forking those.  Enhances resiliency if checks hang. || 
    402 || NHC_RM || Auto-detected || Resource manager with which to interact (`pbs`, `slurm`, `sge`, or `lsf`) || 
    403 || NVIDIA_HEALTHMON || `nvidia-healthmon` || Command used by `check_nv_healthmon` to check nVidia GPU status || 
    404 || NVIDIA_HEALTHMON_ARGS || `-e -v` || Arguments to `$NVIDIA_HEALTHMON` command || 
    405 || OFFLINE_NODE || `$HELPERDIR/node-mark-offline` || Helper script used to mark nodes offline || 
    406 || ONLINE_NODE || `$HELPERDIR/node-mark-online` || Helper script used to mark nodes online || 
    407 || PASSWD_DATA_SRC || `/etc/passwd` || Colon-delimited file in standard passwd format from which to load user account data || 
    408 || PATH || `/sbin:/usr/sbin:/bin:/usr/bin` || If a path is not specified for a particular command, this variable defines the search directory order. 
    409 || PBSNODES || `pbsnodes` || Command used by above helper scripts to mark nodes online/offline || 
    410 || PBSNODES_LIST_ARGS || `-n -l all` || Arguments to `$PBSNODES` to list nodes and their status notes || 
    411 || PBSNODES_OFFLINE_ARGS || `-o -N` || Arguments to `$PBSNODES` to mark node offline with note || 
    412 || PBSNODES_ONLINE_ARGS || `-c -N` || Arguments to `$PBSNODES` to mark node online with note || 
    413 || PBS_SERVER_HOME || `/var/spool/torque` || Directory for TORQUE files || 
    414 || RESULTFILE || `/var/run/nhc/$NAME.status` || Used in [#DetachedMode Detached Mode] to store result of checks for subsequent handling 
    415 || RM_DAEMON_MATCH || TORQUE/PBS:  `/\bpbs_mom\b/`[[br]]SLURM:  `/\bslurmd\b/`[[br]]SGE/UGE:  `/\bsge_execd\b/` || Used by `check_ps_userproc_lineage` to make sure all user processes were spawned by the RM daemon || 
    416 || SILENT || `0` || Set to `1` to disable logging via `$LOGFILE` || 
    417 || SLURM_SCONTROL || `scontrol` || Command to use for SLURM's `scontrol` (may include path) || 
    418 || SLURM_SC_OFFLINE_ARGS || `update State=DRAIN` || Arguments to pass to SLURM's `scontrol` to offline a node || 
    419 || SLURM_SC_ONLINE_ARGS || `update State=IDLE` || Arguments to pass to SLURM's `scontrol` to online a node || 
    420 || SLURM_SERVER_HOME || `/var/spool/slurmd` || Location of SLURM data files (see also:  `$JOBFILE_PATH`) || 
    421 || SLURM_SINFO || `sinfo` || Command to use for SLURM's `sinfo` (may include path) || 
    422 || STAT_CMD || `/usr/bin/stat` || Command to use to `stat()` files || 
    423 || STAT_FMT_ARGS || `-c` || Parameter to introduce format string to `stat` command || 
    424 || *TIMEOUT || `30` || Watchdog timer (in seconds) || 
    425 || VERBOSE || `0` || Set to `1` to display each check line before it's executed || 
    426  
    427 '''Example usage:''' 
    428 {{{ 
    429        * || export PATH="$PATH:/opt/torque/bin:/opt/torque/sbin" 
    430   n*.rh6 || MAX_SYS_UID=499 
    431   n*.deb || MAX_SYS_UID=999 
    432   *.test || DEBUG=1 
    433        * || export MARK_OFFLINE=0 
    434        * || NVIDIA_HEALTHMON="/global/software/rhel-6.x86_64/modules/nvidia/tdk/3.304.3/nvidia-healthmon/nvidia-healthmon" 
    435 }}} 
    436  
    437 === Detached Mode === 
    438  
    439 Version 1.2 and higher support a feature called "detached mode."  When this feature is activated in `/etc/sysconfig/nhc` (by setting `DETACHED_MODE=1`), the `nhc` process will immediately fork itself.  The foreground (parent) process will immediately return success.  The child process will run all the checks and record the results in `$RESULTFILE` (default:  `/var/run/nhc.status`).  The next time `nhc` is executed, just before forking off the child process (which will again run the checks in the background), it will load the results from `$RESULTFILE` from the last execution.  Once the child process has been spawned, it will then return the previous results to its caller. 
    440  
    441 The advantage of detached mode is that any hangs or long-running commands which occur in the checks will not cause the resource manager daemon (e.g., `pbs_mom`) to block.  Sites that use home-grown health check scripts often use a similar technique for this very reason -- it's non-blocking. 
    442  
    443 However, a word of caution:  if a detached-mode `nhc` encounters a failure, it won't get acted upon until the '''next execution'''.  So let's say you have NHC configured to only on job start and job end.  Let's further suppose that the `/tmp` filesystem encounters an error and gets remounted read-only at some point after the completion of the last job and that you have `check_fs_mount_rw /tmp` in your `nhc.conf`.  In normal mode, when a new job tries to start, `nhc` will detect the read-only mount on job start and will take the node out of service before the job is allowed to begin executing on the node.  In detached mode, however, since `nhc` has not been run in the meantime, and the previous run was successful, `nhc` will return success and allow the job to start ''before'' the error condition is noticed! 
    444  
    445 For this reason, when using detached mode, periodic checks are HIGHLY recommended.  This will not completely prevent the above scenario, but it will drastically reduce the odds of it occurring.  Users of detached mode, as with any similar method of delayed reporting, must be aware of and accept this caveat in exchange for the benefits of the more-fully-non-blocking behavior. 
    446  
    447 === Built-in Checks === 
    448  
    449 ''In the documentation below, parameters surrounded by square brackets ([like this]) are ''optional''.  All others are required.'' 
    450  
    451 The Warewulf Node Health Check distribution supplies the following checks: 
    452  
    453 ===== check_cmd_output ===== 
    454 {{{ 
    455 check_cmd_output [-t timeout] [-r retval] [-m [!]match [...]] { -e 'command [arg1 [...]]' | command [arg1 [...]] } 
    456 }}} 
    457  
    458   `check_cmd_output` executes a ''`command`'' and compares each line of its output against any ''`match`'' strings passed in.  If any positive match _is not_ found in the command output, or if any negative match _is_ found, the check fails.  The check also fails if the exit status of ''`command`'' does not match ''`retval`'' (if supplied) or if the ''`command`'' fails to complete within ''`timeout`'' seconds (default 5).  Options to this check are as follows: 
    459  
    460   ||= '''Check Option''' =||= '''Purpose''' =|| 
    461   || `-e` ''`command`'' || Execute ''`command`'' and gather its output.  The ''`command`'' is split on word boundaries, much like `/bin/sh -c '`''`command`''`'` does. || 
    462   || `-m` `[!]`''`match`'' || If negated, no line of the output may match the specified ''`match`'' expression.  Otherwise, at least one line must match. || 
    463   || `-r` ''`retval`'' || Exit status (a.k.a. return code or return value) of ''`command`'' must equal ''`retval`'' or the check will fail. || 
    464   || `-t` ''`secs`'' || Command will timeout if not completed within ''`secs`'' seconds (default is 5). || 
    465  
    466   '''NOTE''':  If the ''`command`'' is passed using `-e`, the ''`command`'' string is split on word boundaries to create the `argv[]` array for the command.  If passed on the end of the check line, DO NOT quote the command.  Each parameter must be distinct.  Only use quotes to group multiple words into a single argument.  For example, passing ''`command`'' as `"service bind restart"` will work if used with `-e` but will fail if passed at the end of the check line (use without quotes instead)! 
    467  
    468   '''''Example''' (Verify that the `rpcbind` service is alive)'':  `check_cmd_output -t 1 -r 0 -m '/is running/' /sbin/service rpcbind status` 
    469  
    470 ===== check_cmd_status ===== 
    471 {{{ 
    472 check_cmd_status [-t timeout] -r retval command [arg1 [...]] 
    473 }}} 
    474  
    475   `check_cmd_status` executes a ''`command`'' and redirects its output to `/dev/null`.  The check fails if the exit status of ''`command`'' exit status does not match ''`retval`'' or if the ''`command`'' fails to complete within ''`timeout`'' seconds (default 5).  Options to this check are as follows: 
    476  
    477   ||= '''Check Option''' =||= '''Purpose''' =|| 
    478   || `-r` ''`retval`'' || Exit status (a.k.a. return code or return value) of ''`command`'' must equal ''`retval`'' or the check will fail. || 
    479   || `-t` ''`secs`'' || Command will timeout if not completed within ''`secs`'' seconds (default is 5). || 
    480  
    481   '''''Example''' (Make sure SELinux is disabled)'':  `check_cmd_status -t 1 -r 1 selinuxenabled` 
    482  
    483 ===== check_dmi_data_match ===== 
    484 {{{ 
    485 check_dmi_data_match [-h handle] [-t type] [-n | '!'] string 
    486 }}} 
    487  
    488   `check_dmi_data_match` uses parsed, structured data taken from the output of the `dmidecode` command to allow the administrator to make very specific assertions regarding the contents of the DMI (a.k.a. SMBIOS) data.  Matches can be made against any output or against specific types (classifications of data) or even handles (identifiers of data blocks, typically sequential).  Output is restructured such that sections which are indented underneath a section header have the text of the section header prepended to the output line along with a colon and intervening space.  So, for example, the string "<tab><tab>ISA is supported" which appears underneath the "Characteristics:" header, which in turn is underneath the "BIOS Information" header/type, would be parsed by `check_dmi_data_match` as "BIOS Information: Characteristics: ISA is supported" 
    489  
    490   See the `dmidecode` man page for more details. 
    491  
    492   '''''Example''' (check for BIOS version)'':  `check_dmi_data_match "BIOS Information: Version: 1.0"` 
    493  
    494 ===== check_dmi_raw_data_match ===== 
    495 {{{ 
    496 check_dmi_raw_data_match ['!'] string 
    497 }}} 
    498  
    499   `check_dmi_raw_data_match` is basically like a `grep` on the raw output of the `dmidecode` command.  If you don't need to match specific strings in specific sections but just want to match a particular string anywhere in the raw output, you can use this check instead of `check_dmi_data_match` (above) to avoid the additional overhead of parsing the output into handles, types, and expanded strings. 
    500  
    501   '''''Example''' (check for BIOS version in raw output; could really match any version)'':  `check_dmi_raw_data_match "Version: 1.0"` 
    502  
    503 ===== check_file_contents ===== 
    504 {{{ 
    505 check_file_contents file matchexpression [...] 
    506 }}} 
    507  
    508   `check_file_contents` looks at the specified file and allows one or more glob or regular expression matches to be applied to the contents of the file.  The check fails unless ALL specified expressions successfully match the file content, but order in which they appear in the file need not match the order specified on the check line.  No post-processing is done on the file, but take care to quote any shell metacharacters in your match expressions properly.  Also remember that matching against the contents of large files will slow down NHC and potentially cause a timeout.  Reading of the file stops when all match expressions have been successfully found in the file. 
    509  
    510   The file is only read once per invocation of `check_file_contents`, so if you need to match several expressions in the same file, passing them all to the same check is advisable. 
    511  
    512   '''''Example''' (verify setting of $pbsserver in pbs_mom config)'':  `check_file_contents /var/spool/torque/mom_priv/config '/^\$pbsserver master$/'` 
    513  
    514 ===== check_file_stat ===== 
    515 {{{ 
    516 check_file_stat [-D num] [-G name] [-M mode] [-N secs] [-O secs] [-T num] [-U name] [-d num] [-g gid] [-m mode] [-n secs] [-o secs] [-t num] [-u uid] filename(s) 
    517 }}} 
    518  
    519   `check_file_stat` allows the user to assert specific properties on one or more files, directories, and/or other filesystem objects based on metadata returned by the Linux/Unix `stat` command.  Each option specifies a test which is applied to each of the ''filename(s)'' in order.  The check fails if any of the comparisons does not match.  Options to this check are as follows: 
    520  
    521   ||= '''Check Option''' =||= '''Purpose''' =|| 
    522   || `-D` ''`num`'' || Specifies that the device ID for ''filename(s)'' should be ''`num`'' (decimal or hex) || 
    523   || `-G` ''`name`'' || Specifies that ''filename(s)'' should be owned by group ''`name`'' || 
    524   || `-M` ''`mode`'' || Specifies that the permissions for ''filename(s)'' should include at LEAST the bits set in ''`mode`'' || 
    525   || `-N` ''`secs`'' || Specifies that the `ctime` (i.e., inode change time) of ''filename(s)'' should be newer than ''`secs`'' seconds ago || 
    526   || `-O` ''`secs`'' || Specifies that the `ctime` (i.e., inode change time) of ''filename(s)'' should be older than ''`secs`'' seconds ago || 
    527   || `-T` ''`num`'' || Specifies that the minor device number for ''filename(s)'' be ''`num`'' || 
    528   || `-U` ''`name`'' || Specifies that filename(s) should be owned by user ''`name`'' || 
    529   || `-d` ''`num`'' || Specifies that the device ID for ''filename(s)'' should be ''`num`'' (decimal or hex) || 
    530   || `-g` ''`gid`'' || Specifies that ''filename(s)'' should be owned by group id ''`gid`'' || 
    531   || `-m` ''`mode`'' || Specifies that the permissions for ''filename(s)'' should include at LEAST the bits set in ''`mode`'' || 
    532   || `-n` ''`secs`'' || Specifies that the `mtime` (i.e., modification time) of ''filename(s)'' should be newer than ''`secs`'' seconds ago || 
    533   || `-o` ''`secs`'' || Specifies that the `mtime` (i.e., modification time) of ''filename(s)'' should be older than ''`secs`'' seconds ago || 
    534   || `-t` ''`num`'' || Specifies that the major device number for ''filename(s)'' be ''`num`'' || 
    535   || `-u` ''`uid`'' || Specifies that ''filename(s)'' should be owned by uid ''`uid`'' || 
    536  
    537   '''''Example''' (Assert correct uid, gid, owner, group, & major/minor device numbers for `/dev/null`)'':  `check_file_stat -u 0 -g 0 -U root -G root -t 1 -T 3 /dev/null` 
    538  
    539 ===== check_file_test ===== 
    540 {{{ 
    541 check_file_test [-a] [-b] [-c] [-d] [-e] [-f] [-g] [-h] [-k] [-p] [-r] [-s] [-t] [-u] [-w] [-x] [-O] [-G] [-L] [-S] [-N] filename(s) 
    542 }}} 
    543  
    544   `check_file_test` allows the user to assert very simple attributes on one or more files, directories, and/or other filesystem objects based on tests which can be performed via the shell's built-in `test` command.  Each option specifies a test which is applied to each of the ''filename(s)'' in order.  NHC internally evaluates the shell expression `test `''`option filename`'' for each option given for each ''filename'' specified.  (In other words, passing 2 options and 3 filenames will evaluate 6 `test` expressions in total.)  The check fails if any of the `test` command evaluations returns false.  For efficiency, this check should be used in preference to `check_file_stat` whenever possible as it does not require calling out to the `stat` command.  Options to this check are as follows: 
    545  
    546   ||= '''Check Option''' =||= '''Purpose''' =|| 
    547   || `-a` || Evaluates to true if the ''`filename`'' being tested exists (same as `-e`). || 
    548   || `-b` || Evaluates to true if the ''`filename`'' being tested exists and is block special. || 
    549   || `-c` || Evaluates to true if the ''`filename`'' being tested exists and is character special. || 
    550   || `-d` || Evaluates to true if the ''`filename`'' being tested exists and is a directory. || 
    551   || `-e` || Evaluates to true if the ''`filename`'' being tested exists. || 
    552   || `-f` || Evaluates to true if the ''`filename`'' being tested exists and is a regular file. || 
    553   || `-g` || Evaluates to true if the ''`filename`'' being tested exists and is setgid. || 
    554   || `-h` || Evaluates to true if the ''`filename`'' being tested exists and is a symbolic link. || 
    555   || `-k` || Evaluates to true if the ''`filename`'' being tested exists and has its sticky bit set. || 
    556   || `-p` || Evaluates to true if the ''`filename`'' being tested exists and is a named pipe. || 
    557   || `-r` || Evaluates to true if the ''`filename`'' being tested exists and is readable. || 
    558   || `-s` || Evaluates to true if the ''`filename`'' being tested exists and is not empty. || 
    559   || `-t` || Evaluates to true if the ''`filename`'' being tested is a numeric file descriptor which references a valid tty. || 
    560   || `-u` || Evaluates to true if the ''`filename`'' being tested exists and is setuid. || 
    561   || `-w` || Evaluates to true if the ''`filename`'' being tested exists and is writable. || 
    562   || `-x` || Evaluates to true if the ''`filename`'' being tested exists and is executable. || 
    563   || `-O` || Evaluates to true if the ''`filename`'' being tested exists and is owned by NHC's EUID. || 
    564   || `-G` || Evaluates to true if the ''`filename`'' being tested exists and is owned by NHC's EGID. || 
    565   || `-L` || Evaluates to true if the ''`filename`'' being tested exists and is a symbolic link (same as `-h`). || 
    566   || `-S` || Evaluates to true if the ''`filename`'' being tested exists and is a socket. || 
    567   || `-N` || Evaluates to true if the ''`filename`'' being tested exists and has been modified since it was last read. || 
    568  
    569   '''''Example''' (Assert correct ownerships and permissions on `/dev/null` similar to above, assuming NHC runs as root)'':  `check_file_test -O -G -c -r -w /dev/null` 
    570  
    571 ===== check_fs_inodes ===== 
    572 {{{ 
    573 check_fs_inodes mountpoint [min] [max] 
    574 }}} 
    575  
    576   Ensures that the specified ''mountpoint'' has at least ''min'' but no more than ''max'' total inodes.  Either may be blank. 
    577  
    578   '''WARNING:'''  Use of this check requires execution of the `/usr/bin/df` command which may HANG in cases of NFS failure!  If you use this check, consider also using [#DetachedMode Detached Mode]! 
    579  
    580   '''''Example''' (make sure /tmp has at least 1000 inodes)'':  `check_fs_inodes /tmp 1k` 
    581  
    582 ===== check_fs_ifree ===== 
    583 {{{ 
    584 check_fs_ifree mountpoint min 
    585 }}} 
    586  
    587   Ensures that the specified ''mountpoint'' has at least ''min'' free inodes. 
    588  
    589   '''WARNING:'''  Use of this check requires execution of the `/usr/bin/df` command which may HANG in cases of NFS failure!  If you use this check, consider also using [#DetachedMode Detached Mode]! 
    590  
    591   '''''Example''' (make sure /local has at least 100 inodes free)'':  `check_fs_ifree /local 100` 
    592  
    593 ===== check_fs_iused ===== 
    594 {{{ 
    595 check_fs_iused mountpoint max 
    596 }}} 
    597  
    598   Ensures that the specified ''mountpoint'' has no more than ''max'' used inodes. 
    599  
    600   '''WARNING:'''  Use of this check requires execution of the `/usr/bin/df` command which may HANG in cases of NFS failure!  If you use this check, consider also using [#DetachedMode Detached Mode]! 
    601  
    602   '''''Example''' (make sure /tmp has no more than 1 million used inodes)'':  `check_fs_iused /tmp 1M` 
    603  
    604 ===== check_fs_mount ===== 
    605 {{{ 
    606 check_fs_mount [-0] [-r] [-t fstype] [-s source] [-o options] [-O remount_options] [-e missing_action] [-E found_action] {-f|-F} mountpoint [...] 
    607 }}} 
    608   '''-OR-''' (''deprecated'') 
    609  
    610 {{{ 
    611 check_fs_mount mountpoint [source] [fstype] [options] 
    612 }}} 
    613  
    614   `check_fs_mount` examines the list of mounted filesystems on the local machine to verify that the specified entry is present.  ''mountpoint'' specifies the directory on the node where the filesystem should be mounted.  ''source'' is a '''match string''' (see previous section) which is compared against the device, whatever that may be (e.g., ''server'':/''path'' for NFS or `/dev/sda1` for local).  ''fstype'' is a match string for the filesystem type (e.g., `nfs`, `ext4`, `tmpfs`).  ''options'' is a match string for the mount options.  Any number (zero or more) of these 3 items (i.e., sources, types, and/or options) may be specified; additionally, one or more mountpoints may be specified.  Use `-f` for normal filesystems and `-F` for auto-mounted filesystems (to trigger them to be mounted prior to performing the check). 
    615  
    616   Unless the `-0` (non-fatal) option is given, this check will fail if any of the specified filesystems are not found or do not match the type(s)/source(s)/option(s) specified.  The `-r` (remount) option will cause NHC to attempt to re-mount the filesystem by issuing the system command "`mount -o `''`remount_options`''` `''`filesystem`''" in the background as root.  This is "best effort," so success or failure of the mount attempt is not taken into account.  If specified, ''missing_action'' is executed if a filesystem is not found.  Also, if specified, ''found_action'' is executed for each filesystem which __is__ found and correctly mounted. 
    617  
    618   '''''Example''' (check for NFS hard-mounted /home from bluearc1:/global/home and mount if missing)'':  `check_fs_mount -r -s bluearc1:/global/home -t nfs -o *hard* -f /home` 
    619  
    620 ===== check_fs_mount_ro ===== 
    621 {{{ 
    622 check_fs_mount_ro [-0] [-r] [-t fstype] [-s source] [-o options] [-O remount_options] [-e missing_action] [-E found_action] -f mountpoint 
    623 }}} 
    624   '''-OR-''' (''deprecated'') 
    625  
    626 {{{ 
    627 check_fs_mount_ro mountpoint [source] [fstype] 
    628 }}} 
    629  
    630   Checks that a particular filesystem is mounted read-only.  Shortcut for `check_fs_mount -o '/(^|,)ro($|,)/' `''`...`'' 
    631  
    632 ===== check_fs_mount_rw ===== 
    633 {{{ 
    634 check_fs_mount_rw [-0] [-r] [-t fstype] [-s source] [-o options] [-O remount_options] [-e missing_action] [-E found_action] -f mountpoint 
    635 }}} 
    636   '''-OR-''' (''deprecated'') 
    637  
    638 {{{ 
    639 check_fs_mount_rw mountpoint [source] [fstype] 
    640 }}} 
    641  
    642   Checks that a particular filesystem is mounted read-write.  Shortcut for `check_fs_mount -o '/(^|,)rw($|,)/' `''`...`'' 
    643  
    644 ===== check_fs_free ===== 
    645 {{{ 
    646 check_fs_free mountpoint minfree 
    647 }}} 
    648  
    649   (Version 1.2+) Checks that a particular filesystem has at least ''minfree'' space available.  The value for ''minfree'' may be specified either as a percentage or a numerical value with an optional suffix (`k` or `kB` for kilobytes, the default; `M` or `MB` for megabytes; `G` or `GB` for gigabytes; etc., all case insensitive). 
    650  
    651   '''WARNING:'''  Use of this check requires execution of the `/usr/bin/df` command which may HANG in cases of NFS failure!  If you use this check, consider also using [#DetachedMode Detached Mode]! 
    652  
    653   '''''Example''''':  `check_fs_free /tmp 128MB` 
    654  
    655 ===== check_fs_size ===== 
    656 {{{ 
    657 check_fs_size mountpoint [minsize] [maxsize] 
    658 }}} 
    659  
    660   (Version 1.2+) Checks that a particular filesystem is between ''minsize'' and ''maxsize'' (inclusive).  Either may be blank; to check for a specific size, pass the same value for both parameters.  The value(s) for ''minsize'' and/or ''maxsize'' are specified as a numerical value with an optional suffix (`k` or `kB` for kilobytes, the default; `M` or `MB` for megabytes; `G` or `GB` for gigabytes; etc., all case insensitive). 
    661  
    662   '''WARNING:'''  Use of this check requires execution of the `/usr/bin/df` command which may HANG in cases of NFS failure!  If you use this check, consider also using [#DetachedMode Detached Mode]! 
    663  
    664   '''''Example''''':  `check_fs_size /tmp 512m 4g` 
    665  
    666 ===== check_fs_used ===== 
    667 {{{ 
    668 check_fs_used mountpoint maxused 
    669 }}} 
    670  
    671   (Version 1.2+) Checks that a particular filesystem has less than ''maxused'' space consumed.  The value for ''maxused'' may be specified either as a percentage or a numerical value with an optional suffix (`k` or `kB` for kilobytes, the default; `M` or `MB` for megabytes; `G` or `GB` for gigabytes; etc., all case insensitive). 
    672  
    673   '''WARNING:'''  Use of this check requires execution of the `/usr/bin/df` command which may HANG in cases of NFS failure!  If you use this check, consider also using [#DetachedMode Detached Mode]! 
    674  
    675   '''''Example''''':  `check_fs_used / 98%` 
    676  
    677 ===== check_hw_cpuinfo ===== 
    678 {{{ 
    679 check_hw_cpuinfo [sockets] [cores] [threads] 
    680 }}} 
    681  
    682   `check_hw_cpuinfo` compares the properties of the OS-detected CPU(s) to the specified values to ensure that the correct number of physical sockets, execution cores, and "threads" (or "virtual cores").  For a single-core, non-hyperthreading-enabled processor, all 3 parameters would be identical.  Multicore CPUs will have more ''cores'' than ''sockets'', and CPUs with [https://en.wikipedia.org/wiki/Hyper-threading Intel HyperThreading Technology (HT)] turned on will have more ''threads'' than ''cores''.  Since HPC workloads often suffer when HT is active, this check is a handy way to make sure that doesn't happen. 
    683  
    684   '''''Example''' (dual-socket 4-core Intel Nehalem with HT turned off)'':  `check_hw_cpuinfo 2 8 8` 
    685  
    686 ===== check_hw_eth ===== 
    687 {{{ 
    688 check_hw_eth device 
    689 }}} 
    690  
    691   `check_hw_eth` verifies that a particular Ethernet device is available.  Note that it cannot check for IP configuration at this time. 
    692  
    693   '''''Example''''':  `check_hw_eth eth0` 
    694  
    695 ===== check_hw_gm ===== 
    696 {{{ 
    697 check_hw_gm device 
    698 }}} 
    699  
    700   `check_hw_gm` verifies that the specified Myrinet device is available.  This check will fail if the Myrinet kernel drivers are not loaded but does not distinguish between missing drivers and a missing interface. 
    701  
    702   '''''Example''''':  `check_hw_gm myri0` 
    703  
    704 ===== check_hw_ib ===== 
    705 {{{ 
    706 check_hw_ib rate 
    707 }}} 
    708  
    709   `check_hw_ib` determines whether or not an active IB link is present with the specified data rate (in Gb/sec). 
    710  
    711   '''''Example''' (QDR Infiniband)'':  `check_hw_ib 40` 
    712  
    713 ===== check_hw_mcelog ===== 
    714 {{{ 
    715 check_hw_mcelog 
    716 }}} 
    717  
    718   `check_hw_mcelog` queries the running `mcelog` daemon, if present.  If the daemon is not running or has detected no errors, the check passes.  If errors are present, the check fails and sends the output to the log file and syslog. 
    719   The default behavior is to run `mcelog --client` but is configurable via the `$MCELOG` and `$MCELOG_ARGS` variables. 
    720    
    721   (Version 1.4.1 and higher) `check_hw_mcelog` will now also check the correctable and uncorrectable error counts in the past 24 hours and compare them to the settings `$MCELOG_MAX_CORRECTED_RATE` and `$MCELOG_MAX_UNCORRECTED_RATE`, respectively; if either actual count exceeds the value specified in the threshold, the check will fail.  Set either or both variables to the empty string to obtain the old behavior. 
    722  
    723 ===== check_hw_mem ===== 
    724 {{{ 
    725 check_hw_mem min_kb max_kb 
    726 }}} 
    727  
    728   `check_hw_mem` compares the total system memory (RAM + swap) with the minimum and maximum values provided (in kB).  If the total memory is less than ''min_kb'' or more than ''max_kb'' kilobytes, the check fails.  To require an exact amount of memory, use the same value for both parameters. 
    729  
    730   '''''Example''' (exactly 26 GB system memory required)'':  `check_hw_mem 27262976 27262976` 
    731  
    732 ===== check_hw_mem_free ===== 
    733 {{{ 
    734 check_hw_mem_free min_kb 
    735 }}} 
    736  
    737   `check_hw_mem_free` adds the free physical RAM to the free swap (see below for details) and compares that to the minimum provided (in kB).  If the total free memory is less than ''min_kb'' kilobytes, the check fails. 
    738  
    739   '''''Example''' (require at least 640 kB free)'':  `check_hw_mem_free 640` 
    740  
    741 ===== check_hw_physmem ===== 
    742 {{{ 
    743 check_hw_physmem min_kb max_kb 
    744 }}} 
    745  
    746   `check_hw_physmem` compares the amount of physical memory (RAM) present in the system with the minimum and maximum values provided (in kB).  If the physical memory is less than ''min_kb'' or more than ''max_kb'' kilobytes, the check fails.  To require an exact amount of RAM, use the same value for both parameters. 
    747  
    748   '''''Example''' (at least 12 GB RAM/node, no more than 48 GB)'':  `check_hw_physmem 12582912 50331648` 
    749  
    750 ===== check_hw_physmem_free ===== 
    751 {{{ 
    752 check_hw_physmem_free min_kb 
    753 }}} 
    754  
    755   `check_hw_physmem_free` compares the free physical RAM to the minimum provided (in kB).  If less than ''min_kb'' kilobytes of physical RAM are free, the check fails.  For purposes of this calculation, kernel buffers and cache are considered to be free memory. 
    756  
    757   '''''Example''' (require at least 1 kB free)'':  `check_hw_physmem_free 1` 
    758  
    759 ===== check_hw_swap ===== 
    760 {{{ 
    761 check_hw_swap min_kb max_kb 
    762 }}} 
    763  
    764   `check_hw_swap` compares the total system virtual memory (swap) size with the minimum and maximum values provided (in kB).  If the total swap size is less than ''min_kb'' or more than ''max_kb'' kilobytes, the check fails.  To require an exact amount of memory, use the same value for both parameters. 
    765  
    766   '''''Example''' (at most 2 GB swap)'':  `check_hw_swap 0 2097152` 
    767  
    768 ===== check_hw_swap_free ===== 
    769 {{{ 
    770 check_hw_swap_free min_kb 
    771 }}} 
    772  
    773   `check_hw_swap_free` compares the amount of free virtual memory to the minimum provided (in kB).  If the total free swap is less than ''min_kb'' kilobytes, the check fails. 
    774  
    775   '''''Example''' (require at least 1 GB free)'':  `check_hw_swap_free 1048576` 
    776  
    777 ===== check_moab_sched ===== 
    778 {{{ 
    779 check_moab_sched [-t timeout] [-a alert_match] [-m [!]match] [-v version_match] 
    780 }}} 
    781  
    782   `check_moab_sched` executes `mdiag -S -v` and examines its output, similarly to `check_cmd_output`.  In addition to the arbitrary positive/negative ''`match`'' strings, it also accepts an ''`alert_match`'' for flagging specific Moab alerts and a ''`version_match`'' for making sure the expected version is running.  The check will fail based on any of these matches, or if `mdiag` does not return within the specified timeout. 
    783  
    784   '''''Example''' (ensure we're running Moab 7.2.3 and it's not paused)'':  `check_moab_sched -t 45 -m '!/PAUSED/' -v 7.2.3` 
    785  
    786 ===== check_moab_rm ===== 
    787 {{{ 
    788 check_moab_rm [-t timeout] [-m [!]match] 
    789 }}} 
    790  
    791   `check_moab_rm` executes `mdiag -R -v` and examines its output, similarly to `check_cmd_output`.  In addition to the arbitrary positive/negative ''`match`'' strings, it also checks for any RMs which are not in the `Active` state (and fails if any are inactive).  The check will also fail if `mdiag` does not return within the specified timeout. 
    792  
    793   '''''Example''' (basic Moab RM sanity check)'':  `check_moab_rm -t 45` 
    794  
    795 ===== check_moab_torque ===== 
    796 {{{ 
    797 check_moab_torque [-t timeout] [-m [!]match] 
    798 }}} 
    799  
    800   `check_moab_torque` executes `qmgr -c 'print server'` and examines its output, similarly to `check_cmd_output`.  In addition to the arbitrary positive/negative ''`match`'' strings, it also checks to make sure that the `scheduling` parameter is set to `True` (and fails if it isn't).  The check will also fail if `qmgr` does not return within the specified timeout. 
    801  
    802   '''''Example''' (basic TORQUE configuration/responsiveness sanity check)'':  `check_moab_torque -t 45` 
    803  
    804 ===== check_net_socket ===== 
    805 {{{ 
    806 check_net_socket [-0] [-a] [-!] [-n <name>] [-p <proto>] [-s <state>] [-l <locaddr>[:<locport>]] [-r <rmtaddr>[:<rmtport>]] [-t <type>] [-u <user>] [-d <daemon>] [-e <action>] [-E <found_action>] 
    807 }}} 
    808  
    809  (Version 1.4.1+) `check_net_socket` executes either the command `$NETSTAT_CMD $NETSTAT_ARGS` (default:  `netstat -Tanpee -A inet,inet6,unix`) or (if `$NETSTAT_CMD` is not in `$PATH`) the command `$SS_CMD $SS_ARGS` (default:  `ss -anpee -A inet,unix`).  The output of the command is parsed for socket information.  Then each socket is compared with the match criteria passed in to the check:  protocol ''`proto`'', state ''`state`'', local and/or remote address(es) ''`locaddr`''/''`rmtaddr`'' with optional ports ''`locport`''/''`rmtport`'', type ''`type`'', owner ''`user`'', and/or process name ''`daemon`''.  If a matching socket is found, ''`found_action`'' is executed, and the check returns successfully.  If no match is found, ''`action`'' is executed, and the check fails.  Reverse the success/failure logic by specifying `-!` (meaning that one or more matching sockets fail the check). 
    810  
    811  The ''`name`'' parameter may be used to label the type of socket being sought (e.g., `-n 'SSH daemon TCP listening socket'`).  If `-0` is specified, the check is non-fatal (i.e., missing matches will be noted but will not terminate NHC.  Use `-a` to locate all matching sockets (mainly for debugging). 
    812  
    813   '''''Example''' (search for HTTP daemon IPv4 listening socket and restart if missing)'':  `check_net_socket -n "HTTP daemon" -p tcp -s LISTEN -l '0.0.0.0:80' -d httpd -e 'service httpd start'` 
    814    
    815 ===== check_nv_healthmon ===== 
    816 {{{ 
    817 check_nv_healthmon 
    818 }}} 
    819  
    820  (Version 1.2+) `check_nv_healthmon` runs the command `$NVIDIA_HEALTHMON` (default:  `nvidia-healthmon`) with the arguments specified in `$NVIDIA_HEALTHMON_ARGS` (default:  `-e -v`) to check for problems with any nVidia Tesla GPU devices on the system.  If any errors are found, the entire (human-readable) output of the command is logged, and the check fails.  '''NOTE:'''  Version 3.304 or higher of the nVidia Tesla Deployment Kit (TDK) is required!  See [http://developer.nvidia.com/cuda/tesla-deployment-kit] for details and downloads. 
    821  
    822   '''''Example''''':  `check_nv_healthmon` 
    823  
    824 ===== check_ps_blacklist ===== 
    825 {{{ 
    826 check_ps_blacklist command [[!]owner] [args] 
    827 }}} 
    828  
    829   (Version 1.2+) `check_ps_blacklist` looks for a running process matching ''command'' (or, if ''args'' is specified, ''command+args'').  If ''owner'' is specified, the process must be owned by ''owner''; if the optional `!` is also specified, the process must NOT be owned by ''owner''.  If any matching process is found, the check fails.  (This is the opposite of `check_ps_daemon`.) 
    830  
    831   '''''Example''' (prohibit sshd NOT owned by root)'':  `check_ps_blacklist sshd !root` 
    832  
    833 ===== check_ps_cpu ===== 
    834 {{{ 
    835 check_ps_cpu [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]match] [-e action] threshold 
    836 }}} 
    837  
    838   (Version 1.4+) `check_ps_cpu` is a resource consumption check.  It flags any/all matching processes whose current percentage of CPU utilization meets or exceeds the specified ''threshold''.  The `%` suffix on the ''threshold'' is optional but fully supported.  Options to this check are as follows: 
    839  
    840   ||= '''Check Option''' =||= '''Purpose''' =|| 
    841   || `-0` || Non-fatal.  Failure of this check will be ignored. || 
    842   || `-a` || Find, report, and act on all matching processes.  Default behavior is to fail check after first matching process. || 
    843   || `-e` ''`action`'' || Execute `/bin/bash -c '`''action''`'` if matching process is found. || 
    844   || `-f` || Full match.  Match against entire command line, not just first word. || 
    845   || `-K` || Kill '''parent''' of matching process (or processes, if used with `-a`) with SIGKILL.  (NOTE:  Does NOT imply `-k`) || 
    846   || `-k` || Kill matching process (or processes, if used with `-a`) with SIGKILL. || 
    847   || `-l` || Log matching process (or processes, if used with `-a`) to NHC log (`$LOGFILE`). || 
    848   || `-m` ''`match`'' || Look only at processes matching ''match'' (NHC match string, possibly negated).  Default is to check all processes. || 
    849   || `-r` ''`value`'' || Renice matching process (or processes, if used with `-a`) by the specified ''`value`'' (may be positive or negative). || 
    850   || `-s` || Log matching process (or processes, if used with `-a`) to the syslog. || 
    851   || `-u` [!]''`user`'' || User match.  Matches only processes owned by ''user'' (or, if negated, NOT owned by ''user''). || 
    852  
    853   '''''Example''' (look for non-root-owned process consuming 99% CPU or more; renice it to the max)'':  `check_ps_cpu -u !root -r 20 99%` 
    854  
    855 ===== check_ps_daemon ===== 
    856 {{{ 
    857 check_ps_daemon command [owner] [args] 
    858 }}} 
    859  
    860   `check_ps_daemon` looks for a running process matching ''command'' (or, if ''args'' is specified, ''command+args'').  If ''owner'' is specified, the process must be owned by ''owner''.  If no matching process is found, the check fails. 
    861  
    862   '''''Example''' (look for a root-owned sshd)'':  `check_ps_daemon sshd root` 
    863  
    864 ===== check_ps_kswapd ===== 
    865 {{{ 
    866 check_ps_kswapd cpu_time discrepancy [action [actions...]] 
    867 }}} 
    868  
    869   `check_ps_kswapd` compares the accumulated CPU time (in seconds) between `kswapd` kernel threads to make sure there's no imbalance among different NUMA nodes (which could be an early symptom of failure).  Threads may not exceed ''cpu_time'' seconds nor differ by more than a factor of ''discrepancy''.  Unlike most checks, `check_ps_kswapd` need not be fatal.  Zero or more ''actions'' may be specified from the following allowed actions:  `ignore` (do nothing), `log` (write error to log file and continue), `syslog` (write error to syslog and continue), or `die` (fail the check as normal).  The default is "`die`" if no ''action'' is specified. 
    870  
    871   '''''Example''' (max 500 CPU hours, 100x discrepancy limit, only log and syslog on error)'':  `check_ps_kswapd 1800000 100 log syslog` 
    872  
    873 ===== check_ps_loadavg ===== 
    874 {{{ 
    875 check_ps_loadavg limit_1m limit_5m limit_15m 
    876 }}} 
    877  
    878   `check_ps_loadavg` looks at the 1-minute, 5-minute, and 15-minute load averages reported by the kernel and compares them to the parameters ''`limit_1m`'', ''`limit_5m`'', and ''`limit_15m`'', respectively.  If any limit has been exceeded, the check fails.  Limits which are empty (i.e., `''`) or not supplied are ignored (i.e., assumed to be infinite) and will never fail. 
    879  
    880   '''''Example''' (ensure the 5-minute load average stays below 30)'':  `check_ps_loadavg '' 30` 
    881  
    882 ===== check_ps_mem ===== 
    883 {{{ 
    884 check_ps_mem [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]match] [-e action] threshold 
    885 }}} 
    886  
    887   (Version 1.4+) `check_ps_mem` is a resource consumption check.  It flags any/all matching processes whose total memory consumption (including both physical and virtual memory) meets or exceeds the specified ''threshold''.  The ''threshold'' is interpreted as kilobytes (1024 bytes) or can use NHC's standard byte-suffix syntax (e.g., `32GB`).  Percentages are not supported for this check at this time.  Options to this check are as follows: 
    888  
    889   ||= '''Check Option''' =||= '''Purpose''' =|| 
    890   || `-0` || Non-fatal.  Failure of this check will be ignored. || 
    891   || `-a` || Find, report, and act on all matching processes.  Default behavior is to fail check after first matching process. || 
    892   || `-e` ''`action`'' || Execute `/bin/bash -c '`''action''`'` if matching process is found. || 
    893   || `-f` || Full match.  Match against entire command line, not just first word. || 
    894   || `-K` || Kill '''parent''' of matching process (or processes, if used with `-a`) with SIGKILL.  (NOTE:  Does NOT imply `-k`) || 
    895   || `-k` || Kill matching process (or processes, if used with `-a`) with SIGKILL. || 
    896   || `-l` || Log matching process (or processes, if used with `-a`) to NHC log (`$LOGFILE`). || 
    897   || `-m` ''`match`'' || Look only at processes matching ''match'' (NHC match string, possibly negated).  Default is to check all processes. || 
    898   || `-r` ''`value`'' || Renice matching process (or processes, if used with `-a`) by the specified ''`value`'' (may be positive or negative). || 
    899   || `-s` || Log matching process (or processes, if used with `-a`) to the syslog. || 
    900   || `-u` [!]''`user`'' || User match.  Matches only processes owned by ''user'' (or, if negated, NOT owned by ''user''). || 
    901  
    902   '''''Example''' (look for process owned by `baduser` consuming 32GB or more of memory; log, syslog, and kill it)'':  `check_ps_mem -u baduser -l -s -k 32G` 
    903  
    904 ===== check_ps_physmem ===== 
    905 {{{ 
    906 check_ps_physmem [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]match] [-e action] threshold 
    907 }}} 
    908  
    909   (Version 1.4+) `check_ps_physmem` is a resource consumption check.  It flags any/all matching processes whose physical memory consumption (i.e., resident RAM only) meets or exceeds the specified ''threshold''.  The ''threshold'' is interpreted as a percentage if followed by a `%`, or as a number of kilobytes (1024 bytes) if numeric only, or can use NHC's standard byte-suffix syntax (e.g., `32GB`).  Options to this check are as follows: 
    910  
    911   ||= '''Check Option''' =||= '''Purpose''' =|| 
    912   || `-0` || Non-fatal.  Failure of this check will be ignored. || 
    913   || `-a` || Find, report, and act on all matching processes.  Default behavior is to fail check after first matching process. || 
    914   || `-e` ''`action`'' || Execute `/bin/bash -c '`''action''`'` if matching process is found. || 
    915   || `-f` || Full match.  Match against entire command line, not just first word. || 
    916   || `-K` || Kill '''parent''' of matching process (or processes, if used with `-a`) with SIGKILL.  (NOTE:  Does NOT imply `-k`) || 
    917   || `-k` || Kill matching process (or processes, if used with `-a`) with SIGKILL. || 
    918   || `-l` || Log matching process (or processes, if used with `-a`) to NHC log (`$LOGFILE`). || 
    919   || `-m` ''`match`'' || Look only at processes matching ''match'' (NHC match string, possibly negated).  Default is to check all processes. || 
    920   || `-r` ''`value`'' || Renice matching process (or processes, if used with `-a`) by the specified ''`value`'' (may be positive or negative). || 
    921   || `-s` || Log matching process (or processes, if used with `-a`) to the syslog. || 
    922   || `-u` [!]''`user`'' || User match.  Matches only processes owned by ''user'' (or, if negated, NOT owned by ''user''). || 
    923  
    924   '''''Example''' (look for all non-root-owned processes consuming more than 20% of system RAM; syslog and kill them all, but continue running)'':  `check_ps_physmem -0 -a -u !root -s -k 20%` 
    925  
    926 ===== check_ps_service ===== 
    927 {{{ 
    928 check_ps_service [-0] [-f] [-S|-r|-c|-s|-k] [-u user] [-d daemon | -m match] [ -e action | -E action ] service 
    929 }}} 
    930  
    931   (Version 1.4+) `check_ps_service` is similar to `check_ps_daemon` except it has the ability to start, restart, or cycle services which aren't running but should be, and to stop or kill services which shouldn't be running but are.  Options to this check are as follows: 
    932  
    933   ||= '''Check Option''' =||= '''Purpose''' =|| 
    934   || `-0` || Non-fatal.  Failure of this check will be ignored. || 
    935   || `-S` || Start service.  Service ''service'' will be started if not found running.  Equivalent to `-e '/sbin/service `''`service`''` start'` || 
    936   || `-c` || Cycle service.  Service ''service'' will be cycled if not found running.  Equivalent to `-e '/sbin/service `''`service`''` stop ; sleep 2 ; /sbin/service `''`service`''` start'` || 
    937   || `-d` ''`daemon`'' || Match running process by ''daemon'' instead of ''service''.  Equivalent to `-m '*`''`daemon`''`'` || 
    938   || `-e` ''`action`'' || Execute `/bin/bash -c '`''action''`'` if process IS NOT found running. || 
    939   || `-E` ''`action`'' || Execute `/bin/bash -c '`''action''`'` if process IS found running. || 
    940   || `-f` || Full match.  Match against entire command line, not just first word. || 
    941   || `-k` || Kill service.  Service ''service'' will be killed (and check will fail) if found running.    Similar to `pkill -9 `''`service`'' || 
    942   || `-m` ''`match`'' || Use ''match'' to search the process list for the service.  Default is `*`''`service`'' || 
    943   || `-r` || Restart service.  Service ''service'' will be restarted if not found running.  Equivalent to `-e '/sbin/service `''`service`''` restart'` || 
    944   || `-s` || Stop service.  Service ''service'' will be stopped (and check will fail) if found running.  Performs `/sbin/service `''`service`''` stop` || 
    945   || `-u` [!]''`user`'' || User match.  Matches only processes owned by ''user'' (or, if negated, NOT owned by ''user''). || 
    946  
    947   '''''Example''' (look for a root-owned sshd and start if missing)'':  `check_ps_service -u root -S sshd` 
    948  
    949 ===== check_ps_time ===== 
    950 {{{ 
    951 check_ps_time [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]match] [-e action] threshold 
    952 }}} 
    953  
    954   (Version 1.4+) `check_ps_time` is a resource consumption check.  It flags any/all matching processes whose total utilization of CPU time meets or exceeds the specified ''threshold''.  The ''threshold'' is a quantity of minutes suffixed by an `M` and/or a quantity of seconds suffixed by an `S`.  A number with no suffix is interpreted as seconds.  Options to this check are as follows: 
    955  
    956   ||= '''Check Option''' =||= '''Purpose''' =|| 
    957   || `-0` || Non-fatal.  Failure of this check will be ignored. || 
    958   || `-a` || Find, report, and act on all matching processes.  Default behavior is to fail check after first matching process. || 
    959   || `-e` ''`action`'' || Execute `/bin/bash -c '`''action''`'` if matching process is found. || 
    960   || `-f` || Full match.  Match against entire command line, not just first word. || 
    961   || `-K` || Kill '''parent''' of matching process (or processes, if used with `-a`) with SIGKILL.  (NOTE:  Does NOT imply `-k`) || 
    962   || `-k` || Kill matching process (or processes, if used with `-a`) with SIGKILL. || 
    963   || `-l` || Log matching process (or processes, if used with `-a`) to NHC log (`$LOGFILE`). || 
    964   || `-m` ''`match`'' || Look only at processes matching ''match'' (NHC match string, possibly negated).  Default is to check all processes. || 
    965   || `-r` ''`value`'' || Renice matching process (or processes, if used with `-a`) by the specified ''`value`'' (may be positive or negative). || 
    966   || `-s` || Log matching process (or processes, if used with `-a`) to the syslog. || 
    967   || `-u` [!]''`user`'' || User match.  Matches only processes owned by ''user'' (or, if negated, NOT owned by ''user''). || 
    968  
    969   '''''Example''' (look for `runawayd` daemon process consuming more than a day of CPU time; restart service and continue running)'':  `check_ps_time -0 -m '/runawayd/' -e '/sbin/service runawayd restart' 3600m` 
    970  
    971 ===== check_ps_unauth_users ===== 
    972 {{{ 
    973 check_ps_unauth_users [action [actions...]] 
    974 }}} 
    975  
    976   `check_ps_unauth_users` examines all processes running on the system to determine if the owner of each process is authorized to be on the system.  Authorized users are anyone with a UID below, by default, 100 (including root) and any users currently running jobs on the node.  All other processes are unauthorized.  If an unauthorized user process is found, the specified action(s) are taken.  The following actions are valid:  `kill` (terminate the process), `ignore` (do nothing), `log` (write error to log file and continue), `syslog` (write error to syslog and continue), or `die` (fail the check as normal).  The default is "`die`" if no ''action'' is specified. 
    977  
    978   '''''Example''' (log, syslog, and kill rogue user processes)'':  `check_ps_unauth_users log syslog kill` 
    979  
    980 ===== check_ps_userproc_lineage ===== 
    981 {{{ 
    982 check_ps_userproc_lineage [action [actions...]] 
    983 }}} 
    984  
    985   `check_ps_userproc_lineage` examines all processes running on the system to check for any processes not owned by an "authorized user" (see previous check) which are not children (directly or indirectly) of the Resource Manager daemon.  Refer to the `$RM_DAEMON_MATCH` configuration variable for how NHC determines the RM daemon process.  If such a rogue process is found, the specified action(s) are taken.  The following actions are valid:  `kill` (terminate the process), `ignore` (do nothing), `log` (write error to log file and continue), `syslog` (write error to syslog and continue), or `die` (fail the check as normal).  The default is "`die`" if no ''action'' is specified. 
    986  
    987   '''''Example''' (mark the node bad on rogue user processes)'':  `check_ps_userproc_lineage die` 
    988  
    989 == Customization == 
    990  
    991 Once you've fully configured NHC to run the built-in checks you need for your nodes, you're probably at the point where you've thought of something else you wish it could do but currently can't.  NHC's design makes it very easy to create additional checks for your site and have NHC load and use them at runtime.  This section will detail how to create new checks, where to place them, and what NHC will do with them. 
    992  
    993 While technically a "check" can be anything the `nhc` driver script can execute, for consistency and extensibility purposes (as well as usefulness to others), we prefer and recommend that checks be shell functions defined in a distinct, namespaced `.nhc` file.  The instructions contained in this section will assume that this is the model you wish to use. 
    994  
    995 '''NOTE:'''  If you do choose to write your own checks, and you feel they might be useful to the NHC community, we encourage you to share them.  E-mail them to either the [mailto:warewulf-devel@lbl.gov Warewulf Developers' Mailing List] or the [mailto:torquedev@supercluster.org TORQUE Developers' Mailing List].  We prefer either individual file attachments or a unified diff (i.e., `diff -Nurp`) against the NHC tarball/SVN tree, but any usable format will likely be accepted. 
    996  
    997 === Writing Checks === 
    998  
    999 The first decision to be made is what to name your check file.  As mentioned above, check files live (by default) in `/etc/nhc/scripts/` and are named ''`something`''`.nhc`^([#note2 2])^.  A file containing utility and general-purpose functions called `common.nhc` can be found here.  All other files placed here by the upstream package follow the naming convention ''`site_id`''`_`''`class`''`.nhc` (e.g., the Warewulf project's file containing hardware checks is named `ww_hw.nhc`).  Your ''`site_id`'' can be anything you'd like but should be recognizable.  The ''`class`'' should refer to the subsystem or conceptual group of things you'll be monitoring. 
    1000  
    1001 For purposes of this example, we'll pretend we're from John Sheridan University, using site abbreviation "`jsu`," and we want to write checks for the "rien" system.  ("Rien" is French for "nothing.") 
    1002  
    1003 Your `/etc/nhc/scripts/jsu_rien.nhc` file should start with a header which provides a summary of what will be checked, the name and e-mail of the author, possibly the date or other useful information, and any copyright or license restrictions you are placing on the file^([#note3 4])^.  It should look something like this: 
    1004  
    1005 {{{ 
    1006 # NHC -- John Sheridan University's Rien Checks 
    1007 # 
    1008 # Your Name <you@your.com> 
    1009 # Date 
    1010 # 
    1011 # Copyright and/or license information if different from upstream 
    1012 # 
    1013 }}} 
    1014  
    1015 Next, initialize any variables you will use to sane defaults.  This does two things:  it provides anyone reading your code a single place to look for "global" variables, and it makes sure you have something to test for later if you need to check the existence of cache data.  Make sure your variables are properly namespaced; that is, they should start with a prefix corresponding to your site, the system you're checking, etc. 
    1016  
    1017 {{{ 
    1018 # Initialize variables 
    1019 RIEN_NORMAL_VARIABLE="" 
    1020 RIEN_ARRAY_VARIABLE=( ) 
    1021 }}} 
    1022  
    1023 If your check may run more than once and does anything that's resource-intensive (running subprocesses, file I/O, etc.), you should (in most cases, unless it would cause malfunctions to occur) perform the intensive tasks only once and store the information in one or more shell variables for later use.  These should be the variables you just initialized in the section above.  They can be arrays or scalars. 
    1024  
    1025 {{{ 
    1026 # Function to populate data structures with data 
    1027 function nhc_rien_gather_data() { 
    1028     # Gather and cache data for later use. 
    1029     RIEN_NORMAL_VARIABLE="rien" 
    1030     RIEN_ARRAY_VARIABLE=( "nada" ) 
    1031 } 
    1032 }}} 
    1033  
    1034 Next, you need to write your check function(s).  These should be named `check_`''`class`''`_`''`purpose`'' where ''`class`'' is the same as used previously ("rien" for this example), and ''`purpose`'' gives a descriptive name to the check to convey what it checks.  Our example will use the obvious-but-potentially-vague "works" as its purpose, but the name you choose will undoubtedly be more clever. 
    1035  
    1036 If you have created a data-gathering function as shown above and populated one or more cache variables, the first thing your check should do is see if the cache has been populated already.  If not, run your data-gathering function before proceeding with the check. 
    1037  
    1038 As for how you write the check...well, that's entirely up to you.  It will depend on what you need to check and the available options for doing so.  (However, consult the next section for some tips and bashisms to make your checks more efficient.)  The example here is clearly a useless and contrived one but should nevertheless be illustrative of the general concept: 
    1039  
    1040 {{{ 
    1041 # Check to make sure rien is functioning properly 
    1042 function check_rien_works() { 
    1043     # Load cache if empty 
    1044     if [[ ${#RIEN_ARRAY_VARIABLE[*]} -eq 0 ]]; then 
    1045         nhc_rien_gather_data 
    1046     fi 
    1047  
    1048     # Use cached data 
    1049     if [[ "${RIEN_ARRAY_VARIABLE[0]}" = "" ]]; then 
    1050         die "Rien is not working" 
    1051     fi 
    1052     return 0 
    1053 } 
    1054 }}} 
    1055  
    1056 If other check functions are needed for a particular subsystem, write those similarly.  If you're using a cache, each check should look for (and call the `gather` function if necessary) the cache variables before doing the actual checking as shown above. 
    1057  
    1058 Once you have all the checks you need, you can add them to the configuration file on your node(s), like so: 
    1059  
    1060 {{{ 
    1061  *  || check_rien_works 
    1062 }}} 
    1063  
    1064 Next time NHC runs, it will automatically pick up your new check(s)! 
    1065  
    1066 === Tips and Best Practices for Checks === 
    1067  
    1068 Several of the philosophies and underlying principles which governed the design and implementation of the Warewulf Node Health Check project were mentioned above in the [#top introduction].  Certain code constructs were used to fulfill these principles which are not typical for the average run-of-the-mill shell script, largely because things which must be highly performant tend not to be written as shell scripts.  Why?  Two reasons:  (1) It doesn't have a lot of the fancier, more complex features of the dedicated (i.e., non-shell) scripting languages; and (2) Many script authors don't know of many of the features bash ''does'' offer because they're used so infrequently.  It can be somewhat of a self-fulfilling prophesy when nobody bothers to learn something since no one else is using it. 
    1069  
    1070 So why was bash chosen for this project?  Simple:  it's everywhere.  If you're running Linux, it's almost guaranteed to be there^([#note4 4])^.  The same cannot be said of any other scripting or non-compiled language (not even PERL or Python).  And forcing everyone to write their checks in C or another compiled language would raise the barrier to entry and reduce the number of sites for which it could be useful.  Since half the point is getting more places using a common tool (or at least a common framework), that would defeat the purpose.  Thus, bash made the most sense. 
    1071  
    1072 The important question, then, becomes how to make bash scripts more efficient.  And the solution is clear:  do as much as possible with native bash constructs instead of shelling out to subcommands like `sed`, `awk`, `grep`, and the other common UNIX swashbucklers.  The more one investigates the features bash provides, the more one finds how many of its long-held features tend to go unused and just how much one truly ''is'' able to do without the need to fork-and-exec.  In this section, several aspects of common shell script constructs will be reviewed (along with one or two uncommon ones) along with ways to improve efficiency and avoid subcommands whenever possible. 
    1073  
    1074 ==== Arrays ==== 
    1075  
    1076 Arrays are an important tool in any sufficiently-capable scripting language.  Bash has had support for arrays for quite some time; recent versions even add associative array support (i.e., string-based indexing, akin to hashes in PERL).  To maintain compatibility, associative arrays are not currently used in NHC, but traditional arrays are used quite heavily.  Though a complete tutorial on arrays in bash is beyond the scope of this document, a brief "cheat sheet" is probably a good idea.  So here you go: 
    1077  
    1078 ||= Syntax =||= Purpose =|| 
    1079 || `declare -a AVAR` || Declare the shell variable `$AVAR` to be an array (not strictly required, but good form). || 
    1080 || `AVAR=( ... )` || Assign elements of array `$AVAR` based on the word expansion of the contents of the parentheses.  `...` is one or more words of the form `[`''`subscript`''`]=`''`value`'' or an expression which expands to such.  Only the ''`value`''(s) are required. || 
    1081 || `${AVAR[`''`subscript`''`]}` || Evaluates to the ''`subscript`''^th^ element of the array `$AVAR`.  ''`subscript`'' must evaluate to an integer >= 0. || 
    1082 || `${#AVAR[*]}` || Evaluates to the number of elements in the array `$AVAR`. || 
    1083 || `${AVAR[*]}` || Evaluates to all the values in the `$AVAR` array as a single word (like `$*`).  Use only where keeping values separate doesn't matter. || 
    1084 || `"${AVAR[@]}"` || Evaluates to all values in the `$AVAR` array, each as a separate word.  This keeps values distinct (just like `$@` vs. `$*`). || 
    1085 || `"${AVAR[@]:`''`offset`''`:`''`length`''`}"` || Evaluates to the values of `$AVAR` as above, starting at element `${AVAR[`''`offset`''`]}` and including at most ''`length`'' elements.  ''`length`'' may be omitted, and ''`offset`'' may be negative. || 
    1086  
    1087 A more detailed examination of bash arrays can be found [http://tldp.org/LDP/abs/html/arrays.html here]. 
    1088  
    1089 Several examples of array-based techniques will appear in the following sections, so insure your grasp on the basic usage of array syntax before continuing. 
    1090  
    1091 ==== File I/O ==== 
    1092  
    1093 When using the command prompt, most of us reach for things like `cat` or `less` when we need to view the contents of a file; thus, our inclination tends to be to reach for the same tools when writing shell scripts.  `cat`, however, is not a bash built-in, so a fork-and-exec is required to spawn `/bin/cat` just so it can read a file and return the contents.  This overhead is negligible for interactive shell usage, and may be a non-issue for many shell-scripting scenarios, but for efficiency-critical scenarios like NHC, we can and should do better! 
    1094  
    1095 File input and output (either truncate or append) are both natively supported by bash using the (mostly) well-known [https://www.gnu.org/software/bash/manual/html_node/Redirections.html Redirection Operators].  Rather than reading data from files into variables (arrays or scalars) using command substitution (i.e., the {{{`}}} and `$()` operators), use redirection operators to pull the contents of the file into the variable.  One technique for doing this is to redirect to the `read` built-in.  So instead of this: 
    1096  
    1097 {{{ 
    1098 MOTD=`cat /etc/motd` 
    1099 }}} 
    1100  
    1101 use: 
    1102  
    1103 {{{ 
    1104 read MOTD < /etc/motd 
    1105 }}} 
    1106  
    1107 bash also allows an even simpler form for using this technique: 
    1108  
    1109 {{{ 
    1110 MOTD=$(< /etc/motd) 
    1111 }}} 
    1112  
    1113 It looks similar to command substitution but uses I/O redirection in place of an actual command. It '''does''', however, still do a `fork()` and `pipe()` to do the file I/O.  On Linux, this is done via `clone()` which is fairly lightweight but still not quite as efficient as the `read` command shown above (which is a `bash` built-in). 
    1114  
    1115 The same syntax can be used to populate array variables with multiple fields' worth of data: 
    1116  
    1117 {{{ 
    1118 UPTIME=( $(< /proc/uptime) ) 
    1119 }}} 
    1120  
    1121 This will store the system uptime (in seconds) in the variable `${UPTIME[0]}` and the idle time in `${UPTIME[1]}`.  Declare `$UPTIME` as an array in advance using `declare -a` or `local -a` to make this clearer, and (as always!) make sure to add comments!  To avoid the `fork()` (see above), use `read` instead: 
    1122  
    1123 {{{ 
    1124 read -a UPTIME < /proc/uptime 
    1125 }}} 
    1126  
    1127 Though not as easy to spot, other subcommands may also be able to be eliminated using this technique.  For example, the Linux kernel makes the full hostname of the system available in a file in the `/proc` filesystem.  Knowing this, the {{{`hostname`}}} command substitution may be eliminated by utilizing the contents of this file: 
    1128  
    1129 {{{ 
    1130 read HOSTNAME < /proc/sys/kernel/hostname 
    1131 }}} 
    1132  
    1133 As an aside...  Knowing these tricks may also be helpful in other situations.  If you're trying to repair a system in which the root filesystem has become partially corrupted, and the `cat` command no longer works, this can provide you a way to view the contents of system files directly in your shell! 
    1134  
    1135 ==== Line Parsing and Loops ==== 
    1136  
    1137 While certainly not as capable as [http://www.perl.org/ PERL] at text processing, the shell does offer some seldom-used features to facilitate the processing of line-oriented input.  By default, the shell splits things up based on whitespace (i.e., space characters, tabs, and newlines) to distinguish each "word" from the next.  This is why quoting must be used to join arguments which contain spaces to allow them to be treated as single parameters.  As with many aspects of the shell, however, this behavior can be customized, allowing for different delimiter characters to be applied to input (typically file I/O).  Since character-delimited files are commonplace in UNIX, this idiom is quite frequently useful when shell scripting. 
    1138  
    1139 One easily-recognized example would be `/etc/passwd`.  It is both line-oriented and colon-delimited.  Parsing its contents is often useful for shell scripts, but most which need this data use `awk` or `cut` to pull the appropriate fields.  Direct splitting and parsing of this file can be done in native bash without the use of subcommands: 
    1140  
    1141 {{{ 
    1142 IFS=':' 
    1143 while read -a LINE ; do 
    1144     THIS_UID=${LINE[2]} 
    1145     UIDS[${#UIDS[*]}]=$THIS_UID 
    1146     PWDATA_USER[$THIS_UID]="${LINE[0]}" 
    1147     PWDATA_GID[$THIS_UID]=${LINE[3]} 
    1148     PWDATA_GECOS[$THIS_UID]="${LINE[4]}" 
    1149     PWDATA_HOME[$THIS_UID]="${LINE[5]}" 
    1150     PWDATA_SHELL[$THIS_UID]="${LINE[6]}" 
    1151 done < /etc/passwd 
    1152 IFS=$' \t\n' 
    1153 }}} 
    1154  
    1155 The above code reads a line at a time from `/etc/passwd` into the `$LINE` array.  Because the bash Input Field Separator variable, `$IFS`, has been set to a colon ('`:`') instead of whitespace, each field of the `passwd` file will go into a separate element of the `$LINE` array.  The values in `$LINE` are then used to populate 5 parallel arrays with the userid, GID, GECOS field, home directory, and shell for each user (indexed by UID).  It also keeps an array of all the UIDs it has seen.  Everything here is done in the same bash process which is executing the script, so it is quite efficient.  The `$IFS` variable is reset to its proper value after the loop completes. 
    1156  
    1157 Sometimes, however, the elimination of a subprocess is impractical or impossible.  A similar approach may still be used to keep the parsing of the command's output as efficient as possible.  For example, a bash-native implementation of the `netstat -nap` command would be impossible (or at least a very close approximation thereof), so we could use the following method to populate our cache data from its output: 
    1158  
    1159 {{{ 
    1160 IFS=$'\n' 
    1161 LINES=( $(netstat -nap) ) 
    1162  
    1163 IDX=0 
    1164 for ((i=0; i<${#LINES[*]}; i++)); do 
    1165     IFS=$' \t\n' 
    1166     LINE=( ${LINES[$i]} ) 
    1167     if [[ "${LINE[0]}" != "tcp" && "${LINE[0]}" != "udp" ]]; then 
    1168         continue 
    1169     fi 
    1170     NET_PROTO[$IDX]=${LINE[0]} 
    1171     NET_RECVQ[$IDX]=${LINE[1]} 
    1172     NET_SENDQ[$IDX]=${LINE[2]} 
    1173     NET_LOCADDR[$IDX]=${LINE[3]} 
    1174     NET_REMADDR[$IDX]=${LINE[4]} 
    1175     if [[ "${NET_PROTO[$IDX]}" == "tcp" ]]; then 
    1176         NET_STATE[$IDX]=${LINE[5]} 
    1177         NET_PROC[$IDX]=${LINE[6]} 
    1178     else 
    1179         NET_STATE[$IDX]="" 
    1180         NET_PROC[$IDX]=${LINE[5]} 
    1181     fi 
    1182     if [[ "${NET_PROC[$IDX]}" == */* ]]; then 
    1183         IFS='/' 
    1184         LINE=( ${NET_PROC[$IDX]} ) 
    1185         NET_PROCPID[$IDX]=${LINE[0]} 
    1186         NET_PROCNAME[$IDX]=${LINE[1]} 
    1187     else 
    1188         NET_PROCPID[$IDX]='???' 
    1189         NET_PROCNAME[$IDX]="unknown" 
    1190     fi 
    1191     ((IDX++)) 
    1192 done 
    1193 IFS=$' \t\n' 
    1194 }}} 
    1195  
    1196 By resetting `$IFS` to contain only a newline character, we can easily split the command results into individual lines.  We place these results into the `$LINES` array.  Each line is then split on the traditional whitespace characters and placed into the `$LINE` (with no '`S`' on the end) array.  We're tracking only TCP and UDP sockets here, so everything else (including column headers) gets thrown away.  We store each field in our cache arrays, and we even further split one of the fields which uses '`/`' as a separator.  After our loop is complete, we reset `$IFS`, and we now have a fully-populated set of cache variables containing all our TCP- and UDP-based sockets, all with only 1 fork-and-exec required! 
    1197  
    1198 ==== Text Transformations ==== 
    1199  
    1200 Bash got a regular expression matching operator in version 3, but it still lacks regex-based transforms.  However, with a minimum of extra effort, glob-based transforms can often provide the necessary functionality. 
    1201  
    1202 The following basic variable transformations are available: 
    1203  
    1204 ||= Syntax =||= Purpose =|| 
    1205 || `${VAR:`''`offset`''`}` || Evaluates to the substring of `$VAR` starting at ''`offset`'' and continuing until the end of the string.  If ''`offset`'' is negative, it is interpreted relative to the end of `$VAR`. || 
    1206 || `${VAR:`''`offset`''`:`''`length`''`}` || Same as above, but the result will contain at most ''`length`'' characters from `$VAR`. || 
    1207 || `${#VAR}` || Gives the length, in characters, of the value assigned to `$VAR`. || 
    1208 || `${VAR#`''`pattern`''`}` || Removes the shortest string matching ''`pattern`'' from the beginning of `$VAR`. || 
    1209 || `${VAR##`''`pattern`''`}` || Same as above, but the longest string matching ''`pattern`'' is removed. || 
    1210 || `${VAR%`''`pattern`''`}` || Removes the shortest string matching ''`pattern`'' from the end of `$VAR`. || 
    1211 || `${VAR%%`''`pattern`''`}` || Same as above, but the longest string matching ''`pattern`'' is removed. || 
    1212 || `${VAR/`''`pattern`''`/`''`replacement`''`}` || The first string matching ''`pattern`'' in `$VAR` is replaced with ''`replacement`''.  ''`replacement`'' and the last `/` may be omitted to simply remove the matching string.  Patterns starting with `#` or `%` must match beginning or end (respectively) of `$VAR`. || 
    1213 || `${VAR//`''`pattern`''`/`''`replacement`''`}` || Same as above, but ALL strings matching ''`pattern`'' are replaced/removed. || 
    1214  
    1215 So here are some ways the above constructs can be used to do common operations on strings/files: 
    1216  
    1217 ||= Traditional Method =||= Native `bash` method =|| 
    1218 || `sed 's/^ *//'` || `while [[ "$LINE" != "${LINE## }" ]]; do LINE="${LINE## }" ; done` || 
    1219 || `sed 's/ *$//'` || `while [[ "$LINE" != "${LINE%% }" ]]; do LINE="${LINE%% }" ; done` || 
    1220 || `echo ${LIST[*]} | fgrep string` || `[[ "${LIST[*]//string}" != "${LIST[*]}" ]]` || 
    1221 || `tail -1` || `${LINES[*]:-1}` || 
    1222 || `cat file | tr '\r' ''` || `LINES=( "${LINES[@]//$'\r'}" )` || 
    1223  
    1224 There are infinitely more, of course, but these should get you thinking along the right lines! 
    1225  
    1226 ==== Matching ==== 
    1227  
    1228 Matching input data against potential or expected patterns is common to all programming, and NHC is no exception.  As previously mentioned, however, bash 2.x did not have regular expression matching capability.  To abstract this out, NHC's `common.nhc` file (loaded automatically by `nhc` when it runs) provides the `mcheck_regexp()`, `mcheck_range()`, and `mcheck_glob()` functions which return 1 if the first argument matches the pattern provided as the second argument.  To allow for a single matching interface to support all styles of matching, the `mcheck()` function is also provided.  If the pattern is surrounded by slashes (e.g., `/pattern/`), `mcheck()` will attempt a regular expression match; if the pattern is surrounded by braces (e.g., `{pattern}`), a range match is attempted; otherwise, it attempts a glob match.  (For older bash versions which lack the regex matching operator, `egrep` is used instead...which unfortunately will mean additional subshells.)  The `mcheck()` function is used to implement the pattern matching of the first field in `nhc.conf`. 
    1229  
    1230 ---- 
    1231  
    1232 [=#note1 (1)]  Actually, nhc-wrapper will strip "-wrapper" off the end of its name and execute whatever remains, or you can specify a subprogram directly using the -P option on the nhc-wrapper command line.  It was intentionally written to be somewhat generic in its operation so as to be potentially useful in wrapping other utilities. 
    1233  
    1234 [=#note2 (2)]  Technically, ''any'' file in that directory gets loaded __regardless__ of extension.  This may change in the future, so use of the `.nhc` extension is highly recommended. 
    1235  
    1236 [=#note3 (3)]  If you don't specify otherwise, all checks made available publicly or directly to the Warewulf development team at [http://www.lbl.gov/ LBNL] are copyrighted by the author and licensed as specified in the [source:/trunk/common/LICENSE LBNL-BSD license] used by Warewulf. 
    1237  
    1238 [=#note4 (4)]  Well, okay...  If you're running enough of Linux that it can function as a compute node.  Bootstrap images and other embedded/super-minimal cases aren't really applicable to NHC anyway. 
     5Please find the new LBNL Node Health Check (NHC) page and documentation at https://github.com/mej/nhc!