= System Monitoring = Our system monitoring is based on configuring the monitor tool. The current draft of this configuration is below. The initial configuration is based on just a few states: ''START'':: the standard initial state, used to perform priming reads of status and to setup operating defaults ''UNKNOWN'':: the standard state used when no other state fits with the current conditions ''startingNetwork'':: we do not have an ethernet address ''running'':: all is ok, our process pid file exists and contains a process id and the ethernet has an IP number in the range we expect ''FXOhang'':: we have detected a hung state from our FXO modules. In this state, we attempt to reset the devices. ''dead'':: used when we are in the UNKNOWN state for too long There are two scripts which we use to monitor and reset the FXO status: ''check_installed_FXO_status'':: display the current status ''reset_FXO'':: run commands to unload and reload the FXO module {{{ # Private and confidential. # # Copyright Jazmin Communications Pty Ltd, 2009 # All rights reserved. # # Not for external release. # # A monitor configuration for the ip04 system # # ENTER START { LOG "starting monitor" SET CYCLE = 2 # monitor aggressively while booting SET enetStatus = RUN "/sbin/ifconfig en1" } STATE startingNetwork { enetStatus NOT ~ /inet[ ]+192.168/ } ENTER startingNetwork { SET CYCLE = 2 # monitor aggressively while waiting for the network to come back } POLL startingNetwork { SET enetStatus = RUN "/sbin/ifconfig en1" } # we define operational states based on various conditional tests # if all conditions pass, the monitor enters the given state and runs # our 'ENTER' method. # the following reads a .pid file and verifies that it contains a number. # note that we demonstrate the 'COLLECT' verb here. COLLECT can be used # to collect data into a variable for later tests. This is largely for # optimisation. # The other way to specify the condition for this state is simply: # # FILE 'myproc.pid' ~ /[0-9]+/ # STATE running { enetStatus ~ /inet[ ]+192.168/; COLLECT myprocpid FROM FILE '"/tmp/myproc.pid"'; myprocpid ~ /[0-9]+/ # note: current bug, the file name cannot contain a path } # if a pid file was found, we log the fact. Every 'CYCLE' seconds, we will test that the # system is still running. ENTER running { LOG "running ok" SET CYCLE = 5 # less frequent monitoring while things are running nicely } POLL running { SET fxostatus = RUN "'/Users/martin/Desktop/current/Jazmin Communications/check_installed_FXO_status'" SET enetStatus = RUN "/sbin/ifconfig en1" } # if no states match, the monitor automatically enters the state 'UNKNOWN' # we can catch this by setting up an enter method: ENTER UNKNOWN { LOG "unknown system state" } STATE FXOhang { fxostatus ~ /0xff/ } ENTER FXOhang { LOG "detected hang in FXO module" SET fxostatus = RUN "'/Users/martin/Desktop/current/Jazmin Communications/reset_FXO'" } # if we do not enter the 'running' state, our monitor will enter the UNKNOWN # state because we have not setup any conditions for any other states. # If we have been in the UNKNOWN state for 10 seconds or more, we give up and # decide the system is dead. STATE dead { CURRENT ~ /UNKNOWN/; TIMER >= 4 } # in this sample, if the system is dead, we simply log the fact and exit. ENTER dead { LOG "program is not running, restarting monitor"; SPAWN "/bin/date >>/tmp/dates" } }}}