| Version 2 (modified by , 16 years ago) (diff) |
|---|
System Monitoring
Our system monitoring is based on configuring the monitor tool. The current draft of this configuration is below.
The initial configuration is based on just a few states:
- START
- the standard initial state, used to perform priming reads of status and to setup operating defaults
- UNKNOWN
- the standard state used when no other state fits with the current conditions
- startingNetwork
- we do not have an ethernet address
- running
- all is ok, our process pid file exists and contains a process id and the ethernet has an IP number in the range we expect
- FXOhang
- we have detected a hung state from our FXO modules. In this state, we attempt to reset the devices.
- dead
- used when we are in the UNKNOWN state for too long
There are two scripts which we use to monitor and reset the FXO status:
- check_installed_FXO_status
- display the current status
- reset_FXO
- run commands to unload and reload the FXO module
# Private and confidential.
#
# Copyright Jazmin Communications Pty Ltd, 2009
# All rights reserved.
#
# Not for external release.
#
# A monitor configuration for the ip04 system
#
#
ENTER START {
LOG "starting monitor"
SET CYCLE = 2 # monitor aggressively while booting
SET enetStatus = RUN "/sbin/ifconfig en1"
}
STATE startingNetwork {
enetStatus NOT ~ /inet[ ]+192.168/
}
ENTER startingNetwork {
SET CYCLE = 2 # monitor aggressively while waiting for the network to come back
}
POLL startingNetwork {
SET enetStatus = RUN "/sbin/ifconfig en1"
}
# we define operational states based on various conditional tests
# if all conditions pass, the monitor enters the given state and runs
# our 'ENTER' method.
# the following reads a .pid file and verifies that it contains a number.
# note that we demonstrate the 'COLLECT' verb here. COLLECT can be used
# to collect data into a variable for later tests. This is largely for
# optimisation.
# The other way to specify the condition for this state is simply:
#
# FILE 'myproc.pid' ~ /[0-9]+/
#
STATE running {
enetStatus ~ /inet[ ]+192.168/;
COLLECT myprocpid FROM FILE '"/tmp/myproc.pid"';
myprocpid ~ /[0-9]+/ # note: current bug, the file name cannot contain a path
}
# if a pid file was found, we log the fact. Every 'CYCLE' seconds, we will test that the
# system is still running.
ENTER running {
LOG "running ok"
SET CYCLE = 5 # less frequent monitoring while things are running nicely
}
POLL running {
SET fxostatus = RUN "'/Users/martin/Desktop/current/Jazmin Communications/check_installed_FXO_status'"
SET enetStatus = RUN "/sbin/ifconfig en1"
}
# if no states match, the monitor automatically enters the state 'UNKNOWN'
# we can catch this by setting up an enter method:
ENTER UNKNOWN {
LOG "unknown system state"
}
STATE FXOhang {
fxostatus ~ /0xff/
}
ENTER FXOhang {
LOG "detected hang in FXO module"
SET fxostatus = RUN "'/Users/martin/Desktop/current/Jazmin Communications/reset_FXO'"
}
# if we do not enter the 'running' state, our monitor will enter the UNKNOWN
# state because we have not setup any conditions for any other states.
# If we have been in the UNKNOWN state for 10 seconds or more, we give up and
# decide the system is dead.
STATE dead {
CURRENT ~ /UNKNOWN/;
TIMER >= 4
}
# in this sample, if the system is dead, we simply log the fact and exit.
ENTER dead {
LOG "program is not running, restarting monitor";
SPAWN "/bin/date >>/tmp/dates"
}
