wiki:SystemMonitorConfiguration

Context Navigation

Version 2 (modified by martin, 16 years ago) (diff)
initial draft of a startup monitor. Lots to do but demonstrates some of the possible ways to perform various operations

System Monitoring

Our system monitoring is based on configuring the monitor tool. The current draft of this configuration is below.

The initial configuration is based on just a few states:

START: the standard initial state, used to perform priming reads of status and to setup operating defaults
UNKNOWN: the standard state used when no other state fits with the current conditions
startingNetwork: we do not have an ethernet address
running: all is ok, our process pid file exists and contains a process id and the ethernet has an IP number in the range we expect
FXOhang: we have detected a hung state from our FXO modules. In this state, we attempt to reset the devices.
dead: used when we are in the UNKNOWN state for too long

There are two scripts which we use to monitor and reset the FXO status:

check_installed_FXO_status: display the current status
reset_FXO: run commands to unload and reload the FXO module

#   Private and confidential.
#
#   Copyright Jazmin Communications Pty Ltd, 2009
#   All rights reserved.
#
#   Not for external release.
#
# A monitor configuration for the ip04 system
#
# 

ENTER START {
	LOG "starting monitor"
	SET CYCLE = 2  # monitor aggressively while booting
	SET enetStatus = RUN "/sbin/ifconfig en1"
}

STATE startingNetwork {
	enetStatus NOT ~ /inet[ ]+192.168/
}

ENTER startingNetwork {
	SET CYCLE = 2  # monitor aggressively while waiting for the network to come back
}

POLL startingNetwork {
	SET enetStatus = RUN "/sbin/ifconfig en1"
}

# we define operational states based on various conditional tests
# if all conditions pass, the monitor enters the given state and runs 
# our 'ENTER' method.

# the following reads a .pid file and verifies that it contains a number. 
# note that we demonstrate the 'COLLECT' verb here. COLLECT can be used
# to collect data into a variable for later tests. This is largely for
# optimisation.
# 	The other way to specify the condition for this state is simply:
#
#   FILE 'myproc.pid' ~ /[0-9]+/
#
STATE running {
	enetStatus ~ /inet[ ]+192.168/;
	COLLECT myprocpid FROM FILE '"/tmp/myproc.pid"';
	myprocpid ~ /[0-9]+/ # note: current bug, the file name cannot contain a path
}

# if a pid file was found, we log the fact.  Every 'CYCLE' seconds, we will test that the
# system is still running.
ENTER running {
	LOG "running ok"
	SET CYCLE = 5  # less frequent monitoring while things are running nicely
}

POLL running {
	SET fxostatus = RUN "'/Users/martin/Desktop/current/Jazmin Communications/check_installed_FXO_status'"
	SET enetStatus = RUN "/sbin/ifconfig en1"
}

# if no states match, the monitor automatically enters the state 'UNKNOWN' 
# we can catch this by setting up an enter method:

ENTER UNKNOWN {
	LOG "unknown system state"
}

STATE FXOhang {
	fxostatus ~ /0xff/
}

ENTER FXOhang {
	LOG "detected hang in FXO module"
	SET fxostatus = RUN "'/Users/martin/Desktop/current/Jazmin Communications/reset_FXO'"
}

# if we do not enter the 'running' state, our monitor will enter the UNKNOWN
# state because we have not setup any conditions for any other states.
# If we have been in the UNKNOWN state for 10 seconds or more, we give up and 
# decide the system is dead.

STATE dead {
	CURRENT ~ /UNKNOWN/;
	TIMER >= 4
}

# in this sample, if the system is dead, we simply log the fact and exit.

ENTER dead {
	LOG "program is not running, restarting monitor";
	SPAWN "/bin/date >>/tmp/dates"
}

Download in other formats:

Plain Text