Being on pager duty is a downer. And it’s a big mistake that ruins the on-call or first-responder position when you allocate no-no time to actually invest in a useful reporting platform. Like many before me, I used to create exactly that evil: write a bash script, and throw it in cron.
Example of the Wrong Method
MAILTO=pager.list@example.com */5 * * * * curl http://www.example.com/ || echo "website down"
There’s a pretty good technique to reduce the occurrence of duplicate pages, and that’s to use diff. Let’s work through an example of checking a web page for errors.
PAGE_1="http://localhost/example.php" PREV_DIR=/var/spool/pager_prev RECENT_DIR=/var/spool/pager_recent REF_DIR=/var/spool/pager_refs REF_PAGE="$REF_DIR/example.txt" THRESHOLD_DELTA=3 # adjust this to DO_ALERT=0 ALERT_FILE=/tmp/example.mail MAILTO=server.alerts@example.com
That sets up a series of directories: things presently getting tested get downloaded into the recent directory. Results from the previous test get moved to the previous directory.
[ -f $RECENT_DIR/example.txt ] && mv $RECENT_DIR/example.txt $PREV_DIR [ ! -f $RECENT_DIR/example.err ] && rm -f $PREV_DIR/example.err [ -f $RECENT_DIR/example.err ] && mv $RECENT_DIR/example.err $PREV_DIR
That did the basic move for every previous page request and error determination.
curl -o $RECENT_DIR/example.txt "$PAGE_1" &> $RECENT_DIR/example.err [ $? -ne 0 -a ! -f $PREV_DIR/example.err] && DO_ALERT=1
That will tell us if this was the first time we haven’t been able to finish a page request.
diff $PREV_DIR/example.txt $RECENT_DIR/example.txt > $RECENT_DIR/example.diff if [ $? -ne 0 ] ; then $lines = $( wc -l < /tmp/example.diff ) [ $lines -gt $THRESHOLD_DELTA ] && DO_ALERT=1 fi
That recorded the change in the page if there was one. The end of the page might still say “java.IOException” but we don’t need to get paged unless that error actually changes.
the_diff=`head $RECENT_DIR/example.diff` the_error='' [ -f $RECENT_DIR/example.err -a `wc -l < $RECENT_DIR/example.err` -gt 0 ] && the_error=`cat $RECENT_DIR/example.err` if [ $DO_ALERT -ne 0 ] ; then mail -s "[monitor] example.diff changed" $MAILTO <<
And that emails us the results of a change in page structure.
Or you cold install nagios or Zabbix an get it all.