Mail Delivery Time Monitoring

After the launch of our new MagicMail environment last year, in order to be completely on top of any issues that may arise, we've steadily added more monitoring for that system. One monitoring check that required a bit of scripting is checking the time it takes for end-to-end delivery on our system.

First of all, to view the code for this system, go here: https://gist.github.com/FRII/9748818

I won't explain every line of those files, but I'll be going over the general concepts of each script.

check_delivery_time.py is an actual Nagios plugin (via NRPE), as that's our current monitoring system. (This may be changing in the coming months, look for more posts on that subject later.) It's executed by NRPE every 15 minutes in our configuration, and checks the configured Maildir on the filesystem for messages. Anything that 'matches' (the subject line is the full hostname of the current server, as determined by the socket module and the "domain" config parameter) is pulled out and examined. The body of the message contains a floating point Unix timestamp, which is compared against the filesystem creation time of the file in the Maildir. This allows us to clock the time between the message being sent, and when the SMTP daemons wrote it to the filesystem. Note that IMAP and POP are NOT involved in this process, as that would basically measure a totally seperate metric; clocking the IMAP/POP server is unrelated to clocking the SMTP server.

deliverytimemailsender.py runs on a separate machine, and simply SMTPs the desired messages (with hostname-subjects and timestamp-bodies) to each configured server. Fairly straightforward. Be sure to send this MORE frequently than you run the Nagios check, we have ours set to every 5 minutes. The check_delivery_time.py will remove any matching messages it finds, and only clock on the latest one (if anyone wants to update it to average them, be my guest!). If the check runs without any matching messages, it will report a Nagios 'unknown' status code, hence sending multiple messages per check interval to ensure it has some to look at.

The last file is just the JSON formatted config file. Most of those options should usually be the same, the various headers and SMTP parameters are provided in case you need them to bypass spam protection, or other such measures. In our environment they're all the same address. Keep in mind that the script will only attempt to SMTP authenticate if credentials are supplied in the JSON, so if you're on an IP that can relay, you can simply omit those parameters from the config file if you prefer.

To briefly touch on delivery times, and talk up our new mail environment: The average delivery time of our previous MailArmory system was roughly 1-5 minutes depending on server load, size of the message, and various other factors. MagicMail's delivery time for these test messages is usually about 0.1 - 0.2 seconds. I have our configuration set to warn on a 2 second delivery time, critical on 5 seconds, and so far it's never even issued a warning. I'd call that some pretty respectable delivery time.