Replication monitoring with monit

During my MBA, I always heard “what you measure is what you get.” That seems to be true in IT as well. If you want your services to stay up and running, you need to make sure they are monitored.

I’m terrified of backups. That’s not quite true, I’m terrified that I don’t have good backups. They’re so easy to ignore. I mean, you don’t need them 99.9% of the time. On top of that, they’re incredibly hard to get right.

Most people can do filesystem backups, they aren’t that hard. But how do you back up a 15GB mysql database? Most people probably just backup the datafiles and hope. I don’t like that solution. I really like to have nice consistent backups. To make sure I can get a good backup without impacting the performance of our production environment, I use replication.

Our main database server replicates to a standby server. Along with being ready to takeover in case of a failure, we use our standby for backups. Here’s our backup script:

#! /bin/bash
DATE=$(date +"%Y%m%d")
#Dump the facebook DB
mysql -u root -e "slave stop sql_thread;"
mysqldump -u root --all-databases -q -e | bzip2 - >/data/backup/backups/facebook-db-backup-${DATE}.dmp.bz2
mysql -u root -e "slave start;"

It’s really simple. First, it stops the sql_thread portion of replication. That means we keep copying changes from production, we just don’t apply them to this copy of the database. Once that is done, we use mysqldump to do a full backup. Once the backup is done, we restart the replication slave. Simple, right? So why am I so scared?

I’m nervous that replication won’t get restarted. If that happened, we would no longer have a good backup. I’m terrified that I go to restore the database and find out that my data is three months old. That type of thing keeps me up at night.

Luckily, Jeremy installed monit on all of our servers a few months ago. In just a few hours, I cooked up the following scripts to monitor replication. First, here’s a ruby script that will touch a file on the filesystem if monitoring is running okay. I run this every minute from cron.

#! /usr/bin/env ruby
require 'mysql''','root')
h=conn.query("show slave status").fetch_hash
unless h.nil?
  if h["Slave_IO_Running"] == "Yes" and h["Slave_SQL_Running"] == "Yes"
    system("touch /var/run/monit/watchdog")

With that code in place and running from cron, we can ask monit to watch our slave.

check file DbSlaveReplication with path /var/run/monit/watchdog IF timestamp > 2 minutes then alert check process mysql with pidfile /var/run/mysqld/ group database start program = "/etc/init.d/mysqld start" stop program = "/etc/init.d/mysqld stop" if failed host port 3306 then restart if 5 restarts within 5 cycles then timeout

That’s all it takes to make sure replication is running in our environment. It’s just a little bit of code, but it helps me sleep better at night.