Replication Monitoring with Dead Man's Snitch

Database replication is an amazingly powerful tool. It simplifies backups, helps you maintain a disaster recovery environment and can provide an ad-hoc query environment to allow database exploration. Like most technologies, however, it isn’t perfect. If replication breaks you might end up with worthless backups, an out of date DR environment or decisions made with old data. To make sure this doesn’t happen, it’s important to monitor replication.

I’ve written about database replication monitoring in the past. It’s been a few years since that post and we’re using different tools in our environment. It seemed like time to revisit how we monitor database replication.

This time around, my goals were:

  • Email notification within a reasonable period when replication fails (A couple of hours is fine for me since we’re mostly using replication as an ad-hoc query environment)
  • Assurance that lack of email meant replication was working
  • Something simple that was either inexpensive or used a tool we were already paying for

After looking at our tools, I decided to test out monitoring using Dead Man’s Snitch. If you haven’t used Dead Man’s Snitch, it’s a tool for monitoring periodic jobs, like cron jobs. You add a bit of code to tell Dead Man’s Snitch when a job completes. If your job doesn’t check in, Dead Man’s Snitch let’s you know. It’s a super simple, but incredibly powerful tool.

As an example, here is the chef config that shows how we monitor our backups:

  cron "backup rails app on production" do
    user "deploy"
    minute "0"
    hour "0"
    command "/home/deploy/backup.sh && curl https://nosnch.in/arandomcode"
  end

In this example, if backup.sh fails, then we don’t run curl to check in with Dead Man’s Snitch.

It turns out this same pattern can be used to make sure replication is working. To do this, we need a script that returns a failure code if our MySQL replication has failed. After a bit of trial and error, this is what I came up with:

  #! /bin/bash
  function value_of_replication_field() {
      mysql -u root --password=<%= node["mysql"]["server_root_password"] %> -e "show slave status\G" | grep "${1}" | awk '{print $2}'
  }

  last_io_errno=$(value_of_replication_field Last_IO_Errno)
  last_errno=$(value_of_replication_field Last_Errno)
  seconds_behind=$(value_of_replication_field Seconds_Behind_Master)

  if [ "${last_io_errno}" != "0" ] ; then
    echo "Replication Failed with IO Error: ${last_io_errno}"
    exit 1
  fi

  if [ "${last_errno}" != "0" ] ; then
    echo "Replication Failed with Error: ${last_errno}"
    exit 1
  fi

  if [ $seconds_behind -gt 600 ] ; then
    echo "Replication is behind by more than 10 minutes: ${seconds_behind}"
    exit 1
  fi

  exit 0

This code uses the MySQL shell to get replication status and then looks for three different error conditions. If none hold, it returns success. Now, we just need to set up an hourly job in dead mans snitch and then configure cron to run this script hourly.

  cron "check replication" do
    user "deploy"
    minute "0"
    hour "0"
    command "/home/deploy/check_replication.sh && curl https://nosnch.in/arandomcode"
  end

Now if replication fails, we’ll get an email within an hour using only tools we were already paying for.