=^.^=

Quick and Dirty (and Free!) Host Monitoring for DNS Failover and Round-Robin

karma

Round-Robin DNS gets trash-talked a lot because although it is a cheap and easy way to distribute loads it is counter-redundant: the more A records (servers) there are behind a domain the more points of failure there are and the lower your mean time to failure is going to be. The good news is that if one in five web servers/reverse proxies are down then only about one fifth of your audience is unable to connect at any given time.

The answer to this problem is host monitoring. If we can update our DNS records to remove the IPs of downed servers then add them back when the hosts recover no direct intervention on our part is required. Unfortunately, DNS is a heavily cached system so we will have to work with reasonably short timeouts. DNS Made Easy recommends a TTL of no less than 180 seconds as some ISPs are configured to ignore the TTLs of records which they deem are too short and default to a much higher value. The drawback to short TTLs is that you will end up receiving more DNS queries, which is a problem if you use a commercial billed-by-million-queries DNS provider like Amazon's Route 53 or EasyDNS's enterprise service.

If your objective is to have web server failover that happens instantly this is simply not the solution for you - you need a load balancer and/or anycast address space. Amazon's Route53 and DNS Made Easy can be configured to check as often as every minute and it doesn't make a lot of sense to run a ping/tcp test more often than that. At worst this means that the failover system doesn't even know there is a problem for up to 60 seconds. Once the failover system updates the records there may be a short delay while the slave name servers synchronize. Then we have to wait for the record to expire at any-given-user's ISP's recursive name servers, which could take up to the TTL of your record or longer if their ISP is manipulative. Then you may have to wait for the record to expire in the caching DNS daemon on their home or office router. Then you may have to wait for the record to expire in their OS or browser's DNS cache. This could take up to 15 minutes even if you use a very low TTL like 180.

So the question is: you already have DNS infrastructure. Why pay these large DNS outfits for host monitoring and DNS failover when it's not really that great anyway and you can do it just as well as they can?

Just because BIND doesn't have built-in support? Pshaw!

You could just as easily do the host monitoring with nagios/icinga or use the mysql-bind backend or even some other database-backed name daemon but in this article I'll show you how to drop in a simple shell script that will work with your existing BIND installation because it demonstrates how mind-numbingly simple this is and why it shouldn't be charged for as a premium service.

Observe a typical zone file with round-robin:

$TTL 6400       ; max TTL
@       IN      SOA     ns1.somedomain.com. admin.somedomain.com. (
                                201305140       ; Serial
                                28800           ; Refresh
                                7200            ; Retry
                                60480           ; Expire
                                600 )           ; TTL Minimum
@               IN      A       10.0.0.10
@               IN      A       10.0.0.11
@               IN      A       10.0.0.12
@               IN      A       10.0.0.13
@               IN      A       10.0.0.14
*               IN      A       10.0.0.10
*               IN      A       10.0.0.11
*               IN      A       10.0.0.12
*               IN      A       10.0.0.13
*               IN      A       10.0.0.14
ns1             IN      A       10.0.1.10
ns2             IN      A       10.0.1.11
@               IN      NS      ns1.somedomain.com.
@               IN      NS      ns2.somedomain.com.
www             IN      CNAME   somedomain.com.

Our SOA contains the serial which will have to be updated by the script if our changes are to propagate properly. In the zone file on the master server(s) replace the SOA and block of round-robin A records with $INCLUDE statements like this:

$INCLUDE "/var/bind/soa.include"
$INCLUDE "/var/bind/ips.include"
ns1             IN      A       10.0.1.10
ns2             IN      A       10.0.1.11
@               IN      NS      ns1.somedomain.com.
@               IN      NS      ns2.somedomain.com.
www             IN      CNAME   somedomain.com.

Do this for every zone file which is to use this pool of A records. Now we have a centralized place to put the IPs and serial number that come from the shell script.

Create the script on the master name server and chmod +x it, don't forget to update the paths to reflect your DNS situation. Also note that I'm adding a wildcard subdomain to the pool:

#!/bin/bash
HOSTS="10.0.0.10 10.0.0.11 10.0.0.12 10.0.0.13 10.0.0.14"
COUNT=4
echo "; Generated by monitor.sh $(date)" > /chroot/dns/var/bind/ips.include
for myHost in $HOSTS
do
  count=$(ping -c $COUNT $myHost | grep 'received' | awk -F',' '{ print $2 }' | awk '{ print $1 }')
  if [ $count -eq 0 ]; then
    # 100% failed 
    echo "$(date) $myHost is down" >> /var/log/monitor.log
  else
    echo "@               IN      A       $myHost" >> /chroot/dns/var/bind/ips.include
    echo "*               IN      A       $myHost" >> /chroot/dns/var/bind/ips.include
  fi

done

echo "; Generated by monitor.sh $(date)
\$TTL 300       ; max TTL
@       IN      SOA     ns1.somedomain.com. admin.somedomain.com. (
                                $(date +%s)      ; Serial
                                300             ; Refresh
                                60              ; Retry
                                86400           ; Expire
                                300 )           ; TTL Minimum" > /chroot/dns/var/bind/soa.include

rndc reload

This script will ping each host in the HOSTS array four times. If at least one ping is received the host is written to a new version of ips.include (note the single angle bracket when inserting the date). If all four pings fail a message will be recorded in /var/log/monitor.log. You may want to adjust the number of pings and failure tolerance, or replace the logging line with an e-mail notification instead. Once the ping tests are done a new soa.include is written with an epoch serial number and the zones are reloaded.

At the end of execution you should see something like this in ips.include:

; Generated by monitor.sh Tue May 14 16:15:26 EDT 2013
@               IN      A       10.0.0.10
*               IN      A       10.0.0.10
@               IN      A       10.0.0.11
*               IN      A       10.0.0.11
@               IN      A       10.0.0.12
*               IN      A       10.0.0.12
@               IN      A       10.0.0.13
*               IN      A       10.0.0.13
@               IN      A       10.0.0.14
*               IN      A       10.0.0.14

And in soa.include:

; Generated by monitor.sh Tue May 14 16:15:26 EDT 2013
$TTL 300       ; max TTL
@       IN      SOA     ns1.somedomain.com. admin.somedomain.com. (
                                1368562526      ; Serial
                                300             ; Refresh
                                60              ; Retry
                                86400           ; Expire
                                300 )           ; TTL Minimum

Note that you may need to chown named: the .include files after they are created the first time, depending on your environment.

I switched from using the widely popular YYYYMMDDID format to epoch since the 5 minute interval requires hours, minutes and seconds to be effective and YYYMMDDHHMMSS is too large a value for BIND. This resulted in a lower serial value - you may have to go around to your slaves and manually delete then reload their zone files.

This approach ends up generating a lot of NOTIFY traffic since every 5 minutes (or whatever interval you cron the shell script at) a new serial is loaded and all of your slaves have to be contacted. A more graceful improvement would be to save the state that each host is in inside of a temporary file and only update the serial when there has actually been a change in the status of your pool.

Another neat thing I thought of trying was using something like heartbeat for real-time monitoring and dnsupdate to dynamically update the zone files. This should narrow the propagation latency on your side of the equation down to the barest minimum possible.

Comments

There are no comments for this item.