(reprint of an article we once wrote for CircleId)
Earlier this week we announced our “Proactive Nameservers”, which is just marketing speak for what it really is: hot swappable nameservers or nameserver fail over.
What is it?
- you define some warm spare nameservers that are not normally in your delegation
- you load those servers up with your zone data (most likely by having it slave from your hidden master)
- you then monitor everything to a) make sure the current delegation is working and b) your backups are “ready” to step in should you need them (the last thing anybody wants to do would be to swap in some nameservers with out of date zone-data in the event of a failure)
- in the event of a failure or degraded performance with your current delegation, the system can either a) remove problematic servers from your delegation or b) change the delegation entirely to your backup pool.
That’s it. It’s basically what every webmaster, IT department and CTO wishes they had set up before their web hosting provider, registrar or managed DNS provider got blown away in a DDoS or had their “router tables corrupted”, and then wishes they would be able do (switch their delegation) but can’t do while that same provider (and all its hosted domains) is down hard.
What you see now are frantic workarounds where people stick the IPs for their nameserver provider into /etc/hosts so they can log into the otherwise unreachable management panel, figure out how to dump their zone records (provided that their DNS host even allows that) and then setting them up someplace else and then finally switching the nameserver delegation. By the time they get to this point it’s usually been a) a few hours after the outage started before it occurred to anybody that this would work and b) a few minutes before the outage ends anyway.
So what we’re doing here is setting all that up in advance, monitoring for conditions that require it to happen, and then automating its execution when the circumstances arise.
You may never need this, but it is impossible to know if you will or not.
Why is this better than using multiple DNS providers in your delegation from the outset?
Using multiple DNS providers all the time is in our minds a best practice. It may be more work to keep the various solutions talking to each other and in sync but it’s worth it. We have had numerous customers in the past using both us and one or two other providers concurrently that were not impacted when we or the other providers were DDoS-ed. Earlier this year, the number of our users on our easyRoute53 integration with Amazon’s Route 53 skyrocketed 400% in one day (the day after we got DDoS-ed).
But there are some limitations and some caveats to loading up your nameserver delegation with multiple provider nameservers at the same time:
1) I’ve seen people pile in 10 or 13 nameservers in an effort to achieve super-redundancy. Which works up to a point. Unfortunately what can also happen is that bloats the size of the DNS response packet past 512 bytes and thus triggering TCP retries on all their queries. This slows things down and we’ve seen issues lately where mobile clients on some networks do not handle it gracefully, actually resulting in failed lookups.
2) If you have a lot of nameservers in your delegation and some of them are non-performing (say one of your providers is down), then you again slow things up because initial queries or cache refreshes are going to hit all of your unresponsive nameservers and wait for the ensuing time-outs. With this approach they will only be querying the live, responsive servers.
3) You don’t show all your cards. We didn’t envision this as a DDoS mitigation tactic for a domain who is the direct target of a DDoS (the system works better if your provider goes down because of a DDoS against somebody else, which is probably more likely for most businesses). Having said that, if you are the direct target of a DDoS and you have this in place, you buy yourself some time before the botnet recalibrates and hits your backup pools. You can use that time to arrange or activate other DDOS mitigation such as GRE tunnels or proxies, which will be a lot easier to setup if you’re actually still able to operate.
4) Finally there is the syncing issues, which you have in either approach but at least now there’s a process that is actively monitoring if you have a sync issue or not.
Where can this go?
Once you break out of the box where nameserver delegations are assumed to be inert, static obelisks that most people (even IT and web professionals) don’t think about much, and make the leap that your delegations can be responsive and actively optimized; the horizons open up:
- you can optimize for response times.
- you can optimize by cost. If you’re familiar with Ruv Cohen’s work with Spotcloudand his spot market for CPU cycles, it becomes possible to envision the eventual emergence of a spot market for DNS responses. Especially with global load balancing and geographic targeting becoming more popular.
We think of our idea as a form of “uptime insurance”, one where the remedy isn’t compensation in money but rather continuity of services.
Our implementation is probably the Kitty Hawk version, but dynamic nameserver delegations is an idea whose time has come.