CMU DNS: Cleaning out the Augean Stables

G. Somlo, May 2010

1. Intro, Motivation, and Overview

DNS (Domain Name Service) is one of the most highly visible services supported by the Network Group. Any outage is almost guaranteed to be immediately noticed by end users, and, as such, it is of utmost importance to ensure the high availability of the service through careful design for redundancy, failover, and load sharing.

The purpose of this writeup is primarily to document the new, recently deployed system. However, to better illustrate our progress, and justify some of the technical decisions that were made, I will compare and contrast key differences between the current state of affairs and the original point from where I started.

In a nutshell, we started out with a system consisting of 18 servers interacting in ways both complex and unpredictable, with configurations automatically generated by a poorly understood NetReg subsystem (dns-config.pl) which both lacked documentation and was perfectly willing to generate broken, unloadable DNS configurations without so much as a warning. We now have a three-tiered architecture, where all servers within a tier or class are identically configured, allowing for virtually free horizontal scaling (cost of hardware notwithstanding). With geographical redundancy built in, this system could be deployed on as few as five total servers (we currently use 10, for symmetry -- see
3.1). The dns-config.pl configuration file generator has been completely rewritten, and will now issue useful warnings while refusing to generate broken configurations. The NetReg mechanism by which DNS servers get populated with zone data and configuration is now well understood and documented (see Section 4).

The remainder of this document is organized as follows: Section 2 delves into the various underlying mechanisms for implementing IP anycast, which is the essential building block for providing both load sharing, and seamless failover in the event of hardware failure or maintenance. Section 3 showcases the overall architecture of the new system, explains the role played by each of the three server tiers, and describes how the tiers interoperate with each other. Section 4 documents NetReg's DNS Server Group service (read: object) type, the primary interface by which a NetReg admin is currently expected to make DNS changes. Our current methodology for supporting DNSSec is presented in Section 5. Section 6 concludes with ideas for future work, such as offering service on IPv6, and improving NetReg to better support the management of restricted DNS views for special-purpose clients.

2. Failover and Load-Sharing with IP Anycast

2.1. Built-In DNS Failover

DNS has a very simple built-in failover mechanism: all speakers of the DNS protocol (clients as well as servers) typically encourage (and often require) that at least two server IP addresses be configured for every DNS "reference". To illustrate this point, let's consider DNS end clients (stub resolvers, in DNS parlance) found on most users' desktop, laptop, and handheld devices. In addition to a primary DNS server address to which all queries are sent by default, the device asks (and sometimes insists) that at least one more, secondary DNS server address be configured, to which queries are sent in the event the primary server becomes unavailable. On our campus network, the two (or more) DNS server IP addresses are automatically set via DHCP, but can still be viewed by running ipconfig /all from a Windows shell, or by viewing the contents of /etc/resolv.conf on Mac and Linux machines. Similarly, when DNS servers refer to one another for purposes such as forwarding or delegation, best practices encourage using multiple server IPs for the primary and backup target of the forward or delegation, respectively.

This failover method is not always sufficient in real-world, real-time scenarios. DNS record lookups (e.g., using 'host', or from within a Web browser) will merely take a few seconds longer (while the secondary server is being queried), and go mostly unnoticed. There are, however, situations when a few extra seconds of DNS response delay leads to highly visible outages. For example, the sshd remote login service will attempt a reverse lookup on the connecting clients' IP addresses (for reasons as trivial as attempting to generate a more informative log entry). When the lookup doesn't complete within sshd's set time limit, the login attempt is timed out, resulting in denial of service to ssh clients. It is, of course, possible to configure sshd to forgo reverse DNS lookup on connecting clients. However, rather than externalize this cost across all system administrators, along with addressing all other cases where real-time DNS responses are required, it is much more economical to just make DNS more reliable, as demonstrated in the remainder of this section.

Another highly visible outage scenario occurs when DNS forwarding is used. When a DNS server forwards a client query to another server, the client may give up before a second forwarder has a chance to respond. In effect, having secondary (and ternary, etc.) forwarders may be a complete waste of resources. This specific failure scenario was observed in the field only a few months ago.

2.2. Basics of IP Anycast

IP Anycast is a technique by which one or more IP addresses are advertised using the underlying routing protocol from more than one physical location. As applicable to DNS, we configure a well-known DNS IP address (such as, e.g., 128.2.1.10) on multiple servers, and then announce host routes to this address pointed at each server.

The primary application for anycast is load sharing: each server's advertised anycast IP route "attracts" the "closest" clients' queries, where "distance" is determined by the routing protocol. As will become apparent in
Section 3, this also means that server placement within the network is very important to ensure an even load distribution.

Failover can be accomplished if the individual routes comprising an anycast virtual server IP are announced conditionally, based on each real server's availability, and can be quickly withdrawn in the event of a failure. Once a real server's anycast route to the virtual IP is pulled, the underlying routing protocol will redistribute client queries across the remaining real servers.

2.3. Anycast with Quagga

This method was in use on the recently phased-out campus DNS infrastructure, before the upgrade. Each DNS server was expected to participate in the campus backbone routing protocol (OSPF), and announce the virtual DNS service IP(s) it had configured as extra loopbacks using Quagga (an open source routing software package).

This method suffered from severely limited failover capabilities. In the event of the entire machine losing power, network connectivity, or crashing, its announced anycast route went away along with its availability for DNS service, which in turn allowed clients to be re-routed to other available servers. However, in the event of a failure limited to the DNS software (bind/named), the Quagga routing daemon would continue fulfilling its intended purpose (to announce routes for locally configured loopback IPs, in this case), causing a denial of service to those clients which had the misfortune of having their DNS queries routed to this server. A partial outage of this type would last until, through manual intervention, either the DNS software was restored to operational status on the server, or Quagga was shut down to stop announcing the DNS anycast IP and thus stop attracting clients whose queries would then be dropped.

One potential way to improve this situation would (have) be(en) to add support for probes and conditional route announcements to Quagga (similar to SLA, see 2.5) and submit the patch upstream. Another, less robust, but still workable solution would have been to write a small script to shut off Quagga in the event that bind/named was unresponsive (e.g. a simple cron job).

Instead, the adopted solution was one entirely worthy of Rube Goldberg. An extra machine-specific loopback IP was set up on each server, and added to the list of IPs monitored for DNS service availability, in the idea that: Additionally, if clients were configured with two anycast service IPs (primary and secondary, see 2.1), these two IPs had to be supported on entirely separate and non-overlapping pools of hardware. In the event that bind/named failed on a server announcing both primary and secondary anycast IPs via a still-operational Quagga daemon, clients failing over from primary to secondary would have their queries routed back to the same failed server, guaranteeing a full DNS blackout instead of simply degraded service.

In conclusion, once the appropriate number of servers to support the primary DNS anycast IP was determined, it had to be doubled in order to support a secondary IP for the same client population, requiring a ridiculously high number of DNS servers which spent the majority of their time idling.

2.4. Anycast with SLB

SLB (Server Load Balancer) is a feature included with the Cisco 6500 hardware platform on which our campus routing infrastructure is built. While SLB is mainly intended to offer rich support for various load balancing policies, we use it for its ability to probe service availability on real servers and to conditionally announce (anycast) routes for virtual service IPs based on the probes' result. Our new, upgraded DNS infrastructure currently uses SLB to implement IP anycast, offering both load sharing (one real server for each tier -- see Section 4 -- advertised from each distribution router) and fast failover (announcements are withdrawn within 10 seconds if probing detects a failed server).

With SLB, individual servers no longer have to run anything beyond bind/named, greatly simplifying administration and monitoring. However, the most important gain obtained by moving anycast support off the servers and back into the network infrastructure was that the same hardware pool of identically configured machines can now support both primary and secondary anycast IPs. Since SLB immediately removes all anycast routes pointing at a failed server, we have eliminated all scenarios in which client traffic would have been sent to failed servers.

For an example of how SLB is configured to manage and monitor DNS servers, consider the following fragment of IOS configuration deployed on each distribution router (see Figure 1 for the relevant network topology elements of the campus backbone, post-DNS-overhaul):
	ip slb probe CMUDNS dns
	 lookup cmu.edu
	!
	ip slb serverfarm CMUDNS
	 predictor leastconns
	 probe CMUDNS
	 !
	 real IP.OF.REAL.DNS.SERVER
	  no faildetect inband
	  inservice
	!
	ip slb serverfarm CMUDNS-BAK
	 predictor leastconns
	 !
	 real IP.OF.CORE.ROUTER
	  no faildetect inband
	  inservice
	!
	ip slb vserver CMUDNS-UDP-10
	 virtual 128.2.1.10 udp dns
	 serverfarm CMUDNS backup CMUDNS-BAK
	 advertise active
	 idle 4
	 inservice
SLB "vservers" must be configured per-IP, per-protocol, but vserver snippets for TCP and a secondary service IP, 128.2.1.11 were omitted for brevity.

Each distribution router will forward DNS queries to the local real server, if it is available. In the event of a failure, queries are routed to the core, which then evenly spreads them out across the remaining distribution routers with operational DNS servers.

2.5. Anycast with SLA

SLA (Service Level Agreements) is another technology available on the Cisco 6500 platform, beginning with IOS version 12.2.(33)SXI. It directly allows the insertion of conditional route statements into the running configuration, and is therefore much simpler and lightweight than SLB. Also, SLA has a much better chance of becoming supported under IPv6 in future Cisco IOS releases. We plan to switch our DNS IP anycast implementation to SLA as soon as all distribution routers are upgraded to SXI.

The equivalent configuration snippet for implementing anycast using SLA is:
	ip sla 100
	 dns cmu.edu name-server IP.OF.REAL.DNS.SERVER
	 frequency 10
	 exit
	!
	ip sla schedule 100 life forever start-time now
	!
	track 100 ip sla 100
	 exit
	!
	ip route 128.2.1.10 255.255.255.255 IP.OF.REAL.DNS.SERVER track 100
	ip route 128.2.1.11 255.255.255.255 IP.OF.REAL.DNS.SERVER track 100
Packets destined for one of the DNS service IPs will simply be routed to the real server, but only as long as the monitor confirms the service is indeed available. Otherwise, the local route(s) are withdrawn, and client traffic is free to follow OSPF to whichever other distribution router still announces routes to the service. NOTE: In the above example, the real server is sent a query for cmu.edu, and an A record is specifically expected in return. There is currently no way to specify a different record type, so we must be sure to query for a known A record to determine DNS service availability on the probed host.

3. Overall Architecture

3.1. Server Tiers and Network Placement

Network Topology vs. DNS Server Placement

Figure 1. Network Topology vs. DNS Server Placement

Prior to the upgrade, we had a relatively large number of DNS server classes, each responsible for a separate subset of our (sub)domains, which made the overall system unnecessarily difficult to comprehend. To illustrate, we had:
Within each of the first three main categories (except caching, of course), one server was designated as master, while the rest acted as slaves. Sometimes the master itself was used to serve queries, and sometimes it was used in "shadow" mode, where it was only allowed to serve zone transfer requests from its slaves, which in turn supported the publicly visible service.

After the DNS infrastructure overhaul, only two server classes remain: caching and authoritative. Each class is accessed via two anycast service IPs (primary and secondary), which are both supported on every class member. For reasons of symmetry (so we wouldn't have to designate two distribution routers as "special" by having them host DNS servers) we have one cache and one authoritative server connected to each distribution router. This way, not only are all DNS servers equal (within their designated class), but all distribution routers are also equal (by supporting the same anycast statements to advertise the anycast DNS service IPs). Last, but not least, each distribution routers' clients are exactly one Layer-3 routing hop away from their default DNS server. A simplified network diagram illustrating DNS server placement is shown in Figure 1.

3.2. Server Interactions

Overview of the CMU campus DNS architecture

Figure 2. Overview of the CMU campus DNS architecture

The only special case server is the "shadow" authoritative master, which is the single point of update for all external systems (such as NetReg, dhcp, lbnamed, etc) . All anycast-enabled authoritative servers are configured as slaves to this machine. Redundancy and failover for the shadow master is accomplished through
VMotion, as the shadow master is deployed as a guest on our VM infrastructure. Since queries for authoritative DNS data are only ever sent to the authoritative anycast service IPs, a temporary software failure on the shadow master will only impede the ability to enact further DNS updates, but never prevent querying existing data. An illustration of how the various servers interact with each other is given in Figure 2.

We can now count the minimum number of servers required to support this architecture with physical redundancy: for a total of five machines. As explained earlier, we run five caches and four authoritative slaves (without much extra effort, since all caches and all authorities are configured identically to each other), for a total of 10 servers.

3.3. Bypassing the DNS Root

Another issue related to how our servers interact is configuring caches to bypass the DNS root. This is a requirement when resolving *.local names or RFC1918 reverse lookups such as *.172.in-addr.arpa. Since these domains are expected to be private to each site that supports them, the DNS root servers must necessarily fail to resolve any related queries. Therefore, caches that begin resolving such queries at the DNS root will be guaranteed to fail, put undue load on the root servers (poor netiquette), and advertise the lack of skill of their administrators to the Internet community.

Caches must instead be configured to begin recursion for any *.local or reverse RFC1918 queries directly at our own authoritative servers. It's a handy extra improvement in response time to also bypass the DNS root for *.cmu.edu, *.2.128.in-addr.arpa, and any other top-layer domain known to be hosted on our authoritative servers.

Two methods exist to configure a cache to bypass the DNS root: zones can be configured as either "forward" or "stub". The former was used extensively with our previous DNS infrastructure, but has now been abandoned in favor of the latter, which brings about vast improvements in simplicity and flexibility.

To bypass the DNS root for all *.cmu.local lookups using "forward", we'd have to start by adding the following to the cache's named.conf file:
	zone "cmu.local" {
		type forward;
		forwarders {IP.OF.AUTH.1; IP.OF.AUTH.2; ...};
	};
If a query were received for something.cmu.local, it would be forwarded to the first IP address on the forwarders list. The caching server would expect a full answer, which it would then return to its client, and also cache for future reference. The big drawback to this method is that any server listed as a forwarder is expected to either be able to answer the query authoritatively, or be willing to recurse on the forwarding cache's behalf. Answering with a delegation is considered a forwarding failure, and the cache would proceed to query the next IP on its list of forwarders. Assuming that, on the forwarders, a subdomain of cmu.local (e.g. cs.cmu.local) was delegated to another set of servers, it would be necessary to either configure the forwarders to allow recursion from our cache, or to add another entry for zone "cs.cmu.local" to the caching server's named.conf, forwarding to the new set of authorities for that domain.

Needless to say, using this method made our old server configuration brittle and complex, by requiring that data (delegations are accomplished by inserting NS records into the parent zone) be replicated in configuration (the same servers mentioned in the NS delegation records now had to be listed as forwarders in named.conf on the caches). It also required that our authoritative servers act as caches (but only to our caching servers), adding to the overall Rube Goldberg-esque nature of the entire system.

The proper solution to the problem of bypassing the DNS root is to use "stub" zones instead:
	zone "cmu.local" {
		type stub;
		masters {IP.OF.AUTH.1; IP.OF.AUTH.2; ...};
	};
This statement simply forces the cache to always keep a copy of the zone's NS records up to date from the listed masters. The other piece of the puzzle is the fact that the DNS recursion algorithm (by which a cache starts at the DNS root and keeps following delegations until it finds a name server which can answer authoritatively) is, in fact, opportunistic. The recursion only starts at the DNS root if there are no better (i.e. more specific) servers already cached. In conclusion, using "stub" will not only bypass the root, but also allow the cache to follow any sub-delegations (e.g. to cs.cmu.local) without requiring them to be spelled out in the actual config file.

Using "stub" allowed a much cleaner and simpler configuration of our servers: authorities are no longer expected to perform recursion for anyone, and caches only need explicit configuration entries for the top layer (local, cmu.edu, etc.) of domains we serve, while subdomains are resolved by following any applicable delegations returned by the authorities.

4. NetReg and DNS

Campus DNS servers continue to have their configuration and zone data generated and pushed from NetReg (our homebrew IPAM system). At a high level, DNS management in NetReg occurs via two types of objects (or "service groups"): DNS Server Group and DNS View Definition. Two scripts, dns-config.pl and dns.pl generate configuration and zone files, respectively. At a lower level, the configuration elements of the NetReg service groups were lacking documentation, and the scripts were poorly written (and therefore mostly unreadable). Also, at least in the case of dns-config.pl, they allowed the creation of broken, unloadable configuration files with absolutely no warning or feedback of any kind.

The purpose of this section is to document NetReg's DNS configuration elements, explain how they affect the resulting configuration files, and serve as a guide to administrators wishing to make DNS changes via NetReg. This is knowledge I acquired during a
rewrite of dns-config.pl.

4.1. Service Groups Demystified

NetReg service groups are simply sets of other NetReg objects of various types (machine, dns_zone, service, etc.), with various attributes that may be attached either to individual members, or to the set as a whole. Other, separate logic (such as the dns-config.pl config file generator) are tasked with assigning meaning to these object sets and taking action based on their contents.

Let's begin by examining the layout of our typical named.conf file:
	options {
		GLOBAL OPTIONS;
	}

	keys and ACLs (master server only);

	view "some_view" {
		VIEW OPTIONS;
		zone blocks ...
	};
	...
	view "global" {
		zone blocks ...
	}
Note how the "global" view does not have its own set of options. Instead, the 'GLOBAL OPTIONS' from the top of the file apply. Other views may come with their own options which override or add to the inherited global ones. Master servers may contain keys and ACLs which control the DDNS update process.

4.1.1. The DNS View Definition Type
DNS options (whether global or view-specific) are added using this (improperly named, should be something like 'DNS Option Set' instead) service group type. This type of object contains no members, and is used only as a collection of attributes, which may of the following types:
4.1.2. The DNS Server Group Type
This is the main DNS configuration mechanism offered by NetReg. Its purpose is to tie together DNS servers (machine type members) with zones (dns_zone type), and views (supplied by adding DNS View Definition groups as service type members). For a DNS Server Group to have any effect, it must contain machines, and at least one of either a set of zones, or a set of views. Any group which contains all three -- machines, zones, and views -- can be separated out into two equivalent groups, one assigning the zones to the machines, and another one adding views to the same machines. In fact, the latter method of keeping views and zones in separate DNS Server Group objects is strongly recommended, as a way of preserving the legibility of NetReg configuration.

Machine members of a DNS Server Group may have the following attributes set: A server's membership in any DNS Server Group will only lead to entries being added to its named.conf if both a valid version and a valid type (other than 'none') are set. A machine with a valid type and no version may end up being referred to in the named.conf of other group members, but will not receive any entries in its own config file based on its membership in this service group.

Service (i.e., view) members of a DNS Server Group may have the following attributes: Zone members of a DNS Server Group may have the following attributes:
The DNS Server Group object type may also contain the following service-wide attribute:

4.2. Building Configuration Files

The dns-config.pl script will first construct the set of servers which need configuration files, by iterating across the full list of available DNS Server Group objects and retaining all machine members with a valid version and type attribute.

For each host, dns-config.pl will again iterate across all DNS Server Group objects and collect all views applicable to the host.

Finally, for each view on the current host, dns-config.pl will iterate across all DNS Server Group objects once more, collecting all zones applicable the current view. The whole process is shown in the following snippet of Perl-ish pseudocode:
	foreach $DnsSrvGrp {
	  foreach machine member $HN of $DnsSrvGrp {
	    push ($DnsSrvGrp, @{$HGroups{$HN}});
	  }
	}
	foreach $HN (keys %HGroups) {
	  foreach $DnsSrvGrp (@{$HGroups{$HN}}) {
	    verify version consistency for $HN;
	    collect views in $ViewOrder{$VName};
	  }
	  foreach $VName (sort keys %ViewOrder) {
	    foreach $DnsSrvGrp (@{$HGroups{$HN}}) {
	      collect zones from $DnsSrvGrp and process them for inclusion under $HN->$VName;
	    }
	  }
	}
When adding a zone to a 'master' server's config file, dns-config.pl will check whether the zone is configured for DDNS updates, and, if so, also add the necessary authorization bits to the configuration. This information is primarily originated within the DDNS Authorization string attached to each zone, which contains DDNS key information. In addition, host-based authorization for DDNS updates can be configured via service groups of type DDNS_Zone_Auth. This service group type simply bundles zone and machine members, without any attributes. The semantics of a DDNS_Zone_Auth group is that any machine member will be allowed to DDNS-update any zone member, and dns-config.pl ensures this information is added to the relevant named.conf files in the form of ACL statements.

The following is a list of issues to be aware of, in order to prevent dns-config.pl from generating broken named.conf files:

4.3. Current NetReg Service Group Setup

We are currently managing and monitoring the Pittsburgh campus DNS infrastructure, as well as the two off-site servers located on our New York satellite campus. Several other servers under the CMU umbrella are independently managed (Qatar, West, etc.) and, while we slave a few zones from them, we are not concerned with their configuration.

This section explains how the Pittsburgh campus servers are configured, both with global options and in terms of which zones they end up supporting.

4.3.1. Setting Global Options
Several DNS View Definition objects exist in NetReg with options for various (sub)types of DNS servers: For example, the location of the statistics file must be configured on all campus servers, and therefore must go under view.globalopts.common. Only caching servers allow recursion, so allow-recursion goes under view.globalopts.cache. Only the master needs to send notification messages (to its slaves) when zones get updated, so also-notify goes under view.globalopts.authmaster. All authoritative servers maintain journal files (for either dynamic updates or incremental zone transfers), so the 'max-journal-size 1024k' setting (which limits the unbounded growth of a journal and thus prevents performance issues when many full zone transfers need to occur simultaneously) goes under view.globalopts.auth.

New options should be added to exactly one of the above groups. In order to keep NetReg configuration relatively easy to follow, I strongly recommend that DNS Server Group objects used to connect machines with their global options be separate from those used to connect machines with the zones they need to serve. The following DNS Server Group objects are currently in use for setting global options: The two extra views (quickreg and authbridge) configured on the caching servers are specific to a limited number of sources (unregistered wired IP space and the AuthBridge/RAWR proxies, respectively), and serve a set of special, hand-configured zones to these clients to redirect certain types of traffic to our "portal"-style page, a.k.a. FirstConnect. The hand-configured zone files must currently be independently loaded onto each caching server (See
Section 5 for a potential future enhancement to NetReg which would allow these zones to also be pushed to the servers automatically).

4.3.2. Assigning Zones to Servers
There are three types of zones for which we are authoritative: We map these zones to our servers using four DNS Server Group objects: When a new zone is created in NetReg, it should be assigned to precisely one of the above service groups. If the zone is under our control, that will most likely be subdomains.dns (unless, of course, we're adding a new vanity top layer domain like awesome-cmu-project.org, in which case the service group should be toplayer.dns).

The New York machines (ny-server-[03,04]) are multi-purpose servers, and the DNS service they offer is also serving multiple purposes. Both machines act as caching servers for campus clients. ny-server-03 is master for the IP subnet we use at our NY satellite campus (so that their on-site dhcp server can perform DDNS updates locally, and have the result available in the event of a WAN outage). ny-server-03 also acts as an off-site backup authority for our top layer zones (and is therefore listed as an additional slave in the toplayer*.dns groups). Separate options are configured for these servers via the ny-*.globalopts.dns groups.

4.4. Frequently Encountered NetReg DNS Operations

Several tasks are encountered on a regular basis when administering DNS through NetReg. The list includes zone creation, adding CNAMEs to external entities, etc. More tasks will be described here as they are identified along the way.
4.4.1. Creating New Domains (Zones)
DNS (sub)domains are improperly referred to as "zones" in NetReg parlance. When a new one must be created, we start at NetReg's "Add a DNS Zone" form. First, we fill in the zone name. Forward zones will be named something like foo.cmu.edu, or myvanitydomain.org. Reverse zones look like 123.237.128.in-addr.arpa. The Type pull-down allows three different subtypes for both forward (fw-) and reverse (rv) zone types: The remaining fields (under SOA Parameters) are only applicable to *-toplevel (a.k.a. standalone) domains (or zones): Once created, standalone and delegated domains also need NS records, which must be added under the DNS Resources section. Domains under our "jurisdiction" typically use nsauth1.net.cmu.edu and nsauth2.net.cmu.edu as their NS records.

Standalone domains should also be added to the appropriate DNS Server Group, in order to have their zone file pushed to the appropriate master server, and replicated on the appropriate slave servers. Domains under our control should be members of either toplayer.dns or (much more likely) subdomains.dns.

Finally, standalone and parent-hosted (i.e., toplevel and permissible) domains may be associated with one or more subnets, for the purpose of allowing machines to be registered with IPs within those subnets and names within those domains.
4.4.2. CNAMEs to External Entities
We are frequently asked to publish CNAME records pointing at an external entity, such as:
	guides.library.cmu.edu	IN	CNAME	cmu.libguides.com
which would allow clients attempting to connect to an advertised destination of guides.library.cmu.edu to be redirected to an externally hosted service at cmu.libguides.com.

NetReg will only generate CNAME records when a CNAME resource is added to a registered host. For example, adding a somecname.domain.cmu.edu CNAME resource to somehost.net.cmu.edu will result in the following record being added under the domain.cmu.edu zone:
	somecname	IN	CNAME	somehost.net.cmu.edu
Netreg will throw an error if the CNAME's domain (domain.cmu.edu) does not exist as a NetReg-managed "zone", since there would not be a place to publish the CNAME record into DNS. Using the guides.library as an example, we'll frequently see users mistakenly register something like guides.library.cmu.edu in NetReg and then attempt to add a CNAME resource named cmu.libguides.com to it. Not only will this fail due to libguides.com not being a NetReg-managed zone, but even if it could succeed, it would publish a CNAME in the wrong, opposite direction of what's actually desired, in the wrong zone (libguides.com instead of library.cmu.edu):
	cmu.libguides.com	IN	CNAME	guides.library.cmu.edu
To correctly complete this task, we need to perform the following steps:
  1. Create libguides.com as a fw-permissible zone. We don't really need NetReg to ever generate and publish a libguides.com zone file (and, for the record, com is also registered as fw-permissible), but need to be capable of registering fake hosts named *.libguides.com, which requires the domain to exist in NetReg.
  2. Associate the newly created libguides.com zone with the Reserved Devices subnet.
  3. Register a machine named cmu.libguides.com on the Reserved Devices subnet, using Mode = reserved.
  4. On the newly registered machine, under DNS Resources, add a CNAME named guides.library.cmu.edu. This should work just fine, since library.cmu.edu is a valid zone controlled by NetReg.
The next time NetReg generates DNS data, a guides CNAME record will be added under the library.cmu.edu zone, pointing toward cmu.libguides.com, which is the desired result.

5. Supporting DNSSec

DNSSec is only supported on DDNS-enabled zones. In order to keep a loose coupling between NetReg and a DNSSec enabled master server (read: in order to avoid replicating half the DNSSec code in NetReg), most DNSSec specific operations (e.g. zone sigature and key maintenance) happen on the master server directly, without NetReg's knowledge or involvement. Since zone SOA serials must be bumped during DNSSec maintenance, and NetReg already has a model of operation under which it does not keep track of these serials (DDNS), it makes sense to make DDNS a precondition for a zone to be eligible for DNSSec. Additionally, starting with bind version 9.7, DDNS-enabled zones benefit from automated DNSSec maintenance via the 'auto-dnssec maintain' zone-specific option in named.conf. Slave servers also don't need to know anything about DNSSec management (besides being enabled to respond to DNSSec queries by returning RRSIG records, if available), since they'll simply be slaving whatever the master server will transfer to them.

5.1. Enabling DNSSec on a Dynamic Zone

NetReg's global master-specific option list has been updated to include key-directory "/var/named/CMU/keys", which tells the master server where to find zone-specific DNSSec keys if/when it needs to access them. If, in addition to the expected content of a zone's 'DDNS Authorization' (see 4.4.1) we also include the string "dnssec:ena", the dns-config.pl script will add 'auto-dnssec maintain' to the zone's list of options, in effect telling the master server to automatically manage DNSSec signatures on that zone.

5.2. DNSSec Key Management

The operations described here are only intended to be performed on the master DNS server. Slaves will automatically receive transferred copies of the zone(s) managed by the master, and need not get involved in DNSSec key management.

5.2.1 Initial Zone Signing
Before a zone is first signed, we need to generate a Key Signing Key (KSK) and a Zone Signing Key (ZSK). Current DNSSec best practices recommend that a larger KSK be used as a "secure entry point" into the zone. This key will sign the smaller ZSK, which in turn will sign all other records within the zone (thus reducing the size of DNSSec-enabled replies). ZSKs can be rolled over more frequently, since, unlike KSKs, they don't require updates to the zone's parent. We have standardized on 2048-bit KSKs and 1024-bit ZSKs. To generate these keys, issue the following commands on the master server:
    dnssec-keygen -r /dev/urandom -K /var/named/CMU/keys -n ZONE -3 \
                  -b 2048 -f KSK $zone_name

    dnssec-keygen -r /dev/urandom -K /var/named/CMU/keys -n ZONE -3 \
                  -b 1024 $zone_name
where $zone_name is the name of the zone in question (e.g. cmu.edu). On RHEL, make sure that all such keys are owned by named:named, so the bind daemon can access them.

A DS record corresponding to the KSK must be uploaded to the zone's parent (edu in our case). To generate this record, use the following command:
    dnssec-dsfromkey -f $zone_file $zone_name > dsset-$zone_name.
where $zone_file is the file name under which the zone is stored on the master (e.g. CMU.EDU.zone). Upload the DS record to the parent via whatever interface is available (e.g.,
http://net.educause.edu/edudomain/ for .edu domains).

To force bind to (re-)sign the zone immediately, use:
    rndc sign $zone_name
5.2.2 Key Rollover
To roll over a key, first generate a replacement KSK and/or ZSK (using 'dnssec-keygen as illustrated above in 5.2.1). Next, calculate the largest advertised TTL within the zone (MaxTTL). A quick and easy way to accomplish this would be:
    named-checkzone -o - $zone_name $zone_file | awk '{print $2}' | sort -n
If this value is very large, the DNSKEY RRSIG expiration time may occur sooner, and we may use that lower wait time instead.

We now need to re-sign the zone with
    rndc sign $zone_name
which will include both the old and new versions of the key being rolled over.

If this is the KSK being rolled over, we must now also generate a DS record for the new version of the key:
    dnssec-dsfromkey -f $zone_file $zone_name > dsset-$zone_name.
upload it to the parent, and wait for its publication (use whois to determine when publication has occurred).

At this point, we mark the old key for expiration/deletion in $MaxTTL as calculated above:
    dnssec-settime -K /var/named/CMU/keys \
                   -R +$MaxTTL -I +$MaxTTL -D +$MaxTTL $old_key_file
Look for $old_key_file under /var/named/CMU/keys. It should look something like K$zone_name.+007+$old_key_serial_no.key. The old key should stop automatically stop being used shortly after now + $MaxTTL, but 'rndc sign' may be used anytime after that deadline to force bind to resign the zone without the old key.

6. Next Steps

This section serves as a running wish- (and ToDo) list for remaining DNS related work.