Harnessing Formal Methods Beyond Routing Layer

Network verification is a mature field with an industrial presence developed to ensure networks and services are safe from configuration-related outages. However, researchers in this field primarily focused on verifying configurations at the routing and forwarding layer, which mainly involved BGP and OSPF. As it takes a village to raise a child, many more protocols and applications across the networking stack work together to deliver data and services through the Internet to an end user. Disruption in any layer can cause delays and precipitate cloud failures, as was evident from the Microsoft DNS global outage in 2019. Besides the routing layer, other layers, like DNS, received little attention from the verification community.

In this article, I will describe our work on using network verification ideas and formal methods beyond the routing layer to have a robust DNS, one of the foundations of the modern-day Internet. I will describe the lessons I learned in this process, and I hope others will join in harnessing formal methods to have guarantees across the network stack.

Formal Methods for a Robust DNS

The Domain Name System (DNS) primarily translates domain names into IP addresses. DNS name resolution seems simple at a high level but has evolved into a complex and intricate protocol over time. The DNS is on the critical path of every application we use today but is a fragile protocol, as it is easy to make “two classes of correctness errors” that cause it to fail often (recent outages include - Salesforce, Slack, Azure). The Haiku shown below is widely popular among network engineers and sysadmins. It shows how common DNS-related outages are and how unforgiving they can be.

DNS Haiku

The first class of errors is “errors in DNS configurations.” Reasoning about how a single query will get resolved in the DNS (which is inherently nondeterministic due to multiple nameservers serving the same zone) is a daunting task; worse, operators must ensure that all possible queries behave as intended. The second class of errors is “DNS protocol implementation errors”. There are multiple DNS implementations of the protocol in the wild, both via open-source (Bind, PowerDNS, Knot, NSD, CoreDNS)and in public or private clouds (Azure, Amazon Route 53, Akamai, Cloudflare). It is difficult to write an efficient implementation that is also bug-free and compliant with DNS RFC specifications, which are currently spread across more than 30 RFCs.

Our idea was to combine formal methods, which have proved successful in other areas, with DNS-specific insights to develop techniques that can provide provable correctness guarantees that help achieve a robust DNS. While there is a large body of work on verifying the configurations at the network routing layer, there was no such work for DNS.

While the semantics of routing and forwarding are well understood (e.g., longest prefix matching), DNS’s semantics are relatively poorly understood by comparison. More generally, while there are superficial similarities between routing and DNS, the specific details are vastly different. For example, DNS is nondeterministic, supports query rewriting and delegation, and is managed in a distributed fashion. We, therefore, needed new techniques that address these challenges. These new techniques can also be generalized in the future to apply to other network ecosystem protocols with similar challenges. Moreover, much of the work in the network verification area has focused on router configurations, leaving to the individual router vendors the task of checking whether network devices implement their protocols correctly.

Formal Model of DNS

Although existing RFC standards specify the behavior of the DNS, these standards are informal and described in English, and to develop techniques using formal methods, we need to reason precisely about the behavior of the DNS. We, therefore, had to develop a mathematical formalization of the DNS that allows for the development of automated techniques with strong guarantees. The formal model of DNS that we developed from multiple RFCs is the first one for DNS and served as the foundation for our works.

DNS Configuration Verifier - GRᴏᴏᴛ

We built a verification tool, GRᴏᴏᴛ, that performs exhaustive and proactive static analysis of DNS configuration files (zone files) to guarantee key correctness propertiess; by contrast, existing solutions are reactive and incomplete. GRᴏᴏᴛ avoids verifying the huge space of DNS queries by first partitioning all possible queries into “equivalence classes (ECs)”, each of which captures a distinct behavior. This is in line with the recurring theme of equivalence classes used throughout network verification. GRᴏᴏᴛ then symbolically executes the set of queries in each equivalence class to efficiently find (or prove the absence of) any bugs such as rewrite loops. Our formal model of the DNS resolution was crucial for the efficient symbolic execution of ECs. Using GRᴏᴏᴛ, DNS engineers can proactively identify many DNS zone file errors, apart from common ones like delegation inconsistencies, for example:

Rewrite loop: “Is there a query that is rewritten in a loop $q_1 \rightarrow q_2 \rightarrow q_3 \rightarrow \ldots \rightarrow q_1$ under our domain”?
External nameserver: “Is there a query under our domain that is sent for resolution to a name server, not under our domain (like not under $\texttt{windows.com}$)?”
Number of rewrites: “Is there a query under our domain that is rewritten more than twice (or $n$ times)?”
Rewrite blackholing: “Does a query exist under our domain that is eventually rewritten to some other domain name that does not exist, and DNS returns $\texttt{NXDOMAIN}$?”

GRᴏᴏᴛ successfully revealed bugs in zone files from a campus network, Microsoft, and a large infrastructure service provider. When applied to internal zone files consisting of over 3.5 million records from a large infrastructure service provider, GRᴏᴏᴛ revealed around 160k issues of blackholing, which initiated a cleanup of the zone files.

RFC Compliance Testing of DNS Implementations - 𝗦𝗖𝗔𝗟𝗘 and Fᴇʀʀᴇᴛ

Using the formal model, we developed the first automatic test generator for finding RFC compliance errors in DNS nameserver implementations to handle protocol implementation errors. The major challenge in generating tests was that the DNS test consists of both a query and a zone file. To reach the core query resolution logic, they must be jointly generated with the query related to the zone file. We solved this challenge using the “𝖲𝖢𝖠𝖫𝖤 (Small-scope Constraint-driven Automated Logical Execution)” approach. The key insight behind 𝖲𝖢𝖠𝖫𝖤 is that we create an executable model of the DNS formal resolution semantics and then symbolically execute it for all paths through the model up to a bound, which still generates tests that cover many distinct nameserver behaviors.

We employed the 𝖲𝖢𝖠𝖫𝖤 approach to build Fᴇʀʀᴇᴛ, which generates tests using a logical model of DNS resolution implemented in a modeling language called Zen that has built-in support for symbolic execution. Zen is a general purpose intermediate representation with the advantage that one can easily extend the code to new DNS RFCs that would be added in the future or to any organization-specific deviations. This aligns with the proposal to move away from the monolithic approach and decouple modeling from the analysis.

Using Ferret, we identified 30 unique bugs across all DNS implementations; one of the bugs is a critical vulnerability in Bind (high-severity rated CVE-2021-25215) that attackers could easily exploit to crash Bind DNS resolvers and nameservers remotely. Through an internship, I integrated Ferret tests into “Amazon Route 53 DNS” implementation’s continuous development pipeline.

Key Lessons Learned

The following are the key lessons I learned after working on DNS for my Ph.D.:

Domain-Specific insights: It is generally tempting to use standard techniques like model checkers for DNS or symbolic/concolic testing for DNS implementations, but networks are so much simpler, and their inherent structure is lost if we use standard techniques in a black-box way. The domain-specific methods work much better (like DNS tree-based equivalence classes and 𝖲𝖢𝖠𝖫𝖤 for implementations), but we need to find the balance between using them and standard techniques (graph-based algorithms in GRᴏᴏᴛ and general purpose intermediate language in Fᴇʀʀᴇᴛ) to ease the barrier of entry.
Theory can help in its place: Complexity theory helped me understand the power of DNS as a term writing system, which helped in providing a lower time bound for DNS configuration verification, and formal theory helped me define a model for DNS. Heuristics are important, but the theory helps go beyond that to have strong guarantees.
Working with implementors and operators in the field gives extra insight: Working with DNS teams at Microsoft, Akamai, and Amazon and with DNS implementors (DNS-OARC) helped me understand the real problems, and to update the tools from what I did in research with certain features to have a practical impact. (Also, this Netverify post by Aditya and Aaron describes the importance of talking to a network operator in detail.)

Going beyond DNS and BGP

Global end-to-end properties, for example, checking whether a user can access a service, can be checked by modularly checking a sub-property at each layer in the network stack and then stitching them together for better scalability, among many reasons. Configurations are now extensively used in the application layer beyond DNS and BGP, for example, in container orchestration frameworks like Kubernetes. We need to develop techniques for them and eventually stitch all the configuration checking efforts from each layer to have end-to-end configuration guarantees. On the protocol side, there have been recent efforts to use formal methods for congestion control (here and here) and QUIC, and we need more efforts to check various protocol implementations, including BGP, used in the network stack today. We believe our 𝖲𝖢𝖠𝖫𝖤 approach to RFC compliance testing and ‘ferreting’ out bugs could be useful more broadly beyond the DNS and is an ongoing work.