relational verification,

Network change verification (even for networks without specifications)

Zachary Kincaid Zachary Kincaid View Oct 04, 2024 · 5 mins read
Arvind Krishnamurthy Arvind Krishnamurthy View Oct 04, 2024 · 5 mins read
Ratul Mahajan Ratul Mahajan View Oct 04, 2024 · 5 mins read
David Walker David Walker View Oct 04, 2024 · 5 mins read
Xieyang Xu Xieyang Xu View Oct 04, 2024 · 5 mins read
Yifei Yuan Yifei Yuan View Oct 04, 2024 · 5 mins read
Ennan Zhai Ennan Zhai View Oct 04, 2024 · 5 mins read
Network change verification (even for networks without specifications)

Networks routinely need changes to patch security holes, expand capacity, perform routine maintenance, and a myriad of other reasons. But every change is dangerous, and brings with it the risk of an outage. In theory, network verification can prevent a range of outages, but in practice, it often falls short, and outages have occurred in networks that are using verification.

The main reason is that the network verification effort is only as good as the specification being verified, and most networks do not have good specifications. Indeed, because most networks have been built up, extended, and fixed over time, by multiple teams of engineers, no one really knows how or why they work in detail. To the extent the how and why is known, it is typically not written down anywhere precisely, for easy use as a specification.

Consider a common change: Move all traffic on link a1 to a parallel link a2 as a precursor to shutting a1 for maintenance. Here’s a first picture of the setting. Traffic on the red link before the change should flow along the black link after the change.

Changing network paths from a1 to a2

To check that our change is correct, one should verify that a1 no longer carries any traffic and that all traffic it was carrying moves to a2. Doesn’t sound too hard! But that’s just the beginning—the easy part really.

We also do not want to change how traffic flows upstream of a1 and a2 and we do not want to change how traffic flows downstream of a1 and a2. The following picture gives us a better perspective on what we need to ensure—traffic on the orange subpaths must remain unchanged.

Changing network paths from a1 to a2

However, that is not all. We should also ensure that traffic flowing across other links in the network (such as the traffic traversing green links below) is unaffected.

Changing network paths from a1 to a2

In other words, to craft a specification that captures the full intent of the change, it seems as though we will need to specify how all traffic moves through the network, exactly. In large networks, which may have millions of traffic classes, such specifications are nearly impossible to come by. Even if we had such a spec, it would be complex, difficult to understand, and a significant effort to maintain. Partial (coarse) specifications do sometimes exist, but they only provide partial protection.

Can we verify network changes without such complex behavior specifications? This may seem impossible at first sight, but our recent work points to a path forward. The key idea is to use relational verification—instead of reasoning about a single network snapshot in isolation (say, the post-change network snapshot, as is typically done today), reason jointly about two snapshots, the pre- and post-change snapshots.

A relational specification describes similarities and differences between snapshots. For example, “Path1 is the same as Path2” is a relational specification as is “Path1 changes to Path2.” If the impact of changes is small, and most of the network is supposed to stay the same, then relational specifications can be concise, even if the network itself is enormous. Fortunately, our experience is that real-world changes are usually quite small and relational specs are quite manageable.

To experiment with these ideas, we have developed a language called Rela for writing relational specifications, and an automated verification procedure to check that networks adhere to them. Consider again our example from above: We want traffic using link a1 to now use link a2. The core of such a change is a one-liner in Rela:

spec change_link = { a1 : replace(a1, a2); }

The first occurrence of a1 is called the “zone” (or area of interest) of a change. Zones are sets of paths through the network. They are specified using regular expressions. The command after the colon is a traffic modifier. In this case, the modifier states that traffic running through a1 should instead run through a2.

If we would like to say that some part of the network does not change, that is easy too. Here is a specification that says all paths (regular expression .* ) stay the same (the preserve modifier). Sometimes engineers will re-factor a configuration; a specification like this checks that they did not accidentally change network forwarding paths.

spec nochange = { .* : preserve }

To extend the specification from our example to express that upstream and downstream subpaths do not change, we extend the zone of the specification and require that those parts do not change (the semicolon concatenates specifications along a set of paths):

spec change = {
    nochange;
    change_link;
    nochange;
}

Finally, to specify that traffic that has nothing to do with link a1 also remains unchanged, we use the prioritized choice operator » denoting that the second term applies only to paths that are not covered by the first term. The full specification of our change is:

spec fullchange = change >> nochange

These two terms, none of which require the engineers to know much about the current network behavior, represent the complete specification of the change. Based on this specification, any unintended behavioral change (e.g., due to a buggy configuration change) will be flagged by Rela.

See the paper for other types of specifications including more path modifiers and mechanisms for specifying different changes for different kinds of packets separately. The paper also explains how we verify such specifications using finite state automata and transducers. Our experiments show that Rela is expressive enough to describe over 9 in 10 complex changes to a large cloud provider’s backbone network in fewer than 10 lines and verification takes under 20 minutes even though the network has over a thousand of devices.

We believe that relational reasoning about networks has immense potential, and change verification is just one application of many. Relational reasoning could also verify differences between parts of the network that are expected to be similar but not identical (as is common for large networks), and it could also help synthesize configuration changes from specifications. We are excited about exploring such applications in the future.


Images courtersy of Network Map Vectors by Vecteezy

Zachary Kincaid
Author Zachary Kincaid
Zachary Kincaid is an Associate Professor at Princeton University. His main research interests are in program analysis, logic, and programming languages. He has active research projects that aim to make program analysis compositional and robust.
Arvind Krishnamurthy
Author Arvind Krishnamurthy Follow
Arvind Krishnamurthy is the Short-Dooley Professor in the Paul G. Allen School of Computer Science & Engineering. His research interests are in building effective and robust computer systems in the context of both data centers and Internet-scale systems. More recently, his research has focussed on programmable networks and systems for machine learning. He is an ACM fellow, a past program chair of ACM SIGCOMM and Usenix NSDI, is the Vice President of Usenix, and serves on the ICSI and CRA boards.
Ratul Mahajan
Author Ratul Mahajan Follow
Ratul Mahajan is an Associate Professor at the University of Washington (Paul G. Allen School of Computer Science) and a Co-founder and CEO of Intentionet, a company that is enabling safe, rapid evolution of computer networks using formal analysis and high-level design approaches. Prior to that, he was a Principal Researcher at Microsoft Research. He got his PhD at the University of Washington and B.Tech at Indian Institute of Technology, Delhi, both in Computer Science and Engineering.
David Walker
Author David Walker
David Walker is a Professor of Computer Science at Princeton University. He received his doctoral and master’s degrees in computer science from Cornell, and his bachelor’s from Queen’s University in Kingston, Ontario. During sabbaticals from Princeton, he has served as a visiting researcher at Microsoft Research in Redmond (2008) and in Cambridge (2009), and as Associate Visiting Faculty at the University of Pennsylvania (2015-2016).
Xieyang Xu
Author Xieyang Xu
Xieyang Xu is a Research Scientist at Meta. His research interest lies in computer networks and systems.
Yifei Yuan
Author Yifei Yuan
Yifei Yuan is a senior engineer in the Network Research Team at Alibaba Cloud. His research interests lie in designing formal techniques and building large-scale verification systems for validating production-scale networks.
Ennan Zhai
Author Ennan Zhai
Ennan Zhai is a Director of Network Research at Alibaba Cloud now. My research focuses on building high-performance and reliable networking systems, by utilizing techniques in areas including programming languages, verification, and programmable hardware.