SDN: software defined networking,
. . . or small distributed namespaces?
Software Defined Networking (SDN) sometimes seems a little bit like Las Vegas. The shiny virtual facade defies the shaky foundations it's built on, and there seems to be a song and dance (and a certain amount of gambling) involved to come out of it richer on the other end. (Opinions differ on what SDN stands for. Carrying the torch from ISDN: "Still Does Nothing", or "Stuck (in a) Dystopian Nightmare", or more to the point "Stability Not Delivered", etc.)
For the next few months, I am spending a sojourn at new network startup Socketplane, looking at how to apply the principles of Promise Theory to networking for a DevOps-aware world. Over the past two years, I have been lucky to be around several of the smart thinkers in this space. This essay sketches out ideas about networking that I think are increasingly obvious to users in the datacentre, but which also needs to be applied to a larger Internet of Things. The key is all about tenancy. Within a truly shared infrastructure, the straw building up on the camel's back will be the container.
Network as a user experience
SDN is a business-driven, time-to-market initiative, for the E-commerce age, which addresses the need to make networking changes quickly as new products and services are introduced for a market on a global scale---a manifesto not unlike the umbrella movement of DevOps, indeed some would place it there along with Software Defined Everything.
As a discipline, networking has lived a strangely separate existence from computing, over its history, and has adopted very different methodologies. It has embraced perhaps more of the methods of distributed systems than its computational counterpart (as it was forced to confront issues of scalability earlier), but it has fallen behind in the era of free and open source software innovation, and it's overdue for a makeover. Software defined networking could be that makeover. Maybe.
When I first heard about the idea of OpenFlow and centralized SDN controllers, my initial reaction was one of dismay: it sounded wrong, like something defined more by software frustration than network experience. The IT industry regularly falls back on centralized, `brute-force' solutions each time a new problem is perceived, even when this flies in the face of past learning. But what about scalability?
The answer proposed for solving the network crisis was a programming framework with low level APIs, and a GUI built into OpenStack. It looked like a recipe for instability and insecurity. Fortunately, the industry did not take this all too literally (a variety of `takes' on SDN have come to the fore, including OpenVSwitch, Cisco's ACI with a complete model rethink, Arista's APIs, and Cumulus open platform offers to bring essential open source freedoms, etc). Centralization can mean many different things, and there is room to discuss control models. SDN is still an open field of possibility.
At the crossroads
Does it matter how we solve technical problems? I think it does. Networking has reached its critical juncture, its existential crisis, but not quite its moment of enlightenment. From what I can tell, the dream of SDN is racing through an experimental minefield of either pragmatic or cosmetic workarounds, staunchly keeping its poker face. And somehow it looks about as convincing as the Vegas Eiffel tower: impressive, but not quite the real thing. All of this is understandable enough, given the industry pressure to deliver solutions, but my question is: do we have a coherent long-term plan for overcoming the obvious kludges and defects?
The initial hubris (that real software engineers would make it all right) has faded. It is not enough to bring a software culture to networking. Software engineering has plenty of its own flaws. But I think there is room for a cultural middle-ground, if software and networking would embrace one anothers' expertise. Thank goodness for the mantras of DevOps.
My interest in the team at companies like Socketplane and Cumulus Networks has been their willingness to go beyond the superficial questions of making something (anything) work, and recreate the diversity and cultural freedoms that opened up innovation in computing. I am humbled by their openness to working with me on this crucial user experience.
Engineering the right illusion
In the age of infrastructure as a service, one of the most pressing needs is for network virtualization, i.e. the leasing out of shared resources amongst a number of unrelated tenants. SDN could potentially mean several things, but let me focus on this one. Networking needs to realize the increasingly expansive and changeable communications requirements of multiple independent tenants, whether they be steeped in physical, virtual, open or restricted machinery. Virtualizing is a key approach to resource management. It makes infrastructure into a service, without necessarily exposing all the wires and pulleys.
In the past, this multi-tenancy for the network could be achieved by manually cobbling devices and protocols together. The whole thing would be wired somewhat like a Christmas tree, using unrelated technologies like DiffServ, virtual circuits (MPLS, FR), VLAN, federated border control with BGP, etc. All these pieces were strewn across a plethora of devices, without a unified or coordinated system of management.
Indeed, they still are. In the current model of SDN, not much has really been tidied up or integrated. We still rely on all those things to some extent. After all, doing away with, say, VLAN would be like doing away with global brands like ketchup, and coke. Rather, network virtualization has been approached by emulating the old software stack, designed principally for LAN, based on tunnels that encapsulate the existing view of the world. The goal is to present and automate a familiar experience, rather than to simplify or create a new experience (it's Vegas, baby).
SDN models virtual wires, virtual subnets, and hides from WAN, without actually asking whether any of that really makes any sense in a virtualized framework. The resulting use of overlays to conjure this illusion adds significant complexity and fragility at the outset.
The scope of networking itself has expanded and diversified, since the design of IPv4/Ethernet, but we don't ask those questions. Today we have datacentres, cloud, mobile devices, and a burgeoning Internet of Things. More importantly, the pattern of workloads has changed considerably. We have dense and sparse networks, North-South and East-West anisotropy in fabrics, and both crystalline and gaseous ensembles of devices.
Backwards compatibility aside for a moment, is an emulation of the past what we want from SDN? Will this serve us best going forwards? To approach this question, I think it's helpful to provide some context about where we are today, and how we got here.
The IPv4 generation: Two architectures welded together
The Internet (aka L3) was designed as two different classes of network: a collection of `local' networks (LAN), and an interconnection network between them (WAN). These concepts are not really well defined, but let me define LAN as being subnets, and WAN as everything else. The network was originally conceived as a hierarchy of sparse distributed endpoints.
Local Area networks (LAN) were designed with a `bus' architecture in mind. Tenants shout out messages to their trusted neighbours on the local bus and wait for a reply, hoping that not everyone chooses to talk at the same time. Everyone gets a number when they sign up, and (since they don't have fixed locations) to find someone, you have to shout and wait for a response (like renting a go-cart `come in number 23, your time is up!').
Wide Area networks were the transport architecture, for putting messages on trains to somewhere farther afield. It's not so much about distance, as whether or not there is a direct line. They had hierarchically routable addresses, and there was an adaptable infrastructure designed to do the route planning and forwarding.
These two piecemeal architectures were `unified' by combining the routable part of the address with a local identifier, into a single space-saving 32-bit word (instead of say keeping them as separate numbers in a tuple). The IP prefix was for WAN (ignored by LAN), and the postfix was for LAN (ignored by WAN).
Now the problem was that L3's address mechanism for LAN duplicated a system of addresses already existing on the local bus: L2 had its own (MAC) addresses that only local users could call. Just like the telephone system: you either call someone on a local extension with an easy to remember number, or you have to dial a complicated prefix to get out and connect with someone else's phones.
So now, which number should we use to talk to someone on the network? It made sense to use the IP quasi-doublet, so that everyone could use a single system (and the numbers are a lot easier to remember). Also, of course, unlike the unpredictable MAC addresses, the soft L3 addresses were designed to be assigned as a group that could be aggregated and summarized, by the WAN prefix, for routing. But the problem was that L2 wasn't listening to L3 addresses, and couldn't because of the encapsulation design of the wire protocols. Hack number 1 was to introduce a mapping service from L2 to L3 called ARP.
(Encapsulations often rewrite MAC addresses on top of this already fudged use of associative indirection.)
Like phones again, for local traffic, no one needed to use the whole phone number (IP address), but it turned out that it was easier to deal with just one set of addresses. So whenever local users wanted to talk amongst themselves, they would call on ARP to make the translation. ARP involved shouting out on the local bus to ask who was who (`come in number 23, your time is up!'). So it doubled as a neighbour discovery mechanism too.
The LAN bus architecture was a child of its time. The original network-cabling actually was a common bus, so it made perfect sense. This is why we built all kinds of tools and fixes that ought now to seem like nonsense today (Ethernet spanning trees, bridges, etc). But the thin Ethernet buses broke frequently. It became clear, quite quickly, that a bus architecture had too many problems, and (cabling complexity aside) a hub and spoke model of local networking was preferred. At that moment, much of the technology of Ethernet was redundant, but we kept it anyway.
There were technologies for tenants to share a common bus, and technologies for forwarding data. Those are the two things you need to get data from source to destination. However, instead of using and optimizing existing IP forwarding methods, manufacturers built a second forwarding scheme based on L2 labels instead. Lo, descendeth VLAN to inflict its torrid complexity on networking (if not infinite complexity, then 4096 complexity). And, what's more, its magic only worked inside a single box.
This was the point at which any thought of forwarding L2 frames should have been ejected. Why would we need two forwarding mechanisms? We ought to have moved to a single scalable forwarding model for LAN (i.e. subnets) and WAN at L3. In fact, L2's purpose is to do point-to-point flow control on physical connections. If there are no wires (like inside a hypervisor) why would you ever do anything so useless as L2 encapsulation just strip it off immediately? The correct thing to do would be to design a purely L3 interface. However, an entire industry embraced a LAN-centric view, and L3 switch hybrids. Today we even talk about forwarding by layer 2.5 of the OSI model (somewhere in these layers lies networking's KT extinction event). Although some will argue that L2 is faster, more efficient, etc, that is only true because this is where all the investment has been spent in optimization. L2's purpose lies in signal flow control, not in forwarding, or even flooding. Virtual circuits, like MPLS have worked well as policy routing through tenant infrastructure.
Note that it is the bridge forwarding workaround mechanism that SDN builds on and encourages (not the intentionally designed IP forwarding with all of its ready-to-use scalability).
Switches solved the broadcast/collision problem, but they could not fix the real design flaw in the IPv4 LAN/WAN design: namely the reliance on flooding, or all the public shouting that came as as the legacy of having a bus architecture welded onto a forwarding architecture, via a pseudo neighbour-discovery protocol (ARP). This broadcasting makes multi-tenancy awkward, because broadcasts want to cross boundaries; also, it means wasting bandwidth for shared services like ARP, DHCP and BOOTP, etc.
Never mind what you can do, what do you need?
Instead of solving this broadcasting problem, a new workaround was introduced in the form of VLAN channels. VLANs partition off different tenants in the network from one another, so they wouldn't have to complain about the noise from their neighbours, and can figure out which tenant stuff belongs to. VLANs had one nice side-effect: they potentially allowed network spaces to co-exist without too much conflict. So multiple tenants could ignore one another even on the same subnet/LAN. This was a crude kind of network virtualization, but all the burden fell upon the network hardware administrator.
VLANs are a crude implementation of namespaces, at the L2 level. Nice idea, inadequate implementation.
Perhaps an even more serious issue with overloading L2 networking is that it ignores the intentional aspect of networking (the workload semantics): what is networking for? What promises is it supposed to keep? Designing a service without thinking about the consumer is never going to lead to a sense that promises are kept. So we need to look at what agencies in a system that need services from the network, and model how these agencies cooperate through their life-cycles.
Ironically, the thing that might finally sort this out, is pressure brought to bear by cloud sharing, and the renewed interest in the end-points of the network: software containers.
A LAN is not a logical group, it is a physical aggregationLAN (subnetting) is widely abused as a form of logical host-grouping (summarization), probably because the tools to identify arbitrary end-points into semantic groups are poorly developed in networking. This helps to preserve the dominance of LAN thinking.
Subnets allow network prefixes to be stored in routing tables more cheaply than individual addresses. This is a scaling mechanism, which is broken by tunnels.
Addresses are for physical location, tags and labels are for logical grouping.
An index, by any other name
One thing that Las Vegas does better than SDN is handling multi-tenancy within shared spaces.
When your party checks into a Vegas hotel, needing a few rooms, the hotel broker assigns you some containers (king-size non-smoking). What it doesn't try to do, unless you are already very rich, is to move all the spare rooms to your own private floor; nor does the hotel set up partitions and renumber the rooms for you so you only have to count to 10.10.10.10. Instead, room numbers are fixed and reflect the coordinates of where they are located. This makes navigation easy.
For convenience, as a consumer, you can hang your name on the door, or register (name, room number) pairs in a directory. A quick lookup in the front-desk registry confirms that tenant X will be found in room Y, whose location is completely defined according to the coordinate system of the physics topology
Guest : (building, floor, room).
This astounding invention is called an index, and is a well known technology from phone-books, textbooks, and even databases. The idea has been used for making shortcuts to spatial resources almost since society was a thing (name/address books, function/phone Yellow Pages, Network Information Service NIS (like Yellow Pages), disk blocks to inodes, memory addresses to process heaps, etc). One does not rewire the platter of a harddisk every time someone wants to store a new file. An index is something that is quicker to search than the whole resource space, and provides a pointer to a directly findable location.
Hotels also do a better job of virtualizing resource segmentation. When your room assignments are split across several floors of the hotel, they don't start building a private elevator and set up walls so you can't see the other tenants, you share the common elevator, and the common space, and you avoid the other guests by getting a key to your room. If guests don't want to be observed, they can encrypt themselves in niqab or dark glasses. The keys are not shared. The front desk might issue several keys, according to an access control list which they maintain.
If you have company offices in London and New York, you don't normally address all the traffic to London, then hire a courier to forward the post privately by secret addresses. You would use the existing postal address infrastructure, and publish public addresses. This all seems quite normal and obvious. But this is not what we did with SDN.
Broadcasting to find out something you prefer didn't exist in the first place is annoying at best, but this muddle works well enough as long as you don't change things around too much, and your voice can be heard around the whole building.
But now enter the spectre of scale.
Tunnels are a stop-gap measure only.
The problem with all of this is that it is not designed to scale horizontally or at speed. You can add faster processing, but you can't easily share load when flooding.
What happens when the single hotel (or gated LAN segment) isn't big enough for the tenants within, and the nearest available space in on a different LAN subnet (a different network building)? Now you need a router to tell you how to reach it, because you have no direct link-level connection to it anymore. The natural thing to do would be to abandon the LAN architecture and use the WAN part, but that would require us to think in a new way.
What people have done instead is to try to wire switches together to extend the size of a LAN. This simplifies routing, because it is built automatically by neighbour discovery to some extent (IPv6 has a smarter neighbour discovery built in), but it places a burden on physical switches, but when we now implement switching in software (SDN) the virtual wiring we would use to connect switches together is made of the very stuff we are trying to make. But worse than that, we need the WAN architecture to make a LAN workaround by tunnelling. What a mess.
A protocol for calling out in two locations at the same time, as if they were one, leads to SDN Heath-Robinson contraptions: L2 tunnels. These connect multiple `LAN' and hope for the broadcast messages to be heard through these. If that were not enough, we now really do need to renumber the hotel rooms on the other end of the tunnel too, because one building's namespace is not naturally coordinated with another building's namespace.
Hence things like NVGRE and VxLAN were born, to tunnel one local namespace into another, and patch the welded address spaces, hitching a ride on the routed WAN architecture, and trying hard to say `hey what are you looking at? Nothing to see here. Move along!'. Not a separation of concerns, but a mad conflation of dependencies to maintain the illusion of the same horse and cart. (Actually, these things would be more or less fine, if they just left out the L2 part.)
This might be a patch on an industry's current idiosyncratic habits, with some measure of backward compatibility, but this is no solution for the long-term. For one thing, it doesn't answer the more fundamental question: what intentional process is driving the need for these tunnels and namespaces in the first place? Doesn't that also require some kind of management. Why are the two not integrated?
I believe that, if we answer those questions, the arguments for current L2 overlays has to evaporate in a puff of logic. From there one can migrate to a simpler approach, based on simple index indirection of L3 (anycast or DNS, like CDN, or something modern based on a true spatial coordinate system). With an index, all other issues of Network Function Virtualization (load balancing, firewalls, etc) can be handled by software libraries on end-points.
The great E-scape
Digging tunnels to connect walled off regions seems to be going too far, especially if the tunnels are being made and re-made on demand, rather than being fixed infrastructure. So it is surely time to revisit those assumptions.
What could be the alternative?
I suspect that, instead of tunnelling L2, we should be building indexed namespaces, on top of fixed L3 addresses, on a per-tenant basis. In a hub-spoke architecture, with point to point connections, MAC addresses should never have to be used, let alone learned and transported. And inside a hypervisor, it is virtually a hub-spoke architecture.
If you don't need broadcast messages anymore, you can eliminate the need to think about communication as a LAN/WAN problem. Everything is forwarded to its indexed location. Optimize that! The hub-spoke architecture makes them unnecessary, because the switch knows exactly what address each port is connected to. It doesn't need to ask them all. It doesn't matter whether we use the MAC address or the IP-postfix, because these are both unique to the switch. IPv6 combines the two in a simple way automatically.
Now everything simply becomes an end-to-end problem, as it was originally intended. Similarly, if you remove the pointless double addressing of the LAN between IP postfix addressing and MAC addresses, then you don't need ARP anymore, and peer discovery can assign link local addresses more directly. Now you can just use IP addresses in full, and the split between LAN and WAN is no longer relevant.
One of the features of IPv6 was to get rid of pointless broadcasting, and ARP along with it. It was designed without the bus model centre-stage. It's peer discovery mechanism was better suited to a hub-spoke architecture. As an addressing scheme, it is not perfect (it is still prefix and encapsulation, rather than a tuplet), but it is not too bad (there are workarounds for this based on fast packet header analysis). The real problem is that network addressing itself is poorly designed, and switches that are assumed to be the LAN domains have limited capacity. IP addresses seem to be modelled on the fixed size addresses used for memory access in chip-sets, rather than something with geo-spatial regularity. In the multi-dimensional space of a network, a tuple-space scheme would make more sense. This is where virtualization belongs.
Think of the room numbers in a hotel: (building, floor, room). This kind of flat tuple addressing makes much more sense than encapsulated and packed Russian doll headers that conceal location without handing over to a gateway. (The encapsulation header model prevents address components from being used commutatively during route computation, which in turn leads to duplication of effort and a proliferation of rewriting hacks).
IPv6 addresses, with their simple prefixes to hack a simple doublet (routable,non-routable), can still be good enough if we can index them with a virtual address indirection. This should probably remind us of directory services.
Putting L2 back into its doll, and embracing L3 namespaces
There are many issues to be addressed in the concept of SDN. One is a sane, self-service, and DevOps-aware user experience for workloads (which socketplane has already demonstrated), another is a scale-free low-complexity solution on the back-end.
Today we loosely, and erroneously, associate North-South (datacentre external) traffic with the routed L3 part, and East-West (datacentre internal) traffic with the LAN L2 part. This is all wrong. LAN was not designed to scale indefinitely with structureless, random addressing, whereas the routable part was designed to scale by planned address aggregation. L3 is undeniably better suited to create high density fabrics for scaling E-W traffic too. It is the walkie-talkie we need.
VLANs, on the other hand, are definitely not a scaling mechanism (although we pretend they are), but they do teach the value of namespaces. They do not assist messages in going anywhere, they prevent messages from going somewhere. They are selective network ear-muffs, not loudhailers or walkie-talkies.
So why is SDN so focused on the non-routable, non-scalable part of the stack? Well, change is hard. We are creatures of habit. People are invested in their knowledge (or their ketchup), and I am ever surprised at how far one can push pure switching by brute force and performance alone. The real cost is a management cost. But it will fail eventually, and unlike IP it is not designed to recover.
Mapping L2 namespaces over a tunnel has got nothing at all to do with private L3 addressing. The only reason it makes an unwanted appearance at all is the forced reliance on ARP as a quasi neighbour-discovery mechanism. Also, the widespread L2 `flow thinking' seems to be a related cultural artefact, which encourages traffic balancing of network load by stream rather than by packet, diminishing the ability to perform efficient multiplexing, and makes routes more fragile. A gold rush to instant functional gratification rides roughshod over basic networking principles.
If we factor in the intentional aspects of why workloads are orchestrated, it makes a lot more sense to distribute this intentionality with a per-application dynamical directory service: an index.
Network addressing still needs to be based on geographically aggregated patterns in order to scale. Virtualization of addresses does not need to interfere with that, if we simply use a service that implements private namespaces on top of public IP addresses, there is no need even for tunnels (Russian doll encapsulation). This could all be hidden in a virtual network interface on individually addressable application containers (like Docker). Like offices or hotel rooms, containers in a stable infrastructure are not rebuilt completely, with new addresses. They have fixed locations and addresses, and are simply re-purposed on the fly, not torn down and rebuilt from scratch (short term tenancy, long term structure).
Indexing application namespaces just needs some tuples to be maintained with a per-application service.
(namespace, private IP address, public IP address)
Let's not muddle virtualization (overlays) with addressing, and end up hacking a needless protocol stack. Namespaces are a service. They cannot scale for all tenants combined; to attempt that would be madness. DNS, LDAP, NIS, etc, notwithstanding, what is needed is something that is dynamical and scales along with the application rather than the fixed infrastructure. A simple, small, replicated key-value database would do the trick.
Don't rebuild your foundations on the flySeparation of timescales is the key to stable systems. Core infrastructure (like networking) should not be dynamic. Namespaces can be dynamic atop a slowly evolving platform. Two different needs are served by dependable infrastructure and volatile tenancy. We need to stop mixing them up out of simple thoughtlessness.
I see people arguing for a dynamic infrastructure. That is wrong thinking. Infrastructure and foundations are meant to be stable. What you build on top of that can be dynamic.
Per-application namespace virtualization for datacentres and beyond
Although the industry is currently fixated on cloud datacentres, the world is larger than that. Soon, we are going to be moving containers from datacentres to mobile devices and back, and the datacentre will grow to envelope our daily environments in a seamless way through embedded systems. We should not be planning networking based on purely symmetrical computational fabrics in walled gardens.
Focusing on tenant containers, for inspiring network virtualization, is a great idea though, because they are the sources (the promisers) of intentionality in a network, and thus they drive the semantics that infrastructure need to promise in order to make tenancy efficient.
What virtualization properly needs is regular, indexable, addressed spaces, and an agent oriented understanding of all networking mechanisms. In fact, I believe that a model of a semantic space is, in fact, the correct way to virtualize networks, using Promise Theory principles, and per-application coordination key-value index. Container spaces like Kubernetes are already taking this approach for the end-points, and now is the time to do the same for the networking. There doesn't need to be an overlay of the network through encapsulation, as long as the channel is private, say, by encryption.
Encapsulation is a model for describing interfaces, not locations.
Most of all, the software industry needs to learn how to write distributed applications without relying on protocol overloading and technology-related details they should never have had to see in the first place. The explanation for all of this conceptual inertia is that applications have been written with state tied up in the networking itself.
Un-learning the network stack
Just because it's what we've learned, doesn't mean it's still the right way to do it.
Imagine roaming around hotels shouting out for friends. It happens in the movies, but you don't want it in your world. And flooding emulated buses is for mobile open-spaces, not hypervisors and point-to-point connected topologies. At hotels, neighbour discovery (with handshakes and introductions) takes place in the bar and lobby; later you exchange contact info and get a room. You really shouldn't go roaming the elevators, calling out in hotel hallways for your friends and family.
We shall eventually need a model of IP addressing that handles solid phase and gaseous phase addressing, for a mobile network experience: a atmosphere of mobile users around a fixed solid core. SDN over IPv4 is not going to get us there, I fear. Already, packet inspection is being used to get around encapsulation's limitations, i.e. lack of a uniform tuple model. I would love to see that radical fix, but real-world change is more gradual, and messier than that.
Anyway, we should keep an eye on what is happening around containers now. This focus on intent (the network promises and their agencies), and the workloads, will be the catalyst (no pun intended) to bring SDN to the next level (layer, plane, oh god, whatever...).
Sat Jan 18 12:36:57 CET 2015
ACK: As always, I am grateful to Dinesh Dutt, Mike Dvorkin, Brent Salisbury, David Tucker, Madhu Venugopal, and John Willis for discussions.