The Open Fabric: Intelligent Network, Intelligent Operator

Maybe I’m old school, but I’m leery of black box networking. Especially if critical services are dependent on my network infrastructure. I wore those shoes for 19 years, so this perspective has been etched into my instinct by real world experiences. I don’t think I’m alone in this thinking.

When bad things happen to network users, all eyes are on the physical network team, even when it’s not a physical network problem. The physical network is guilty until it can be proven otherwise. So it’s fair that physical network operators are skeptical of technology whose inner workings are unknowable. Waiting on your network vendor’s technical support team isn’t an option when the CIO is breathing down your neck. Especially if a mitigation exists that can be acted on immediately.

That said, there is indeed a limit to the human ability. It's increasingly lossy as the number of data points grow. Moreover, the levels of ability are inconsistent across individuals. Box level operations leaves it up to the operator to construct an accurate mental model of the network, with its existing and possible states, and this generally results in varying degrees of poor decision making.

To make up for this shortcoming, it’s well known by now that the next level in networking is centered around off-box intelligence and control — this theme goes by the heading "SDN". However some forms off-boxing network intelligence create new problems — such as when the pilots no longer are able to fly the plane. This is bad news when the autopilot is glitching.

How safe is the intelligent network, if the network operators have been dumbed down? The public cloud folks know this spells disaster. So if they don’t buy into the notion of “smart network, dumb operator”, then why should anyone else with a network? In some industries, it doesn’t take much more than one extended disaster before the “out of business” sign is hanging on the door.

If your SDN was built by your company’s development team (like maybe at Google), your network operator is probably a network SRE that is very familiar with its code. They can look under the hood, and if they still need help, their technical support team (the SDN developers) work down the hallway.

On the other hand, black box SDN is fine and dandy until something fails in an unexpected way. I suspect that eventually it will — for example, when all the ace developers who built it have moved on to their next startup and are replaced by ordinary folks. You need to trust that these SDN products have an order of magnitude higher quality than the average network product, since the blast radius is much larger than a single box. But they too are created by mortals, working at someone else’s company. The reality is that when the network breaks unexpectedly, you are often left to your own devices.

So how is the rest of the world supposed to up-level their networking when the only choices seem to be black-box SDN or build-your-own SDN? (To be very clear, I’m talking primarily about the physical layers of the network.)

Let me tell you what I think.

IMO, a resilient network has an intelligent on-box "base" mode (openflow does not qualify), which guarantees the basic level of network service availability at all times to critical endpoints.  It implements a consistent distributed control and data plane.  This mode should be based on knowable open communication standards, implemented by network nodes and programed declaratively. Over that should be an off-box "enhanced" mode that builds on top of the base mode to arrive at specific optimization goals (self driving automation, bandwidth efficiency, minimum latency, end-user experience, dynamic security, network analytics, etc).  I believe this is how the intelligent network must be designed, and is consistent with the wisdom in Juniper founder Pradeep Sidhu's statement "centralize what you can, distribute what you must."

If the enhanced mode system has an issue, turn it off and let the network return safely to the base mode state. Kind of like the human body — if I get knocked unconscious I’m really happy my autonomic nervous system keeps all the critical stuff ticking while I come back to my senses.

This also allows the enhanced mode system to focus on optimizations and evolve rapidly, while the base mode system focuses on resilience and might evolve less rapidly. However, the enhanced mode must be fully congruent with the chosen base mode. So a first order decision is in choosing the right base mode, since any enhanced mode is built on top of it.

You might be asking by now, what does this have to do with the title?

My thesis is that if you choose a knowable base mode system, it will naturally lead you to a knowable enhanced mode system. Said another way, a standards based base mode system, which is knowable, leads to a knowable enhanced mode system, since it uses the building blocks provided to it by the base mode system. It's not rocket science.

The next important consideration is that the enhanced mode system is designed to provide full transparency to the operator into it’s functions and operations, such that the network operator can take control if and when needed (see article). The best systems will actively increase the understanding of human operators. This is how we get both the intelligent network and the intelligent operator. I think this is what pragmatic SDN for the masses would look like.

The Open Fabric

Wednesday, December 5, 2018

Intelligent Network, Intelligent Operator

No comments:

Post a Comment