The Open Fabric

Blending into software infrastructure

2019-01-10T15:29:00.000-08:00

Electronic networks existed long before electronic compute and storage. Early on, the network was simple wires and switch boards, and the endpoints were humans. Telegraphs turn taps into on/off current on the wire and back to audible clicks. Telephones turn voice into fluctuating current and back to voice.

Since then, the network has existed as a unique entity apart from the things it connected. Until now.

Less than two decades ago most applications were built in vertical silos. Each application got its own servers, storage, database and so on. The only thing applications shared was the network. The network was the closest thing to a shared resource pool — the original “cloud”. With increasing digital transformation, other services were also pooled, such as storage and database. However each application interfaced with these pooled resources and with other applications directly. Applications had little in common other than the pooled resources they shared.

As more code was written, the value of pooling common software functions and best practices into a “soft” infrastructure layer became evident. The role of this software infrastructure was to normalize the many disparate ways of doing common things in application software. This meant application developers could focus on unique value and not boilerplate. This also meant the software infrastructure could make the best use of underlying physical resources on behalf of the applications. Storage was absorbed into software infrastructure, and even database. Software infrastructure was necessary to achieve economies of scale in digital businesses.

Over the past few decades, emphasis has been on decentralization of network control. The network’s greatest mission was survivability, and the trade-off was other optimizations, such as for end-user experience. However the challenges of massive scale has created the need for centralized algorithms to ensure good end-user experience, eliminate stranded capacity, improve time to recovery and speed up deployment. Imperatives to achieve economies at hyper scale. For some problems, like survivability, cooperating peers is best, but for others, like bin packing, master-slave is better.

So while the network may continue to implement last resort survivability in a distributed way, those optimizations that require centralization will/have/should be driven by the common software infrastructure layer in the most demanding environments.

From day one, the network design team at Bloomberg reported into the software infrastructure org. My boss Franko reported to Chuck Zegar, the godfather of Bloomberg’s software infrastructure (and Mike Bloomberg’s first employee). Mike tasked Chuck to engineer Bloomberg’s networks. This first-hand experience in such an organization has led me to believe that software infrastructure is also the ideal place in the org chart for the network engineering role to develop networks that best serve the business and end users.

Backend security done right

2018-12-24T15:38:00.002-08:00

I think of classical firewall-based security like a community with a common manned gate, and where the homes inside the community don’t have locks on their doors. Strong locks on strong doors is better if you ask me.

Traditional firewalls look for patterns in the packet header to determine what action to take on flows that match specified patterns. I'd equate this to the security guard at the gate allowing in folks based by how they look. If someone looks like a person who belongs to the community, then they're let in. In the same way, if a crafted packet header matches permit rules in the firewall, then it is allowed through.

One might say that’s where app-based firewalls come in. They go deeper than just the packet header to identify the application that is being transported. But how do these firewalls know what application a transport session belongs to? Well, they do pretty much the same thing — look for patterns within transport sessions that hopefully identify unique applications. Which means that with a little more effort, hackers can embed patterns in illegitimate sessions that fool the firewall.

I suppose there are even more sophisticated schemes to distinguish legitimate flows from illegitimate ones that may work reasonably for well-known apps. Problem is the application you wrote (or changed) last week is not "well-known", so you’re back to the 5-tuple.

This is among the problems with man-in-the-middle security -- they're just not good enough to keep out the most skilled intruder for long. Break past the front gate and all the front doors are wide open. The dollars you expend for this kind of security isn’t limited to the cost of the security contraption. It also includes the limitations this security model imposes on application cost, velocity, scale and richness, and the amount of wasted capacity it leaves in your high-speed network fabric. Clos? Why bother.

One might say that server-based micro-segmentation is the answer to this dilemma. But what significant difference does it make if we clone the security guard at the gate and put one in front of each door? The same pattern matching security means the same outcomes. Am I missing something?

I'm sure there are even more advanced firewalls, but that surely bring with them either significant operational inefficiencies or the risks of collateral damage (like shutting down legitimate apps). I'm not sure the more advanced companies use them.

In my humble opinion, “zero trust” means that application endpoints must never trust anything in the middle to protect them from being compromised. Which means that each endpoint application container must know which clients are entitled to communicate with it, and allow only those clients to connect if they can present credentials that can be used to prove their true identity and integrity. Obviously a framework that facilitates this model is important. Some implementations of this type of security include Istio, Cilium, Nanosec and Aporeto, although not necessarily SPIFFE based.

Where backend applications universally adopt this security model, in-the-middle firewalls are not needed to ensure security and compliance in backend communication. Drop malware into the DC and it might as well be outside the gate. It doesn’t have keys to any of the doors, i.e. the credentials to communicate with secured applications. Firewalls can now focus on keeping the detritus out at the perimeter — D/DoS, antivirus, advanced threats, and the like, which don’t involve the management of thousands of marginally effective match-action rules, and all the trade-offs that come with blunt-force security.

Death by a thousand scripts

2018-12-22T18:09:00.002-08:00

Early on in my automation journey, I learned some basic lessons from experiencing what happens when multiple scripting newbies independently unleash their own notions of automation on the network. I was one of them.

Our earliest automations were largely hacky shell scripts that spat out snippets of config which would then be shoved into routers by other scripts. We’re talking mid 1990s. It started out fun and exciting, but with more vendors, hardware, use cases, issues, and so on, things went sideways.

We ended up with a large pile of scripts with different owners, often doing similar things to the network in different ways. Each script tailored for some narrow task on a specific vendor hardware in a specific role. As folks who wrote the scripts came and went, new scripts of different kinds showed up, while other scripts became orphaned. Other scripts ran in the shadows, only known to their creator. Chasing down script issues was a constant battle. It was quite the zoo.

We learned quickly that with automation power must come automation maturity.

The problem was that we were operating based on the needs at hand, without stepping back and building in the bigger picture. With each need of the day came a new script or tweak to an existing one. Each with different logging, debugging, error handling, etc, if any at all. No common framework. Each script a snowflake more or less. Forbid it that we should make any change to the underlying network design. Eventually the scripts became the new burden in place of those we set out to reduce with them.

Even seasoned developers assigned to network automation make newbie mistake. The code is great, but the automation not so great. Main reason being they lack the subject matter knowledge to deconstruct the network system accurately, and most likely they were not involved when the network was being designed.

IMO, the best network automation comes from a class of networking people that I refer to as network developers — software developers that are networking subject matter experts. These folks understand the importance of the network design being simple to automate. They are core members of the network design team, not outsourced developers.

In my case, I found that if my automation logic reflected the functional building blocks in the network design, I could capture the logic to handle a unique function within corresponding methods of a model-based automation library. For example, a method to generate vendor-specific traffic filter from a vendor-independent model, and another method to bind and unbind it on an interface. Code to handle the traffic filter function was not duplicated elsewhere. Constructing a network service that is a composite of multiple functions was just a matter of creating a service method that invoked the relevant function methods.

This approach ensured that my team focused on systematically enhancing the capabilities of a single automation library rather than replicating the same network manipulations in code over and over in different places. With the devops mindset, often new device features were tested only after incorporating them into the automation library, which meant features were tested with and for automation. Multiple high value outcomes derived from a common coherent automation engine, and a very strong network automation as network design philosophy.

There are certainly other equally good (or better) automation models. For example, Contrail fabric automation takes a role oriented approach.

Let me close before I go too far off on a tangent. The lesson here — don’t wing automation for too long. Aim for method, not madness.

Network automation as network design

2018-12-07T12:05:00.001-08:00

A very large part of my professional career in network design was spent working on automation. A journey in automation that began in 1996 when I and a few colleagues engineered Bloomberg’s first global IP WAN, which evolved into the most recognized (and agile) WAN in the financial services industry.

The automation behind that network started off very basic, and over the years evolved into a very lean and flexible model-based library core, with the various programs (provisioning, health-checking, discovery, etc) that were built on top.  This small automation library (less than 15K of OO code) drove a high function multi-service network with support for 6+ different NOS' and 100+ different unique packet forwarding FRUs. Including service-layer functions such as VPNs with dynamic context-specific filter, QoS and route policy construction, BGP peering (internal and external), inline VRF NAT, etc and support for a variety of attachment interfaces such as LAG/MC-LAG and channelized Ethernet/SONET/SDH/PDH interfaces (down to fractional DS1s) with Ethernet, PPP, FR and ATM encapsulations. And, yes, even xSTP.  It was quite unique in that I have yet to see some of the core concepts that made it so lean and flexible repeated elsewhere. We were among the earliest examples of a DevOps culture in networking.

Many lessons were learned over the years spent evolving and fine tuning that network automation engine. I hope to capture some of those learnings in future blog entries. In this blog entry, I want to share a perspective on the core foundation of any proper network automation — and that is network architecture and design.

All great software systems start with great software architecture and design. A key objective of software architecture and design is to achieve a maximum set of high quality and scalable known and yet-to-be-known outcomes, with the least amount of complexity and resources. Applying paint to the network canvas without tracing an outline of the desired picture doesn’t always get you to the intended outcomes. Design top down, build bottom up, as they say.

In networking, automation is a means to an end, which are the high quality services to be rendered. Once we know what network services we expect to deliver over our network, we can then identify the building-block functions that are required and their key attributes and relationships. From there we identify the right technologies to enable these functions such that they are synergistic, simple, scalable and resource efficient. The former is network architecture and the latter is network design.

It’s only after you arrive at a foundational network architecture that you should start the work on automation. After this, automation should coevolve with your network design. Indeed, the network design must be automatable. In this regard, one might even look at network automation as an aspect of network architecture and design, since it's role is to realize the network design. Obviously this would mean the Bloomberg automation library implements the Bloomberg network design, and not yours (although it might have everything you need).

IMO, great network automation is based on software design that tightly embraces the underlying network design, such that there are corresponding functional equivalents in the automation software for the functions in the network design. This is what I call model-based automation (as opposed to model-blind automation).  In this sense again, good network automation software design is inclusive of the network design.

This last assertion is an example of what I spoke of in my previous blog, of how an enhanced mode system (here the network automation system) should be a natural extension of a base mode system (the network system), such that if the base mode system is knowable, then so is the enhanced mode system. Which, when done right, makes possible both an intelligent network as well as the intelligent operator. This assertion is backed by a very successful implementation on a large high function production network.

In conclusion, what I've learned is that network architecture and design is king and, done right, network automation becomes a natural extension of it. Companies that do not have the resources or know-how to properly marry automation design and network design have a few hard lessons ahead of them in their automation journey. For some companies, it might make sense to consider network automation platforms such as Contrail's fabric automation, which incorporates standards-based network design that is built into a model-based open-source automation engine.

Intelligent Network, Intelligent Operator

2018-12-05T21:41:00.003-08:00

Maybe I’m old school, but I’m leery of black box networking. Especially if critical services are dependent on my network infrastructure. I wore those shoes for 19 years, so this perspective has been etched into my instinct by real world experiences. I don’t think I’m alone in this thinking.

When bad things happen to network users, all eyes are on the physical network team, even when it’s not a physical network problem. The physical network is guilty until it can be proven otherwise. So it’s fair that physical network operators are skeptical of technology whose inner workings are unknowable. Waiting on your network vendor’s technical support team isn’t an option when the CIO is breathing down your neck. Especially if a mitigation exists that can be acted on immediately.

That said, there is indeed a limit to the human ability. It's increasingly lossy as the number of data points grow. Moreover, the levels of ability are inconsistent across individuals. Box level operations leaves it up to the operator to construct an accurate mental model of the network, with its existing and possible states, and this generally results in varying degrees of poor decision making.

To make up for this shortcoming, it’s well known by now that the next level in networking is centered around off-box intelligence and control — this theme goes by the heading "SDN". However some forms off-boxing network intelligence create new problems — such as when the pilots no longer are able to fly the plane. This is bad news when the autopilot is glitching.

How safe is the intelligent network, if the network operators have been dumbed down? The public cloud folks know this spells disaster. So if they don’t buy into the notion of “smart network, dumb operator”, then why should anyone else with a network? In some industries, it doesn’t take much more than one extended disaster before the “out of business” sign is hanging on the door.

If your SDN was built by your company’s development team (like maybe at Google), your network operator is probably a network SRE that is very familiar with its code. They can look under the hood, and if they still need help, their technical support team (the SDN developers) work down the hallway.

On the other hand, black box SDN is fine and dandy until something fails in an unexpected way. I suspect that eventually it will — for example, when all the ace developers who built it have moved on to their next startup and are replaced by ordinary folks. You need to trust that these SDN products have an order of magnitude higher quality than the average network product, since the blast radius is much larger than a single box. But they too are created by mortals, working at someone else’s company. The reality is that when the network breaks unexpectedly, you are often left to your own devices.

So how is the rest of the world supposed to up-level their networking when the only choices seem to be black-box SDN or build-your-own SDN? (To be very clear, I’m talking primarily about the physical layers of the network.)

Let me tell you what I think.

IMO, a resilient network has an intelligent on-box "base" mode (openflow does not qualify), which guarantees the basic level of network service availability at all times to critical endpoints.  It implements a consistent distributed control and data plane.  This mode should be based on knowable open communication standards, implemented by network nodes and programed declaratively. Over that should be an off-box "enhanced" mode that builds on top of the base mode to arrive at specific optimization goals (self driving automation, bandwidth efficiency, minimum latency, end-user experience, dynamic security, network analytics, etc).  I believe this is how the intelligent network must be designed, and is consistent with the wisdom in Juniper founder Pradeep Sidhu's statement "centralize what you can, distribute what you must."

If the enhanced mode system has an issue, turn it off and let the network return safely to the base mode state. Kind of like the human body — if I get knocked unconscious I’m really happy my autonomic nervous system keeps all the critical stuff ticking while I come back to my senses.

This also allows the enhanced mode system to focus on optimizations and evolve rapidly, while the base mode system focuses on resilience and might evolve less rapidly. However, the enhanced mode must be fully congruent with the chosen base mode. So a first order decision is in choosing the right base mode, since any enhanced mode is built on top of it.

You might be asking by now, what does this have to do with the title?

My thesis is that if you choose a knowable base mode system, it will naturally lead you to a knowable enhanced mode system. Said another way, a standards based base mode system, which is knowable, leads to a knowable enhanced mode system, since it uses the building blocks provided to it by the base mode system. It's not rocket science.

The next important consideration is that the enhanced mode system is designed to provide full transparency to the operator into it’s functions and operations, such that the network operator can take control if and when needed (see article). The best systems will actively increase the understanding of human operators. This is how we get both the intelligent network and the intelligent operator. I think this is what pragmatic SDN for the masses would look like.

I'm back, again.

2018-11-21T08:35:00.000-08:00

It’s been over 3 years since I last shared my thoughts here. It’s been that long since I left an amazing 19 year journey at Bloomberg, at the helm of the team that developed the financial industry’s most exceptional IP network. A network that I took great pride in and gave so much of my life for. I am grateful to have had the opportunity to build what I did there and learn many things along the way.
 
Three years ago I decided to go from building mission-critical global networks to building network technologies for mission-critical networks with the team at Juniper Networks. Two very different worlds. It wasn’t easy, but I have evolved. For the record, my heart is still that of a network operator. 20+ years at the front lines doesn’t wash off in 3 years. Someday I hope to go back to being an operator, when I have exhausted my usefulness on this side of the line, or maybe before that.
 
One of the perks of being a customer is that I could say whatever was on my mind about networking tech. Not so as a vendor. So I chose to stay low and focus on trying to align what was on the truck to what I knew operators like me needed. I feel that I can be more open now that we’ve reached the bend.
 
I’ve had the opportunity to be in the middle of getting some important things unstuck at Juniper. The Juniper of 3, 2 and 1 year ago were quite different from the one today, marked by decreasing inertia. Over that time I lit the fire on three key pivots towards executing to the needs of DC operators. Specifically, the pivot to a strong DC focus in the NOS, the solidifying of our network design strategy for [multi-service] DC, and getting a proper DC fabric controller effort in motion. They’re all connected, and together complete the pieces of the DC fabric puzzle. Three years in.

My respect to all the Juniper friends that turned opportunities into realities. I'm a big believer that real change starts at the talent level and, with the right conditions, percolates up and across. Juniper has talent. Now with the right game in play and the right organizational alignment at Juniper, good things are set to happen.
 
They say ideas and talk are cheap without execution, so having influenced these out of a virtual existence into reality is my key result, and their positive impact on Juniper. Driving big change isn’t new to me. I built the financial industries most venerate IP network from the ground up, and been at the forefront of multiple key shifts in the networking industry, including having understood the need for, and bringing EVPN into existence, which I write about here (the parts I can talk about publicly).
 
It’s almost impossible to get a market to move into a new vision with the push of a button. Smart companies start from where their customers are. The most successful tech companies are the ones that have the right vision, and can navigate their customers onto the right path toward that vision. The right path pays special attention to the “how”, and not just the “what” and “why”. The market recognizes a good thing when it sees it. The one that works best for it. For a vendor, the right path opens up to new opportunities. This has always been my focus. From the beginning, EVPN itself was meant to be a stepping stone in a path, not a destination -- "a bridge from here to there". As I like to say, value lies not in the right destination, but in the right journey. The destination always changes in tech, so the right vision today merely serves as a good vector today for the inevitably perpetual journey.
 
Now I hope to resume sharing my perspective here about topics such as, the right stages for evolving the network operator role, EVPN in the DC, pragmatic SDN, and other random thoughts that I have the scars to talk about, and with the new insights I have gained over the past 3 years.
 
I’m not a prolific writer so don’t expect too much all at once. :-)

Better than best effort — Rise of the IXP

2015-05-15T08:44:00.002-07:00

This is the second installation in my Better than Best Effort series. In my last post I talked about how the incentive models on the Internet keep ISPs from managing peering and backbone capacity in a way that supports reliable communication in the face of the ever growing volume of rich media content. If you haven't done so, please read that post before you read this one.

It's clear that using an ISP for business communication comes with the perils associated to the "noisy neighbor" ebb and flow of consumer related high volume data movement. Riding pipes that are intentionally run hot to keep costs down is a business model that works for ISP, but not for business users of the Internet. Even with business Internet service, customers may get a better grade of service within a portion of an ISP's networks, but not when their data needs to traverse another ISP which they are not a customer of. There is no consistent experience, for anyone.

However there is an evolving solution to avoid getting caught in the never ending battle between ISP and large consumer content. As the title of this blog gives away, the solution is called an Internet eXchange Point (IXP).

IXPs are where the different networks that make up the Internet most often come together. Without IXP, the Internet would be separate islands of networks. IXP are, in a sense, the glue of the Internet. From the early days of the Internet, IXP have been used to simplify connectivity between ISPs resulting in significant cost savings for them. Without an IXP, an ISP would need to run separate lines and dedicate network ports for each peer ISP with whom it connects.

However IXP and ISP are not distant relatives. They are in fact close cousins. Here's why.

Both ISP and IXP share two fundamental properties. The first is that they both have a fabric, and the second is that they both have "access" links used to connect to customers so they can intercommunicate over this fabric. The distinction between the two is in the nature of the access interfaces and the fabrics. ISP fabrics are intended to reach customers that are spread out over a wide geographic area. An IXP fabric on the other hand is fully housed within a single data center facility. In some cases an IXP fabric is simply a single Ethernet switch. ISP access links use technology needed to span across neighborhoods, while IXP access links are basically ordinary Ethernet cables that run an average of around several dozen meters. So essentially the distinction between the two is that an ISP is a WAN and an IXP is a LAN.

The bulk of the cost in a WAN is in the laying and maintenance of the wires over geographically long distances. Correspondingly the technology used at the ends of those wires is chosen based on the ability to wring out as much value out of those wires as possible. The cost of a WAN is significantly higher than a LAN with a comparable number of access links. ISP need to carefully manage costs which are much higher per byte. It is on account of the tradeoffs that ISP make in order to manage these costs that the Internet is often unpredictably unreliable.

So how can IXP help?

Let's assume that most business begin to use IXP as meet-me points. Remember that the cost dynamics of operating an IXP are different than an ISP. At each IXP these business customers can peer with one another and their favorite ISPs for the cost of a LAN access link.

There are at least a few advantages of this over connecting to an ISP. Firstly, two business entities communicating via an IXP's [non-blocking] LAN are effectively not sharing the capacity between them with entities that are not communicating with either of them, and so making their experience far more predictable. This opens the door for them to save capital and operational costs by eliminating the private lines that they may currently have with these other business entities. Secondly, a business that is experiencing congestion to remote endpoints via an ISP can choose to not use it by [effectively] disabling BGP routing with that ISP. This is different from the standard access model used by businesses in that if there are problems downstream of one of their ISP, their control is limited to the much fewer ISP from whom they have purchased access capacity.

The following illustrates a scenario where there is congestion at a peering point downstream of one of the ISP being used by a business that is affecting it's ability to reach other offices, partners or customers that are hosted on other ISP.

In the access model, since BGP cannot signal path quality, traffic is blindly steered over a path that has the shortest number of intermediate networks versus a path with the best performance. Buying extra access circuits alone to avoid Internet congestion is not a winnable game (more on this in the next part of this series).

The alternative approach using an IXP would look something like the following.

This illustration shows how being at an IXP creates more direct access to more endpoints at a better price point than buying access lines to numerous ISP. You can also see how peering with other business entities locally at an IXP can improve reliability, reduce costs and simplify business-to-business connectivity by combining it with Internet connectivity.

There is an interesting trend occurring within the growing number of managed co-location data centers. Hosted within many of these co-location data centers are IXP. Some managed data center operators like Equinix even operate their own IXP at their data centers. These data centers are an ideal place for businesses to connect with one another through IXP without the downsides that come with using consumer-focused ISP.

This is not to say that the operational capabilities at all IXP are at a level needed to support large numbers of businesses. There is work to be done to scale peering in a manner that will give customers minimal configuration burden and maximal control.

There will even be need for business-focused ISPs that connect business customers at one IXP to business customers connected to the Internet at another IXP. Although net neutrality prohibits the differentiated treatment of data over the Internet, it does not forbid an ISP or IXP from selecting the class of customer it chooses to serve. This is much like the difference between a freeway and a parkway. Parkways do not serve commercial traffic and so in a way they offer a differentiated service to non commercial traffic.

As the Internet enables new SaaS and IaaS providers to find success by avoiding the high entrance cost of building a private service delivery network, more businesses are turning to the Internet to access their solution providers of choice. The old Internet connectivity model cannot reliably support the growing use of the Internet for business and so a better connectivity model is needed for a reliable Internet. New opportunities await.

In upcoming posts I will discuss additional thoughts on further improving the reliability of communicating over the Internet.

Better than best effort — Reliability and the Internet.

2015-05-13T06:02:00.005-07:00

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system.

Networks prior to the Internet were largely closed systems, and the cost of communicating was extraordinarily high. In those days, the free exchange of ideas at all levels was held back by cost. On the Internet, for a cost proportional to a desired amount of access bandwidth, one can communicate with a billion others. This has propelled human achievement forward over the last 20 years. By way of Metcalfe’s law, the Internet’s value is immeasurably larger than any private network ever will be.

So why do large private service delivery networks still exist?

The answer lies primarily in one word: reliability. What Metcalfe’s law doesn’t cover is the reliability of communication of connected users, and the implications of a lack of reliability on the value of services delivered. Although Internet reliability is improving, much like the highway system, it still faces certain challenges inherent with open and unbiased systems.

On a well run private network, bandwidth and communications are regulated to deliver an optimal experience, and network issues are addressed more rapidly as all components are managed by a single operator. The Internet on the other hand is a network of networks wherein network operators do not have sufficient incentive to transport data for which they are not adequately compensated.

Growing high volume content services such as video streaming place unrelenting strain on the backbones and peering interfaces of Internet service providers. With network neutrality and the corresponding lack of QoS on the Internet, ISPs have to maintain significant backbone and peering capacity to ensure other communication continue to function in the presence of these high volume traffic. However ISP operators have demonstrated that they are much more inclined to provide capacity to their direct customers than they are to those who are not on their network.

Some large ISPs refuse to better manage their peering capacity yet they host a large number of end users. These end users are, in a qualitative sense, locked inside their ISP. This seems to be forcing large web and video content providers to buy capacity directly from the ISP of their mutual users in order to deliver content to them. Despite this latter trend, peering (and many backbone) links continue to be challenged.

(Note: With some ISP there is a qualitative difference between business Internet service and consumer Internet service when it comes to backbone and peering capacity)

For private enterprises that want to engage in business-to-business communications over the Internet, these “noisy neighbor” dynamics do not lend well to reliable service delivery and/or to cost management. It is cost prohibitive to buy Internet access from many ISPs for the purpose of B2B communications and, conversely, buying access from only a few ISPs puts the communication between two business entities on different ISP at risk of being routed via a congested peering link. Unfortunately BGP does not take path quality into account when choosing them.

On the outside, it seems straight-forward enough for businesses to continue to peer directly over private leased lines. However there is a trend that is putting pressure on this model. An increasing number of private businesses are leveraging an emerging landscape of SaaS and IaaS services. This is driving a general acceptance of the Internet as a primary medium for B2B communication. Private connectivity also comes with the baggage of added capital and operational costs.

For many businesses, Internet-based B2B communication “as is” may be fine for SaaS services such as HR and billing where a temporary loss of service is survivable. But there are a class of services and communications that are too-important-to-fail for many businesses and even for larger ecosystem such as capital markets. Reliable infrastructure is a prerequirement for engaging in these services.

The prevalent Internet access models fail to bring the Internet to a consistent level of reliability needed by many businesses. The support of B2B communication over the Internet needs to improve as more businesses adopt the Internet for their core business communication needs. Needless to say, I have my thoughts on how this should happen which I share on my next blog.

(Note: I have intentionally avoided DDoS and security related problems that also come with being on the Internet. I believe these can be better handled once the more fundamental problem with the plumbing is dealt with.)

EVPN. The Essential Parts.

2015-03-05T18:04:00.002-08:00

In a blog post back in October 2013 I said I would write about the essential parts of EVPN that make it a powerful foundation for data center network virtualization. Well just when you thought I'd fallen off the map, I'm back. :)

After several years as an Internet draft, EVPN has finally emerged as RFC7432. To celebrate this occasion I created a presentation, EVPN - The Essential Parts. I hope that shedding more light on EVPN's internals will help make the decision to use (or to not use) EVPN easier for operators. If you are familiar with IEEE 802.1D (Bridging), IEEE 802.1Q (VLAN), IEEE 802.3ad (LAG), IETF RFC4364 (IPVPN) and, to some degree, IETF RFC6513 (NGMVPN) then understanding EVPN will come naturally to you.

Use cases are intentionally left out of this presentation as I prefer the reader to creatively consider whether their own use cases can be supported with the basic features that I describe. The presentation also assumes that the reader has a decent understanding of overlay tunneling (MPLS, VXLAN, etc) since the use of tunneling for overlay network virtualization is not unique to EVPN.

Let me know your thoughts below and I will try to expand/improve/correct this presentation or create other presentations to address them. You can also find me on Twitter at @aldrin_isaac.

Here is the presentation again => EVPN - The Essential Parts

Update: I found this excellent presentation on EVPN by Alastair Johnson that is a must read. I now have powerpoint envy :)

Evolving the data center network core

2013-11-05T21:15:00.001-08:00

As complex functions that were historically shoehorned into the network core move out to the edge where they belong, data center core network operators can now focus on improving the experience for the different types of applications that share the network. Furthermore, with fewer vertically scaled systems giving way to many horizontally scaled systems, the economics of data center bandwidth and connectivity needs to change.

I’ve jotted down some thoughts for improving the data center core network along the lines of adding bandwidth, managing congestion and keeping costs down.

Solve bandwidth problems with more bandwidth

Adding bandwidth had been a challenge for the last several years owing to the Ethernet industry not being able to maintain the historical Ethernet uplink:downlink speedup of 10:1, and at the same time not bringing down the cost of Ethernet optics fast enough. Big web companies started to solve the uplink bandwidth speed problem in the same way they had solved the application scaling problem -- scale uplinks horizontally. In their approach, the role of traditional bridging is limited to the edge switch (if used at all), and load-balancing to the edge is done using simple IP ECMP across a “fabric” topology. The number of spine nodes on a spine-leaf fabric is constrained only by port counts and the number of simultaneous ECMP next hops supported by the hardware. By horizontally scaling uplinks, it became possible to create non-blocking or near non-blocking networks even when uplink ports speeds are not the 10x of access they once were. Aggregating lower speed ports in this way also benefits from the ability to use lower-end switches at a lower cost.

Using ECMP does come with it’s own demons. Flow based hashing isn’t very good when the number of next hops isn’t large. This leads to imperfect load balancing which results in imperfect bandwidth utilization and nonuniform experience for flows in the presence of larger (elephant) flows. To address this issue, some large web companies look to use up to 64 ECMP paths to increase the efficiency of ECMP flow placement across the backbone. However even with the best ECMP flow placement, it's better that uplink port speeds are faster than downlink speeds to avoid the inevitable sharing of an uplink by elephant flows from multiple downstream ports.

Yet another approach goes back to the use of chassis based switches with internal CLOS fabrics -- most chassis backplanes do a better job of load balancing, in some cases by slicing packets into smaller chunks at the source card before sending them in parallel across multiple fabric links and reassembling the packet at the destination card.

Managing Congestion

Although the best way to solve a bandwidth problem is with more bandwidth, even with a non-blocking fabric network congestion will still occur. An example is with events that trigger multiple (N) hosts to send data simultaneously to a single host. Assuming all hosts have the same link speed, the sum of the traffic rates of the senders is N times the speed of the receiver and will result in congestion at the receiver’s port. This problem is common in scale-out processing models such as mapreduce. In other cases, congestion is not a result of a distributed algorithm, but competing elephant flows that by nature attempt to consume as much bandwidth as the network will allow.

Once the usual trick of applying as much bandwidth as the checkbook will allow has been performed, it’s time for some other tools to come into play -- class-based weighted queuing, bufferbloat eliminating AQM techniques and more flexible flow placement.

Separate traffic types:

The standard technique of managing fairness across traffic types by placing different traffic types in different weighted queues is extremely powerful and sadly underutilized in the data center network. One of the reasons for the underuse is woefully deficient queuing in many merchant silicon based switches.

The ideal switch would support a reasonable number of queues where each queue has sufficient, dedicated and capped depth. My preference is 8 egress queues per port that can each hold approximately 20 jumbo frames on a 10GE link. Bursty applications may need a bit more for their queues and applications with fewer flows can in many cases be afforded less. Proper use of queueing and scheduling ensures that bandwidth hungry traffic flows do not starve other flows. Using DSCP to classify traffic into traffic classes is fairly easy to do and can be done from the hosts before packets hit the network. It is also important to implement the same queuing discipline at the hosts (on Linux using the ‘tc’ tool) as exists in the network to ensure the same behavior end-to-end from source computer to destination computer.

One thing to watch out for is that on VoQ based switches with deep buffers the effective queue depth of an egress port servicing data from multiple ingress ports will, in the worst case, be the sum of the ingress queue depths, which may actually be too much.

Address bufferbloat:

Once elephants are inside their own properly scheduled queue, they have little affect on mice in another queue. However elephants in the same queue begin to affect each other negatively. The reason is that TCP by nature attempts to use as much bandwidth as the network will give it. The way in which out-of-the-box TCP knows that it’s time to ease off of the accelerator is when packets start to drop. The problem here is that in a network with buffers, these buffers have to get full before packets drop, leading to effectively a network with no buffers. When multiple elephant flows are exhausting the same buffer pool in an attempt to find their bandwidth ceiling, the resulting performance is also less than the actual bandwidth available to them.

In the ideal world, buffers would not be exhausted for elephant flows to discover their bandwidth ceiling. The good news is that newer AQM techniques like CoDel and PIE in combination with ECN-enabled TCP work together to maximize TCP performance without needlessly exhausting network buffers. The bad news is that I don’t know of any switches that yet implement these bufferbloat management techniques. This is an area where hardware vendors have room to improve.

Class-aware load balancing:

The idea behind class-aware load balancing across a leaf-spine fabric is to effectively create different transit rails for elephants and for mice. In class-aware load balancing, priorities are assigned to traffic classes on different transit links such that traffic of a certain class will be forwarded only over the links that have the highest priority for that class with the ability to fall back to other links when necessary. Using class-aware load balancing, more links can be prioritized for elephants during low mice windows and less during high mice windows. Other interesting possibilities also exist.

Spread elephant traffic more evenly:

Even after separating traffic types into different queues, applying bufferbloat busting AQM, and class-aware load balancing, there is still the matter of hot and cold spots created by flow-based load-balancing of elephant flows. In the ideal situation, every link in an ECMP link group would be evenly loaded. This could be achieved easily with per-packet load balancing (or hashing on IP ID field), but given the varying size of packets, the resulting out-of-sequence packets can have a performance impact on the receiving computer. There are a few ways to tackle these issues -- (1) Enable per-packet load-balancing only on the elephant classes. Here we trade off receiver CPU for bandwidth efficiency, but only for receivers of elephant flows. We can reduce the impact on the receiver by using jumbo frames. Additionally since elephant flows are generally fewer than mice the CPU impact aggregated across all nodes is not that much. (2) Use adaptive load-balancing on elephant class. In adaptive load balancing, the router samples traffic on it’s interfaces and selectively places flows on links to even out load. These generally consume FIB space, but since elephant flows are fewer than mice using some FIB space for improved load balancing of elephant flows is worth it.

Update: Another very promising way to spread elephant traffic more evenly across transit links is MPTCP (see http://www.youtube.com/watch?v=02nBaaIoFWU). MPTCP can be used to split an elephant connection across many subflows that are then hashed by routers and switches across multiple paths -- MPTCP then shifts traffic across subflows so as to achieve the best throughput. This is done by moving data transmission from less performant subflows to subflows that are more performant.

Keeping costs down

The dense connectivity requirement created by the shift from vertically scaled computing and storage to horizontally scaled approaches has put a tremendous new cost pressure on the data center network. The relative cost of the network in a scale-out data center is rising dramatically as the cost of compute and storage falls -- this is because the cost of network equipment is not falling in proportion. This challenges network teams as they are left with the choice of building increasingly oversubscribed networks or dipping into the share that is intended for compute and storage. Neither of these are acceptable.

The inability of OEMs to respond to this has supported the recent success of merchant silicon based switches. The challenge with using merchant silicon based switches is the limited FIB scaling on these switches as compared to more expensive OEM switches. OEM switches tend to have better FIB scaling and QoS capability, but at a higher cost point.

Adopting switches with reduced FIB capacity can make some folks nervous. Chances are, however, that some "little" changes in network design can make it possible to use these switches without sacrificing scalability. One example of how to reduce the need for high FIB capacity in a stub fabric is to not send external routes into the fabric but only send in a default route. The stub fabric would only need to carry specific routes for endpoint addresses inside the fabric. Addressing the stub fabric from a single large prefix which is advertised out also reduces the FIB load on the transit fabric, which enables the use of commodity switches also in the transit fabric. For multicast, using PIM bidir instead of PIM DM or SM/SSM will also significantly reduce the FIB requirements for multicast. Using overlay network virtualization also results in a dramatic reduction in the need for large FIB tables when endpoint addresses need to be mobile within or beyond the stub fabric -- but make sure that your core switches can hash on overlay payload headers or you will lose ECMP hashing entropy. [ Note: some UDP or STT-based overlay solutions manipulate the source port of the overlay header to improve hashing entropy on transit links. ]

Besides the cost of custom OEM switches, OEMs have also fattened their pocketbooks on bloated optical transceiver prices. As network OEMs have been directing optics spend into their coffers, customers have been forced to spend even more money buying expensive switches with rich traffic management capabilities and tools in order to play whack-a-mole with congestion related problems in the data center network. As customers have lost their faith in their incumbent network equipment vendors a market for commodity network equipment and optics has begun to evolve. The prices of optics have fallen so sharply that, if you try hard enough, it is possible to get high quality 10GE transceivers for under $100 and 40GE for low hundreds as well. Now it’s possible to populate the switch for much less than the cost of the switch whereas previously the cost of populating a switch with optics was multiples of the cost of the actual switch. Furthermore, with the emergence of silicon photonics I believe we will also see switches with on-board 10GE single-mode optics by 2016. With luck, we’ll see TX and RX over a single strand of single-mode fiber -- I believe this is what the market should be aiming for.

As the core network gets less expensive, costs may refuse to come down for some data center network operators as the costs shift to other parts of the network. The licensing costs associated to software-based virtual network functions will be one of the main culprits. In the case of network virtualization, one way to avoid the cost would be to leverage ToR-based overlays. In some cases, ToR-based network virtualization comes at no additional costs. If other network functions like routing, firewalling and load-balancing are still performed on physical devices (and you like it that way), then "passive mode" MVRP between hypervisor and ToR switch in combination with ToR-based overlay will enable high performance autonomic network virtualization. The use of MVRP as a UNI to autonomically trigger VLAN creation and attachment between hypervisor switch and ToR switch is already a working model and available on Juniper QFabric and OpenStack (courtesy of Piston Cloud and a sponsor). [ Note: ToR-based network virtualization does not preclude the use of other hypervisor-based network functions such as firewalling and load-balancing ]

All said however, the biggest driver of cost are closed vertically integrated solutions that create lock-in and hold back operators from choice. Open standards and open platforms give operators the freedom to choose network equipment by speed, latency, queuing, high availability, port density, and such properties without being locked in to a single vendor. Lock-in leads to poor competition and ultimately to higher costs and slower innovation as we have already witnessed.

Wrapping up here, I’ve touched on some existing technologies and approaches, and some that aren’t yet available but are very needed to build a simple, robust and cost effective modern data center core network. If you have your own thoughts on simple and elegant ways to build the datacenter network fabric of tomorrow please share in the comments.

The real Slim Shady

2013-10-22T14:29:00.001-07:00

Historically when an application team needed compute and storage resources they would kick off a workflow that pulled in several teams to design, procure and deploy the required infrastructure (compute, storage & network). The whole process generally took a few months from request to delivery of the infrastructure.

The reason for this onerous approach was really that application groups generally dictated their choice of compute technology. Since most applications scaled vertically, the systems and storage scaled likewise. When the application needed more horsepower, it was addressed with bigger more powerful computers and faster storage technology. The hardware for the request was then staged followed by a less-than-optimal migration to the new hardware.

The subtlety that gets lost regarding server virtualization is that a virtualization cluster is based on [near] identical hardware. The first machines that were virtualized where the ones who’s computer and storage requirements could be met by the hardware that the cluster was based on. These tended to be the applications that were not vertically scaled. The business-critical vertically scaled applications continued to demand special treatment, driving the overall infrastructure deployment model used by the enterprise.

The data center of the past is littered with technology of varying kinds. In such an environment technology idiosyncrasies change faster than the ability to automate them -- hence the need for operators at consoles with a library of manuals. Yesterdays data center network was correspondingly built to cater to this technology consumption model. A significant part of the cost of procuring and operating the infrastructure was on account of this diversity. Obviously meaningful change would not be possible without addressing this fundamental problem.

Large web operators had to solve the issue of horizontal scale out several years ahead of the enterprise and essentially paved the way for the horizontal approach to application scaling. HPC had been using the scale out model before web, but the platform technology was not consumable by the common enterprise. As enterprises began to leverage web-driven technology as the platform for their applications they gained it’s side benefits, one of which is horizontal scale out.

With the ability to scale horizontally it was now possible to break an application into smaller pieces that could run across smaller “commodity” compute hardware. Along with this came the ability to build homogeneous easily scaled compute pools that could meet the growth needs of horizontally scaling applications simply by adding more nodes to the pool. The infrastructure delivery model shifted from reactive application-driven custom dedicated infrastructure to a proactive capacity-driven infrastructure-pool model. In the latter model, capacity is added to the pool when it runs low. Applications are entitled to pool resources based on a “purchased” quota.

When homogeneity is driven into the infrastructure, it became possible to build out the physical infrastructure in groups of units. Many companies are now consuming prefabricated racks with computers that are prewired to a top-of-rack switch, and even pre-fabricated containers. When the prefabricated rack arrives, it is taken to it’s designated spot on the computer room floor and power and network uplinks are connected. In some cases the rack brings itself online within minutes with the help of a provisioning station.

As applications transitioned to horizontal scaling models and physical infrastructure could be built out in large homogeneous pools some problems remained. In a perfect world, applications would be inherently secure and be deployed to compute nodes based on availability of cpu and memory without the need for virtualization of any kind. In this world, the network and server would be very simple. The reality is, on the one side, that application dependencies on shared libraries do not allow them to co-exist with an application that needs a different version of those libraries. This among other things forces the need for server virtualization. On the other side, since today’s applications are not inherently secure, they depend on the network to create virtual sandboxes and enforce rules within and between these sandboxes. Hence the need for network virtualization.

Although server and network virtualization have the spotlight, the real revolution in the data center is simple homogeneous easily scalable physical resource pools and applications that can use them to effectively. Let's not lose sight of that.

[Improvements in platform software will secure applications and allow them co-exist on the same logical machine within logical containers, significantly reducing the need for virtualization technologies in many environments. This is already happening.]

The bumpy road to EVPN

2013-10-13T21:23:00.001-07:00

In 2004 we were in the planning phase of building a new data center to replace one we had outgrown. The challenge was to build a network that continued to cater to a diverse range of data center applications and yet deliver significantly improved value.

Each operational domain tends to have one or more optimization problem whose solution is less than optimal for another domain. In an environment where compute and storage equipment come in varying shapes and capabilities and with varying power and cooling demands, the data center space optimization problem does not line up with the power distribution and cooling problem, the switch and storage utilization problem, or the need to minimize shared risk for an application, to name a few.

The reality of the time was that the application, backed by it's business counterparts, generally had the last word -- good or bad. If an application group felt they needed a server that was as large as a commercial refrigerator and emitted enough heat to keep a small town warm, that's what they got, if they could produce the dollars for it. Application software and hardware infrastructure as a whole was the bastard of a hundred independent self-proclaimed project managers and in the end someone else paid the piper.

When it came to moving applications into the new data center, the first ask of the network was to allow application servers to retain their IP address. Eventually most applications moved away from a dependence on static IP addresses, but many continued to depend on middle boxes to manage all aspects of access control and security (among other things). The growing need for security-related middle-boxes combined with their operational model and costs continued to put pressure on the data center network to provide complex bandaids.

A solid software infrastructure layer (aka PaaS) addresses most of the problems that firewalls, load-balancers and stretchy VLANs are used for, but this was unrealistic for most shops in 2005. Stretchy VLANs were needed to make it easier on adjacent operational domains -- space, power, cooling, security, and a thousand storage and software idiosyncrasies. And there was little alternative than to deliver stretchy VLANs using a fragile toolkit. With much of structured cabling and chassis switches giving way to data center grade pizza box switches, STP was coming undone. [Funnily the conversation about making better software infrastructure continues to be overshadowed by a continued conversation about stretchy VLANs.]

Around 2007 the network merchants who gave us spanning tree came around again pandering various flavors of TRILL and lossless Ethernet. We ended up on this evolutionary dead end for mainly misguided reasons. In my opinion, it was an unfortunate misstep that set the clock back on real progress. I have much to say on this topic but I might go off the deep end if I start.

Prior to taking on the additional responsibility to develop our DC core networks, I was responsible for the development of our global WAN where we had had a great deal of success building scalable multi-service multi-tenant networks. The toolkit to build amazingly scalable, interoperable multi-vendor multi-tenancy already existed -- it did not need to be reinvented. So between 2005 and 2007 I sought out technology leaders from our primary network vendors, Juniper and Cisco, to see if they would be open to supporting an effort to create a routed Ethernet solution suitable for the data center based on the same framework as BGP/MPLS-based IPVPNs. I made no progress.

It was around 2007 when Pradeep stopped in to share his vision for what became Juniper's QFabric. I shared with him my own vision for the data center -- to make the DC network a more natural extension of the WAN and based on common toolkit. Pradeep connected me to Quaizar Vohra and ultimately to Rahul Agrawal. Rahul and I discussed the requirements for a routed Ethernet for the data center based on MP-BGP and out of these conversations MAC-VPN was born. At about the same time Ali Sajassi at Cisco was exploring routed VPLS to address hard-to-solve problems with flood-and-learn VPLS, such as multi-active multi-homing. With pressure from yours truly to make MAC-VPN a collaborative industry effort, Juniper reached out to Cisco in 2010 and the union of MAC-VPN and R-VPLS produced EVPN, a truly flexible and scalable foundation for Ethernet-based network virtualization for both data center and WAN. EVPN evolved farther with the contributions from great folks at Alcatel, Nuage, Ericsson, Verizon, Huawei, AT&T and others.

EVPN and a few key enhancement drafts (such as draft-sd-l2vpn-evpn-overlay, draft-sajassi-l2vpn-evpn-inter-subnet-forwarding, draft-rabadan-l2vpn-evpn-prefix-advertisement) combine to form a powerful, open and simple solution for network virtualization in the data center. With the support added for VXLAN, EVPN builds on the momentum of VXLAN. IP-based transport tunnels solve a number of usability issues for data center operators including the ability to operate transparently over the top of a service provider network and optimizations such as multi-homing with "local bias" forwarding. The other enhancement drafts describe how EVPN can be used to natively and efficiently support distributed inter-subnet routing and service chaining, etc.

In the context of SDN we speak of "network applications" that work on top of the controller to implement a network service. EVPN is a distributed network application, that works on top of MP-BGP. EVPN procedures are open and can be implemented by anyone with the benefit of interoperation with other compliant EVPN implementations (think federation). EVPN can also co-exist synergistically with other MP-BGP based network applications such as IPVPN, NG-MVPN and others. A few major vendors already have data center virtualization solutions that leverage EVPN.

I hope to produce a blog entry or so to describe the essential parts of EVPN that make it a powerful foundation for data center network virtualization. Stay tuned.

Air as a service

2013-07-04T20:22:00.000-07:00

Have you ever wondered about air? We all share the same air. We know it's vitally important to us. If we safeguard it we all benefit and if we pollute it we all suffer. But we don't want to have to think about it every time we take a breath. That's the beauty of air. Elegantly simple and always there for you.

Imagine air as a service (AaaS), one where you need to specify the volume of air, the quality of the air, etc before you could have some to breathe. As much as some folk might be delighted in the possibility to capitalize on that, it would not be the right consumption model for such a fundamental resource. If we had to spend time and resources worrying about the air we breathe we'd have less time and resources to do other things like make dinner.

Why does air as it is work so well for us? I think it's for these reasons, (1) there is enough of it to go around and (2) reasonable people (the majority) take measures to ensure that the air supply is not jeopardized.

Network bandwidth and transport should be more like how we want air to be. The user of network bandwidth and transport (citizen) should not have to think about these elemental services of the network other than to be a conscientious user of this shared resource. The operator of the network (government) should ensure that the network is robust enough to meet these needs of network users. Furthermore the operator should protect network users from improper and unfair usage without making the lives of all network users difficult, or expecting users to know the inner workings of the network in order to use it.

The past is littered with the carcasses of attempts by network vendors and operators to force network-level APIs and other complexity on the network user. Remember the ATM NIC? Those who forget the past are doomed to repeat it's failures and fail to benefit from it's successes.

What the average network user wants is to get the elemental services of the network without effort, like breathing air. So don't make it complicated for the network user -- just make sure there's enough bandwidth to go around and get those packets to where they need to go.

Regarding scale-out network virtualization in the enterprise data center

2013-06-24T21:21:00.000-07:00

There's been quite a lot of discussion regarding the benefits of scale-out network virtualization. In this blog I present some additional thoughts to ponder regarding the value of network virtualization in the enterprise DC. As with any technology options, the question that enterprise network operators need to ask themselves regarding scale-out network virtualization is whether it is the right solution to the problems they need to address.

To know whether scale-out network virtualization in the enterprise DC is the answer to the problem, we need to understand the problem in a holistic sense. Let's set aside our desire to recreate the networks of the past (vlans, subnets, etc, etc) in a new virtual space, and with an open mind ask ourselves some basic questions.

Clearly at a high level, enterprises wish to reduce costs and increase business agility. To reduce costs it's pretty obvious that enterprises need to maximize asset utilization. But what specific changes should enterprise IT bring about to maximize asset utilization and bring about safe business agility? This question ought to be answered in the context of the full sum of changes to the IT model necessary to gain all the benefits of the scale-out cloud.

Should agility come in the form of PaaS or stop short at Iaas? Should the individual machine matter? Should a service be tightly coupled with an instance of a machine or rather should the service exist as data and application that is independent of a machine (physical or virtual)?

In a scale-out cloud, the platform [software] infrastructure directs the spin up and down of application instances relative to demand. The platform infrastructure also spins up application capacity when capacity is lost due to physical failures. Furthermore, the platform software infrastructure ensures that services are only provided to authorized applications and users and secured as required by data policy. VLANs, subnets and IP addresses don't matter to scale-out cloud applications. Clearly network virtualization isn't a requirement for a well designed scale-out PaaS cloud. (Multi-tenant IaaS clouds do have a very real need for scale-out network virtualization)***

So why does scale-out network virtualization matter to the "single-tenant" enterprise? Here's two reasons why I believe enterprises might believe they need it, and two reasons why I think maybe they don't need it for those reasons.

Network virtualization for VM migration.

The problem in enterprises is that a dynamic platform layer such as I describe above isn't quite achievable yet because, unlike the Google's of the world, enterprises generally purchase most of their software from third parties and have legacy software that does not conform to any common platform controls. Many of the applications that enterprises use maintain complex state in memory that if lost can be disruptive to critical business services. Hence, the closest an enterprise can do these days to attain cloud status, is IaaS -- i.e. take your physical machines and turn them into virtual machines. Given this dependence on third party applications, dynamic bursting and that sort of true cloudy capabilities aren't universally applicable in the back end of an enterprise DC.

The popularity of vmotion in the enterprise is testament to the current need to preserve the running state of applications. VM migration is primarily used for two reasons -- (1) to improve overall performance by non-disruptively redistributing running VMs to even out loads and (2) to non-disruptively move virtual machines away from physical equipment that is scheduled for maintenance. This is different from scale-out cloud applications where virtual machines would not be moved, but rather new service instances spun up and others spun down to address both cases.

We all know that for vmotion to work, the migrated VM needs to retain the same IP and MAC address of it's prior self. Clearly if the VM migration were limited to only a subset of the available compute assets this will lead to underutilization and hence higher costs. If a VM should be migrated to any available compute node (assuming again that retaining IP and MAC is a requirement), the requirement would appear to be scale-out network virtualization.

Network virtualization for maintaining traditional security constructs.

As I mentioned before, a scale-out PaaS cloud enforces application security beginning with a trusted registration process. Some platforms require the registration of schemas that application are then forced to conform to when communicating with another application. This isn't a practical option for consumers of off-the-shelf applications. But clearly, not enforcing some measure of expected behavior between endpoints isn't a very safe bet either.

The classical approach to network security has been to create subnets and place firewalls at the top of them. A driving reason for this design is that implementing security at the end-station wasn't considered very secure since an application vulnerability could allow malware on a compromised system to override local security. This drove the need for security to be imposed at the nearest point in the network that was less likely to be compromised rather than at the end station.

When traditional firewall based security is coupled with traditional LANs, a virtual machine is limited to only the compute nodes that are physically connected to that LAN and so we end up with the underutilization of the available compute assets that are on other LANs. However if rather than traditional LANs, we instead use scale-out LAN virtualization, then the firewall (physical or virtual) could be wherever the firewall is, and the VMs that the firewall secures can be on any compute node. Nice.

So it seems we need scale-out network virtualization for vmotion and security...

Actually we don't -- not if we don't have a need for non-IP protocols. Contrary to what some folks might believe, VM migration doesn't require that a VM remains in the same subnet -- it requires that a VM retains it's IP and MAC address which is easily done using host-specific routes (IPv4 /32 or IPv6 /128 routes). Using host-specific routing a VM migration would require that the new hypervisor advertise the VM's host route (initially with a lower metric) and the original hypervisor withdraw it when the VM is suspended.

So now that we don't need a scale-out virtual LAN for vmotion, that leaves the matter of the firewall. The ideal place to implement security is at the north and south perimeters of your network. As I mentioned earlier security inside the host (the true south) can be defeated and hence the subnet firewall (the compromise). But with the advent of the hypervisor, there is now a trusted security enforcement point at the edge. We can now implement firewall security right at the vNIC of the VM (the "south" perimeter). When coupled with perimeter security at the points where communication lines connect your DC to the outside world (the "north" perimeter), you don't need scale-out virtual LANs to support traditional subnet firewalls either. It's debatable whether additional firewall security is required at intermediate points between these two secured perimeters -- my view is that they are not unless you have a low degree of confidence in your security operations. There is a tradeoff to intermediate security which comes at the expense of reduced bandwidth efficiency, increased complexity and higher costs to name a few.

The use of host-specific routing combined with firewall security applied at the vNIC is evident in LAN virtualization solutions that support distributed integrated routing and bridging capability (aka IRB or distributed inter-subnet routing). The only way to perform fully distributed shortest-path routing with free and flexible VM placement, is to use host-based routing. The dilemma then is where to impose firewall security -- at the vNIC of course!!

Although we don't absolutely need network virtualization for either VM migration or to support traditional subnet firewalls, there is one really good problem that overlay based networking helps with, and that is scaling. Merchant silicon and other lower priced switches don't support a lot of hardware forwarding entries. This means that your core network fabric might not have enough room in it's hardware forwarding tables to support a very large number of VMs. Overlays solve this issue by only requiring the core network fabric to support about as many forwarding entries as there are switches in the fabric (assuming one access subnet per leaf switch). However even in this case per my reasoning in the prior three paragraphs, for a single-tenant enterprise the overlay network should only need to support a single tenant instance and hence would be used for dealing with the scaling limitations of hardware and not for network virtualization.

Building a network-virtualization-free enterprise IaaS cloud.

There's probably a couple of ways to achieve vmotion and segmentation without network virtualization and with very little bridging. Below is one way to do this using only IP. The following approach does not leverage overlays and so each fabric can only support as many hosts as the size of the hardware switch L3 forwarding table.

(1) Build an IP-based core network fabric using switches that have adequate memory and compute power to process fabric-local host routes. The switch L3 forwarding table size reflects the number of VMs you intend to support in a single instance of this fabric design. Host routes should only be carried in BGP. You can use the classic BGP-IGP design or for a BGP-only fabric you might consider draft-lapukhov-bgp-routing-large-dc. Assign the fabric a prefix that is large enough to provide addresses to all the VMs you expect your fabric to support. This fabric prefix will be advertised out of and a default route advertised in to the fabric for inter-fabric and other external communication.
(2) Put a high performance virtual router in your hypervisor image that will get a network facing IP via DHCP and is scripted to automatically BGP peer with it's default gateway which will be the ToR switch. The ToR switch should be configured to accept incoming BGP connections from any vrouter that is on it's local subnet. The vrouter will advertise host routes of local VMs via BGP and for outbound forwarding will use it's default route to the ToR.
(3) To bring up a VM on a hypervisor, your CMS should create an unnumbered interface on the vrouter, attach the vNIC of the VM to it and create a host route to the VM which should be advertised via BGP. The reverse should happen when the VM is removed. This concludes the forwarding aspect of the design.
(4) This next step handles the firewall aspect of the design. Use a bump-in-the-wire firewall like Juniper's vGW to perform targeted class-based security at the vNIC level. If you prefer to apply ACLs on the VM facing port of the vrouter, then you should carve out prefixes for different roles from the fabric's assigned address space to make writing the ACLs a bit easier.

Newer hardware switches support 64K and higher L3 forwarding entries and also come with more than enough compute and memory to handle the task so it's reasonable to achieve upward of 32K VMs per fabric. Further scaling is achieved by having multiple of these fabrics (each with their own dedicated maskable address block) interconnected via an inter-connect fabric, however VM migration should be limited to a single fabric. But if you prefer to go with the overlay approach to achieve the greater scaling, replace BGP to the ToR with MP-BGP to two fabric route reflectors for VPNv4 route exchange. When provisioning a VM-facing interface on the vrouter you'll need to place it into a VRF and import/export a route target.

I've left out some details for the network engineers among you to exercise your creativity (and avoid making this blog as long as my first one) -- why should Openstack hackers have all the fun? :)

Btw, native multicast should work just fine using this approach. Best of all, you can't be locked in to a single vendor.

If you believe you need scale-out overlay network virtualization consider using one that is based on an open standard such as IPVPN or E-VPN. The latter does not require MPLS as some might believe and supports efficient inter-subnet routing via this draft which I believe will be accepted as a standard eventually. Both support native multicast and both are or will be supported by at least three or more vendors eventually with full inter-operability. I'm hopeful that my friends in the centralized camp will some day see the value of also using and contributing to open control-plane and management-plane standards.

Angry SDN hipsters.

2013-06-09T20:25:00.000-07:00

Some folks seem to get a little too hung up on one philosophy or another -- too blind to see good in any other form except the notions that have evolved in their mind. I'm hoping I'm not one of them. I do have opinions, but which I believe are rational.

The counter culture of networking waves the SDN banner. That acronym seems to belong to them. They don't know what it stands for yet, but one thing they seem to be sure of is that nothing good can come by allowing networking innovations to evolve or even to exist in their birthplace.

The way I see evolving the network fabric is through improving on the best of the past. Every profession I know from medicine, finance, law, mathematics, physics, you name it -- all of them are building their tomorrow on a mountain of past knowledge and experience. So I'm sure my feeling the same about the network doesn't make me outdated, just maybe not a fashionable SDN hipster.

Some angry SDN hipsters say that the core network needs to be dumbed down. They must have had a "bad childhood," technically speaking. One too many Cisco 6500's stuffed with firewalls, load balancers and other things that didn't belong there. Maybe even a few with a computer or two crammed into them. I'm not sure I can feel sorry for you if that was your experience. Maybe you didn't realize that was a bad idea until it was too late. Maybe you were too naive and didn't know how to strike the right balance in your core network. Whatever it was, I can assure you that your experience isn't universal, and neither is your opinions about how tomorrow should or shouldn't be.

Those who couldn't figure out how to manage complexity yesterday won't be able to figure it out tomorrow. Tomorrow will come and soon become yesterday and they'll still be out there searching. Endlessly. Never realizing that the problem wasn't so much the network, it was them and the next big company that they put their trust in.

I had a great experience building great networks. I stayed away from companies that didn't give me what I needed to get the job done right. The network was a heck of a lot easier to manage than computers in my day, and the technology has kept pace in almost every aspect. You see Amazon and Google aren't the only ones that can build great infrastructure. And some of us don't need help from VMWare thank you.

So mister angry SDN hipster, do us all a favor and don't keep proposing to throw the baby out with the bath water. We know your pain and see your vision too, but ours might not be so narrow.

Straight talk about the enterprise data center network.

2013-06-03T10:58:00.002-07:00

Building a mission critical system requires the careful selection of constituent technologies and solutions based on their ability to support the goals of the overall system. We do this regardless of whether we subscribe to one approach or another of building such a system.

It is well known that the network technologies commonly used today to build a data center network have not adequately met the needs of applications and operators. However, what is being drowned out in the media storm around "SDN" is that many of the current day challenges in the data center network can be addressed within an existing and familiar toolkit. The vision of SDN should be beyond where today’s data center networks could have been yesterday.

In this "treatise" I highlight what I believe has been lacking in todays data center core network toolkit as well as address certain misconceptions. I'll begin by listing key aspects of a robust network, followed by perspectives on each.

A robust network should be evolved in all of the following aspects:

Modularity - freedom to choose component solutions based on factors, such as bandwidth, latency, ports, cost, serviceability, etc., This generally requires the network to be solution and vendor-neutral as no single solution or vendor satisfies all requirements. Management and control-plane applications are not excluded from this requirement.
Automation - promotes the definition of robust network services, automated instantiation of these network services, full cycle management of network services and of other physical and logical network entities, and api-based integration into larger service offerings.
Operations - functional simplicity and transparency (not a complicated black box), ease of finding engineering and operations talent and ease of building or buying robust software to transparently operate it.
Flexibility - any port (physical or virtual) should support any network service (“tenancy”, service policy, etc). This property implies that the network can support multiple coexisting services while still meeting end user performance and other experience expectations.
Scalability - adding capacity (bandwidth, ports, etc) to the network should be trivial and achievable without incremental risk.
Availability - through survivability, rapid self-healing and unsusceptibility.
Connectivity - in the form of a simple, robust and consistent way to federate network fabrics and internetwork.
Cost -- since inflated costs inhibit innovation.

This blog post is about the location-based connective fabric over which higher-layer network services and applications operate. I might talk about "chaining in" conversation-based network services (such as stateful firewalling, traffic monitoring, load-balancing, etc) another time.

On modularity.

Real modularity frees you to select components for your system that best meet it's requirements, without the constraint of unnecessary lock in. In today’s network, the part of the network that can make or break this form of modularity are generally the control-plane and data-plane protocols.

Network protocols are like language. Proprietary protocols have a similar impact to networking as languages to civilization, which is that language silos hinder a connected world from forming. It took the global acceptance of English to enable globalization -- for the world to be more accessible and opportunities not constrained by language.

The alphabet soup of proprietary and half-way control-plane protocols we’ve had forced on us has resulted in mind-bending complexity and has become a drag on the overall advancement of the network. Each vendor claims their proprietary protocol is better than the other guys, but we all know that proprietary protocols are primarily a tool to lock customers out of choice.

Based on evidence, it’s reasonable to believe that robust open protocols and consistent robust implementations of these would address the modularity requirement. We can see this success in the data-plane protocols of TCP/IP and basic Ethernet.

Many folks see the world of network standards as a pile of RFCs related to BGP, IGPs, and other network information exchange protocols. What they don’t see is that many RFCs actually describe network applications (sound familiar?). For example, the IETF RFC 4364 ("BGP/MPLS IP Virtual Private Networks") describes the procedures and data exchange required to implement a network application used for creating IP-based virtual private networks. It describes how to do this in a distributed way using MPBGP for NLRI exchange (control-plane) and MPLS for data plane -- in other words it does not attempt to reinvent what it can readily use. Likewise there are RFCs that describe other applications such as Ethernet VPNs, pseudo-wire service, traffic engineering, etc.

Openflow extends standardization to the API of the data plane elements, but it is still only a single chapter in the modularity story. Putting a proprietary network application on top of Openflow is damaging to Openflow's goal of openness since a system is only as open as the least open part of it.

The modularity of any closed system eventually breaks down.

On automation.

Achieving good network automation has been more of a challenge for some operators than for others. If you haven't crossed this mountain then it's hard, but if you're over the top already then it's a lot easier. Based on my experience the challenges with automating today’s networks are concentrated in a couple of places.

Low-level configuration schema rather than service-oriented schema. This is fairly obvious when you look at configuration schema of common network operating systems. To pull together a network service such as an IPVPN, many operators need a good "cookbook" that provides a "recipe" which describes how to combine a little bit of these statements with a little bit of those seemingly unrelated statements, and that also hopefully warns you of hidden dangers in often undocumented default settings.
Different configuration languages to describe the same service and service parameters. The problem also extends to the presentation of status and performance information. There are very large multi-vendor networks today that seamlessly support L2 and L3 VPNs, QoS, and all that other good stuff across the network -- this is made possible by the use of common control and data plane protocols. However it is quite a challenge to build a provisioning library that has to speak in multiple tongues (ex: IOS, JunOS, etc), and keep up with changes in schema (or presentation). This problem highlights the need for not only common data plane and control plane languages, but also for common management plane languages.
Inability to independently manage one instance of a service from another. An ideal implementation would allow configuration data related to one service instance to be tagged so that services can be easily retired or modified without affecting other services that may depend on the same underlying resources. Even where Openflow is involved, Openflow rules that are common to different service instances need to be managed in a manner that prevents the removal of these common rules when an instance of a service that shares that common rule is removed -- with Openflow, this forces the need for a central all-knowing controller to compile the vectors for the entire fabric and communicate the diffs to data plane elements.
Lack of robust messaging toolkit to capture and forward meaningful event data. Significant efficiencies can be achieved when network devices capture events and deliver them over a reliable messaging service to a provisioning or monitoring system (syslog does not qualify). For example [simplistically] if a switch could capture LLDP information regarding a neighbor and send that to a provisioning station along with information about itself then the provisioning system could autonomically configure the points of service without the need for an operator to drive the provisioning.
Inability to atomically commit all the statements related to the creation, modification, deletion and rollback of a service instance as a single transaction. Some vendors have done a better job than others in this area.
Excessively long commit times (on platforms where commits are supported). Commits should take milliseconds and not 10’s of seconds to be considered API quality.

Interestingly the existence of these issues gives impetus to the belief by some that Openflow will address them by effectively getting rid of them. Openflow is to today's CLI as Assembly Language is to Python. With all the proprietary extensions tacked on to the side of most implementations of Openflow the situation is hardly any better than crummy old CLI, except now you need a couple of great software developers with academic levels of network knowledge and experience (a rare breed) to engineer a network.

On the other hand, for the sake of automation, buying a proprietary centralized network application that uses Openflow to speak to data plane elements isn't necessarily an ideal choice either. These proprietary network applications may support a northbound interface based on a common service schema (management plane) and issue Openflow-based directives to data plane elements, but implement proprietary procedures internally -- a black box.

On operations.

Relying on hidden procedures that are at the heart of the connectivity of your enterprise isn't, in my opinion, a good thing. Today's network professionals believe that control-plane and data-plane transparency are essential to running a mission critical network since that transparency is essential to identifying, containing and resolving issues. The management plane, on the other hand, is considered important for the rapid enablement of services but, in most cases, is less of a concern in relation to business continuity. Some perspectives on SDN espouse fixing of issues at the management plane at the expense of disrupting and obscuring the control plane. Indeed some new SDN products don’t even use Openflow.

Vendors and others that believe that enterprises are better off not understanding the glue that keeps their system together are mistaken. Banks need bankers, justice needs lawyers, the army needs generals and mission critical networks need network professionals. One could define APIs to drive any of these areas and have programmers write the software, but lack of domain expertise begs failure.

When I first read the TRILL specification I was pretty baffled. The network industry had already created most of the basic building blocks needed to put STP-based protocols out of their miserable existence and yet it needed to invent another protocol that would bring us only half way to where we needed the DC network to be. The first thing that crossed my mind when I came out of the sleep induced by reading this document was regarding the challenge of managing all the different kinds of wheels created by endless reinvention. Trying to holistically run a system with wheels of different sizes and shapes all trying to do effectively the same thing yet in different ways, is mind numbing and counter-productive. My reaction was to do something about this -- hence E-VPN, which enables scalable Ethernet DCVPN, based on an existing proven wheel.

Domain expertise is built on transparency, and success is proportional to how non-repetitive, minimal, structured and technically sound your choice of technologies are -- and you'll be better off if your technologists can take the reins when things are heading south.

On flexibility.

Anyone who has built a large data center network in the past 15 years is familiar with the tradeoffs imposed by spanning tree protocols and Ethernet flooding. The tradeoff for a network that optimized for physical space (i.e. any server anywhere) was a larger fault domain while the tradeoff for networks that optimized for high availability was restrictions on the placement of systems (or a spaghetti of patches). When the DC started to get filled up things got ugly. Things were not much better with regard to multi-tenancy at the IP layer of the data center network either.

As I mentioned before, some of the same vendors in the DC network space had already created the building blocks for constructing scalable and flexible networks for service providers. But they kept these technologies out of the hands of DC network operators -- the DC business made profits on volume while the WAN business made its dollars on premiums for feature-rich equipment. If network vendors introduced lower cost DC equipment with the same features as the WAN equipment they risked harming their premium WAN business. Often the vendor teams that engineered DC equipment were different from those that engineered WAN equipment and they did not always work well together, if at all.

WAN technology such as MPLS was advertised as being too complex. Having built both large WAN and DC networks I'll admit that I've had a harder time building a good DC network than building a very large and flexible MPLS WAN. The DC network often “got me”, while the WAN technology was far more predictable. But instead of giving us flexible DCVPN based on robust, scalable and established technologies we instead were given proprietary flavors of TRILL for the DC. The DC network was essentially turf protected with walls made of substandard proprietary protocols. The good news is that all that is changing -- our vendors have known for quite some time that WAN technology is indeed good for the DC and artificial lines need not be drawn between the two.

A flexible DC network allows any service to be enabled on any network port (physical or virtual). One could opt to achieve this flexibility using network software running in hypervisors under the control of centralized SDN controllers. This model might be fine for some environments where compute virtualization is pervasive and if good techniques to reduce risk are employed. Most enterprise DC environments on the other hand will continue to have “bare metal” servers which will be networked over physical ports. "Physicalization" remains strong and certainly PaaS in an enterprise DC does not necessarily require a conventional hypervisor. Some environments may even need to facilitate the extension of a virtual network to other local or remote devices where it will not be possible to impose a hypervisor. Ideally the DC network would not need to be partitioned into parts that are interconnected by specialized gateway choke points. The goal of reducing network complexity and increasing flexibility can't be achieved without eliminating gateways and other service nodes where they only get in the way.

On scalability.

The glue that holds together a network is the control-plane of the network -- it’s job is to efficiently and survivably distribute information about the position and context of destinations on a network and determine the best paths via which to reach them. The larger a network the more the details of the control plane matter. As I alluded to before, spanning tree protocols showed up on the stage with seams already unraveled (this is why smart service providers avoid it like the plague).

The choice of control plane matters significantly in the scaling of a network that needs to provide seamless end-to-end services. A good network control plane combined with its proper implementation enables the efficient and reliable distribution of forwarding information from the largest and most expensive equipment on the network to the smallest and cheapest ones.

A good control plane takes a divide and conquer approach to scaling. As an example of this, given a set of possible paths to a destination, BGP speakers only advertise to their peers the best paths they have chosen based on their position in the network. This approach avoids the need for the sender to transmit more data than necessary and the receiver to have to store and process more of it. Another scaling technique that is used by some BGP-based network applications is to have network nodes explicitly subscribe to the routing information that are relavent to them. Scaling features are indeed available in good standards-based distributed control planes and not unique to proprietary centralized ones.

One could argue that a central controller can do a better job since it has full visibility into all the information in the system. However a central controller can only support a domain of a limited size, more or less the size of what an instance of an IGP can support. The benefits therefore are tempered when building a seamless network of significant scale since to do this necessitates multiple controller domains that share summarized information with each other. As you step back, you begin to see why [scalable and open] network protocols matter.

In addition to the network control plane, open standards have allowed further scaling in other ways, such as by enabling one service domain to be cleanly layered over another service provider domain without customer and service provider having to know the details of the layer above or below.

There are other properties that tend to be inherent in scalable network technology and in their implementations. Bandwidth scaling, for example, (1) avoids the need for traffic to move back and forth across the network as it makes it’s way between two endpoints, (2) avoids multiple copies of the same packet on any single link of the network (whether intra or inter-subnet), and (3) makes it possible for different traffic types to share the same physical wires fairly with minimal state and complexity. Scalable network technology is also fairly easy to understand and debug (but I said that already).

On availability.

In todays networks, each data plane node is co-resident with it's own personal control-plane entity within the same sheet metal packaging. However the communication between that control-plane entity and the coupled data plane entity is opaque. Some operators reduce the size of each box to reduce the unknowns related to the failure of any single box. The complexity inside the box doesn't matter so much since the operator is able to reduce the impact of a failure of the box. What he can see is the control-plane dialog between boxes which he understands. Now imagine that box is the size of your DC network, and the "inside" is still opaque.

In my experience the biggest risk to high availability is software. The quality of software tends to vary as the companies that produce them go through shifts and changes. Many operators have been on the receiving end of this dynamic lately. Even software that is intended to improve availability tends to be a factor in reducing it. In order to minimize the chance of getting hit hard by a software bug, many operators deploy new software into their network in phases (after first testing in their lab hopefully). Any approach that requires an operator to cross his fingers and upgrade large chunks of the network or god forbid the whole thing is probably not suitable for a mission critical system. In my opinion, it is foolish to trust any vendor's software to be bug free. Building a mission critical system involves reducing software fault domains as much as it does reducing other single points of failure.

The most resilient networks would have two separate network "rails" with hosts that straddle both rails and software on the hosts that know how to dynamically shift away from problems on one or the other network. For the highest availability, the software running each network would be different from the other since then they will not be subject to the same bugs.

On connectivity.

In order to scale out a network and reduce risks, we would most likely divide the network into smaller control-plane domains (lets call them fabrics) and then inter-connect these fabrics. There are two main ways we could go about federating these fabrics to create a fabric of fabrics.

Gateways -- We might choose to interconnect the fabrics using gateways, but depending on the gateway technology, the use of gateways can tend to damage the transparency between two communicating endpoints and/or they create artificial choke points and burdensome complexity in the network. Gateways may be the way to go for public clouds for tenants to access their virtual networks, but in large enterprise data centers, fabric domains will more often be created for HA purposes and not for creating virtual network silos with high data-plane walls.

Seamless -- In the ideal world, we should be able to create virtual networks that are not constrained to a single fabric. In order to achieve this, distributed controllers need to federate with each other using a common paradigm (let's say L2 VPN) and ideally using a common protocol language. This seems to take us back to the need for a solid distributed control plane protocol (we can't seem to get away from it!).

If seamless connectivity is not to be limited to a single fabric, then a good protocol will be the connectivity maker. Then it begs the question why is a good protocol not good enough for the fabric itself? Hmm...

On cost.

Health care in the United States is one of the most expensive in the world, but if I was in need of serious medical attention I wouldn't want to be anywhere else. If health care were free innovation would probably cease. Once patents expire and generics hit the market access to life-saving medication becomes commoditized and more accessible. Similarly when Broadcom brought Trident to the market and Intel came with Alta, building good data center networks became accessible to folks that weren't in big margin businesses. However the commodity silicon vendors of the world aren't at the head of the innovation curve. Innovation isn't cheap, but it does need to be fair to both the innovator and the consumer.

The truth about network hardware cost is that it depends on what you need and which point on the adoption curve you are at. The other truth about cost is that customers have a role in driving down costs through the choices we make. As an example of this -- if I believe that routers should have 100GE coherent DWDM interface cards for DCI, then I put my dollars on the first vendor that brings a good implementation to market -- I resist alternatives since it defeats my goal of making the market. There may be a higher price to being a first mover, but once a competitive market is established prices will fall and we're all the better for it. Competition flourishes when sellers know they can be easily displaced -- hence again standards. The alternative is to spend your dollars on run of the mill technology and cry about it.

Fortunately, most data centers don't require really fancy technology as many vendors might have you believe (unless you love fiber channel). The problem is that no single vendor seems to have the right combination of technology that would make for an ideal network fabric. The network equipment we buy has a ton of features we probably didn't need (like DCB) yet pay dearly for and not enough of the stuff that we could really have used. In future blogs I hope to outline a small set of building block technologies that I believe would enable simple, cheap and scalable data center fabrics. It just might be possible for DC network operators to have their cake and eat it too.

In conclusion.

One of the greatest value of the SDN “gold rush” to the enterprise DC network is not necessarily in the change of hands to a brand new way of networking, but in how it has shone a light on it's current dysfunction. SDN is giving network vendors a "run for the money" that will result in them closing the long-standing gaps and bringing great standards-based technologies into the data center network that they've unnecessarily kept out of the data center.

There is more to be done beyond using the fear of SDN to ensure that network vendors are making the best decisions on behalf of the rest of us -- decisions based on customer needs versus on vendor politics and pure self-interest. The way standards bodies work today is driven more by the self-interest of each of it’s members then by what would be best for customers. On the other hand, it was technology buyers that drove the creation of standards bodies because of how much worse it was prior to their existence. Costs can become unconstrained without competition, and once a closed solution is established, extracting it can be a long, costly and perilous journey.

Good standards bring structure to the picture, each good standard bringing one more foundation on which other systems can be efficiently built. When control and data plane are standards-based and properly layered then an Internet emerges. When the management plane is standardized it will reach more people faster.

New and innovative network control applications indeed have value and create new out-of-the box thinking that are good for the industry. On the other hand, that conversation should not be a reason for holding back progress along proven lines. Fixing the problems with todays data center networks should not require giving the boot to all of thirty years of packet network evolution.

The market is agitating for network vendors to bring us solutions that work for us, not just for them. The message is loud and clear and the vendors that are listening will be successful. Vendors and customers that buy too much into the hype will lose big time. Vendors that don't respond to the challenge at hand will also find themselves heading towards the exit.

As much as it's been a bit tiring to search for gems of truth amidst the noise, it's also an exciting time for the data center network.

My advise to buyers and implementors? Choose wisely.