Friday, May 15, 2015

Better than best effort — Rise of the IXP

This is the second installation in my Better than Best Effort series.  In my last post I talked about how the incentive models on the Internet keep ISPs from managing peering and backbone capacity in a way that supports reliable communication in the face of the ever growing volume of rich media content.  If you haven't done so, please read that post before you read this one.

It's clear that using an ISP for business communication comes with the perils associated to the "noisy neighbor" ebb and flow of consumer related high volume data movement.  Riding pipes that are intentionally run hot to keep costs down is a business model that works for ISP, but not for business users of the Internet.  Even with business Internet service, customers may get a better grade of service within a portion of an ISP's networks, but not when their data needs to traverse another ISP which they are not a customer of.  There is no consistent experience, for anyone.

However there is an evolving solution to avoid getting caught in the never ending battle between ISP and large consumer content.  As the title of this blog gives away, the solution is called an Internet eXchange Point (IXP).

IXPs are where the different networks that make up the Internet most often come together.  Without IXP, the Internet would be separate islands of networks.  IXP are, in a sense, the glue of the Internet.  From the early days of the Internet, IXP have been used to simplify connectivity between ISPs resulting in significant cost savings for them.  Without an IXP, an ISP would need to run separate lines and dedicate network ports for each peer ISP with whom it connects.

However IXP and ISP are not distant relatives.  They are in fact close cousins.  Here's why.

Both ISP and IXP share two fundamental properties.  The first is that they both have a fabric, and the second is that they both have "access" links used to connect to customers so they can intercommunicate over this fabric.  The distinction between the two is in the nature of the access interfaces and the fabrics.  ISP fabrics are intended to reach customers that are spread out over a wide geographic area.  An IXP fabric on the other hand is fully housed within a single data center facility.   In some cases an IXP fabric is simply a single Ethernet switch.  ISP access links use technology needed to span across neighborhoods, while IXP access links are basically ordinary Ethernet cables that run an average of around several dozen meters.  So essentially the distinction between the two is that an ISP is a WAN and an IXP is a LAN.

The bulk of the cost in a WAN is in the laying and maintenance of the wires over geographically long distances.  Correspondingly the technology used at the ends of those wires is chosen based on the ability to wring out as much value out of those wires as possible.   The cost of a WAN is significantly higher than a LAN with a comparable number of access links.  ISP need to carefully manage costs which are much higher per byte.  It is on account of the tradeoffs that ISP make in order to manage these costs that the Internet is often unpredictably unreliable.

So how can IXP help?

Let's assume that most business begin to use IXP as meet-me points.  Remember that the cost dynamics of operating an IXP are different than an ISP.  At each IXP these business customers can peer with one another and their favorite ISPs for the cost of a LAN access link.   



There are at least a few advantages of this over connecting to an ISP.  Firstly, two business entities communicating via an IXP's [non-blocking] LAN are effectively not sharing the capacity between them with entities that are not communicating with either of them, and so making their experience far more predictable.  This opens the door for them to save capital and operational costs by eliminating the private lines that they may currently have with these other business entities.  Secondly, a business that is experiencing congestion to remote endpoints via an ISP can choose to not use it by [effectively] disabling BGP routing with that ISP.  This is different from the standard access model used by businesses in that if there are problems downstream of one of their ISP, their control is limited to the much fewer ISP from whom they have purchased access capacity.  

The following illustrates a scenario where there is congestion at a peering point downstream of one of the ISP being used by a business that is affecting it's ability to reach other offices, partners or customers that are hosted on other ISP.



In the access model, since BGP cannot signal path quality, traffic is blindly steered over a path that has the shortest number of intermediate networks versus a path with the best performance.  Buying extra access circuits alone to avoid Internet congestion is not a winnable game (more on this in the next part of this series).

The alternative approach using an IXP would look something like the following.



This illustration shows how being at an IXP creates more direct access to more endpoints at a better price point than buying access lines to numerous ISP.  You can also see how peering with other business entities locally at an IXP can improve reliability, reduce costs and simplify business-to-business connectivity by combining it with Internet connectivity.

There is an interesting trend occurring within the growing number of managed co-location data centers. Hosted within many of these co-location data centers are IXP. Some managed data center operators like Equinix even operate their own IXP at their data centers. These data centers are an ideal place for businesses to connect with one another through IXP without the downsides that come with using consumer-focused ISP.  

This is not to say that the operational capabilities at all IXP are at a level needed to support large numbers of businesses. There is work to be done to scale peering in a manner that will give customers minimal configuration burden and maximal control.

There will even be need for business-focused ISPs that connect business customers at one IXP to business customers connected to the Internet at another IXP. Although net neutrality prohibits the differentiated treatment of data over the Internet, it does not forbid an ISP or IXP from selecting the class of customer it chooses to serve. This is much like the difference between a freeway and a parkway. Parkways do not serve commercial traffic and so in a way they offer a differentiated service to non commercial traffic.

As the Internet enables new SaaS and IaaS providers to find success by avoiding the high entrance cost of building a private service delivery network, more businesses are turning to the Internet to access their solution providers of choice. The old Internet connectivity model cannot reliably support the growing use of the Internet for business and so a better connectivity model is needed for a reliable Internet. New opportunities await.

In upcoming posts I will discuss additional thoughts on further improving the reliability of communicating over the Internet.

Wednesday, May 13, 2015

Better than best effort — Reliability and the Internet.

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system.

Networks prior to the Internet were largely closed systems, and the cost of communicating was extraordinarily high.   In those days, the free exchange of ideas at all levels was held back by cost.  On the Internet, for a cost proportional to a desired amount of access bandwidth, one can communicate with a billion others.  This has propelled human achievement forward over the last 20 years.  By way of Metcalfe’s law, the Internet’s value is immeasurably larger than any private network ever will be.

So why do large private service delivery networks still exist?

The answer lies primarily in one word: reliability.  What Metcalfe’s law doesn’t cover is the reliability of communication of connected users, and the implications of a lack of reliability on the value of services delivered.  Although Internet reliability is improving, much like the highway system, it still faces certain challenges inherent with open and unbiased systems.

On a well run private network, bandwidth and communications are regulated to deliver an optimal experience, and network issues are addressed more rapidly as all components are managed by a single operator.  The Internet on the other hand is a network of networks wherein network operators do not have sufficient incentive to transport data for which they are not adequately compensated.

Growing high volume content services such as video streaming place unrelenting strain on the backbones and peering interfaces of Internet service providers.  With network neutrality and the corresponding lack of QoS on the Internet, ISPs have to maintain significant backbone and peering capacity to ensure other communication continue to function in the presence of these high volume traffic.  However ISP operators have demonstrated that they are much more inclined to provide capacity to their direct customers than they are to those who are not on their network.

Some large ISPs refuse to better manage their peering capacity yet they host a large number of end users.  These end users are, in a qualitative sense, locked inside their ISP.  This seems to be forcing large web and video content providers to buy capacity directly from the ISP of their mutual users in order to deliver content to them.  Despite this latter trend, peering (and many backbone) links continue to be challenged.

(Note: With some ISP there is a qualitative difference between business Internet service and consumer Internet service when it comes to backbone and peering capacity)

For private enterprises that want to engage in business-to-business communications over the Internet, these “noisy neighbor” dynamics do not lend well to reliable service delivery and/or to cost management.  It is cost prohibitive to buy Internet access from many ISPs for the purpose of B2B communications and, conversely, buying access from only a few ISPs puts the communication between two business entities on different ISP at risk of being routed via a congested peering link.  Unfortunately BGP does not take path quality into account when choosing them.

On the outside, it seems straight-forward enough for businesses to continue to peer directly over private leased lines.  However there is a trend that is putting pressure on this model.  An increasing number of private businesses are leveraging an emerging landscape of SaaS and IaaS services.  This is driving a general acceptance of the Internet as a primary medium for B2B communication.   Private connectivity also comes with the baggage of added capital and operational costs.

For many businesses, Internet-based B2B communication “as is” may be fine for SaaS services such as HR and billing where a temporary loss of service is survivable.  But there are a class of services and communications that are too-important-to-fail for many businesses and even for larger ecosystem such as capital markets.  Reliable infrastructure is a prerequirement for engaging in these services.

The prevalent Internet access models fail to bring the Internet to a consistent level of reliability needed by many businesses.  The support of B2B communication over the Internet needs to improve as more businesses adopt the Internet for their core business communication needs.  Needless to say, I have my thoughts on how this should happen which I share on my next blog.

(Note: I have intentionally avoided DDoS and security related problems that also come with being on the Internet.  I believe these can be better handled once the more fundamental problem with the plumbing is dealt with.)

Thursday, March 5, 2015

EVPN. The Essential Parts.

In a blog post back in October 2013 I said I would write about the essential parts of EVPN that make it a powerful foundation for data center network virtualization.  Well just when you thought I'd fallen off the map, I'm back.  :)

After several years as an Internet draft, EVPN has finally emerged as RFC7432.  To celebrate this occasion I created a presentation, EVPN - The Essential Parts.  I hope that shedding more light on EVPN's internals will help make the decision to use (or to not use) EVPN easier for operators.  If you are familiar with IEEE 802.1D (Bridging), IEEE 802.1Q (VLAN), IEEE 802.3ad (LAG), IETF RFC4364 (IPVPN) and, to some degree, IETF RFC6513 (NGMVPN) then understanding EVPN will come naturally to you.

Use cases are intentionally left out of this presentation as I prefer the reader to creatively consider whether their own use cases can be supported with the basic features that I describe.  The presentation also assumes that the reader has a decent understanding of overlay tunneling (MPLS, VXLAN, etc) since the use of tunneling for overlay network virtualization is not unique to EVPN.

Let me know your thoughts below and I will try to expand/improve/correct this presentation or create other presentations to address them.  You can also find me on Twitter at @aldrin_isaac.

Here is the presentation again => EVPN - The Essential Parts

Update:  I found this excellent presentation on EVPN by Alastair Johnson that is a must read.  I now have powerpoint envy :)

Tuesday, November 5, 2013

Evolving the data center network core

As complex functions that were historically shoehorned into the network core move out to the edge where they belong, data center core network operators can now focus on improving the experience for the different types of applications that share the network.  Furthermore, with fewer vertically scaled systems giving way to many horizontally scaled systems, the economics of data center bandwidth and connectivity needs to change.

I’ve jotted down some thoughts for improving the data center core network along the lines of adding bandwidth, managing congestion and keeping costs down.

Solve bandwidth problems with more bandwidth


Adding bandwidth had been a challenge for the last several years owing to the Ethernet industry not being able to maintain the historical Ethernet uplink:downlink speedup of 10:1, and at the same time not bringing down the cost of Ethernet optics fast enough.  Big web companies started to solve the uplink bandwidth speed problem in the same way they had solved the application scaling problem -- scale uplinks horizontally.  In their approach, the role of traditional bridging is limited to the edge switch (if used at all), and load-balancing to the edge is done using simple IP ECMP across a “fabric” topology.  The number of spine nodes on a spine-leaf fabric is constrained only by port counts and the number of simultaneous ECMP next hops supported by the hardware.  By horizontally scaling uplinks, it became possible to create non-blocking or near non-blocking networks even when uplink ports speeds are not the 10x of access they once were.  Aggregating lower speed ports in this way also benefits from the ability to use lower-end switches at a lower cost.

Using ECMP does come with it’s own demons.  Flow based hashing isn’t very good when the number of next hops isn’t large.  This leads to imperfect load balancing which results in imperfect bandwidth utilization and nonuniform experience for flows in the presence of larger (elephant) flows.  To address this issue, some large web companies look to use up to 64 ECMP paths to increase the efficiency of ECMP flow placement across the backbone.  However even with the best ECMP flow placement, it's better that uplink port speeds are faster than downlink speeds to avoid the inevitable sharing of an uplink by elephant flows from multiple downstream ports.

Yet another approach goes back to the use of chassis based switches with internal CLOS fabrics -- most chassis backplanes do a better job of load balancing, in some cases by slicing packets into smaller chunks at the source card before sending them in parallel across multiple fabric links and reassembling the packet at the destination card.

Managing Congestion


Although the best way to solve a bandwidth problem is with more bandwidth, even with a non-blocking fabric network congestion will still occur.  An example is with events that trigger multiple (N) hosts to send data simultaneously to a single host.  Assuming all hosts have the same link speed, the sum of the traffic rates of the senders is N times the speed of the receiver and will result in congestion at the receiver’s port.  This problem is common in scale-out processing models such as mapreduce.  In other cases, congestion is not a result of a distributed algorithm, but competing elephant flows that by nature attempt to consume as much bandwidth as the network will allow.

Once the usual trick of applying as much bandwidth as the checkbook will allow has been performed, it’s time for some other tools to come into play -- class-based weighted queuing, bufferbloat eliminating AQM techniques and more flexible flow placement.

Separate traffic types:

The standard technique of managing fairness across traffic types by placing different traffic types in different weighted queues is extremely powerful and sadly underutilized in the data center network.  One of the reasons for the underuse is woefully deficient queuing in many merchant silicon based switches.  

The ideal switch would support a reasonable number of queues where each queue has sufficient, dedicated and capped depth.  My preference is 8 egress queues per port that can each hold approximately 20 jumbo frames on a 10GE link.  Bursty applications may need a bit more for their queues and applications with fewer flows can in many cases be afforded less.  Proper use of queueing and scheduling ensures that bandwidth hungry traffic flows do not starve other flows.  Using DSCP to classify traffic into traffic classes is fairly easy to do and can be done from the hosts before packets hit the network.  It is also important to implement the same queuing discipline at the hosts (on Linux using the ‘tc’ tool) as exists in the network to ensure the same behavior end-to-end from source computer to destination computer.  

One thing to watch out for is that on VoQ based switches with deep buffers the effective queue depth of an egress port servicing data from multiple ingress ports will, in the worst case, be the sum of the ingress queue depths, which may actually be too much.

Address bufferbloat:

Once elephants are inside their own properly scheduled queue, they have little affect on mice in another queue.  However elephants in the same queue begin to affect each other negatively.  The reason is that TCP by nature attempts to use as much bandwidth as the network will give it.  The way in which out-of-the-box TCP knows that it’s time to ease off of the accelerator is when packets start to drop.  The problem here is that in a network with buffers, these buffers have to get full before packets drop, leading to effectively a network with no buffers.  When multiple elephant flows are exhausting the same buffer pool in an attempt to find their bandwidth ceiling, the resulting performance is also less than the actual bandwidth available to them.

In the ideal world, buffers would not be exhausted for elephant flows to discover their bandwidth ceiling.  The good news is that newer AQM techniques like CoDel and PIE in combination with ECN-enabled TCP work together to maximize TCP performance without needlessly exhausting network buffers.  The bad news is that I don’t know of any switches that yet implement these bufferbloat management techniques.  This is an area where hardware vendors have room to improve.

Class-aware load balancing:

The idea behind class-aware load balancing across a leaf-spine fabric is to effectively create different transit rails for elephants and for mice.  In class-aware load balancing, priorities are assigned to traffic classes on different transit links such that traffic of a certain class will be forwarded only over the links that have the highest priority for that class with the ability to fall back to other links when necessary.  Using class-aware load balancing, more links can be prioritized for elephants during low mice windows and less during high mice windows.  Other interesting possibilities also exist.

Spread elephant traffic more evenly:

Even after separating traffic types into different queues, applying bufferbloat busting AQM, and class-aware load balancing, there is still the matter of hot and cold spots created by flow-based load-balancing of elephant flows.  In the ideal situation, every link in an ECMP link group would be evenly loaded.  This could be achieved easily with per-packet load balancing (or hashing on IP ID field), but given the varying size of packets, the resulting out-of-sequence packets can have a performance impact on the receiving computer.  There are a few ways to tackle these issues -- (1) Enable per-packet load-balancing only on the elephant classes.  Here we trade off receiver CPU for bandwidth efficiency, but only for receivers of elephant flows.  We can reduce the impact on the receiver by using jumbo frames.  Additionally since elephant flows are generally fewer than mice the CPU impact aggregated across all nodes is not that much.  (2) Use adaptive load-balancing on elephant class.  In adaptive load balancing, the router samples traffic on it’s interfaces and selectively places flows on links to even out load.  These generally consume FIB space, but since elephant flows are fewer than mice using some FIB space for improved load balancing of elephant flows is worth it.

Update: Another very promising way to spread elephant traffic more evenly across transit links is MPTCP (see http://www.youtube.com/watch?v=02nBaaIoFWU).  MPTCP can be used to split an elephant connection across many subflows that are then hashed by routers and switches across multiple paths -- MPTCP then shifts traffic across subflows so as to achieve the best throughput.  This is done by moving data transmission from less performant subflows to subflows that are more performant.

Keeping costs down


The dense connectivity requirement created by the shift from vertically scaled computing and storage to horizontally scaled approaches has put a tremendous new cost pressure on the data center network.  The relative cost of the network in a scale-out data center is rising dramatically as the cost of compute and storage falls -- this is because the cost of network equipment is not falling in proportion.  This challenges network teams as they are left with the choice of building increasingly oversubscribed networks or dipping into the share that is intended for compute and storage.  Neither of these are acceptable.

The inability of OEMs to respond to this has supported the recent success of merchant silicon based switches.  The challenge with using merchant silicon based switches is the limited FIB scaling on these switches as compared to more expensive OEM switches.  OEM switches tend to have better FIB scaling and QoS capability, but at a higher cost point. 

Adopting switches with reduced FIB capacity can make some folks nervous.  Chances are, however, that some "little" changes in network design can make it possible to use these switches without sacrificing scalability.   One example of how to reduce the need for high FIB capacity in a stub fabric is to not send external routes into the fabric but only send in a default route.  The stub fabric would only need to carry specific routes for endpoint addresses inside the fabric.  Addressing the stub fabric from a single large prefix which is advertised out also reduces the FIB load on the transit fabric, which enables the use of commodity switches also in the transit fabric.  For multicast, using PIM bidir instead of PIM DM or SM/SSM will also significantly reduce the FIB requirements for multicast.  Using overlay network virtualization also results in a dramatic reduction in the need for large FIB tables when endpoint addresses need to be mobile within or beyond the stub fabric -- but make sure that your core switches can hash on overlay payload headers or you will lose ECMP hashing entropy.  [ Note: some UDP or STT-based overlay solutions manipulate the source port of the overlay header to improve hashing entropy on transit links. ]

Besides the cost of custom OEM switches, OEMs have also fattened their pocketbooks on bloated optical transceiver prices.  As network OEMs have been directing optics spend into their coffers, customers have been forced to spend even more money buying expensive switches with rich traffic management capabilities and tools in order to play whack-a-mole with congestion related problems in the data center network.  As customers have lost their faith in their incumbent network equipment vendors a market for commodity network equipment and optics has begun to evolve.  The prices of optics have fallen so sharply that, if you try hard enough, it is possible to get high quality 10GE transceivers for under $100 and 40GE for low hundreds as well.  Now it’s possible to populate the switch for much less than the cost of the switch whereas previously the cost of populating a switch with optics was multiples of the cost of the actual switch.  Furthermore, with the emergence of silicon photonics I believe we will also see switches with on-board 10GE single-mode optics by 2016.  With luck, we’ll see TX and RX over a single strand of single-mode fiber -- I believe this is what the market should be aiming for.

As the core network gets less expensive, costs may refuse to come down for some data center network operators as the costs shift to other parts of the network.  The licensing costs associated to software-based virtual network functions will be one of the main culprits.  In the case of network virtualization, one way to avoid the cost would be to leverage ToR-based overlays.  In some cases, ToR-based network virtualization comes at no additional costs.  If other network functions like routing, firewalling and load-balancing are still performed on physical devices (and you like it that way), then "passive mode" MVRP between hypervisor and ToR switch in combination with ToR-based overlay will enable high performance autonomic network virtualization.  The use of MVRP as a UNI to autonomically trigger VLAN creation and attachment between hypervisor switch and ToR switch is already a working model and available on Juniper QFabric and OpenStack (courtesy of Piston Cloud and a sponsor). [ Note: ToR-based network virtualization does not preclude the use of other hypervisor-based network functions such as firewalling and load-balancing ]

All said however, the biggest driver of cost are closed vertically integrated solutions that create lock-in and hold back operators from choice.  Open standards and open platforms give operators the freedom to choose network equipment by speed, latency, queuing, high availability, port density, and such properties without being locked in to a single vendor.  Lock-in leads to poor competition and ultimately to higher costs and slower innovation as we have already witnessed.

Wrapping up here, I’ve touched on some existing technologies and approaches, and some that aren’t yet available but are very needed to build a simple, robust and cost effective modern data center core network.  If you have your own thoughts on simple and elegant ways to build the datacenter network fabric of tomorrow please share in the comments.

Tuesday, October 22, 2013

The real Slim Shady

Historically when an application team needed compute and storage resources they would kick off a workflow that pulled in several teams to design, procure and deploy the required infrastructure (compute, storage & network).  The whole process generally took a few months from request to delivery of the infrastructure.  

The reason for this onerous approach was really that application groups generally dictated their choice of compute technology.  Since most applications scaled vertically, the systems and storage scaled likewise.  When the application needed more horsepower, it was addressed with bigger more powerful computers and faster storage technology.  The hardware for the request was then staged followed by a less-than-optimal migration to the new hardware.  

The subtlety that gets lost regarding server virtualization is that a virtualization cluster is based on [near] identical hardware.  The first machines that were virtualized where the ones who’s computer and storage requirements could be met by the hardware that the cluster was based on.  These tended to be the applications that were not vertically scaled.  The business-critical vertically scaled applications continued to demand special treatment, driving the overall infrastructure deployment model used by the enterprise.

The data center of the past is littered with technology of varying kinds.  In such an environment technology idiosyncrasies change faster than the ability to automate them -- hence the need for operators at consoles with a library of manuals.  Yesterdays data center network was correspondingly built to cater to this technology consumption model.  A significant part of the cost of procuring and operating the infrastructure was on account of this diversity.  Obviously meaningful change would not be possible without addressing this fundamental problem.

Large web operators had to solve the issue of horizontal scale out several years ahead of the enterprise and essentially paved the way for the horizontal approach to application scaling.   HPC had been using the scale out model before web, but the platform technology was not consumable by the common enterprise.   As enterprises began to leverage web-driven technology as the platform for their applications they gained it’s side benefits, one of which is horizontal scale out.  

With the ability to scale horizontally it was now possible to break an application into smaller pieces that could run across smaller “commodity” compute hardware.  Along with this came the ability to build homogeneous easily scaled compute pools that could meet the growth needs of horizontally scaling applications simply by adding more nodes to the pool.  The infrastructure delivery model shifted from reactive application-driven custom dedicated infrastructure to a proactive capacity-driven infrastructure-pool model.  In the latter model, capacity is added to the pool when it runs low.  Applications are entitled to pool resources based on a “purchased” quota.

When homogeneity is driven into the infrastructure, it became possible to build out the physical infrastructure in groups of units.  Many companies are now consuming prefabricated racks with computers that are prewired to a top-of-rack switch, and even pre-fabricated containers.  When the prefabricated rack arrives, it is taken to it’s designated spot on the computer room floor and power and network uplinks are connected.  In some cases the rack brings itself online within minutes with the help of a provisioning station.

As applications transitioned to horizontal scaling models and physical infrastructure could be built out in large homogeneous pools some problems remained.  In a perfect world, applications would be inherently secure and be deployed to compute nodes based on availability of cpu and memory without the need for virtualization of any kind.  In this world, the network and server would be very simple.  The reality is, on the one side, that application dependencies on shared libraries do not allow them to co-exist with an application that needs a different version of those libraries.  This among other things forces the need for server virtualization.  On the other side, since today’s applications are not inherently secure, they depend on the network to create virtual sandboxes and enforce rules within and between these sandboxes.  Hence the need for network virtualization.

Although server and network virtualization have the spotlight, the real revolution in the data center is simple homogeneous easily scalable physical resource pools and applications that can use them to effectively.  Let's not lose sight of that.


[Improvements in platform software will secure applications and allow them co-exist on the same logical machine within logical containers, significantly reducing the need for virtualization technologies in many environments.  This is already happening.]

Sunday, October 13, 2013

The bumpy road to EVPN

In 2004 we were in the planning phase of building a new data center to replace one we had outgrown.   The challenge was to build a network that continued to cater to a diverse range of data center applications and yet deliver significantly improved value.

Each operational domain tends to have one or more optimization problem whose solution is less than optimal for another domain.  In an environment where compute and storage equipment come in varying shapes and capabilities and with varying power and cooling demands, the data center space optimization problem does not line up with the power distribution and cooling problem, the switch and storage utilization problem, or the need to minimize shared risk for an application, to name a few.

The reality of the time was that the application, backed by it's business counterparts, generally had the last word -- good or bad.  If an application group felt they needed a server that was as large as a commercial refrigerator and emitted enough heat to keep a small town warm, that's what they got, if they could produce the dollars for it.  Application software and hardware infrastructure as a whole was the bastard of a hundred independent self-proclaimed project managers and in the end someone else paid the piper.

When it came to moving applications into the new data center, the first ask of the network was to allow application servers to retain their IP address.  Eventually most applications moved away from a dependence on static IP addresses, but many continued to depend on middle boxes to manage all aspects of access control and security (among other things).  The growing need for security-related middle-boxes combined with their operational model and costs continued to put pressure on the data center network to provide complex bandaids.

A solid software infrastructure layer (aka PaaS) addresses most of the problems that firewalls, load-balancers and stretchy VLANs are used for, but this was unrealistic for most shops in 2005.  Stretchy VLANs were needed to make it easier on adjacent operational domains -- space, power, cooling, security, and a thousand storage and software idiosyncrasies.  And there was little alternative than to deliver stretchy VLANs using a fragile toolkit.  With much of structured cabling and chassis switches giving way to data center grade pizza box switches, STP was coming undone.  [Funnily the conversation about making better software infrastructure continues to be overshadowed by a continued conversation about stretchy VLANs.]

Around 2007 the network merchants who gave us spanning tree came around again pandering various flavors of TRILL and lossless Ethernet.  We ended up on this evolutionary dead end for mainly misguided reasons.  In my opinion, it was an unfortunate misstep that set the clock back on real progress.  I have much to say on this topic but I might go off the deep end if I start.

Prior to taking on the additional responsibility to develop our DC core networks, I was responsible for the development of our global WAN where we had had a great deal of success building scalable multi-service multi-tenant networks.  The toolkit to build amazingly scalable, interoperable multi-vendor multi-tenancy already existed -- it did not need to be reinvented.  So between 2005 and 2007 I sought out technology leaders from our primary network vendors, Juniper and Cisco, to see if they would be open to supporting an effort to create a routed Ethernet solution suitable for the data center based on the same framework as BGP/MPLS-based IPVPNs.  I made no progress.

It was around 2007 when Pradeep stopped in to share his vision for what became Juniper's QFabric.  I shared with him my own vision for the data center -- to make the DC network a more natural extension of the WAN and based on common toolkit.  Pradeep connected me to Quaizar Vohra and ultimately to Rahul Agrawal.  Rahul and I discussed the requirements for a routed Ethernet for the data center based on MP-BGP and out of these conversations MAC-VPN was born.  At about the same time Ali Sajassi at Cisco was exploring routed VPLS to address hard-to-solve problems with flood-and-learn VPLS, such as multi-active multi-homing.  With pressure from yours truly to make MAC-VPN a collaborative industry effort, Juniper reached out to Cisco in 2010 and the union of MAC-VPN and R-VPLS produced EVPN, a truly flexible and scalable foundation for Ethernet-based network virtualization for both data center and WAN.  EVPN evolved farther with the contributions from great folks at Alcatel, Nuage, Verizon, AT&T, Huawei and others.

EVPN and a few key enhancement drafts (such as draft-sd-l2vpn-evpn-overlaydraft-sajassi-l2vpn-evpn-inter-subnet-forwarding, draft-rabadan-l2vpn-evpn-prefix-advertisement) combine to form a powerful, open and simple solution for network virtualization in the data center.  With the support added for VXLAN, EVPN builds on the momentum of VXLAN.  IP-based transport tunnels solve a number of usability issues for data center operators including the ability to operate transparently over the top of a service provider network and optimizations such as multi-homing with "local bias" forwarding.  The other enhancement drafts describe how EVPN can be used to natively and efficiently support distributed inter-subnet routing and service chaining, etc.

In the context of SDN we speak of "network applications" that work on top of the controller to implement a network service.  EVPN is a distributed network application, that works on top of MP-BGP.  EVPN procedures are open and can be implemented by anyone with the benefit of interoperation with other compliant EVPN implementations (think federation).  EVPN can also co-exist synergistically with other MP-BGP based network applications such as IPVPN, NG-MVPN and others.  A few major vendors already have data center virtualization solutions that leverage EVPN.

I hope to produce a blog entry or so to describe the essential parts of EVPN that make it a powerful foundation for data center network virtualization.  Stay tuned.

Thursday, July 4, 2013

Air as a service

Have you ever wondered about air?  We all share the same air.  We know it's vitally important to us.  If we safeguard it we all benefit and if we pollute it we all suffer.  But we don't want to have to think about it every time we take a breath.  That's the beauty of air.  Elegantly simple and always there for you.

Imagine air as a service (AaaS), one where you need to specify the volume of air, the quality of the air, etc before you could have some to breathe.  As much as some folk might be delighted in the possibility to capitalize on that, it would not be the right consumption model for such a fundamental resource.  If we had to spend time and resources worrying about the air we breathe we'd have less time and resources to do other things like make dinner.



Why does air as it is work so well for us?  I think it's for these reasons, (1) there is enough of it to go around and (2) reasonable people (the majority) take measures to ensure that the air supply is not jeopardized.

Network bandwidth and transport should be more like how we want air to be.  The user of network bandwidth and transport (citizen) should not have to think about these elemental services of the network other than to be a conscientious user of this shared resource.  The operator of the network (government) should ensure that the network is robust enough to meet these needs of network users.  Furthermore the operator should protect network users from improper and unfair usage without making the lives of all network users difficult, or expecting users to know the inner workings of the network in order to use it.

The past is littered with the carcasses of attempts by network vendors and operators to force network-level APIs and other complexity on the network user.  Remember the ATM NIC?  Those who forget the past are doomed to repeat it's failures and fail to benefit from it's successes.

What the average network user wants is to get the elemental services of the network without effort, like breathing air.  So don't make it complicated for the network user -- just make sure there's enough bandwidth to go around and get those packets to where they need to go.