Tuesday, November 5, 2013

Evolving the data center network core

As complex functions that were historically shoehorned into the network core move out to the edge where they belong, data center core network operators can now focus on improving the experience for the different types of applications that share the network.  Furthermore, with fewer vertically scaled systems giving way to many horizontally scaled systems, the economics of data center bandwidth and connectivity needs to change.

I’ve jotted down some thoughts for improving the data center core network along the lines of adding bandwidth, managing congestion and keeping costs down.

Solve bandwidth problems with more bandwidth

Adding bandwidth had been a challenge for the last several years owing to the Ethernet industry not being able to maintain the historical Ethernet uplink:downlink speedup of 10:1, and at the same time not bringing down the cost of Ethernet optics fast enough.  Big web companies started to solve the uplink bandwidth speed problem in the same way they had solved the application scaling problem -- scale uplinks horizontally.  In their approach, the role of traditional bridging is limited to the edge switch (if used at all), and load-balancing to the edge is done using simple IP ECMP across a “fabric” topology.  The number of spine nodes on a spine-leaf fabric is constrained only by port counts and the number of simultaneous ECMP next hops supported by the hardware.  By horizontally scaling uplinks, it became possible to create non-blocking or near non-blocking networks even when uplink ports speeds are not the 10x of access they once were.  Aggregating lower speed ports in this way also benefits from the ability to use lower-end switches at a lower cost.

Using ECMP does come with it’s own demons.  Flow based hashing isn’t very good when the number of next hops isn’t large.  This leads to imperfect load balancing which results in imperfect bandwidth utilization and nonuniform experience for flows in the presence of larger (elephant) flows.  To address this issue, some large web companies look to use up to 64 ECMP paths to increase the efficiency of ECMP flow placement across the backbone.  However even with the best ECMP flow placement, it's better that uplink port speeds are faster than downlink speeds to avoid the inevitable sharing of an uplink by elephant flows from multiple downstream ports.

Yet another approach goes back to the use of chassis based switches with internal CLOS fabrics -- most chassis backplanes do a better job of load balancing, in some cases by slicing packets into smaller chunks at the source card before sending them in parallel across multiple fabric links and reassembling the packet at the destination card.

Managing Congestion

Although the best way to solve a bandwidth problem is with more bandwidth, even with a non-blocking fabric network congestion will still occur.  An example is with events that trigger multiple (N) hosts to send data simultaneously to a single host.  Assuming all hosts have the same link speed, the sum of the traffic rates of the senders is N times the speed of the receiver and will result in congestion at the receiver’s port.  This problem is common in scale-out processing models such as mapreduce.  In other cases, congestion is not a result of a distributed algorithm, but competing elephant flows that by nature attempt to consume as much bandwidth as the network will allow.

Once the usual trick of applying as much bandwidth as the checkbook will allow has been performed, it’s time for some other tools to come into play -- class-based weighted queuing, bufferbloat eliminating AQM techniques and more flexible flow placement.

Separate traffic types:

The standard technique of managing fairness across traffic types by placing different traffic types in different weighted queues is extremely powerful and sadly underutilized in the data center network.  One of the reasons for the underuse is woefully deficient queuing in many merchant silicon based switches.  

The ideal switch would support a reasonable number of queues where each queue has sufficient, dedicated and capped depth.  My preference is 8 egress queues per port that can each hold approximately 20 jumbo frames on a 10GE link.  Bursty applications may need a bit more for their queues and applications with fewer flows can in many cases be afforded less.  Proper use of queueing and scheduling ensures that bandwidth hungry traffic flows do not starve other flows.  Using DSCP to classify traffic into traffic classes is fairly easy to do and can be done from the hosts before packets hit the network.  It is also important to implement the same queuing discipline at the hosts (on Linux using the ‘tc’ tool) as exists in the network to ensure the same behavior end-to-end from source computer to destination computer.  

One thing to watch out for is that on VoQ based switches with deep buffers the effective queue depth of an egress port servicing data from multiple ingress ports will, in the worst case, be the sum of the ingress queue depths, which may actually be too much.

Address bufferbloat:

Once elephants are inside their own properly scheduled queue, they have little affect on mice in another queue.  However elephants in the same queue begin to affect each other negatively.  The reason is that TCP by nature attempts to use as much bandwidth as the network will give it.  The way in which out-of-the-box TCP knows that it’s time to ease off of the accelerator is when packets start to drop.  The problem here is that in a network with buffers, these buffers have to get full before packets drop, leading to effectively a network with no buffers.  When multiple elephant flows are exhausting the same buffer pool in an attempt to find their bandwidth ceiling, the resulting performance is also less than the actual bandwidth available to them.

In the ideal world, buffers would not be exhausted for elephant flows to discover their bandwidth ceiling.  The good news is that newer AQM techniques like CoDel and PIE in combination with ECN-enabled TCP work together to maximize TCP performance without needlessly exhausting network buffers.  The bad news is that I don’t know of any switches that yet implement these bufferbloat management techniques.  This is an area where hardware vendors have room to improve.

Class-aware load balancing:

The idea behind class-aware load balancing across a leaf-spine fabric is to effectively create different transit rails for elephants and for mice.  In class-aware load balancing, priorities are assigned to traffic classes on different transit links such that traffic of a certain class will be forwarded only over the links that have the highest priority for that class with the ability to fall back to other links when necessary.  Using class-aware load balancing, more links can be prioritized for elephants during low mice windows and less during high mice windows.  Other interesting possibilities also exist.

Spread elephant traffic more evenly:

Even after separating traffic types into different queues, applying bufferbloat busting AQM, and class-aware load balancing, there is still the matter of hot and cold spots created by flow-based load-balancing of elephant flows.  In the ideal situation, every link in an ECMP link group would be evenly loaded.  This could be achieved easily with per-packet load balancing (or hashing on IP ID field), but given the varying size of packets, the resulting out-of-sequence packets can have a performance impact on the receiving computer.  There are a few ways to tackle these issues -- (1) Enable per-packet load-balancing only on the elephant classes.  Here we trade off receiver CPU for bandwidth efficiency, but only for receivers of elephant flows.  We can reduce the impact on the receiver by using jumbo frames.  Additionally since elephant flows are generally fewer than mice the CPU impact aggregated across all nodes is not that much.  (2) Use adaptive load-balancing on elephant class.  In adaptive load balancing, the router samples traffic on it’s interfaces and selectively places flows on links to even out load.  These generally consume FIB space, but since elephant flows are fewer than mice using some FIB space for improved load balancing of elephant flows is worth it.

Update: Another very promising way to spread elephant traffic more evenly across transit links is MPTCP (see http://www.youtube.com/watch?v=02nBaaIoFWU).  MPTCP can be used to split an elephant connection across many subflows that are then hashed by routers and switches across multiple paths -- MPTCP then shifts traffic across subflows so as to achieve the best throughput.  This is done by moving data transmission from less performant subflows to subflows that are more performant.

Keeping costs down

The dense connectivity requirement created by the shift from vertically scaled computing and storage to horizontally scaled approaches has put a tremendous new cost pressure on the data center network.  The relative cost of the network in a scale-out data center is rising dramatically as the cost of compute and storage falls -- this is because the cost of network equipment is not falling in proportion.  This challenges network teams as they are left with the choice of building increasingly oversubscribed networks or dipping into the share that is intended for compute and storage.  Neither of these are acceptable.

The inability of OEMs to respond to this has supported the recent success of merchant silicon based switches.  The challenge with using merchant silicon based switches is the limited FIB scaling on these switches as compared to more expensive OEM switches.  OEM switches tend to have better FIB scaling and QoS capability, but at a higher cost point. 

Adopting switches with reduced FIB capacity can make some folks nervous.  Chances are, however, that some "little" changes in network design can make it possible to use these switches without sacrificing scalability.   One example of how to reduce the need for high FIB capacity in a stub fabric is to not send external routes into the fabric but only send in a default route.  The stub fabric would only need to carry specific routes for endpoint addresses inside the fabric.  Addressing the stub fabric from a single large prefix which is advertised out also reduces the FIB load on the transit fabric, which enables the use of commodity switches also in the transit fabric.  For multicast, using PIM bidir instead of PIM DM or SM/SSM will also significantly reduce the FIB requirements for multicast.  Using overlay network virtualization also results in a dramatic reduction in the need for large FIB tables when endpoint addresses need to be mobile within or beyond the stub fabric -- but make sure that your core switches can hash on overlay payload headers or you will lose ECMP hashing entropy.  [ Note: some UDP or STT-based overlay solutions manipulate the source port of the overlay header to improve hashing entropy on transit links. ]

Besides the cost of custom OEM switches, OEMs have also fattened their pocketbooks on bloated optical transceiver prices.  As network OEMs have been directing optics spend into their coffers, customers have been forced to spend even more money buying expensive switches with rich traffic management capabilities and tools in order to play whack-a-mole with congestion related problems in the data center network.  As customers have lost their faith in their incumbent network equipment vendors a market for commodity network equipment and optics has begun to evolve.  The prices of optics have fallen so sharply that, if you try hard enough, it is possible to get high quality 10GE transceivers for under $100 and 40GE for low hundreds as well.  Now it’s possible to populate the switch for much less than the cost of the switch whereas previously the cost of populating a switch with optics was multiples of the cost of the actual switch.  Furthermore, with the emergence of silicon photonics I believe we will also see switches with on-board 10GE single-mode optics by 2016.  With luck, we’ll see TX and RX over a single strand of single-mode fiber -- I believe this is what the market should be aiming for.

As the core network gets less expensive, costs may refuse to come down for some data center network operators as the costs shift to other parts of the network.  The licensing costs associated to software-based virtual network functions will be one of the main culprits.  In the case of network virtualization, one way to avoid the cost would be to leverage ToR-based overlays.  In some cases, ToR-based network virtualization comes at no additional costs.  If other network functions like routing, firewalling and load-balancing are still performed on physical devices (and you like it that way), then "passive mode" MVRP between hypervisor and ToR switch in combination with ToR-based overlay will enable high performance autonomic network virtualization.  The use of MVRP as a UNI to autonomically trigger VLAN creation and attachment between hypervisor switch and ToR switch is already a working model and available on Juniper QFabric and OpenStack (courtesy of Piston Cloud and a sponsor). [ Note: ToR-based network virtualization does not preclude the use of other hypervisor-based network functions such as firewalling and load-balancing ]

All said however, the biggest driver of cost are closed vertically integrated solutions that create lock-in and hold back operators from choice.  Open standards and open platforms give operators the freedom to choose network equipment by speed, latency, queuing, high availability, port density, and such properties without being locked in to a single vendor.  Lock-in leads to poor competition and ultimately to higher costs and slower innovation as we have already witnessed.

Wrapping up here, I’ve touched on some existing technologies and approaches, and some that aren’t yet available but are very needed to build a simple, robust and cost effective modern data center core network.  If you have your own thoughts on simple and elegant ways to build the datacenter network fabric of tomorrow please share in the comments.


  1. Nice write up again Aldrin. I enjoyed hearing your thoughts on the use of queues in the DC fabric. I think this has been ignored, due to some hardware support and also perceptions of being complicated. I'd like to see more on the subject, particularly for separating mice/elephant flows, how to detect or signal these (ideally I think we want to the application or network stack to signal the flow size in DSCP or similar), and how much granularity is needed in the mice/elephant separation to bring about improvements to things like flow completion time without also creating negative consequences (how many queues, what sizes should be assigned to what queues, should this be dynamic).

    Also interesting to hear your thoughts on moving to single strand fiber, given the current (rather annoying) growth in the use of parallel fiber in 40G and 100G implementations today. It looks like we may end up flip flopping between parallel to get to the next phase before a serial version is released, then parallel again for the next step in bandwidth, so on. It'd be nice to overcome this.

    1. Hey Kris,

      Thanks for the suggestions -- these are definitely the areas where networking needs to place an increased emphasis. I think dynamic queue sizing is tricky and a topic I'd like to hear more discussion on. Personally I'd prefer to signal flows to scale up or down using ECN with a CoDel or PIE AQM instead of dynamically resizing queues. FYI, some discussion on doing elephant detection in VM centric environments can be found in http://networkheresy.com/2013/11/01/of-mice-and-elephants.

      With silicon photonics, and 10/25 Gbps optical channels, I believe we need to expect the Ethernet industry to give us Ethernet that can be scaled in increments of 10 or 25 Gbps using bidi WDM over a single strand of single-mode fiber. Among other benefits, investments in cabling infrastructure would last longer. I don't think it's wishful thinking any more.