Monday, June 24, 2013

Regarding scale-out network virtualization in the enterprise data center

There's been quite a lot of discussion regarding the benefits of scale-out network virtualization.   In this blog I present some additional thoughts to ponder regarding the value of network virtualization in the enterprise DC.  As with any technology options, the question that enterprise network operators need to ask themselves regarding scale-out network virtualization is whether it is the right solution to the problems they need to address.

To know whether scale-out network virtualization in the enterprise DC is the answer to the problem, we need to understand the problem in a holistic sense.  Let's set aside our desire to recreate the networks of the past (vlans, subnets, etc, etc) in a new virtual space, and with an open mind ask ourselves some basic questions.

Clearly at a high level, enterprises wish to reduce costs and increase business agility.  To reduce costs it's pretty obvious that enterprises need to maximize asset utilization.   But what specific changes should enterprise IT bring about to maximize asset utilization and bring about safe business agility?  This question ought to be answered in the context of the full sum of changes to the IT model necessary to gain all the benefits of the scale-out cloud.

Should agility come in the form of PaaS or stop short at Iaas?  Should the individual machine matter?  Should a service be tightly coupled with an instance of a machine or rather should the service exist as data and application that is independent of a machine (physical or virtual)?

In a scale-out cloud, the platform [software] infrastructure directs the spin up and down of  application instances relative to demand.  The platform infrastructure also spins up application capacity when capacity is lost due to physical failures.  Furthermore, the platform software infrastructure ensures that services are only provided to authorized applications and users and secured as required by data policy.  VLANs, subnets and IP addresses don't matter to scale-out cloud applications.  Clearly network virtualization isn't a requirement for a well designed scale-out PaaS cloud.  (Multi-tenant IaaS clouds do have a very real need for scale-out network virtualization)***

So why does scale-out network virtualization matter to the "single-tenant" enterprise?  Here's two reasons why I believe enterprises might believe they need it, and two reasons why I think maybe they don't need it for those reasons.

Network virtualization for VM migration.

The problem in enterprises is that a dynamic platform layer such as I describe above isn't quite achievable yet because, unlike the Google's of the world, enterprises generally purchase most of their software from third parties and have legacy software that does not conform to any common platform controls.  Many of the applications that enterprises use maintain complex state in memory that if lost can be disruptive to critical business services.  Hence, the closest an enterprise can do these days to attain cloud status, is IaaS -- i.e. take your physical machines and turn them into virtual machines.  Given this dependence on third party applications, dynamic bursting and that sort of true cloudy capabilities aren't universally applicable in the back end of an enterprise DC.

The popularity of vmotion in the enterprise is testament to the current need to preserve the running state of applications.  VM migration is primarily used for two reasons -- (1) to improve overall performance by non-disruptively redistributing running VMs to even out loads and (2) to non-disruptively move virtual machines away from physical equipment that is scheduled for maintenance.  This is different from scale-out cloud applications where virtual machines would not be moved, but rather new service instances spun up and others spun down to address both cases.

We all know that for vmotion to work, the migrated VM needs to retain the same IP and MAC address of it's prior self.  Clearly if the VM migration were limited to only a subset of the available compute assets this will lead to underutilization and hence higher costs.  If a VM should be migrated to any available compute node (assuming again that retaining IP and MAC is a requirement), the requirement would appear to be scale-out network virtualization.

Network virtualization for maintaining traditional security constructs.

As I mentioned before, a scale-out PaaS cloud enforces application security beginning with a trusted registration process.  Some platforms require the registration of schemas that application are then forced to conform to when communicating with another application.  This isn't a practical option for consumers of off-the-shelf applications.  But clearly, not enforcing some measure of expected behavior between endpoints isn't a very safe bet either.

The classical approach to network security has been to create subnets and place firewalls at the top of them.  A driving reason for this design is that implementing security at the end-station wasn't considered very secure since an application vulnerability could allow malware on a compromised system to override local security.  This drove the need for security to be imposed at the nearest point in the network that was less likely to be compromised rather than at the end station.

When traditional firewall based security is coupled with traditional LANs, a virtual machine is limited to only the compute nodes that are physically connected to that LAN and so we end up with the underutilization of the available compute assets that are on other LANs.  However if rather than traditional LANs, we instead use scale-out LAN virtualization, then the firewall (physical or virtual) could be wherever the firewall is, and the VMs that the firewall secures can be on any compute node.  Nice.

So it seems we need scale-out network virtualization for vmotion and security...

Actually we don't -- not if we don't have a need for non-IP protocols.  Contrary to what some folks might believe, VM migration doesn't require that a VM remains in the same subnet -- it requires that a VM retains it's IP and MAC address which is easily done using host-specific routes (IPv4 /32 or IPv6 /128 routes).  Using host-specific routing a VM migration would require that the new hypervisor advertise the VM's host route (initially with a lower metric) and the original hypervisor withdraw it when the VM is suspended.

So now that we don't need a scale-out virtual LAN for vmotion, that leaves the matter of the firewall.  The ideal place to implement security is at the north and south perimeters of your network.  As I mentioned earlier security inside the host (the true south) can be defeated and hence the subnet firewall (the compromise).  But with the advent of the hypervisor, there is now a trusted security enforcement point at the edge.  We can now implement firewall security right at the vNIC of the VM (the "south" perimeter).  When coupled with perimeter security at the points where communication lines connect your DC to the outside world (the "north" perimeter), you don't need scale-out virtual LANs to support traditional subnet firewalls either.  It's debatable whether additional firewall security is required at intermediate points between these two secured perimeters -- my view is that they are not unless you have a low degree of confidence in your security operations.  There is a tradeoff to intermediate security which comes at the expense of reduced bandwidth efficiency, increased complexity and higher costs to name a few.

The use of host-specific routing combined with firewall security applied at the vNIC is evident in LAN virtualization solutions that support distributed integrated routing and bridging capability (aka IRB or distributed inter-subnet routing).  The only way to perform fully distributed shortest-path routing with free and flexible VM placement, is to use host-based routing.  The dilemma then is where to impose firewall security -- at the vNIC of course!!

Although we don't absolutely need network virtualization for either VM migration or to support traditional subnet firewalls, there is one really good problem that overlay based networking helps with, and that is scaling.  Merchant silicon and other lower priced switches don't support a lot of hardware forwarding entries.  This means that your core network fabric might not have enough room in it's hardware forwarding tables to support a very large number of VMs.  Overlays solve this issue by only requiring the core network fabric to support about as many forwarding entries as there are switches in the fabric (assuming one access subnet per leaf switch).  However even in this case per my reasoning in the prior three paragraphs, for a single-tenant enterprise the overlay network should only need to support a single tenant instance and hence would be used for dealing with the scaling limitations of hardware and not for network virtualization.  

Building a network-virtualization-free enterprise IaaS cloud.

There's probably a couple of ways to achieve vmotion and segmentation without network virtualization and with very little bridging.  Below is one way to do this using only IP.  The following approach does not leverage overlays and so each fabric can only support as many hosts as the size of the hardware switch L3 forwarding table.

(1) Build an IP-based core network fabric using switches that have adequate memory and compute power to process fabric-local host routes.  The switch L3 forwarding table size reflects the number of VMs you intend to support in a single instance of this fabric design.  Host routes should only be carried in BGP.  You can use the classic BGP-IGP design or for a BGP-only fabric you might consider draft-lapukhov-bgp-routing-large-dc.  Assign the fabric a prefix that is large enough to provide addresses to all the VMs you expect your fabric to support.  This fabric prefix will be advertised out of and a default route advertised in to the fabric for inter-fabric and other external communication.
(2) Put a high performance virtual router in your hypervisor image that will get a network facing IP via DHCP and is scripted to automatically BGP peer with it's default gateway which will be the ToR switch.  The ToR switch should be configured to accept incoming BGP connections from any vrouter that is on it's local subnet.  The vrouter will advertise host routes of local VMs via BGP and for outbound forwarding will use it's default route to the ToR.
(3) To bring up a VM on a hypervisor, your CMS should create an unnumbered interface on the vrouter, attach the vNIC of the VM to it and create a host route to the VM which should be advertised via BGP.  The reverse should happen when the VM is removed.  This concludes the forwarding aspect of the design.
(4) This next step handles the firewall aspect of the design.  Use a bump-in-the-wire firewall like Juniper's vGW to perform targeted class-based security at the vNIC level.  If you prefer to apply ACLs on the VM facing port of the vrouter, then you should carve out prefixes for different roles from the fabric's assigned address space to make writing the ACLs a bit easier.

Newer hardware switches support 64K and higher L3 forwarding entries and also come with more than enough compute and memory to handle the task so it's reasonable to achieve upward of 32K VMs per fabric.  Further scaling is achieved by having multiple of these fabrics (each with their own dedicated maskable address block) interconnected via an inter-connect fabric, however VM migration should be limited to a single fabric.  But if you prefer to go with the overlay approach to achieve the greater scaling, replace BGP to the ToR with MP-BGP to two fabric route reflectors for VPNv4 route exchange.  When provisioning a VM-facing interface on the vrouter you'll need to place it into a VRF and import/export a route target.

I've left out some details for the network engineers among you to exercise your creativity (and avoid making this blog as long as my first one) -- why should Openstack hackers have all the fun? :)

Btw, native multicast should work just fine using this approach. Best of all, you can't be locked in to a single vendor.

If you believe you need scale-out overlay network virtualization consider using one that is based on an open standard such as IPVPN or E-VPN.  The latter does not require MPLS as some might believe and supports efficient inter-subnet routing via this draft which I believe will be accepted as a standard eventually.   Both support native multicast and both are or will be supported by at least three or more vendors eventually with full inter-operability.  I'm hopeful that my friends in the centralized camp will some day see the value of also using and contributing to open control-plane and management-plane standards.


  1. E-VPN would make an elegant solution. Generally what comes out of a standards WG will be more flexible and extensible. But it does take longer to get there, and it may accumulate stuff you don't want along the way.

    Centralizing things isn't automatically bad. A BGP RR is centralized. A route-server in an IX is centralized. A TX Matrix is a centralized RE.

    The "centralized camp" in this case might be thinking: I know the address and VPN membership of all my VMs. This only changes when I create, move, or destroy a VM. What's going to be the simplest and easiest to troubleshoot approach to keep that state synchronized across my VM hosts?

    There are numerous answers that smart software engineers can come up with, and some of them are also very elegant. The big difference between the approaches as I see it is with how closely VM orchestration and VM networking is coupled. In the E-VPN scenario nothing stops you from having multiple orchestration systems right down to manual configuration and baremetal all interworking. Of course this can be done with the centralised approaches too, just by building a BGP process into the system to talk with other systems.

  2. Hey Kris, sorry about this late response and thanks for your comments. When I say "centralized camp", I'm referring to solutions where the network has complete dependance on a centralized system (includes controller clusters with strong consistency requirement) that compile low level match-action rules for the network elements. In the case of RR and some of the other centralized route server models, they either do simple route reflection (i.e. they are not the source of truth) or use stateless route policy to modify attributes of route update messages.

  3. The model you describe doesn't work very well because of memory limitations in the hardware devices. The solution that Microsoft deployed ( which drove the design that Petr put together in draft-lapukhov-bgp-routing-large-dc ) is intended to support a massive scale overlay network using NVGRE.

    The problem with E-VPN, MPLS and BGP is the strict finite limits on forwarding tables in hardware devices which have strict limits and exponential cost increases.

    For this reason, all the vendors are using software overlays, including Cisco and Juniper.

    The second reason is that virtualization network using overlays create a system that can console cheaper commodity devices.

    So, no, scaling as you propose doesn't seem to be the way forward. It's possible that hardware solutions will arrive in the years ahead but there are no plans in networking to use them in the next 5 years at least.

    1. Hey Greg, Thanks for your comments.

      What Microsoft is doing with their overlay extends to many thousands of hypervisors (millions of VMs) with a custom smart controller to manage the overlay. The simple approach that I outline does not require a "central" controller (safe from controller bugs) and fully supports robust "in fabric" multicast replication. Since the approach is modular it is very scalable but, as I mention, is limited to 10s of thousands of VMs per stub fabric and host addresses are not mobile beyond the stub fabric instance. For environments that do not depend on out-of-fabric address mobility, approaches such as this work very well for limited cost, risk and complexity.

      Btw, E-VPN and IPVPN reduce the need for large RAM and FIB using RT-Constrain. Additionally, some vendors further optimize by reactively pushing down forwarding vectors into the FIB when the vector is required (similar to Openflow controller reactive mode).