Thursday, January 10, 2019

Blending into software infrastructure


Electronic networks existed long before electronic compute and storage.   Early on, the network was simple wires and switch boards, and the endpoints were humans.  Telegraphs turn taps into on/off current on the wire and back to audible clicks.  Telephones turn voice into fluctuating current and back to voice.

Since then, the network has existed as a unique entity apart from the things it connected.  Until now.  

Less than two decades ago most applications were built in vertical silos.  Each application got its own servers, storage, database and so on.  The only thing applications shared was the network.  The network was the closest thing to a shared resource pool — the original “cloud”.  With increasing digital transformation, other services were also pooled, such as storage and database.  However each application interfaced with these pooled resources and with other applications directly.  Applications had little in common other than the pooled resources they shared.

As more code was written, the value of pooling common software functions and best practices into a “soft” infrastructure layer became evident.  The role of this software infrastructure was to normalize the many disparate ways of doing common things in application software.  This meant application developers could focus on unique value and not boilerplate.  This also meant the software infrastructure could make the best use of underlying physical resources on behalf of the applications.  Storage was absorbed into software infrastructure, and even database.  Software infrastructure was necessary to achieve economies of scale in digital businesses. 

Over the past few decades, emphasis has been on decentralization of network control.  The network’s greatest mission was survivability, and the trade-off was other optimizations, such as for end-user experience.  However the challenges of massive scale has created the need for centralized algorithms to ensure good end-user experience, eliminate stranded capacity, improve time to recovery and speed up deployment.  Imperatives to achieve economies at hyper scale.  For some problems, like survivability, cooperating peers is best, but for others, like bin packing, master-slave is better. 

So while the network may continue to implement last resort survivability in a distributed way, those optimizations that require centralization will/have/should be driven by the common software infrastructure layer in the most demanding environments. 

From day one, the network design team at Bloomberg reported into the software infrastructure org.  My boss Franko reported to Chuck Zegar, the godfather of Bloomberg’s software infrastructure (and Mike Bloomberg’s first employee).  Mike tasked Chuck to engineer Bloomberg’s networks. This first-hand experience in such an organization has led me to believe that software infrastructure is also the ideal place in the org chart for the network engineering role to develop networks that best serve the business and end users.

Monday, December 24, 2018

Backend security done right


I think of classical firewall-based security like a community with a common manned gate, and where the homes inside the community don’t have locks on their doors.   Strong locks on strong doors is better if you ask me.

Traditional firewalls look for patterns in the packet header to determine what action to take on flows that match specified patterns.  I'd equate this to the security guard at the gate allowing in folks based by how they look.  If someone looks like a person who belongs to the community, then they're let in.   In the same way, if a crafted packet header matches permit rules in the firewall, then it is allowed through.

One might say that’s where app-based firewalls come in.  They go deeper than just the packet header to identify the application that is being transported.  But how do these firewalls know what application a transport session belongs to?  Well, they do pretty much the same thing — look for patterns within transport sessions that hopefully identify unique applications.  Which means that with a little more effort, hackers can embed patterns in illegitimate sessions that fool the firewall.

I suppose there are even more sophisticated schemes to distinguish legitimate flows from illegitimate ones that may work reasonably for well-known apps.  Problem is the application you wrote (or changed) last week is not "well-known", so you’re back to the 5-tuple. 

This is among the problems with man-in-the-middle security -- they're just not good enough to keep out the most skilled intruder for long.  Break past the front gate and all the front doors are wide open.  The dollars you expend for this kind of security isn’t limited to the cost of the security contraption.  It also includes the limitations this security model imposes on application cost, velocity, scale and richness, and the amount of wasted capacity it leaves in your high-speed network fabric.  Clos? Why bother.

One might say that server-based micro-segmentation is the answer to this dilemma.  But what significant difference does it make if we clone the security guard at the gate and put one in front of each door?  The same pattern matching security means the same outcomes.  Am I missing something?

I'm sure there are even more advanced firewalls, but that surely bring with them either significant operational inefficiencies or the risks of collateral damage (like shutting down legitimate apps).  I'm not sure the more advanced companies use them.

In my humble opinion, “zero trust” means that application endpoints must never trust anything in the middle to protect them from being compromised.  Which means that each endpoint application container must know which clients are entitled to communicate with it, and allow only those clients to connect if they can present credentials that can be used to prove their true identity and integrity.  Obviously a framework that facilitates this model is important.  Some implementations of this type of security include Istio, Cilium, Nanosec and Aporeto, although not necessarily SPIFFE based.

Where backend applications universally adopt this security model, in-the-middle firewalls are not needed to ensure security and compliance in backend communication.  Drop malware into the DC and it might as well be outside the gate.  It doesn’t have keys to any of the doors, i.e. the credentials to communicate with secured applications.  Firewalls can now focus on keeping the detritus out at the perimeter —  D/DoS, antivirus, advanced threats, and the like, which don’t involve the management of thousands of marginally effective match-action rules, and all the trade-offs that come with blunt-force security.

Saturday, December 22, 2018

Death by a thousand scripts


Early on in my automation journey, I learned some basic lessons from experiencing what happens when multiple scripting newbies independently unleash their own notions of automation on the network.  I was one of them. 

Our earliest automations were largely hacky shell scripts that spat out snippets of config which would then be shoved into routers by other scripts.  We’re talking mid 1990s.   It started out fun and exciting, but with more vendors, hardware, use cases, issues, and so on, things went sideways.  

We ended up with a large pile of scripts with different owners, often doing similar things to the network in different ways.  Each script tailored for some narrow task on a specific vendor hardware in a specific role.  As folks who wrote the scripts came and went, new scripts of different kinds showed up, while other scripts became orphaned.  Other scripts ran in the shadows, only known to their creator.   Chasing down script issues was a constant battle.  It was quite the zoo. 

We learned quickly that with automation power must come automation maturity. 

The problem was that we were operating based on the needs at hand, without stepping back and building in the bigger picture.  With each need of the day came a new script or tweak to an existing one.  Each with different logging, debugging, error handling, etc, if any at all.  No common framework.  Each script a snowflake more or less.  Forbid it that we should make any change to the underlying network design.   Eventually the scripts became the new burden in place of those we set out to reduce with them. 

Even seasoned developers assigned to network automation make newbie mistake.  The code is great, but the automation not so great.  Main reason being they lack the subject matter knowledge to deconstruct the network system accurately, and most likely they were not involved when the network was being designed.  

IMO, the best network automation comes from a class of networking people that I refer to as network developers — software developers that are networking subject matter experts.  These folks understand the importance of the network design being simple to automate.  They are core members of the network design team, not outsourced developers.

In my case, I found that if my automation logic reflected the functional building blocks in the network design, I could capture the logic to handle a unique function within corresponding methods of a model-based automation library.  For example, a method to generate vendor-specific traffic filter from a vendor-independent model, and another method to bind and unbind it on an interface.  Code to handle the traffic filter function was not duplicated elsewhere.  Constructing a network service that is a composite of multiple functions was just a matter of creating a service method that invoked the relevant function methods.

This approach ensured that my team focused on systematically enhancing the capabilities of a single automation library rather than replicating the same network manipulations in code over and over in different places.  With the devops mindset, often new device features were tested only after incorporating them into the automation library, which meant features were tested with and for automation.  Multiple high value outcomes derived from a common coherent automation engine, and a very strong network automation as network design philosophy.

There are certainly other equally good (or better) automation models.  For example, Contrail fabric automation takes a role oriented approach.

Let me close before I go too far off on a tangent.  The lesson here — don’t wing automation for too long.  Aim for method, not madness.

Friday, December 7, 2018

Network automation as network design

A very large part of my professional career in network design was spent working on automation.  A journey in automation that began in 1996 when I and a few colleagues engineered Bloomberg’s first global IP WAN, which evolved into the most recognized (and agile) WAN in the financial services industry.  

The automation behind that network started off very basic, and over the years evolved into a very lean and flexible model-based library core, with the various programs (provisioning, health-checking, discovery, etc) that were built on top.  This small automation library (less than 15K of OO code) drove a high function multi-service network with support for 6+ different NOS' and 100+ different unique packet forwarding FRUs.  Including service-layer functions such as VPNs with dynamic context-specific filter, QoS and route policy construction, BGP peering (internal and external), inline VRF NAT, etc and support for a variety of attachment interfaces such as LAG/MC-LAG and channelized Ethernet/SONET/SDH/PDH interfaces (down to fractional DS1s) with Ethernet, PPP, FR and ATM encapsulations.  And, yes, even xSTP.  It was quite unique in that I have yet to see some of the core concepts that made it so lean and flexible repeated elsewhere.  We were among the earliest examples of a DevOps culture in networking.

Many lessons were learned over the years spent evolving and fine tuning that network automation engine.  I hope to capture some of those learnings in future blog entries.  In this blog entry, I want to share a perspective on the core foundation of any proper network automation — and that is network architecture and design.

All great software systems start with great software architecture and design.  A key objective of software architecture and design is to achieve a maximum set of high quality and scalable known and yet-to-be-known outcomes, with the least amount of complexity and resources.  Applying paint to the network canvas without tracing an outline of the desired picture doesn’t always get you to the intended outcomes.  Design top down, build bottom up, as they say.

In networking, automation is a means to an end, which are the high quality services to be rendered.  Once we know what network services we expect to deliver over our network, we can then identify the building-block functions that are required and their key attributes and relationships.  From there we identify the right technologies to enable these functions such that they are synergistic, simple, scalable and resource efficient.  The former is network architecture and the latter is network design.

It’s only after you arrive at a foundational network architecture that you should start the work on automation.  After this, automation should coevolve with your network design.  Indeed, the network design must be automatable.  In this regard, one might even look at network automation as an aspect of network architecture and design, since it's role is to realize the network design.  Obviously this would mean the Bloomberg automation library implements the Bloomberg network design, and not yours (although it might have everything you need).

IMO, great network automation is based on software design that tightly embraces the underlying network design, such that there are corresponding functional equivalents in the automation software for the functions in the network design.  This is what I call model-based automation (as opposed to model-blind automation).  In this sense again, good network automation software design is inclusive of the network design.  

This last assertion is an example of what I spoke of in my previous blog, of how an enhanced mode system (here the network automation system) should be a natural extension of a base mode system (the network system), such that if the base mode system is knowable, then so is the enhanced mode system.  Which, when done right, makes possible both an intelligent network as well as the intelligent operator.  This assertion is backed by a very successful implementation on a large high function production network.

In conclusion, what I've learned is that network architecture and design is king and, done right, network automation becomes a natural extension of it.  Companies that do not have the resources or know-how to properly marry automation design and network design have a few hard lessons ahead of them in their automation journey.  For some companies, it might make sense to consider network automation platforms such as Contrail's fabric automation, which incorporates standards-based network design that is built into a model-based open-source automation engine.

Wednesday, December 5, 2018

Intelligent Network, Intelligent Operator

Maybe I’m old school, but I’m leery of black box networking.  Especially if critical services are dependent on my network infrastructure.  I wore those shoes for 19 years, so this perspective has been etched into my instinct by real world experiences.  I don’t think I’m alone in this thinking.  

When bad things happen to network users, all eyes are on the physical network team, even when it’s not a physical network problem.  The physical network is guilty until it can be proven otherwise.  So it’s fair that physical network operators are skeptical of technology whose inner workings are unknowable.  Waiting on your network vendor’s technical support team isn’t an option when the CIO is breathing down your neck.  Especially if a mitigation exists that can be acted on immediately.  

That said, there is indeed a limit to the human ability.  It's increasingly lossy as the number of data points grow.  Moreover, the levels of ability are inconsistent across individuals.  Box level operations leaves it up to the operator to construct an accurate mental model of the network, with its existing and possible states, and this generally results in varying degrees of poor decision making.  

To make up for this shortcoming, it’s well known by now that the next level in networking is centered around off-box intelligence and control — this theme goes by the heading "SDN".  However some forms off-boxing network intelligence create new problems — such as when the pilots no longer are able to fly the plane.  This is bad news when the autopilot is glitching. 



How safe is the intelligent network, if the network operators have been dumbed down?  The public cloud folks know this spells disaster.  So if they don’t buy into the notion of “smart network, dumb operator”, then why should anyone else with a network?  In some industries, it doesn’t take much more than one extended disaster before the “out of business” sign is hanging on the door. 

If your SDN was built by your company’s development team (like maybe at Google), your network operator is probably a network SRE that is very familiar with its code.  They can look under the hood, and if they still need help, their technical support team (the SDN developers) work down the hallway. 

On the other hand, black box SDN is fine and dandy until something fails in an unexpected way.  I suspect that eventually it will — for example, when all the ace developers who built it have moved on to their next startup and are replaced by ordinary folks.  You need to trust that these SDN products have an order of magnitude higher quality than the average network product, since the blast radius is much larger than a single box.  But they too are created by mortals, working at someone else’s company.  The reality is that when the network breaks unexpectedly, you are often left to your own devices. 

So how is the rest of the world supposed to up-level their networking when the only choices seem to be black-box SDN or build-your-own SDN?  (To be very clear, I’m talking primarily about the physical layers of the network.)

Let me tell you what I think.  

IMO, a resilient network has an intelligent on-box "base" mode (openflow does not qualify), which guarantees the basic level of network service availability at all times to critical endpoints.  It implements a consistent distributed control and data plane.  This mode should be based on knowable open communication standards, implemented by network nodes and programed declaratively.   Over that should be an off-box "enhanced" mode that builds on top of the base mode to arrive at specific optimization goals (self driving automation, bandwidth efficiency, minimum latency, end-user experience, dynamic security, network analytics, etc).  I believe this is how the intelligent network must be designed, and is consistent with the wisdom in Juniper founder Pradeep Sidhu's statement "centralize what you can, distribute what you must." 

If the enhanced mode system has an issue, turn it off and let the network return safely to the base mode state.  Kind of like the human body — if I get knocked unconscious I’m really happy my autonomic nervous system keeps all the critical stuff ticking while I come back to my senses.  

This also allows the enhanced mode system to focus on optimizations and evolve rapidly, while the base mode system focuses on resilience and might evolve less rapidly. However, the enhanced mode must be fully congruent with the chosen base mode.  So a first order decision is in choosing the right base mode, since any enhanced mode is built on top of it.

You might be asking by now, what does this have to do with the title?  

My thesis is that if you choose a knowable base mode system, it will naturally lead you to a knowable enhanced mode system.  Said another way, a standards based base mode system, which is knowable, leads to a knowable enhanced mode system, since it uses the building blocks provided to it by the base mode system.  It's not rocket science.

The next important consideration is that the enhanced mode system is designed to provide full transparency to the operator into it’s functions and operations, such that the network operator can take control if and when needed (see article).  The best systems will actively increase the understanding of human operators.  This is how we get both the intelligent network and the intelligent operator.   I think this is what pragmatic SDN for the masses would look like.

Wednesday, November 21, 2018

I'm back, again.

It’s been over 3 years since I last shared my thoughts here.  It’s been that long since I left an amazing 19 year journey at Bloomberg, at the helm of the team that developed the financial industry’s most exceptional IP network.  A network that I took great pride in and gave so much of my life for.  I am grateful to have had the opportunity to build what I did there and learn many things along the way.

Three years ago I decided to go from building mission-critical global networks to building network technologies for mission-critical networks with the team at Juniper Networks.  Two very different worlds.  It wasn’t easy, but I have evolved.  For the record, my heart is still that of a network operator.  20+ years at the front lines doesn’t wash off in 3 years.  Someday I hope to go back to being an operator, when I have exhausted my usefulness on this side of the line, or maybe before that. 

One of the perks of being a customer is that I could say whatever was on my mind about networking tech.  Not so as a vendor.  So I chose to stay low and focus on trying to align what was on the truck to what I knew operators like me needed.  I feel that I can be more open now that we’ve reached the bend. 

I’ve had the opportunity to be in the middle of getting some important things unstuck at Juniper.  The Juniper of 3, 2 and 1 year ago were quite different from the one today, marked by decreasing inertia.  Over that time I lit the fire on three key pivots towards executing to the needs of DC operators.  Specifically, the pivot to a strong DC focus in the NOS, the solidifying of our network design strategy for [multi-service] DC, and getting a proper DC fabric controller effort in motion.  They’re all connected, and together complete the pieces of the DC fabric puzzle.  Three years in.

My respect to all the Juniper friends that turned opportunities into realities.  I'm a big believer that real change starts at the talent level and, with the right conditions, percolates up and across.  Juniper has talent.  Now with the right game in play and the right organizational alignment at Juniper, good things are set to happen.

They say ideas and talk are cheap without execution, so having influenced these out of a virtual existence into reality is my key result, and their positive impact on Juniper.  Driving big change isn’t new to me.  I built the financial industries most venerate IP network from the ground up, and been at the forefront of multiple key shifts in the networking industry, including having understood the need for, and bringing EVPN into existence, which I write about here (the parts I can talk about publicly).  

It’s almost impossible to get a market to move into a new vision with the push of a button.  Smart companies start from where their customers are.  The most successful tech companies are the ones that have the right vision, and can navigate their customers onto the right path toward that vision.  The right path pays special attention to the “how”, and not just the “what” and “why”.  The market recognizes a good thing when it sees it.  The one that works best for it.  For a vendor, the right path opens up to new opportunities.  This has always been my focus.  From the beginning, EVPN itself was meant to be a stepping stone in a path, not a destination -- "a bridge from here to there".  As I like to say, value lies not in the right destination, but in the right journey.  The destination always changes in tech, so the right vision today merely serves as a good vector today for the inevitably perpetual journey.

Now I hope to resume sharing my perspective here about topics such as, the right stages for evolving the network operator role, EVPN in the DC, pragmatic SDN, and other random thoughts that I have the scars to talk about, and with the new insights I have gained over the past 3 years.  

I’m not a prolific writer so don’t expect too much all at once.  :-)

Friday, May 15, 2015

Better than best effort — Rise of the IXP

This is the second installation in my Better than Best Effort series.  In my last post I talked about how the incentive models on the Internet keep ISPs from managing peering and backbone capacity in a way that supports reliable communication in the face of the ever growing volume of rich media content.  If you haven't done so, please read that post before you read this one.

It's clear that using an ISP for business communication comes with the perils associated to the "noisy neighbor" ebb and flow of consumer related high volume data movement.  Riding pipes that are intentionally run hot to keep costs down is a business model that works for ISP, but not for business users of the Internet.  Even with business Internet service, customers may get a better grade of service within a portion of an ISP's networks, but not when their data needs to traverse another ISP which they are not a customer of.  There is no consistent experience, for anyone.

However there is an evolving solution to avoid getting caught in the never ending battle between ISP and large consumer content.  As the title of this blog gives away, the solution is called an Internet eXchange Point (IXP).

IXPs are where the different networks that make up the Internet most often come together.  Without IXP, the Internet would be separate islands of networks.  IXP are, in a sense, the glue of the Internet.  From the early days of the Internet, IXP have been used to simplify connectivity between ISPs resulting in significant cost savings for them.  Without an IXP, an ISP would need to run separate lines and dedicate network ports for each peer ISP with whom it connects.

However IXP and ISP are not distant relatives.  They are in fact close cousins.  Here's why.

Both ISP and IXP share two fundamental properties.  The first is that they both have a fabric, and the second is that they both have "access" links used to connect to customers so they can intercommunicate over this fabric.  The distinction between the two is in the nature of the access interfaces and the fabrics.  ISP fabrics are intended to reach customers that are spread out over a wide geographic area.  An IXP fabric on the other hand is fully housed within a single data center facility.   In some cases an IXP fabric is simply a single Ethernet switch.  ISP access links use technology needed to span across neighborhoods, while IXP access links are basically ordinary Ethernet cables that run an average of around several dozen meters.  So essentially the distinction between the two is that an ISP is a WAN and an IXP is a LAN.

The bulk of the cost in a WAN is in the laying and maintenance of the wires over geographically long distances.  Correspondingly the technology used at the ends of those wires is chosen based on the ability to wring out as much value out of those wires as possible.   The cost of a WAN is significantly higher than a LAN with a comparable number of access links.  ISP need to carefully manage costs which are much higher per byte.  It is on account of the tradeoffs that ISP make in order to manage these costs that the Internet is often unpredictably unreliable.

So how can IXP help?

Let's assume that most business begin to use IXP as meet-me points.  Remember that the cost dynamics of operating an IXP are different than an ISP.  At each IXP these business customers can peer with one another and their favorite ISPs for the cost of a LAN access link.   



There are at least a few advantages of this over connecting to an ISP.  Firstly, two business entities communicating via an IXP's [non-blocking] LAN are effectively not sharing the capacity between them with entities that are not communicating with either of them, and so making their experience far more predictable.  This opens the door for them to save capital and operational costs by eliminating the private lines that they may currently have with these other business entities.  Secondly, a business that is experiencing congestion to remote endpoints via an ISP can choose to not use it by [effectively] disabling BGP routing with that ISP.  This is different from the standard access model used by businesses in that if there are problems downstream of one of their ISP, their control is limited to the much fewer ISP from whom they have purchased access capacity.  

The following illustrates a scenario where there is congestion at a peering point downstream of one of the ISP being used by a business that is affecting it's ability to reach other offices, partners or customers that are hosted on other ISP.



In the access model, since BGP cannot signal path quality, traffic is blindly steered over a path that has the shortest number of intermediate networks versus a path with the best performance.  Buying extra access circuits alone to avoid Internet congestion is not a winnable game (more on this in the next part of this series).

The alternative approach using an IXP would look something like the following.



This illustration shows how being at an IXP creates more direct access to more endpoints at a better price point than buying access lines to numerous ISP.  You can also see how peering with other business entities locally at an IXP can improve reliability, reduce costs and simplify business-to-business connectivity by combining it with Internet connectivity.

There is an interesting trend occurring within the growing number of managed co-location data centers. Hosted within many of these co-location data centers are IXP. Some managed data center operators like Equinix even operate their own IXP at their data centers. These data centers are an ideal place for businesses to connect with one another through IXP without the downsides that come with using consumer-focused ISP.  

This is not to say that the operational capabilities at all IXP are at a level needed to support large numbers of businesses. There is work to be done to scale peering in a manner that will give customers minimal configuration burden and maximal control.

There will even be need for business-focused ISPs that connect business customers at one IXP to business customers connected to the Internet at another IXP. Although net neutrality prohibits the differentiated treatment of data over the Internet, it does not forbid an ISP or IXP from selecting the class of customer it chooses to serve. This is much like the difference between a freeway and a parkway. Parkways do not serve commercial traffic and so in a way they offer a differentiated service to non commercial traffic.

As the Internet enables new SaaS and IaaS providers to find success by avoiding the high entrance cost of building a private service delivery network, more businesses are turning to the Internet to access their solution providers of choice. The old Internet connectivity model cannot reliably support the growing use of the Internet for business and so a better connectivity model is needed for a reliable Internet. New opportunities await.

In upcoming posts I will discuss additional thoughts on further improving the reliability of communicating over the Internet.