Monday, December 24, 2018

Backend security done right


I think of classical firewall-based security like a community with a common manned gate, and where the homes inside the community don’t have locks on their doors.   Strong locks on strong doors is better if you ask me.

Traditional firewalls look for patterns in the packet header to determine what action to take on flows that match specified patterns.  I'd equate this to the security guard at the gate allowing in folks based by how they look.  If someone looks like a person who belongs to the community, then they're let in.   In the same way, if a crafted packet header matches permit rules in the firewall, then it is allowed through.

One might say that’s where app-based firewalls come in.  They go deeper than just the packet header to identify the application that is being transported.  But how do these firewalls know what application a transport session belongs to?  Well, they do pretty much the same thing — look for patterns within transport sessions that hopefully identify unique applications.  Which means that with a little more effort, hackers can embed patterns in illegitimate sessions that fool the firewall.

I suppose there are even more sophisticated schemes to distinguish legitimate flows from illegitimate ones that may work reasonably for well-known apps.  Problem is the application you wrote (or changed) last week is not "well-known", so you’re back to the 5-tuple. 

This is among the problems with man-in-the-middle security -- they're just not good enough to keep out the most skilled intruder for long.  Break past the front gate and all the front doors are wide open.  The dollars you expend for this kind of security isn’t limited to the cost of the security contraption.  It also includes the limitations this security model imposes on application cost, velocity, scale and richness, and the amount of wasted capacity it leaves in your high-speed network fabric.  Clos? Why bother.

One might say that server-based micro-segmentation is the answer to this dilemma.  But what significant difference does it make if we clone the security guard at the gate and put one in front of each door?  The same pattern matching security means the same outcomes.  Am I missing something?

I'm sure there are even more advanced firewalls, but that surely bring with them either significant operational inefficiencies or the risks of collateral damage (like shutting down legitimate apps).  I'm not sure the more advanced companies use them.

In my humble opinion, “zero trust” means that application endpoints must never trust anything in the middle to protect them from being compromised.  Which means that each endpoint application container must know which clients are entitled to communicate with it, and allow only those clients to connect if they can present credentials that can be used to prove their true identity and integrity.  Obviously a framework that facilitates this model is important.  Some implementations of this type of security include Istio, Cilium, Nanosec and Aporeto, although not necessarily SPIFFE based.

Where backend applications universally adopt this security model, in-the-middle firewalls are not needed to ensure security and compliance in backend communication.  Drop malware into the DC and it might as well be outside the gate.  It doesn’t have keys to any of the doors, i.e. the credentials to communicate with secured applications.  Firewalls can now focus on keeping the detritus out at the perimeter —  D/DoS, antivirus, advanced threats, and the like, which don’t involve the management of thousands of marginally effective match-action rules, and all the trade-offs that come with blunt-force security.

Saturday, December 22, 2018

Death by a thousand scripts


Early on in my automation journey, I learned some basic lessons from experiencing what happens when multiple scripting newbies independently unleash their own notions of automation on the network.  I was one of them. 

Our earliest automations were largely hacky shell scripts that spat out snippets of config which would then be shoved into routers by other scripts.  We’re talking mid 1990s.   It started out fun and exciting, but with more vendors, hardware, use cases, issues, and so on, things went sideways.  

We ended up with a large pile of scripts with different owners, often doing similar things to the network in different ways.  Each script tailored for some narrow task on a specific vendor hardware in a specific role.  As folks who wrote the scripts came and went, new scripts of different kinds showed up, while other scripts became orphaned.  Other scripts ran in the shadows, only known to their creator.   Chasing down script issues was a constant battle.  It was quite the zoo. 

We learned quickly that with automation power must come automation maturity. 

The problem was that we were operating based on the needs at hand, without stepping back and building in the bigger picture.  With each need of the day came a new script or tweak to an existing one.  Each with different logging, debugging, error handling, etc, if any at all.  No common framework.  Each script a snowflake more or less.  Forbid it that we should make any change to the underlying network design.   Eventually the scripts became the new burden in place of those we set out to reduce with them. 

Even seasoned developers assigned to network automation make newbie mistake.  The code is great, but the automation not so great.  Main reason being they lack the subject matter knowledge to deconstruct the network system accurately, and most likely they were not involved when the network was being designed.  

IMO, the best network automation comes from a class of networking people that I refer to as network developers — software developers that are networking subject matter experts.  These folks understand the importance of the network design being simple to automate.  They are core members of the network design team, not outsourced developers.

In my case, I found that if my automation logic reflected the functional building blocks in the network design, I could capture the logic to handle a unique function within corresponding methods of a model-based automation library.  For example, a method to generate vendor-specific traffic filter from a vendor-independent model, and another method to bind and unbind it on an interface.  Code to handle the traffic filter function was not duplicated elsewhere.  Constructing a network service that is a composite of multiple functions was just a matter of creating a service method that invoked the relevant function methods.

This approach ensured that my team focused on systematically enhancing the capabilities of a single automation library rather than replicating the same network manipulations in code over and over in different places.  With the devops mindset, often new device features were tested only after incorporating them into the automation library, which meant features were tested with and for automation.  Multiple high value outcomes derived from a common coherent automation engine, and a very strong network automation as network design philosophy.

There are certainly other equally good (or better) automation models.  For example, Contrail fabric automation takes a role oriented approach.

Let me close before I go too far off on a tangent.  The lesson here — don’t wing automation for too long.  Aim for method, not madness.

Friday, December 7, 2018

Network automation as network design

A very large part of my professional career in network design was spent working on automation.  A journey in automation that began in 1996 when I and a few colleagues engineered Bloomberg’s first global IP WAN, which evolved into the most recognized (and agile) WAN in the financial services industry.  

The automation behind that network started off very basic, and over the years evolved into a very lean and flexible model-based library core, with the various programs (provisioning, health-checking, discovery, etc) that were built on top.  This small automation library (less than 15K of OO code) drove a high function multi-service network with support for 6+ different NOS' and 100+ different unique packet forwarding FRUs.  Including service-layer functions such as VPNs with dynamic context-specific filter, QoS and route policy construction, BGP peering (internal and external), inline VRF NAT, etc and support for a variety of attachment interfaces such as LAG/MC-LAG and channelized Ethernet/SONET/SDH/PDH interfaces (down to fractional DS1s) with Ethernet, PPP, FR and ATM encapsulations.  And, yes, even xSTP.  It was quite unique in that I have yet to see some of the core concepts that made it so lean and flexible repeated elsewhere.  We were among the earliest examples of a DevOps culture in networking.

Many lessons were learned over the years spent evolving and fine tuning that network automation engine.  I hope to capture some of those learnings in future blog entries.  In this blog entry, I want to share a perspective on the core foundation of any proper network automation — and that is network architecture and design.

All great software systems start with great software architecture and design.  A key objective of software architecture and design is to achieve a maximum set of high quality and scalable known and yet-to-be-known outcomes, with the least amount of complexity and resources.  Applying paint to the network canvas without tracing an outline of the desired picture doesn’t always get you to the intended outcomes.  Design top down, build bottom up, as they say.

In networking, automation is a means to an end, which are the high quality services to be rendered.  Once we know what network services we expect to deliver over our network, we can then identify the building-block functions that are required and their key attributes and relationships.  From there we identify the right technologies to enable these functions such that they are synergistic, simple, scalable and resource efficient.  The former is network architecture and the latter is network design.

It’s only after you arrive at a foundational network architecture that you should start the work on automation.  After this, automation should coevolve with your network design.  Indeed, the network design must be automatable.  In this regard, one might even look at network automation as an aspect of network architecture and design, since it's role is to realize the network design.  Obviously this would mean the Bloomberg automation library implements the Bloomberg network design, and not yours (although it might have everything you need).

IMO, great network automation is based on software design that tightly embraces the underlying network design, such that there are corresponding functional equivalents in the automation software for the functions in the network design.  This is what I call model-based automation (as opposed to model-blind automation).  In this sense again, good network automation software design is inclusive of the network design.  

This last assertion is an example of what I spoke of in my previous blog, of how an enhanced mode system (here the network automation system) should be a natural extension of a base mode system (the network system), such that if the base mode system is knowable, then so is the enhanced mode system.  Which, when done right, makes possible both an intelligent network as well as the intelligent operator.  This assertion is backed by a very successful implementation on a large high function production network.

In conclusion, what I've learned is that network architecture and design is king and, done right, network automation becomes a natural extension of it.  Companies that do not have the resources or know-how to properly marry automation design and network design have a few hard lessons ahead of them in their automation journey.  For some companies, it might make sense to consider network automation platforms such as Contrail's fabric automation, which incorporates standards-based network design that is built into a model-based open-source automation engine.

Wednesday, December 5, 2018

Intelligent Network, Intelligent Operator

Maybe I’m old school, but I’m leery of black box networking.  Especially if critical services are dependent on my network infrastructure.  I wore those shoes for 19 years, so this perspective has been etched into my instinct by real world experiences.  I don’t think I’m alone in this thinking.  

When bad things happen to network users, all eyes are on the physical network team, even when it’s not a physical network problem.  The physical network is guilty until it can be proven otherwise.  So it’s fair that physical network operators are skeptical of technology whose inner workings are unknowable.  Waiting on your network vendor’s technical support team isn’t an option when the CIO is breathing down your neck.  Especially if a mitigation exists that can be acted on immediately.  

That said, there is indeed a limit to the human ability.  It's increasingly lossy as the number of data points grow.  Moreover, the levels of ability are inconsistent across individuals.  Box level operations leaves it up to the operator to construct an accurate mental model of the network, with its existing and possible states, and this generally results in varying degrees of poor decision making.  

To make up for this shortcoming, it’s well known by now that the next level in networking is centered around off-box intelligence and control — this theme goes by the heading "SDN".  However some forms off-boxing network intelligence create new problems — such as when the pilots no longer are able to fly the plane.  This is bad news when the autopilot is glitching. 



How safe is the intelligent network, if the network operators have been dumbed down?  The public cloud folks know this spells disaster.  So if they don’t buy into the notion of “smart network, dumb operator”, then why should anyone else with a network?  In some industries, it doesn’t take much more than one extended disaster before the “out of business” sign is hanging on the door. 

If your SDN was built by your company’s development team (like maybe at Google), your network operator is probably a network SRE that is very familiar with its code.  They can look under the hood, and if they still need help, their technical support team (the SDN developers) work down the hallway. 

On the other hand, black box SDN is fine and dandy until something fails in an unexpected way.  I suspect that eventually it will — for example, when all the ace developers who built it have moved on to their next startup and are replaced by ordinary folks.  You need to trust that these SDN products have an order of magnitude higher quality than the average network product, since the blast radius is much larger than a single box.  But they too are created by mortals, working at someone else’s company.  The reality is that when the network breaks unexpectedly, you are often left to your own devices. 

So how is the rest of the world supposed to up-level their networking when the only choices seem to be black-box SDN or build-your-own SDN?  (To be very clear, I’m talking primarily about the physical layers of the network.)

Let me tell you what I think.  

IMO, a resilient network has an intelligent on-box "base" mode (openflow does not qualify), which guarantees the basic level of network service availability at all times to critical endpoints.  It implements a consistent distributed control and data plane.  This mode should be based on knowable open communication standards, implemented by network nodes and programed declaratively.   Over that should be an off-box "enhanced" mode that builds on top of the base mode to arrive at specific optimization goals (self driving automation, bandwidth efficiency, minimum latency, end-user experience, dynamic security, network analytics, etc).  I believe this is how the intelligent network must be designed, and is consistent with the wisdom in Juniper founder Pradeep Sidhu's statement "centralize what you can, distribute what you must." 

If the enhanced mode system has an issue, turn it off and let the network return safely to the base mode state.  Kind of like the human body — if I get knocked unconscious I’m really happy my autonomic nervous system keeps all the critical stuff ticking while I come back to my senses.  

This also allows the enhanced mode system to focus on optimizations and evolve rapidly, while the base mode system focuses on resilience and might evolve less rapidly. However, the enhanced mode must be fully congruent with the chosen base mode.  So a first order decision is in choosing the right base mode, since any enhanced mode is built on top of it.

You might be asking by now, what does this have to do with the title?  

My thesis is that if you choose a knowable base mode system, it will naturally lead you to a knowable enhanced mode system.  Said another way, a standards based base mode system, which is knowable, leads to a knowable enhanced mode system, since it uses the building blocks provided to it by the base mode system.  It's not rocket science.

The next important consideration is that the enhanced mode system is designed to provide full transparency to the operator into it’s functions and operations, such that the network operator can take control if and when needed (see article).  The best systems will actively increase the understanding of human operators.  This is how we get both the intelligent network and the intelligent operator.   I think this is what pragmatic SDN for the masses would look like.